CiscoPress - Big Data Concepts Methodologies Tools and Applications (2016)
CiscoPress - Big Data Concepts Methodologies Tools and Applications (2016)
Copyright © 2016 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any
means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not
indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Associate Editors
Big Data: Concepts, Methodologies, Tools and Applications is organized into eight distinct sections that provide comprehensive coverage of
important topics. The sections are:
6. Managerial Impact;
8. Emerging Trends.
The following paragraphs provide a summary of what to expect from this invaluable reference tool.
Section 1, “Fundamental Concepts and Theories,” serves as a foundation for this extensive reference tool by addressing crucial theories essential
to the understanding of Big Data. Introducing the book is Big Data Overview by Yushi Shen, Yale Li, Ling Wu, Shaofeng Liu, and Qian Wen; a
great foundation laying the groundwork for the basic concepts and theories that will be discussed throughout the rest of the book. Another
chapter of note in Section 1 is titled The Performance Mining Method: Extracting Performance Knowledge from Software Operation Data by
Stella Pachidi and Marco Spruit. Section 1 concludes, and leads into the following portion of the book with a nice segue chapter,Philosophising
Data: A Critical Reflection On The ‘Hidden' Issuesby Jackie Campbell, Victor Chang, and Amin Hosseinian-Far.
Section 2, “Development and Design Methodologies,” presents in-depth coverage of the conceptual design and architecture of Big Data. Opening
the section is Big Data Architecture: Storage and Computation by Siddhartha Duggirala. Through case studies, this section lays excellent
groundwork for later sections that will get into present and future applications for Big Data, including, of note: Big Data Warehouse Automatic
Design Methodology by Francesco Di Tria, Ezio Lefons, and Filippo Tangorra. The section concludes with an excellent work by Erdem Kaya,
Mustafa Tolga Eren, Candemir Doger, and Selim Saffet Balcisoy, titled Building a Visual Analytics Tool for Location-Based Services.
Section 3, “Tools and Technologies,” presents extensive coverage of the various tools and technologies used in the implementation of Big Data.
Section 3 begins where Section 2 left off, though this section describes more concrete tools at place in the modeling, planning, and applications of
Big Data. The first chapter, Data Intensive Cloud Computing: Issues and Challenges by Jayalakshmi D. S., R. Srinivasan, and K. G. Srinivasa, lays
a framework for the types of works that can be found in this section. Section 3 is full of excellent chapters like this one, including such titles as Big
Data in Telecommunications: Seamless Network Discovery and Traffic Steering with Crowd Intelligence by Yen Pei Tay, Vasaki Ponnusamy, and
Lam Hong Lee; and A Cloud-Aware Distributed Object Storage System to Retrieve Large Data via HTML5-Enabled Web Browsers by Ahmet Artu
Yıldırım and Dan Watson. This section concludes with Graph Mining and Its Applications in Studying Community-Based Graph under the
Preview of Social Network written by Bapuji Rao and Anirban Mitra.
Section 4, “Utilization and Application,” describes how the broad range of Big Data efforts has been utilized and offers insight on and important
lessons for their applications and impact. The first chapter in the section is titled A Survey of Big Data Analytics Systems: Appliances, Platforms,
and Frameworks written by M. Baby Nirmala. This section includes the widest range of topics because it describes case studies, research,
methodologies, frameworks, architectures, theory, analysis, and guides for implementation. The breadth of topics covered in section is also
reflected in the diversity of its authors, from countries all over the globe. Some chapters to note include: Application of Big Data in Healthcare:
Opportunities, Challenges and Techniques by Md Rakibul Hoque and Yukun Bao; and Evaluation of Topic Models as a Preprocessing Engine for
the Knowledge Discovery in Twitter Datasets by Stefan Sommer, Tom Miller, and Andreas Hilbert. The section concludes with Big Data Analytics
on the Characteristic Equilibrium of Collective Opinions in Social Networks by Yingxu Wang and Victor J. Wiebe.
Section 5, “Organizational and Social Implications,” includes chapters discussing the organizational and social impact of Big Data. The section
opens with Blending Technology, Human Potential, and Organizational Reality: Managing Big Data Projects in Public Contexts by Jurgen
Janssens. Where Section 4 focused on the broad, many applications of Big Data technology, this section focuses exclusively on how these
technologies affect human lives, either through the way they interact with each other, or through how they affect behavioral/workplace situations.
Another interesting chapter to note is Synchronizing Execution of Big Data in Distributed and Parallelized Environments by Gueyoung Jung and
Tridib Mukherjee. The section concludes withHere Be Dragons: Mapping Student Responsibility in Learning Analytics by Paul Prinsloo and
Sharon Slade.
Section 6, “Managerial Impact,” presents focused coverage of Big Data in a managerial perspective. The section begins with Business Process
Improvement through Data Mining Techniques: An Experimental Approach by Loukas K. Tsironis. This section serves as a vital resource for
developers who want to utilize the latest research to bolster the capabilities and functionalities of their processes. Chapters in this section offer
unmistakable value to managers looking to implement new strategies that work at larger bureaucratic levels. The section concludes with Big Data
Analytics Demystified by Pethuru Raj.
Section 7, “Critical Issues,” presents coverage of academic and research perspectives on Big Data tools and applications. The section begins
with What Does Learning Look Like?: Data Visualization of Art Teaching and Learning by Pamela G. Taylor. Chapters in this section, such
as Issues Related to Network Security Attacks in Mobile Ad Hoc Networks (MANET) by Rakesh Kumar Singh, will look into theoretical
approaches and offer alternatives to crucial questions on the subject of Big Data. The section concludes with From Data to Vision: Big Data in
Government by Rhoda Joseph.
Section 8, “Emerging Trends,” highlights areas for future research within the field of Big Data, opening with The New “ABC” of ICTs (Analytics +
Big Data + Cloud Computing): A Complex Trade-Off between IT and CT Costs by José Carlos Cavalcanti. This section contains chapters that look
at what might happen in the coming years that can extend the already staggering amount of applications for Big Data. The final chapter of the
book looks at an emerging field within Big Data, in the excellent contribution,Emerging Role of Big Data in Public Sector by Amir Manzoor.
Although the primary organization of the contents in this multi-volume work is based on its eight sections, offering a progression of coverage of
the important concepts, methodologies, technologies, applications, social issues, and emerging trends, the reader can also identify specific
contents by utilizing the extensive indexing system listed at the end of each volume. As a comprehensive collection of research on the latest
findings related to using technology to providing various services, Big Data: Concepts, Methodologies, Tools and Applications, provides
researchers, administrators and all audiences with a complete understanding of the development of applications and concepts in Big Data. Given
the vast number of issues concerning usage, failure, success, policies, strategies, and applications of Big Data in countries around the world, Big
Data: Concepts, Methodologies, Tools and Applications addresses the demand for a resource that encompasses the most pertinent research in
technologies being employed to globally bolster the knowledge and applications of Big Data.
Section 1
Yale Li
Microsoft Corporation, USA
Ling Wu
2
EMC Corporation, USA
Shaofeng Liu
Microsoft Corporation, USA
Qian Wen
Endronic Corp, USA
ABSTRACT
This chapter provides an overview of big data and its environment and opportunities. It starts with a definition of big data and describes the
unique characteristics, structure, and value of big data, and the business drivers for big data analytics. It defines the role of the data scientist and
describes the new ecosystem for big data processing and analysis.
INTRODUCTION
Today we have heard a lot about Big Data. What is Big Data? (Press, 2013) Is there a definite size over which data becomes Big Data? Is it the
number of rows or the number of columns? Is a spreadsheet that contains a million rows Big Data? Is a database that has a billion records Big
Data? How big is Big Data?
Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using the available database
management tools. The challenges include how to capture, curate, store, search, share, analyze and visualize big data. The trend to larger data
sets is due to the additional information derivable from the analysis of a single large set of related data, as compared to separate smaller sets with
the same total amount of data, allowing such correlations to be found to spot business trends, determine the quality of research, prevent diseases,
link legal citations, combat crimes and determine real-time roadway traffic conditions. ” (Wikipedia, 2012)
In today’s environment, we have access to more types of data. These data sources include online transactions, social networking activities, mobile
device services, internet gaming and etc. While the public open data is growing, increasingly powerful ERPs also bring corporate data to a new
level. Big data is changing the world. Data sources are expanding, data from Facebook, twitter, YouTube, Google and etc., are to grow 50X in the
next 10 years.
An IDC study shows that in 2010, there have been 1.2 zettabytes (1,200,000,000,000,000,000,000) of information, a trillion billion bytes of
information to be managed and analyzed. It is estimated that by 2020, there is going to be 35 zeta bytes of information. Data deluge is to grow
44X in this decade. About 90% of this information being created is unstructured, like website clicks, mobile phone calls, Facebook posts, call
center conversations, tweets, videos and emails. (Gens, 2013) Where are all these big data going? It is going to be run in the cloud. When we talk
about cloud computing, we cannot miss big data.
BIG DATA DEFINITION
Big data is defined in Wikipedia that as the “data sets that are too large for storage, management, processing and analysis, it present challenges
beyond traditional IT techniques.” (Wikipedia, 2012) BIG is a term that is relative to the size of the data, and the scope of the IT infrastructure
that is in place. Transforming big data could benefit scientific discovery, environmental and biomedical research, and national security. In order
to do that, big data requires the use of new technical architectures and analytics tools, to generate business value from the huge volume of data, in
order to create insights.
Big Data comes in all kinds of forms: from highly structured ERP (Enterprise Resource Planning) data, or CRM (Customer Relation
Management) data, to multi-million rows of text file, to video files and machine generated sensor data. The common feature is the high data
volume and data complexity. Most of big data is unstructured or semi-structured, and require new techniques and tools to analyze.
Big Data examples are everywhere in our lives. With the popularity of mobile computing and the self-expression tolls, everyone has the ability to
share their thoughts and ideas worldwide. Smart phones carry sensors like GPS, accelerometer, microphone, camera and Bluetooth which can
collect huge amounts of data, and allow research on behavioral and social science, with the large scale mobile data to characterize and understand
real-life phenomena. In 2011, there have been 6 billion mobile phone subscribers, growing 45 percent annually for the past four years. (ITU,
2011) A quarter of them use smartphones. By 2014, mobile internet use should overtake desktop internet use.
There are more than 845 million active Facebook users by the end of 2011, 50 percent log onto Facebook every day; 30 billion pieces of content
are shared every month. Every 60 seconds, there are 510,000 posted comments, 293,000 status updates and 135,000 uploaded photos. 20
million Facebook applications are installed per day. In just 20 minutes, over 1 million links are shared. (Protalinski, 2012)
Twitter has 100 million active users around the world; more than half of them log in to twitter each day to follow their interest. The average user
has 115 followers. An average of 190,000,000 tweets are sent per day. Tweeter handles 1.6 billion queries per day. 34% of marketers have
generated leads using Twitter and 20% have closed deals. (Twitter, 2011)
YouTube has over 800 million unique visitors per month, which generates 92 billion page views. 72 hours of videos are uploaded every minute
and over 4 billion hours of video are watched each month. More videos are uploaded to YouTube in 60 days, than the three major US TV network
programs created in 60 years. (YouTube.com Statistics)
Google has 4.7 billion searches per day in 2011. Google Plus reaches out to 10 million users in 16 days, it is been reported that it has 400 million
users in just one month. (Statistic Brain, 2012)
LinkedIn has 135,000,000 users; Wikipedia hosts over 17 million articles; Foursquare sees 2,000,000 check-ins a week; Instagram reaches out
to 13 million users in 13 months after its launch, and have 150,000,000 photos uploaded; Flickr hosts over 5 billion images. (Decision Stats,
2011) (Pring, 2012)
BIG DATA CHARACTERISTICS
Aside from the ability to managing more data than ever before, we have access to more types of data. These data sources include Internet
transactions, social networking activity, automated sensors, mobile devices, scientific instrumentation and many others. In addition to static data
points, transactions can create a certain “velocity” to this data growth. As an example, the extraordinary growth of social media is generating new
transactions and records on the fly.
Big Data characteristics include 3 V’s: Volume, Velocity and Variety. (Baunach, 2012)
• Huge Data Volume refers to the size of big data, which is definitely huge. Big data is a relative term. For some organizations, 10 terabytes of
data is unmanageable, other organizations may find 50 petabytes overwhelming. For instance, Twitter alone generates more than 7
Terabytes of data every day, Facebook generate 10 TB. From 2010 to 2020, data is to increase 44X from 1.2 Zettabytes (ZB) to 35.2 ZB.
Enterprises are facing massive volumes of data.
• As a ZE.net article reports: According to Oracle president Mark Hurd, the number of devices supplying data back to businesses and
enterprises are booming to 50 billion by the end of the decade. He also mentions that “data is growing by 35 to 40 percent a year” and that
the world is “drowning” in vast amounts of data, which has grown eightfold in the past seven years, and companies are running out of
storage space. With more than nine billion existing devices connected to the Internet, businesses are struggling to cope with the storage of
the vast amounts of data they collect. (Whittaker, 2012) (Oracle, 2013)
• Velocity is the speed of data input and output. This becomes increasingly more important as more and more machine-generated data
explode by the millisecond. For example, click stream data from web sites, sensors that monitor movements, and cellphone providers that
generate GPS data. New and modified data must often be immediately available upon creation. Search and analysis of data on the move are
required as they happen. There might be hundreds of thousands of data events per second. When a data event occurs, the handling speed
becomes increasingly more of a challenge.
• Velocity is all about the rate of change in the data, and how quickly it could be used to generate value. The velocity of large data streams
power the ability to parse text, detect sentiments and identify new patterns. Big Data systems need to be designed to handle large amount of
new and updated data flowing into the system, by providing real-time metadata analysis, 1st minute updates to existing data, real-time
activity streams delivered in context to users’ applications, and the immediate incorporation of the users’ activity data and feedbacks. Real
time offers require that promotions be aligned with geo-location data, and customer purchase history. (Oracle, 2013)
• Variety refers to the various types of data that cannot easily be captured and managed in a traditional database. It also reflects various
degrees of the data structure that includes structured, semi-structured and unstructured data, as well as rich media and transactional data
with various data types: such as content, geo-spatial, machine, mobile, streaming, audio, video, text, weblogs and social media data. (Halfon,
2012) (Clegg, 2012)
Big data implementations require processing, managing and analyzing the variety of data formats and types. The big data platform should be able
to analyze and ingest all of these various data types and fuse them in search results and analytics to produce insights.
The challenges of dealing with the three V’s of big data have been taken to a new level, by a growth of unstructured data sources. Social networks
and mobile devices generate more data, but the big data solutions require the support of a wide variety of unstructured data, and vast volumes of
data, real-time analytics, diverse data models and application platforms. (Lopez, 2012)
BIG DATA VALUE
As a core business asset, the value of data lies within a company’s strategy. The value of big data is reflected on leveraging big data solutions,
which can actually solve real world business problems, such as the complexity of product proliferation, and help companies better align their
product offerings and supply-chains based on the consumer buying patterns.
Big data presents big challenges, but it also presents new opportunities, for businesses to achieving unprecedented competitive advantages. From
enterprises to merchants, every organization can use big data such as consumer information, transaction details, product inventory, as well as
web logs, Facebook contents, and Youtube videos to improve their strategy by using the insights from Big Data, which has been previously
discarded or could not be processed due to technology limitations.
Big data analytics examine large amounts of big data, uncover the hidden patterns, unknown correlations and other information, to reveal
insights which can provide competitive advantages over rival organizations, or help better business decisions for more effective marketing and
customized services of increased revenue.
Big data analytics can be achieved through data mining and predictive data analytics. Due to the size or the level of the data structure, it cannot
be efficiently analyzed using the traditional database and the related tools. New big data technologies such as NOSQL database, Hadoop,
MapReduce and MPP have emerged, which support the big data analytics.
With the Big Data analytics technology, by combining a large number of signals from user activities and those of their friends, organizations can
offer better products, develop deeper customer relationships, and more predictive business strategies. Vendors are able to craft highly
personalized user experiences, and create new kinds of advertising business, products and services.
Being empowered by Big Data, and in turning Big Data into valuable business assets, organizations become more efficient. They enjoy improved
responsiveness, such as faster time to market, better business strategies and timely execution of operations to gain competitive advantages.
A report issued by The Mckinsey Global Institute (MGI) “Big data: the next frontier for innovation, competition and productivity,” shows that
data are becoming a factor of production, like physical or human capital. Companies that can harness big data can trample data-incompetents.
Data equity, a newly coined phrase, is to become as important as brand equity. MGI believes that big data has already been widely adopted by
businesses, and are creating value.
Companies have realized the value of big data and use them to generate more detailed pictures of their customers. A British retailer named Tesco
collects 1.5 billion pieces of data every month, and uses them to adjust its prices and promotions. An American retailer named Williams-Sonoma,
uses information such as income, expense habits and house value of its 60 million customers, to produce different iterations of its catalogue.
Online retailer Amazon claims that 30% of its sales are generated by its recommendation engine. When we purchase items from Amazon, we
always get recommended products through “you may also like xxxx”, which tempt us to buy more. The mobile revolution adds a new dimension
to consumer-targeting. Companies such as America's Place Cast are developing technologies, which allow them to track potential consumers, and
send them enticing offers when they are within a few yards of Starbucks.
The big data revolution is also changing the established industries and business models. Can you imagine IT firms involving themselves in the
healthcare markets? Google Health and Microsoft HealthVault allow consumers to track their health, and record their treatments. Manufacturers
are transforming into service companies: IBM has changed itself from a hardware manufacture to a service (solutions) provider; BMW uses
sensor-data to tell its customers when their cars need to be serviced. Insurance firms monitor the driving style of their customers, and offer them
rates based on that data, instead of their age and gender.
Even government agencies are changing their model of operations. Government agencies use big data to generate statistics, to help them
understand local and global patterns and trends, in order to improve their services. Tax authorities depend on the big data analytics to identify
and predict fraudulent activities, estimate the tax gap, and simulate the effectiveness of policy changes on the tax behavior, and model financial
risk. For example, German Federal Labor Agency managed to cut its annual spending by $14 billion over the three years, while reducing the
length of time people spend out of work, by a detailed study on its clients.
MGI believes that big data cold create a new wave of productivity growth. With proper use, big data can enable retailers to increase their
operating margins by 60%.
Some retailers are using data modeling to optimize their marketing expenditures. Retailers also use the big data technologies to analyze in-store
camera video, and create mapping for consumer foot traffic throughout the stores. These combined with the sales data, help to optimize store
layout planning and product placement to attract consumers. Big data is also used to better utilize the distribution networks, and delight
customers with improved on-time deliveries of their web orders.
Recently, The University of Pittsburgh Medical Center (UPMC) announced a five year, $100 million investment, to create a comprehensive data
warehouse, which is to bring together clinical, financial, administrative, genomic and other information from more than 200 sources, across
UPMC, UPMC Health Plan and other affiliated entities. UPMC believes that by enhancing the collection of big data, and translating it into
actionable insights, can drive greater efficiency, and help the personalization of healthcare, and better define patient populations with a greater
level of granularity. The analysis of unstructured patient data and the mining of claims data for insights can improve wellness and patient
compliance, advanced medical research, also can help government and insurance agencies better detect fraud, identify best care delivery
practices and improve on bio-surveillance. (Lewis, 2012)
Big data is already starting to have an impact on patient care. In the paper “Supercomputer Speeds up Cancer Analysis,” (Lewis, 2012)
oncologists are taking advantage of a supercomputer, that reduces the time it takes to do a genomic analysis of a cancer tumor from eight weeks
down to 47 seconds per patient. This allows the oncologists to prescribe treatment based on the molecular pathways of the tumor, rather than on
its anatomical location, reducing the chances for wrong treatments.
The Centre for Economics and Business Research (CEBR) has an independent economic study on ‘Big Data”. (CEBR, 2012) In the study, they
investigated how United Kingdom organizations could unlock the economic value of big data through big data analytics. Based on the fact that
with high performance analytic solutions, organizations could analyze huge amounts of data quickly to reveal previously unseen patterns, and
make better and faster business decisions.
CEBR estimates that data equity has a worth of 25.1 billion pounds to UK private and public sectors in 2011. The increasing adoption of big data
analytics technologies is to expand this number to 40.7 billion pounds on an annual basis by 2012. According to their research, enhanced
customer intelligence informed by big data is able to meet consumer demands, and evaluate customer behavior more effectively, thus to produce
73.8 billion pounds in benefits over the years 2012—2017. They forecast that the supply chain is to gain 45.9 billion pounds by predictive
analytics forecasting demand, anticipating replenishment points, optimizing stock and resource allocations to greatly reduce costs. They also
anticipate the public sector is to save 2 billion pounds in fraud detection, and generate 4 billion pounds through better performance
management. They estimate that big data innovation can lead to 24.1 billion pounds in contributions, by assisting in the evolution of new
products, services and the creation of new business markets. Especially, the utilization of advanced analytics can lead to new product
development benefits of 8.1 billion pounds over the next 5-year period.
BIG DATA STRUCTURE
What does Big Data look like? What kind of data is Big Data? Big data has some common characteristics: millions if not billions of rows, data that
is too large to be stored on a single storage array, too large to be processed by a single machine, and growing at a high rate. The most important
feature of big data is its structure. Big data can be divided by two data types: structured and unstructured data. Unstructured data can be
subdivided as semi-structured, quasi-structured and completely unstructured data.
Structured data is data that contains a defined structure, type and format. Unstructured data is data that has no inherent structure and no data
type definition. Semi-structured data and quasi-structured data are in-between: Semi-Structured data includes textual data files with a
discernable pattern, while quasi-structured data is textual data with erratic data formats, can be formatted with effort, tools and time.
Although the above data types are different, they can be in reality mixed in many cases. For example, a CRM application that records customer
support information for a call center. The backend database is Microsoft SQL server, which stores call logs for a customer support call center. The
SQL database can store structured data such as date time, ticket number, customer name, account number, problem type, and tier 1 support’s
name, which could be entered by the support help desk person from a GUI. The application is to also store unstructured data, such as call log
information from an email ticket, or an actual phone call description. Around 80-90% of the future data growth comes from unstructured data
types. (Oracle, 2012)
Structured data
Structured data (Figure 1) refers to a set of data that is identifiable and organized in a structure. Spreadsheet and database are typical structured
data, because they are organized with rows and columns. Structured data is usually used for creating, storing and retrieving data. It is accessible
through applications, and managed by technology that allows for query and reporting against some predetermined data types and relationships.
Figure 1. Structured Data
(Google search – Structured Data)
The real benefit of the structured data has been the separation of content from format. Since structured data is easily understood and queried by
search engines, the benefit has been extended to improved metadata and data management.
Structured data examples include database, data warehouse, spreadsheet, emails, reports, metadata, enterprise ERP or CRM systems.
SemiStructured Data
Semi-structure data are intermediate between “Structure Data “and “Unstructured Data”. It represents data that do not conform to a strict
schema, due to frequent changes in the structure of data. It is also called schema-less data or self-describing data.
In semi-structured data, similar entities are grouped together, but entities that belong to the same class may have different attributes. Not all
attributes are required, the size and the type of the same attributes in a group may be different as well. The order of attributes are not necessarily
important.
Semi-structured data (Figure 2) needs to be provided electronically from file systems, or databases or via data exchange formats such as EDI and
XML. It includes textual data files with discernable patterns, such as XML data files that are self-describing, and defined by an XML schema.
(Google Search on SemiStructure Data)
As the amount of on-line data grows, we can find more and more semi-structured data.
“Quasi” Structured Data
The term “Quasi” comes from Latin, which means “as if” or “almost”. “Quasi” structured data are data not exactly structured. “Quasi” structured
data consists of textual data with erratic data formats, and can be formatted with special technology.
An example of Quasi-structured data (Figure 3) is the web clickstream data, which may contain some inconsistencies in data values and formats.
(Google Search on QuasiStructured Data)
Unstructured Data
Comparing to the structured data, unstructured data has no identifiable structures. Unstructured data consists of any data stored in unstructured
format, without any conceptual definition and data type definition.
• Textual objects, which are language based, such as Microsoft word documents, Excel spreadsheets or Outlook emails.
Unstructured data examples include images, video /audio files, the contents of a word document, the body of email messages and data in each
cell of a spreadsheet.
BIG DATA ANALYTICS
Businesses have tried all the ways to deliver relevant information and promotional offers, to targeted consumer marketing channels, such as
televisions, direct mailings and online advertisements. All these methods are aimed at consumer interest, because for businesses, the #1 rule of
marketing is: Know your customer. How do you know your customer? Do you know your customer’s name, age, income, hobbies, tastes, interests
and buying behaviors? Do you know what your customers watch, read and hear? What are their likes and dislikes? Who is your potential
customer?
If companies can know their customers deep enough to answer all the above questions, and use the info for marketing purposes and business
insights, they can be wildly successful. On the other hand, companies that lack customer information often fail, but getting to know your
customers is not easy. Especially with big data exploding, organizations have become more analytical and data driven. How do you retrieve useful
data, analyze the consumer behavior patterns from a vast amounts of data? Traditional data analytic technologies are not competent to handle
the volume and complexity of big data. As a result, new big data technologies are being developed, which include NOSQL databases, Hadoop and
MapReduce.
Business Intelligence vs. Data Science
Companies, organizations and manufacturers combine and compare data, in an effort to optimize business operations, lower cost, improve
quality, increase productivity and create new products. Businesses are becoming more data driven, and analytics are playing an increasingly more
important role. When it comes to business intelligence and data mining, both technologies are based on Enterprise Data Warehouse.
Data warehouse has been widely used in the industry. It integrates corporate data from different sources, uses business intelligence to analyze
data and streamline day to day business operations, with insight into how to optimize business operations. Simple reporting, spreadsheets and
some sophisticated drill-down analysis, have become commonplace usage for business intelligence, using a consistent set of metrics to measure
past performance, and inform business planning.
Data are managed by Database Administrators (DBA), where a data analyst must depend on the DBA for access and changes to the data schema.
Tight security also means longer lead times for the data analyst to get data, and more complex process for schema changes. Another implication
is that the security for data warehouse restricts data analysts from building data sets, or modifying data properties, which can cause shadow
systems to emerge within organizations, that contain critical data for creating analytic data sets.
“Business Intelligence (BI)” refers to the use of company data, to understand the business operation and facilitate decision making. In traditional
data warehousing, business data are gathered, stored and analyzed in data warehouse to develop business reporting. The traditional business
intelligence combines the tools and systems that are used for enterprise data storage and analysis. IT provides a single data source through which
a company’s critical data can be stored and analyzed, and then allows users to execute queries and generate reports. Typical BI tools include
features such as data accessibility, decision support and end user guidance. The Microsoft Business Intelligence Studio and Oracle Business
Intelligence Studio are some common BI tools that are used for relational databases.
Data science refers to the tools and methods that are being used to analyze large amounts of data, to make sense of big data, to find out what the
data can tell about how to do business better. By querying millions or billions of rows, and analyzing terabytes or even zetta-bytes of information,
businesses are enabled to predict prices, and design future products that are more people oriented. The use of data science varies from detecting
risks to catching criminals, and it has the potential to impact every industry. “The future belongs to the companies and the people that can use
data science to turn data into products.” (Lohr, 2012)
In the traditional business intelligence, business users determine what questions to ask, such as monthly sales reports, customer surveys and
product profitability analysis. IT departments capture structured data from the source system, analyze data to answer users’ questions. With the
new data science approach, IT delivers a platform to store, refine and analyze various data sources that enable creative discovery; business users
then explores what questions can be asked: such as how can we have preventative health care? How to maximize asset utilization? What is the
best product strategy? Rich data allow users to tackle questions that are impossible to answer by the traditional BI systems.
The traditional business intelligence tools allow users to retrieve insights from historical data. Data science gathers data, massages it into a
tractable form, then tells the story, and predicts future trends.
Data science is not just about the data, or finding out about what the data might mean. It is about testing some hypotheses, and making sure that
the conclusions retrieved from the data are accurate. Data science is not about just processing vast amount of big data either. It is more about
connecting the information that seem isolated, analyzing the patterns behind the data mess, creating models or patterns for business
productivity. It connects the seemingly isolate dots, and retrieve meaningful information quickly.
For example, an employee‘s daily activities could be totally irrelevant: passes a security camera to enter the building, replies email in the office,
calls his boss or customers, browses the internet, and maybe reserves a business trip online. These activities are totally disjointed if taken alone.
But for a company such as Cataphora, it can process this information, determine patterns and predict everything from a person’s mood to their
skill.
We publish posts, review other’s blog and tweet on tweeter. Have we thought of analyzing the blogs, posts or tweets, and find out: “What is the
best day of the week to publish your post? What time of day can you get your content the most page views and tweets? Are shorter posts better
than longer ones?” These challenging questions are being asked by the data scientists. (Feinleib, 2012) Obviously, it is very important for
marketers and blog owners to know who visits their website on a regular basis, and what the best time is to publish a post online. The answer can
help these online authors make strategic plans to figure out how to move their business forward. But it seems impossible to answer these
questions, with hundreds and thousands of public articles online. A big data analytics company named Datameer took this challenge. Datameer
collects 30 days of big data articles on Feobes.com, from July 9th 2012 to August 8th 2012, by gathering the information about the publication
date, time, page views, tweet, headline and full text of all these articles, the company has come up with some compelling insights:
• Wednesdays and Thursdays are the most popular days to publish, but contents published on Mondays and Saturdays get more views and
tweets;
• The most popular words for headlines don’t lead to the most popular post;
• Just after lunch on the East Coast is the best time to publish the post.
Interesting? If a company thrives on the number of its daily website visitors, this information is extremely critical for their survival. It could help
them manage their data in a way that leads to increased number of hits. Data science becomes the vital tool in defining business strategies.
Google is a pioneer in data science. Google’s PageRank algorithm has been among the first to use data outside of the page itself. Tracking links
has made Google searches much more useful and the PageRank is a key ingredient to Google’s success. Google search can detect spelling errors in
a search and suggest correction to the misspelled search. Google has made huge stride by using voice data, that they have collected and integrated
voice search into the search engine, and provides speech recognitions. (Loukides, 2010)
Facebook uses patterns of the friendship relation to suggest other people you might know; LinkedIn finds jobs or groups that a user might like to
know about; biotech companies can analyze gene sequences in billions of combinations to design and test new drugs; Amazon saves your search,
correlates with what other users search for, and uses it to provide appropriate recommendations. All these companies track the consumer’s data
trail, analyze it, mine it and create “data product” suggestions and recommendations to take advantage of big data. It is very common to be able
to collect vast amounts of data in today’s information exploration era, but how to use data effectively, not just data owned by the private
companies, but all data that is available and relevant. The core value of data science is to take the data collected from users, and provide added
value. Data is only useful when something can be done with it.
Data Analytical Architecture
Let’s first review the typical data analytical architecture for structured data.
(source: EMC Big Data Overview)
Figure 4 shows a typical data warehouse operation. Data from various sources go through the significant pre-processing and checkpoints, then
enters into secured controlled data warehouse environment, for data exploration and analytics.
• Data Extraction and Loading: Data from the transaction or other operational databases use integration techniques to transfer data into
a data warehouse. Data needs to be well understood, structured, and normalized with appropriate data type definitions. The ETL process
extracts data from all data sources, brings it into the data warehouse, then transforms the data into the database structures and internal
formats, of the staging area of data warehouse.
• Data Staging: Data is cleansed in the staging area. Duplications are removed to ensure data integrity and data quality. Another process
then loads the cleansed data into the data warehouse database.
• Data Aggregation: Data is pre-calculated with totals, averages, medians, grouping and etc., for statistical analysis. Data aggregation is a
key part of the data warehouse, which provides a cost effective means of query performance improvement.
• Data Reporting: Analysts get data provisioned for the downstream analytics, which includes: dashboards, reporting, BI applications,
summary and statistical query, and visualization tools for high density data. Data warehouse provide complete views of customers for
successful sales and services. For example, a BI dashboard or report provides the business manager with the sales and inventory
information.
Growing volumes of unstructured data lead to new challenges for data analytics. New capacities to access and analyze all available data, that
enable faster insight through self-service and collaboration, are in demand. It is challenging for traditional data warehouse and business
intelligent technology to meet the requirement of volume, velocity and variety. Advanced analytical techniques, especially distributed parallel
processing technologies, have been developed to provide real time and scaled processing capability for big data (See Figure 5).
The big data techniques complement the business intelligence tools to unlock the value from business information, help businesses anticipate
consumer behavior, form advertising and marketing strategies, increase sales, improve market pricing, optimize business process with efficiency,
help businesses grow with profitability and derive more business values. It also helps businesses to reduce risk, customer churn rates, avoid fraud
and predict new business opportunities.
Big Data Analytics
Big data analytics are the process of analyzing various types of huge data volume to uncover hidden patterns, unknown correlations and other
useful information, to provide better business prediction and decision making. Big data can unlock hidden values by analyzing the vast volume of
information, allow narrower segmentation of customers and much more precisely customized products and services, develop innovative next
generation products, minimize risks and improve decision making. In a word, big data analytics is a solution for getting the most advantages out
of big data. The big data analytic tools are developed for faster and more scalable processes, in order to extract value from the vast amounts of
unstructured data produced daily. Predictive analytics and data mining are the two main categories in the big data analytics.
Big data analytics can be used in many area including: the analysis of customer segmentation and behaviors, optimized marketing campaigns,
identification of data driven products, defining marketing strategies and risk management, the analysis of social networks and relationships,
detection and prevention of fraud, attrition prediction. Big data analytic tools have the ability to converge data from multiple data sources, with
both structure and unstructured data types; databases are scaled-out and distributed, with fast and scalable processing capabilities.
Big data analytics have quickly spread from retail and finance to every business field. It provides a way to understand the market, analyze
consumer interests and spending habits, mine product profitability and predict future sales and investments. Data driven analysis can help to
make better decisions in just about anything.
A Forbes article, “Is Big Data Right for Your Business? President Obama Thinks So!” (Sabhlok, 2012) describes that president Obama’s re-
election victory has been propelled by a talented team of data scientists. Huge amounts of data have been mined and analyzed to raise over one
billion dollars in campaign funds, an amount far exceeding any other election in history. Sophisticated analysis have been used to perfect
fundraising e-mail campaigns and messaging that would yield the best results. Polling data has been gathered in real time, to understand where
the campaign was losing ground, and to allocate resources surgically. Television ads have been data driven as well, enabling the campaign to
specifically target persuadable voters and run ads during programming that appealed to a particular demographic profile. History is to look at the
2012 election as the “Big Data” election. (Sabhlok, 2012)
Even for the recovery efforts of the recent Hurricane Sandy on the East Coast, big data analytics play important roles. Direct relief, an emergency
response organization, uses big data analytics tools to perform analysis and pattern detection. They pull data to see how the storm has affected
the districts, and determine which parts of the region might need equipment, medication, health resources, food and shelter. Big data analytics
can also forecast events, assess their impacts and prevent further damages.
Businesses can succeed if they can create business value from big data, in the large data-driven business revolution.
How can the advanced analytical techniques with big data produce more effective analysis to fuse new business opportunities? Let’s look at the
Target story. (Hill, 2012)
Target is a big retail store chain, and for decades, it has been collecting large amounts of customer data. Target’s IT system gives each customer
profile an ID number, and tons information that are relevant: such as credit card number, when does the customer use coupons, fill out the
questionnaire survey, mail returns, call customer service, place order on the Target website and etc. With this ID, the customer’s demographic
information such as age, marital status, have children or not, home address, salary, distance from home to Target stores and etc., are also
recorded. In addition, Target purchases other information from data collectors, such as the customer’s race, employment history, credit score,
credit history, purchase record, reading habits and so on. When worked by a data scientist, the wonders hidden within these data are to be readily
uncovered.
By a study of the customer’s everyday purchases, like soap, toothpaste, trash bags and toilet paper, it is found that the customers pay almost no
attention on product promotions or coupons. Their purchasing habits are hard to change, regardless of any marketing campaigns. However,
when they are going through a major life event, like getting a new job or moving to a new place, their shopping habits become more flexible. It is
best during this time, for the retail merchants to predict and explore potential business opportunities. A precisely timed advertisement can
change someone’s shopping pattern for years. For example: newly-weds are more likely to start buying a new type of coffee; when a couple move
into a new house, they are likely to start trying a different kind of cereal; when divorced, people start trying different brands of beer. The arrival of
a baby is one of the most important life events. At the time, new parent’s shopping habits are very flexible. If businesses can identify pregnant
shoppers, they can earn millions of dollars’ worth of new business opportunities.
Likewise, Target has tried to attract parents-to-be, and figure out whether or not an individual is having a baby and at what time. The Target’s
marketing group has asked Andrew Pole to help, who is a senior manager at Guest Data Analytical Service Team. According to Target’s request,
he is to set up a model to identify pregnant women, before they are in their second trimesters. Target does not want to wait until the baby is born
to find out about the pregnancy, because as soon as the birth record comes out, the newborn mother is going to be flooded with product
promotions. If Target can identify the women that are pregnant in advance, the marketing department can send them tailored advertising earlier
on, and lock valuable sales opportunities, ahead of competition.
However, pregnancy is very private and personal information, how can Target determine which customer is pregnant? Andrew Pole starts with
data modeling, based on the Target customer data on baby shower registrations. He discovers a lot of useful information with the data model. For
instance, the model suggests that many pregnant women start buying large packages of unscented hand cream, and large amounts of Centrum
supplemental such as Calcium, Magnesium and Zinc etc., in the first twenty weeks of pregnancy. Andrew Pole then selects twenty five typical
commodity consumer data, and builds a pregnancy prediction index. With this index, Target can forecast customer pregnancy within a small
error range, and then send out coupons timed to very specific stages of the customer’s pregnancy.
According to Andrew Pole’s big data model, Target has developed a completely new marketing campaign, and results in explosive growth in the
sales of pregnancy supplies. The big data analysis technique has also expanded from targeting pregnant women, to all other customer groups by
ages, income, gender and education level etc. Target’s revenue has grown from $44 billion in 2002 to $67 billion in 2010. They have continued to
create and focus on items and categories that appeal to specific customer segments.
With the pregnancy prediction index, Target has figured out a teenage girl’s pregnancy before even her father having the slightest knowledge. One
day, an angry father comes into a Target store at Minneapolis, to complain about his high school daughter receiving coupons for baby clothes,
cribs and maternity items from Target. In a fury, he wants to know why the company is sending his teenage daughter such promotions. The store
manager apologizes after verifying that the student is indeed receiving baby related offers. A few days later, however, the father calls in to make
an apology to Target. On the phone, the father is somewhat abashed: “I had a talk with my daughter,” he says, “It turns out there’s been some
activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
It can be really shocking for some, when businesses find out about their private and personal information, such pregnancies in advance. Big data
analytics make it possible, through the collection of a vast volume of information, sorted and analyzed for patterns and trends. When the personal
nature of pregnancy hit the mark, Target is able to identify and predict the trend, and therefore can benefit from highly tailored marketing
campaigns.
Big data is the driving force behind a commercial revolution. The massive consumer data has already gone beyond the scope of the traditional
function of data storage and database management, and from which, one can dig up tremendous commercial value. Businesses need to face the
big data challenge: they must decide if they want to rise up in this revolution, or get buried and perish in it.
THE BIG DATA ECOSYSTEM
Big data is becoming mainstream. With the explosive growth on the types of data sources: applications, digital media, mobiles, sensors, emails,
blogs and videos, data can be complex and have varied formats. Massive data storage, scalable infrastructure, analytical applications, robust
visualization tools and increasing bandwidth availability are all needed for the distributed and parallel processing of big Data. Enterprises have
realized that the data they collect is very valuable, and they need a marketplace for tools, skills and services to take advantage of big data. A new
big data ecosystem is being created. It comprises of technology vendors, resellers and service providers, to enable the development and adoption
of business applications for big data technology.
Companies use technologies such as Hadoop to conduct natural language processing, and analysis on unstructured data from all different data
sources.
Some pioneers in the industry have already been using big data tools in their businesses (Rainmakers, 2012):
• Facebook uses Hadoop to store copies of internal logs and dimensional data sources, and as a source for reporting/analytics and machine
learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB in raw storage, and a 300-machine cluster with
2400 cores and about 3 PB in raw storage;
• Yahoo deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes with
4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search, and do scaling tests to support the development
of Hadoop on larger clusters;
• EBay uses a 532 node cluster (8*532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase;
• Twitter uses Hadoop to store and process tweets, log files and other data generated across Twitter. They use both Java and Scala to access
MapReduce APIs, as well as Pig, Avro, Hive and Cassandra;
• LinkedIn uses daily batch processing with Hadoop. For example, they pre-calculate data for the “People you may know” product by scoring
120 billion relationships per day, in a MapReduce pipeline of 82 Hadoop jobs that require 16TB of intermediate data. LinkedIn also builds an
index structure in the Hadoop pipeline, which creates a multi-terabyte lookup structure that uses perfect hashing. This process trades off
cluster computing resources for faster server responses, which takes LinkedIn about 90 minutes to build a 900GB data store on a 45 node
development cluster. (Bodkin, 2010)
The Data Ecosystem Overview
With huge volumes of data, how are you going to treat your data? With the help of some data analytic tools, some companies see Big Data as
trunks of petabytes, which can generate meaningful business insights. But the Big data ecosystem is more than just that, it is comprised of a lot
more complex layers. To dig out more value, the huge amount of raw data needs to be processed and refined, by trained and skilled IT
professionals. Raw data can come from videos, images, texts, websites, PDF files, and also emails or cameras. Before generating all these data,
some most important questions need to be answered: what do you want to do with these data? What value are you going to retrieve from these
data? Enterprises demand tools, skills and services to take advantage of big data.
The foundation of the big data ecosystem is data, which includes ERP data and transactional data (See Figure 6).
ERP stands for Enterprise resource planning. According to Wikipedia, “Enterprise resource planning (ERP) systems integrate internal and
external management information across an entire organization. ERP systems automate this activity with an integrated software application.
Their purpose is to facilitate the flow of information between all business functions, inside the boundaries of the organization and manage the
connections to outside stakeholders.” An ERP system is a shared database, which supports multiple functions, used by different business units.
Typical ERP system examples include: Oracle and SAP, who are sharing 50% of the market share. ERP data are mostly operations management
and planning data, which cover the areas of Finance, Accounting, HR, CRM, inventory management or supply chain, etc.
Transactional data is the data for point-of-sale, and some other data sources. Data comes from OLTP (online transaction processing), which is
characterized by a large number of on-line transactions, fast query processing, and effectiveness measured by number of transactions per second.
In last decades, transaction data has gone big, applications, having once held megabytes of data, now have expanded to petabytes and zetabytes.
The increasing volumes of transaction data is degrading enterprise application performance, and demanding scalable storage hardware and
network resources.
Data from ERP or OLTP are to be transferred through ETL (extraction, transformation and Loading) into departmental warehouses or local data
marts, before finally loading into enterprise data warehouse. In order for the source data to be loaded correctly, data needs to be well understood,
structured and normalized with appropriate data types. Data needs to go through staging area, become pre-processed and massaged before the
loading process starts. Some factors need to take into consideration: what is the data source quality and the data integrity? How long does it take
the database to handle the ETL batch-processing? What is the right tool for the ETL? Is it Microsoft SQL Server Integration services, Informatica,
or any third party software?
Departmental data warehouses and local data marts have less security and structure constraints, and give business users more flexibility for
departmental business analysis. The next step of data workflow is the Enterprise Data Warehouse (EDW), which is the central storage container
for the company’s most critical data. Online analytical processing (OLAP) technologies are used in EDW, where data are consolidated into multi-
dimensional views of various kinds of business, and OLAP enabled users to analyze multidimensional data from multiple perspectives.
OLAP data is typically stored in a star schema or snowflake schema. Dimension tables and Fact tables are the main components in the OLAP
data. Dimension tables describe the hierarchical business entities of an enterprise, such as time, region, departments, customer and product, and
can be looked up as reference tables. Fact tables describe the transaction details, and usually store the data sets. OLAP cube provides a means to
understand the business performance. A cube aggregates the facts at each level of each dimension, and contains all the data in an aggregated
form. OLAP calculation engines turn massive volumes of raw data into actionable business information, summarized, derived and projected
information from the source data in the data warehouse. The analytic component needs to support sophisticated and autonomous end-user
queries, quick calculations of key business metrics, planning and forecasting functions of large data volumes, which include historical data. Data
warehouse’s core activities are mostly read, so the analytic solution must support a large number of concurrent users hitting on the EDW
simultaneously, in getting data or calculating information to generate business reports. Analysts create data extracts from the EDW to analyze
offline in analytical tools. Analysis tools include spreadsheets (Microsoft Excel), query tools (MS Access and SQL), web browsers, statistical
packages, visualization tools and report writers. They provide a way to select, view, analyze, manipulate and navigate the data.
Enterprise data warehouses (EDW) are critical for reporting and business intelligence. They provide performance reporting, sales forecasting,
product inventory, customer relationships and marketing analysis. Historical data is analyzed by transforming it into projected data. With
unstructured data coming, social media, blobs, clickstream or call center data flow into the EDW, massive volumes of information can quickly
grow beyond the performance capacity of traditional data warehouse. Businesses encounter challenges such as how can they adopt new analytics
and architectures to manage big data, and turn such into business value?
The Big Data Ecosystem
The typical data architecture is designed for storing mission critical data, support enterprise applications and enabling business intelligent
reporting. Big data is generating enormous amounts of data, requiring advanced analytics technology. The big data ecosystem (Figure 7) becomes
a new approach for these analytics.
Big data ecosystem consists of data devices, data collectors, data aggregators and data users. Data devices collect data from multiple locations.
For each gigabyte of new data, a petabyte of data is created about that data.
For example, someone plays video games through TV or PC at home. Data is captured about the levels and skills attained by the player.
Additional data are logged, such as the dates and times, when the player plays certain games, the skill level of the players, and the new levels of
games are unlocked based on player’s proficiency. The users can also purchase new games or additional features, to enhance the games which are
offered via recommendation engines. All these information are stored in the local consoles, and also are shared back with the manufacturers, who
analyze the gaming habits and opportunities for up-sell and cross-sell, and identify the user’s profile.
Smart phones are another rich source of data. A smart phone stores and transmits basic phone usage and text message, as well as data about
users’ internet usages, SMS usage and real-time location. Many grocery stores issue store loyalty cards to the customers. Customers scan the card
when shopping to get promotion price or collect point for events. The loyalty cards not only store the amount the customer spends, but also the
store locations this customer frequently visits, types of products the customer tends to buy. Through the data analysis tool, it shows the customer
shopping habits, and the likelihood of being targeted for certain types of retail promotions.
Data collectors include organizations that collect data from the device and users. Data are being captured through web surfing logs, HTML text of
web browsing history, web registration, questionnaires, interview etc., by various companies. This can range from the phone company tracking
internet usage, the cable TV provider tracking the shows being watched, the prices one is willing to pay for premium TV content, retail stores
tracking the path a customer takes for daily grocery shopping, the IRS analyzing the bank reported information to track tax returns, to the
insurance company capturing a customer’s health records and diet habits to decide on a health and life insurance premium.
Facebook is the biggest social network that collects the people’s personal information. Users fill out profiles with their name, age, gender, email
address, or even more detail information such as address, phone number and relationship status. Their new profile pages even invite people to
add historical information, such as places they have lived and worked. What can Facebook do with these data? Besides data mining for their own
business, Facebook sells the data to marketers who are looking for consumer information and trends. Marketers can then use the data to
customized ad-campaigns, which can target on consumer’s interests. User data collected through phone companies can be used to analyze traffic
patterns, by scanning the density of smart phone in certain locations, also can track the speed of cars of the relative traffic congestion on the busy
roads. Collecting the customer’s driving behavior through a smartphone, can help one to obtain a better insurance premium, for being a safe
driver. There are hundreds of companies specializing in collecting marketing data, and creating consumer profiles across the country.
Data aggregators are organizations that analyze the data collected from various sources. Acxiom and ChoicePoint are the two big companies that
compile information from collected data on individuals, and selling that information to others. They originate data from public records, compile
data from device usage patterns or from online activities, transform data into information, and package into aggregated reports, and then sell the
data to brokers, who then want to market consumer lists as targets for specific ad campaigns, or cross-selling opportunities. They also compile
personal data packages, which contain the customer’s professional licenses, educational background, criminal history and length of residence,
and sell these to employers for employee background checks, or to insurers to be used for insurance coverage and pricing, and so on.
Data users or data buyers are the direct beneficiaries of the information collected and aggregated, in the data value chain. Big companies
purchase data, which includes customer’s name, phone number, address, buying habits, interested merchants etc. According to McKinsey Global
Institute, data is now a $300 billion a year industry, and employs 3 million people in the United States alone. (Morris & Lavandera, 2012) For
instance, retail banks may want to know which customer has the highest likelihood applying for a new credit card, or a home equity loan. They
are to purchase data from a data aggregator showing demographics of people living in specific locations, and those who seems to have been
searching the web for relative information. Obtaining from these various sources, the retail bank can have more specific targeted marketing
campaigns, which would not have been possible 10 years ago, due to the lack of information. Big companies trying to get information about their
potential customers, and what they might be interested in buying, so that they can deliver more relevant offers, products and services to them.
Detailed marketing information is definitely critical in the success of businesses.
Data Scientist in the Big Data Ecosystem
In the data driven era, the efficient operation of organizations relies on the effective use of big data. Enterprises are putting in a lot of money and
efforts to make sure they have the infrastructure for managing big data, and the analytical tools to benefit from big data tools. But finding the
people who have big data skills and understand data science is far more important in solving the big data puzzle.
In a May 2011 McKinsey report, some numbers are being put on the demand: “Now, the US government has earmarked $200 million to support
research in big data. But the advancement and use of big data technology can be inhibited by a lack of deep analytical talent. By 2018, the United
States alone can face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts with
sufficient knowledge, to use the analysis of big data to make effective decisions.” (Manyika, 2011)
Data scientists are in big demand with the emergence of big data. A data scientist is a professional, who has deep technical background, can
expertly manipulate data, and have massive scales of complex analytical skills. These professionals can be people with advanced training in
quantitative area, such as statistics, mathematics, economist and data scientists. They need to have the combination of skills to handle raw data
and unstructured data, strong technical knowledge, strong analytical skills and background. In a word, the ability to straddle both the business
and technical sides of an organization is a must for a data scientist. But what makes a data scientist unique is the ability to use technical skillsets
to solve actual real world problems.
A data scientist should have a combination of complex skills, advanced quantitative knowledge, business skills, technical experience and problem
solving capacity. He should be analytical, curious and skeptical, can tell a story with the data in overcoming business challenges. A data scientist
is the combination of following qualities:
• A variety of academic backgrounds providing a good foundation for a data scientist. Graduate students or PHD students from computer
science, statistics, applied mathematics, physics or economist who have advanced proficiency with mathematics;
• Technical expertise is required. Software engineers, or software programmers/developers who have computer programming skills;
• He needs to be passionate about data, and always seeking creative ways to solving data problems, and mining information from data.
Instead of asking “what happened”, he is more interested in “what is going to happen?” and “what can we do about it?”
• He needs to have skeptical view of his work, and always exam his work critically. He needs to be willing to question old assumptions,
reanalyze business problems and come up with the solutions;
• He needs to have an understanding and experience of business, knowing the value of analytics and how the analytics data fits in the
organization.
A Data scientist needs to have business knowledge, that ensures the correct positioning of projects, by which actionable insights can be derived
from data, and he knows how to make an impact on the business.
A Data scientist needs to be passionate in manipulating and analyzing any data, even the incomplete, disorganized, and huge volumes of
unstructured data. He needs to have experience in a variety of domains, and in working with different issues, so that he can brainstorm a
particular machine learning algorithm or a new statistical model.
A data scientist needs to know the programming and scripting languages, such as R, Python, Hadoop, HBase, Cassandra or SAS. He needs to be
able to discuss the differences between graph databases, document stores and key value stores, where BigTable implementations are become
more and more important.
Besides the essential skillsets to do a good job, a data scientist needs to be creative and innovative. Data scientists must be able to innovate the
collection of data, analyze it and think of it in fantastic ways, so that the data can be put to the best advantageous use. A data scientist can create
the tools used to interpret and translate the streams of data into innovative new products.
At Facebook for instance, social analytics are becoming more important. When a data scientist looks at the Facebook’s social behavior data, such
as click-through rates, social media comment, family photos, conversations, life achievements, he needs to identify the differences in individual
attributes and interpersonal interactions, mine the insights and create innovative and interactive features, to encourage users to get interested
and stay with Facebook. One example is in helping Facebook provide products such as identifying users that one may know but haven’t friended.
Through the data, he needs to reveal patterns of information and clusters of affinity, patterns in communication balance and network change
overtime, and help Facebook best utilize network resources. Data scientists must be able to demonstrate actionable innovation and creativity.
Facebook’s Data Science Team, led by Cameron Marlow, has 12 researchers. They use math, statistics, computer programming skills and social
science to mine the data for insights, which helps Facebook’s business. One of their innovations is the “Like” button, which Facebook offers on
both the Facebook site and general user’s website. With the like button on Facebook site, one can click after looking at a friend’s photo or
comments or any content on Facebook. The Like button, out of the Facebook site, plays a very important role. Beyond Facebook‘s website, when a
user visits a website with Like button on it, they can click on it to let their friends know that they like the site. After clicking on the button, they
have a pop-up asking for login to Facebook. Once they have logged in, they can see the button on the right shows which of their friends like that
page, along with their friend’s profile picture. After clicking the like button, a story can be posted automatically to their Facebook page telling
their friends that they like the site. For instance, if someone is looking at a webpage about their favorite songs, and they click the like button, the
song can then be added to the like button, and that user can post the website they visited online, to also track activities on the internet, and invite
people to hit the Like button. Within the first five months after this feature’s launch, Facebook catalogued more than five billion instances of
people listening to songs online.
Google has their data sciences team as well. By tracking consumer clicks or capturing information from Gmail, Google plus, they have assembled
huge volumes of data, and analyzed web traffic to predict patterns being increasingly applied to other fields. This is a product driven role; the
ideal candidate needs to be an expert with a full list of technical skills, and have strong communication skills. To qualify, you need:
• Experience working with large data sets with statistical software such as R, s-Plus and Matlab;
• Excellent communicator who can collaborate with a multi-disciplinary team of engineers and analysts;
• Ability to draw conclusions from data, by using appropriate techniques and recommend actions;
• It is preferred that one is proficient at processing data set, with machine learning language such as Python and Java script.
The right candidate for this job is able to find new ways to make money from the online search data. The goal for the data scientist is to find
patterns among the data, which can then be used by the product team. It is the data scientists who have built the Google self-driving car, designed
Google glasses, or developed an algorithm to automatically translate Google’s search site from English to Japanese.
Data science is the defining force that shapes the evolution of every industry. Organizations are using data science to shift the balance of power,
and exploring how to optimize people and strategies to realize the full potential of data science. The data scientist job is considered as the “sexiest
job of the 21st century.”
This work was previously published in Enabling the New Era of Cloud Computing edited by Yushi Shen, Yale Li, Ling Wu, Shaofeng Liu, and
Qian Wen, pages 156184, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Baunach, S. (2012). Three vs of big data: Volume, velocity, variety.Data Center Knowledge. Retrieved from
http://www.datacenterknowledge.com/archives/2012/03/08/three-vs-of-big-data-volume-velocity-variety/
Biddick, M. (2012). Feds face ‘big data’ storage challenge.Informationweek. Retrieved from
http://www.informationweek.com/government/information-management/feds-face-big-data-storage-challenge/240000958
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., & Murrows, M. … Gruber, R. E. (2006). Bigtable: A distributed storage system
for structure data. OSDI. Retrieved from
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
Clegg, D. (2012). Big data: The data variety discussion. IBM The Big Data Hub. Retrieved from http://www.ibmbigdatahub.com/blog/big-data-
data-variety-discussion
Davenport, T., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century. Harvard Business Review. Retrieved from
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1
Davis, A. (2012). The government and big data: Use, problems and potential. Computerworld. Retrieved from
http://blogs.computerworld.com/19919/the_government_and_big_data_use_problems_and_potential
Evans, B. (2012). Big data set to explode as 40 billion new devices connect to Internet. Forbes. Retrieved from
http://www.forbes.com/sites/oracle/2012/11/06/big-data-set-to-explode-as-40-billion-new-devices-connect-to-internet/
Feinleib, D. (2012). The big data science behind today’s most popular content. Forbes. Retrieved from
http://www.forbes.com/sites/davefeinleib/2012/09/07/the-big-data-science-behind-todays-most-popular-content/
Flaster, M., Hillyer, B., & Ho, T. K. (n.d.). Exploratory analysis system for semistructured engineering logs. Retrieved from http://ect.bell-
labs.com/who/tkh/publications/papers/xlog.pdf
Gottlieb, I. (1986). Structured data flow: A quasi-synchronous Interpretation of data driven computations . New York: City University of New
York.
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., & Venkatrao, M. (1996). Data cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery , 1, 29–53. doi:10.1023/A:1009726021843
Hardy, Q. (2012a). Google ventures’ big data bet. The New York Times. Retrieved from http://bits.blogs.nytimes.com/2012/04/11/google-
ventures-big-data-bet/
Hardy, Q. (2012b). How big data gets real. The New York Times. Retrieved from http://bits.blogs.nytimes.com/2012/06/04/how-big-data-gets-
real/
Horowitz, B. T. (2012). Big data analytics, HIE could aid hurricane sandy recovery efforts. eWeek. Retrieved from
http://www.eweek.com/enterprise-apps/big-data-analytics-hie-could-aid-hurricane-sandy-recovery-efforts/
Inmon, W. H. (2005). Building the data warehouse . New Delhi: Wiley India Pvt.
Johnson, M. E. (2012). Hurricane Sandy: Big data predicted big power outages. InformationWeek. Retrieved from
http://www.informationweek.com/big-data/commentary/big-data-analytics/bog-data-helped-to-track-hurricane-sandy/240115312
Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The data warehouse lifecycle toolkit . Hoboken, NJ: Wiley Publishers.
Lewis, N. (2012). Pittsburgh healthcare systems invests $100M in big data. Information Week. Retrieved from
http://www.informationweek.com/healthcare/clinical-systems/pittsburgh-healthcare-system-invests-100/240008989
Lohr, S. (2012). The age of big data: Big data’s impact in the world.The New York Times. Retrieved from
http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?pagewanted=all
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation,
competition, and productivity. McKinsey & Company. Retrieved from
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
McDonnell, S. (2011). Big data challenges and opportunities.Spotfire Blogging Team. Retrieved from http://spotfire.tibco.com/blog/?p=6793
McGuire, T., Manyika, J., & Chui, M. (2012). Why big data is the new competitive advantage. Ivey Business Journal. Retrieved from
http://www.iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive-advantage
Morris, J., & Lavandera, E. (2012). Why big companies buy, sell your data. CNN. Retrieved from
http://www.cnn.com/2012/08/23/tech/web/big-data-acxiom
Pring, C. (2012). 216 social media and internet statistics. The Social Skinny. Retrieved from http://thesocialskinny.com/216-social-media-and-
internet-statistics-september-2012
Protalinski, E. (2012). Facebook has over 845 million users.ZdNet. Retrieved from http://www.zdnet.com/blog/facebook/facebook-has-over-
845-million-users/8332
Sabhlok, R. (2012). Is big data right for your business? President Obama thinks so. Forbes. Retrieved from
http://www.forbes.com/sites/rajsabhlok/2012/11/15/is-big-data-right-for-your-business-president-obama-thinks-so/
SAS. (2012). Big data meets big data analytics: Three key technologies for extracting realtime business value from the big data that threatens
to overwhelm traditional computing architectures (White Paper). Retrieved from http://www.sas.com/resources/whitepaper/wp_46345.pdf
Savitz, E. (2012). The big value in big data. Forbes. Retrieved from http://www.forbes.com/sites/ciocentral/2012/09/25/the-big-value-in-big-
data-seeing-customer-buying-patterns/
Schumpeter. (2011, May 26). Building with big data. The Economist. Retrieved from http://www.economist.com/node/18741392
Spakes, G. (2012). Turning big data into value. SAS Voices. Retrieved from http://blogs.sas.com/content/sascom/2012/04/12/turning-big-data-
volume-variety-and-velocity-into-value/
Statistic Brain. (2012). Google annual search statistics. Retrieved from http://www.statisticbrain.com/google-searches/
Terry, K. (2012). Health IT execs urged to promote big data.InformationWeek. Retrieved from
http://www.informationweek.com/healthcare/clinical-systems/health-it-execs-urged-to-promote-big-dat/240009034
Utz, S., & Krammer, N. C. (2009). The privacy paradox on social network sites revisited: The role of individual characteristics and group
norms. Cyberpsychology (Brno) , 3(2).
ABSTRACT
The Analytics tools are capable of suggesting the most favourable future planning by analyzing “Why” and “How” blended with What, Who,
Where, and When. Descriptive, Predictive, and Prescriptive analytics are the analytics currently in use. Clear understanding of these three
analytics will enable an organization to chalk out the most suitable action plan taking various probable outcomes into account. Currently,
corporate are flooded with structured, semi-structured, unstructured, and hybrid data. Hence, the existing Business Intelligence (BI) practices
are not sufficient to harness potentials of this sea of data. This change in requirements has made the cloud-based “Analytics as a Service (AaaS)”
the ultimate choice. In this chapter, the recent trends in Predictive, Prescriptive, Big Data analytics, and some AaaS solutions are discussed.
INTRODUCTION
Business Analytics is a collection of techniques for Collecting, Analyzing and Interpreting data to reveal meaningful information from data.
Business analytics focuses on five key areas of customer requirements (Lustig, Dietrich, Johnson & Dziekan, 2010):
1. Information Access.
2. Insight.
3. Foresight.
4. Business agility.
5. Strategic alignment.
The phases of Business Analytics are:
1. Descriptive Analytics: The first phase of business Analytics. Descriptive analytic is commonly referred to as business intelligence tools.
Descriptive analytics takes into consideration what did happened to improved decisions making based on lessons learnt.Descriptive
Analytics:
2. Predictive Analytics: The predictive analytics is mostly used by insurers to evaluate what could happen by analyzing the past to predict
the future outcomes.Predictive Analytics is used to:
3. Prescriptive analytics: Prescriptive analytics not only focuses on Why, How, When and What; but also recommends how to act for
taking advantage of the circumstance.Prescriptive analytics often serve as a benchmark for an organization’s analytics maturity. IBM has
defined prescriptive analytics as “the final phase” and the future business analytics (Rijmenam, 2013).Features of prescriptive analytics are:
With a clear understanding of Descriptive, Predictive and Prescriptive analytics an organization chalk out the most suitable future planning
taking future outcomes into account. This chapter discusses about Predictive Analytics, Prescriptive Analytics, Big Data Analytics, In-database
analytics and Analytics as a Service (AaaS).
PREDICTIVE ANALYTICS
Predictive Analytics originated from AI (Artificial Intelligence) for making predictions based on discovered and recognized patterns in dataset.
Historically predictive analytics has been studied under the umbrella of Operations Research (OR) or management sciences. Predictive Analytics
is also known as “one-click data mining” since Predictive analytics simplifies and automates the data mining process. Predictive analytics
develops profiles, discovers the factors that lead to certain outcomes and accordingly predicts the most likely outcomes with degrees of
confidence in the predictions.
Predictive Analytics aims at optimizing the performance of a system using sets of intelligent technologies to uncover the relationships and
patterns within large volumes of data to predict future events i.e. what is likely to happen (Bertolucci, 2013).
The following are the common predictive analytics modeling tasks (Underwood, 2013):
• Link Analysis: Relationship discovery.
• Web Mining: Mine relevant information from Structured, Semi-structured and Unstructured data from the web.
Due to fraudulent claims insurance companies losses millions of dollars every year which are ultimately passed down to the customer in terms of
higher insurance premiums. Predictive analytics can be used for fraud detection and speeding up the claims processing of insurance companies
(Gualtieri, 2013).
• Supply Chain.
• Customer Selection.
• Pricing.
• Human resources.
• Product and Service quality.
• Financial Performance.
Predictive analytics will be having a big impact in business organizations particularly dealing with huge volume of Structured, Semi-Structure
and Hybrid data. New methods developed through predictive analytics will help in analyzing the five Vs i.e. Volume, Veracity, Velocity, Variety
and Value of Big Data.
PRESCRIPTIVE ANALYTICS
Prescriptive analytics has been around since 2003. In the report “Hype Cycle of Emerging Technologies”, Gartner has mentioned prescriptive
analytics as the “Innovation Trigger” which is likely to mature in next 5-10 years.
Prescriptive analytics or “optimization” is based on the capabilities of Descriptive and Predictive analytics. The roles of optimizations are as
follows:
• How to achieve the “Best” result by addressing the “Uncertainty” while taking the decision with the help of “Stochastic optimization”?
Prescriptive analytics is one step from predictive analytics. Prescriptive analytics analyzes the situations such as “How to achieve the best
outcome including the effects of variability?” using combination of techniques and tools to examine the effect of decisions in advance prior to
execution of the decision. Such type of precautionary measures saves an organization form taking harmful decisions.
Prescriptive analytics operates with all types of dataset stored in Public, Private and Corporate data bank. Data types dealt by various analytic
tools are classified as:
The structured data includes Numerical and Categorical data while Text, Images, Video, Social media streams, Machine data and Audio, Big Data
are known as unstructured data. The unstructured data contains wealth of mine-able information, but mostly ignored since these data are
difficult to store and analyze. Since these data play a very important role in the decision making of an organization, combining structured and
unstructured and hybrid data set is an excellent idea for business analytics to have a complete view of the issues and challenges for making the
best possible decision.
As shown in the above figure data may be stored in conventional data warehouse or in cloud or a combination of both. Data fed into
computational models are applied various Rules, Algorithms to get the result. The final phase and most important is stage is the prescriptive
analytics where What-If, How, When, Who and Where are applied to take the final conclusion before the decision were used in policy making
process.
Business rules define various intricate business processes. Other technicalities of prescriptive analytics are handled by Mathematical models,
Applied statistics, Operations research, Machine learning and natural language processing (Rose Business Technologies,
http://www.rosebt.com/1/post/2012/08/predictive-descriptive-prescriptive-analytics.html).
The fundamental factors of prescriptive analytics are “Why” and “How”. This two factors are blend with What, Who, Where and When. Using
“What-If” analysis predictive analytic tools changing the input parameters the resulting affects can be observed. However when it comes to rating
the affects of orders huge quantity of tables and hundreds of parameters, predictive analytics decision making becomes labor intensive.
To familiarize the “What-If” analysis, let us consider an example of selling 100 books and predict about the profit. (see Table 1)
Table 1. MS Excel “WhatIf” use case
Now, let us apply the popular MS Excel the “What-If” analysis will predict the benefits of selling the same number of books with varying
percentage of highest prices and their corresponding benefits using formula (1). (see Table 2)
Table 2. WhatIf prescriptive analytics
Scenario Summary
Initial values: % of % of % of % of
books books books books
sold at sold at sold at sold at
highest highest highest highest
price price price price
Hence, the observation is that, if all the books were sold at highest price the maximum profit that can be earned as $ 5000.
Now, let us use the same example to the set the goal of earning profit of $ 6000 selling the same quantity (100 copies) of book without changing
the highest and lowest price using the “Goal Seek” feature. (see Table 3)
Table 3. Goal seek whatif analysis
Lower price 0 0
This is an impractical “Goal” for making a profit of $ 6000 since 133% books are to be sold at the highest price of $ 133.33 which is superficial.
Prescriptive Analytics Use Cases
Prescriptive analytics tool helps the Airlines to set the highest possible price for air tickets by analyzing the itineraries of passengers during the
high times of demand.
Google’s self-driving car takes into considerations various predictions and future outcomes coming on its way and their probable effect on a
possible decision before taking that decision for preventing an accident. Exponential growth of unstructured data by social networking activities
such as Blogs, Twits, Likes has created a huge market for prescriptive analytics. Corporate can use the Facebook “likes” in Prescriptive analytics
to recommending the demand of a particular commodity by scanning of billions of blogs. Leading Big Data vendors are adapting prescriptive
analytics based on Operations Research (OR).
Prescriptive analytics will be the next evolution in business analytics using an automated system combining Big Data, Business rules,
Mathematical models and Machine learning to deliver perceptive advice in a timely fashion. One of these proponents is Ayata, an Austin, Texas,
developer of prescriptive analytics software. Their customer includes major IT players such as Cisco, Dell and Microsoft (Rijmenam, 2013).
Challenges of Prescriptive Analytics
According to Gartner survey only 3% of organizations use prescriptive analytics due to its inherent complexities and intense dependency on
current observations for predicting the future. Serious and careful examination of past and present is vital for predicting desired probable. Apart
from the above mentioned limitations, the following are the added challenges of prescriptive analytics (Rijmenam, 2013):
BIG DATA ANALYTICS
Cloud is sure to transform computing in business. On premise Business Intelligence (BI) is complex, expensive requiring Expensive consulting;
Long implementation cycles; Inflexible; Limited to large clients who can afford. Apart from these disadvantages the existing analytic and (BI)
practices are not competent to deal with the exponentially increasing Big Data generated by Social media, Sensor data, Spatial coordinates and
external data. This paradigm change in requirements has made the “Cloud Based Analytics” the ultimate choice. Since the corporate using cloud
applications for CRM, ERP, HRM and other enterprise applications can easily put their database in the cloud, both cloud based and traditional
business analytics tools can access the data smoothly.
Social media, Sensor data, Spatial coordinates and social networking sites are some of the Big Data sources corporate are to address now. There
are lots of scopes for innovative use of Big Data in banking industry to enhance their customer database by analyzing the customers’ online
activities such as logs to identify the business potentials. Predictive analytics solutions analyze patterns found in Big Data to predict potential
future outcomes. Predictive analytics capitalizes on all available data to provide unique insights, adds smarter decisions to existing systems,
drives better outcomes and delivers greater value.
Big Data analytics platforms deal with fast moving data sources such as various types of Sensor data, Smartphone generated data, Customer
interaction data, On-line transaction data and Web logs for analysis. Big Data is a business strategy for capitalizing on information resources. Big
Data requires Iterative and Exploratory analysis.
Two important trends make the era of Big Data analytics different (Analytics: The real-world use of big data, How innovative enterprises extract
value from uncertain data In collaboration with Saïd Business School at the University of Oxford, IBM Institute for Business Value,
ibm.com/iibv):
• The recent trend of digitizing all the data has resulted lots of large and real-time data across a broad range of industries. Much of the
Unstructured, Semi-structured and Hybrid data such as Streaming, Geospatial, and Sensor-generated data do not fit neatly into traditional
structured relational data warehouses models.
• Today’s advanced analytics technologies and techniques enable organizations to extract insights from data with previously unachievable
levels of sophistication, speed and accuracy.
Big Data business analytics use cases are categorized into following categories:
1. Transactional:
• Fraud detection.
2. SubTransactional:
• Weblogs.
• Social/Online media.
• Telecoms events.
3. NonTransactional:
• Documents.
• Physical events.
• Application events.
Big Data Analytics Technology
The following are some of the practical approaches to deal the Big Data:
1. MapReduce (HADOOP)
2. BSP (HAMA)
3. STORM
“Apache Hama” is a pure Bulk Synchronous Parallel (BSP) computing framework on top of HDFS for huge scientific computations such as
matrix, graph and network algorithms (http://hama.apache.org/).
Storm is free, open source distributed Realtime computation system (http://storm-project.net/). Storm uses Cluster similar to Hadoop. The
difference is that, while Hadoop runs “MapReduce jobs”, Storm runs “topologies”. One key difference between “Jobs” and “topologies” are that, a
“MapReduce job” eventually terminates, while a Storm “topology” processes messages until it is “kill-ed”.
• Worker nodes.
The Master node runs on daemon called “Nimbus” similar to Hadoop's “JobTracker”. Nimbus is responsible for distributing code around the
cluster, assigning tasks to machines, and monitoring for failures.
Worker node runs daemon called the “Supervisor”. The supervisor listens for work assigned to its machine and starts and stops worker processes
based on what Nimbus has assigned to it. Each worker process executes a subset of a topology. Running topology consists of many worker
processes spread across many machines.
Big Data Analytics Technology
Main Big Data analytics technologies are:
1. Hadoop: having low cost open source reliable scale-out architecture for distributed computing.Hadoop core components:
b. MapReduce:Map: distribute a computational problem across a clusterReduce: Master node collects the answers to all the sub-
problems and combines them.The four forms of MapReduce used in various distributed system are:
i. Map Only
2. NoSQL Databases: Huge horizontal scaling and high availability, highly optimized for retrieval and appending. Lots of open source and
proprietary NoSQL with Big Data handling capabilities are coming up. Most of them are based on Hadoop MapReduce framework handling
sophisticated analysis.
3. Analytic RDBMS: Optimized for bulk-load and fast aggregate query workloadsThe features of analytic RDBMS are:
a. Column-Oriented.
c. In-Memory Processing.
Corporate using Big Data analytics can better understand customers, unlock new revenue streams and overtake the competition include
(Hopkins, Evelson, 2011). All types of corporate are finding innovative ways to engage with existing and prospective customers. Combining the
traditional analytics analytics with of Big Data analytics innovative organizations can (Business Analytics for Big Data: Unlock Value to Fuel
Performance, http://www.ibmBig Datahub.com/whitepaper/business-analytics-big-data-unlock-value-fuel-performance):
These efforts are dramatically improving organizations’ ability to compete. However, they require analytics that are tuned specifically to the
unique characteristics of Big Data-analytics that can deliver insights to all stockholders so that decisions and actions can be consistently
optimized at every level of the organization (Business Analytics: Business Analytics for Big Data: Unlock value to fuel performance, IBM
Corporation Software Group, in http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=YTW03329USEN).
In fact, Big Data can be a two-way street between customers and organizations. Manifold benefits branch out from the customer-focused Big Data
analytics. Decreases in the cost of both storage and computing power have made it feasible for every company to tap the power of Big Data to
uncovered insights. The companies who figure out how to capture business value from Big Data will surely have a competitive advantage in a
global economy increasingly driven by pervasive computing. The most effective Big Data analytics are capable of identifying the business
requirements and then mould the Infrastructure, Data sources and other resources to maximize the profit by optimum utilization of resources.
Currently, lots of corporate are taking advantage of Big Data analytics by utilizing it the growing user bases of social networking sites such as
Facebook, LinkedIn, and Twitter. Most of them are directly reaching their customers by these social networking sites. The advanced analytics
tools can integrate the social media data with traditional data sources to gather a complete picture of the consumer environment. For example, a
series of Facebook “Likes” or Twitter comments are only qualitative feedback. However, integrating the Geospatial characteristics of the social
data with more concrete data from point-of-sale and customer loyalty programs will quantify the true value of those social media inputs.
Following section discusses the Big Data analytics tools.
Big Data Analytics Tools
To maximize the benefits of Big Data analytics initiatives, it is critical for organizations to select the right analytics tools and involve people with
adequate analytical skills to a project. Recommended best practices for managing Big Data analytics programs include focusing on:
Often Big Data are defined by the three Vs i.e. Volume, Variety and Velocity-which is helpful but ignores other commonly cited characteristics of
Big Data, such as Veracity and Value.
Companies that have large amounts of information stored in different systems should begin a Big Data analytics project by considering the
interrelatedness of data and the amount of development work that will be needed to link various data sources. Hadoop programming framework
was originally developed by Google to supports the development of applications for processing huge data sets in distributed computing
environment.
Table 4. Big Data predictive analytics tools
The Big Data predictive analytics solutions mentioned in Table 4 range from coding tools to specific business solutions.
Big Data technologies have matured rapidly. Some of the popular Big Data applications are Netezza, SAP HANA, Vertica etc. Most of them now
making it possible the storage and processing of enormous data in a matter of seconds. The mostly used open source counterpart such as
Hadoop, HBase, Avro, Pig, ZooKeeper, Apache Commons and Lucene are also enterprise-ready Big Data analytics tools and applications.
IBM Big Data Platform
IBM provides a number of integrated technology components for end-to-end analytics on data in motion and data at rest. These components
include:
• A data warehouse platform supporting traditional analysis and reporting on structured data at rest.
• Arrange of analytical appliances optimized for specific advanced analytical workloads on Big Data.
• An integrated suite of self-service BI tools for ad hoc analysis and reporting including support for mobile BI.
• Search based technology for building analytic applications offering free form exploratory analysis of multi-structured and structured data.
• Pre-built templates for quick start analytical processing of popular Big Data sources.
• A suite of integrated information management tools to govern and manage data in this new extended analytical environment.
Together, this set of technologies constitutes the IBM Big Data Platform (Gualtieri, 2013). This platform includes three analytical engines to
support the broad spectrum of traditional and Big Data analytical workloads:
• HadoopIBM InfoSphere BigInsights.
IBM InfoSphere BigInsights is the IBM's commercial distribution of the Apache Hadoop system. It has been designed for exploratory analysis of
large volumes of multi-structured data. IBM InfoSphere BigInsights ships with standard Apache Hadoop software. However IBM has
strengthened this by adding a number of features to make it more robust including a Posix compliant file system and storage security.
IBM InfoSphere BigInsights on IBM System zEnterprise
With respect to IBM System z, IBM has announced a version of InfoSphere BigInsights Enterprise Edition that will run within the zEnterprise on
the zEnterprise BladeCenter Extension (zBX) frame. This means that Hadoop can run on virtualized HX5 Linux blades using virtual disk.
IBM PureData System for Analytics (Powered by Netezza Technology)
IBM PureData System for Analytics powered by Netezza technology is the next generation Netezza Appliance optimized for advanced analytical
workloads for structured data.
IBM InfoSphere Warehouse, Smart Analytics System and IBM PureData System for Operational Analytics
The IBM PureData System for Operational Analytics is based on IBM Power System. The IBM Smart Analytics System is a modular, pre-
integrated real-time Enterprise Data Warehouse optimized for operational analytic data workloads available on IBM System x, IBM Power
System or IBM System z servers. Both the IBM PureData System for Operational Analytics and the IBM Smart Analytics System family include
IBM InfoSphere Warehouse 10 software running on DB2 Enterprise Server Edition 10. DB2 10 includes a new NoSQL Graph store. Also
automated optimized data placement leveraging Solid State Disk (SSD) is included in the new PureData System solution.
IBM DB2 Analytic Accelerator (IDAA)
IBM DB2 Analytics Accelerator is an IBM Netezza 1000™ Appliance specifically designed to offload complex analytical queries from operational
transaction processing systems running DB2 mixed workloads on IBM System z. It can also be used with IBM DB2 for z/OS based data
warehouses to accelerate complex query processing
IBM Big Data Platform Accelerators
IBM has built over 100 sample applications, user defined toolkits, standard toolkits, industry accelerators and analytic accelerators to expedite
and simplify development on the IBM Big Data Platform.
IBM Information Management for the Big Data Enterprise
InfoSphere Information Server, InfoSphere Foundation Tools, InfoSphere Optim and InfoSphere Guardium provide an integrated suite of tools
for governing and managing data. IBM Information Management tools for the Big Data analytics include the tools for Big Data analytical
platforms across traditional and contemporary platforms such as:
• Defining.
• Modelling.
• Profiling.
• Cleaning.
• Integrating.
• Virtualizing.
These functionalities are supported by IBM BigInsights, IBM PureData System for Operational Analytics powered by Netezza technology, IBM
DB2 Analytics Accelerator as well as IBM InfoSphere Master Data Management. IBM InfoSphere Information Server also integrates with IBM
InfoSphere Streams to pump filtered event data into IBM InfoSphere BigInsights for further analysis.
Limitations of Big Data Analytics
The business opportunities of Big Data are accompanied by challenges to capturing, storing, and accessing information. Analytics platforms have
been tremendously effective at handling structured data originated in the Customer Relationship Management (CRM) applications to optimize
the functions throughout an enterprise. Although prescriptive analytics are both possible and powerful, there are lots of challenges in integrating
these techniques within the business decision-making process.
Big Data does not create value until it is used to solve business challenges. Dealing with Big Data requires access to more and different kinds of
data, as well as strong analytics capabilities that include both software tools and the requisite skills to use them. Organizations engaged in Big
Data activities reveals that they start with a strong core of analytics capabilities designed to address structured data and gradually added
capabilities to deal with Big Data coming into the organization.
In order to inspect intermediate results, final results and derivatives it is important to provide visualization tools to the users, otherwise they will
not be able to understand the data, can misinterpret results and draw wrong conclusions with Big Data. The cumulative errors can result in
massive faults. Hence, Big Data prescriptive analytics will be having very high large impact on business decisions making and help them become
more effective and efficient.
To deal with Big Data an organization needs technology and human resources with suitable analytical skills in an affordable manner. Another
serious Big Data concern is that the rich sets of Structured and Hybrid data brought into Big Data store for analysis could easily attract cyber
criminals. Unlocking the true value from massive amount of Big Data will require new systems for Centralizing, Aggregating, Analyzing, and
visualizing enormous data sets.
In general analyzing and understanding petabytes of structured and unstructured data poses the following challenges:
• Scalability.
• Robustness.
• Diversity.
• Analytics.
INDATABASE ANALYTICS
In-database processing also referred to as In-database analytics is the integration of data analytics with data warehousing functionality. In-
database processing eliminates the movement of data by embedding analytical functionality directly into the database. In-Database processing is
similar to data mining to some extent since the database is mined for required data using descriptive and predictive models for discovering
meaningful pattern in the database. For instance, in one example of an in-database analytics offering, an extensive library of numerical and
analytical functions, ANSI SQL OLAP extensions, and new libraries of pluggable analytical algorithms have been embedded into a columnar
analytics database.
Following figure illustrates the difference in the approaches the conventional data analytics applications and In-database analytics application. In
case of In-database analytics, the middle layer i.e. analytics server is eliminated for enhanced performances. All the queries are directly injected
into the database and the results of the queries are available with the visualization tools for analysis and decision making.
Analytical programs are capable of performing the larger computations by exporting data from data warehouse to avoid the data movement from
a sluggish Decision Support system. Together with performance and scalability advantages stemming from database platforms with parallelized,
shared-nothing Massively Parallel Processing (MPP) architectures, the database-embedded calculations are capable of respond to growing
demand for high-throughput operational analytics requirements such as fraud detection, credit scoring and risk management (Grimes, 2008).
Data movement may takes lots of time before computations are finished as well as might lead to inconsistencies if the recent version of data is not
available/ selected. The In-database processing is can deal with such type of difficulties/inconsistencies, since the all the data processing
activities are executed in the database itself instead of moving data from the database warehouse.
• Easier programmability.
Most SAS execution makes use of Symmetric Multi-processing (SMP) hardware technology allowing many CPUs to share processing workloads
with coordination through a shared memory infrastructure. MPP architecture is capable of handling many SMP servers configured with virtually
unlimited number of a high performance CPUs.
The In-database processing responds to growing demand for high throughput, operational analytics for ever growing data volumes and
complexity (https://www.google.co.in/#q=in-database%2B processing). Use case of In-memory application includes large databases such as
Credit card fraud detection and Investment risk management systems. In-database analytics is capable of providing significant high speed
performance improvements over traditional disk based Database Management Systems (What Is In-Database Processing?,
http://www.wisegeek.com/what-is-in-database-processing.htm). In-database processing will not only speed up usiness analytics s by running
application inside the database to avoid time-consuming data movement and conversion. In-database processing is also more accurate and cost
effective than the conventional data-intensive environments (Das, 2010).
IBM is having multiple In-database options for its DB2 and Netezza databases. Emerging In-database analytics exploits the programmability and
parallel-processing capabilities of database engines from Teradata, Netezza, Greenplum, and Aster Data Systems.
• Increased Parallelization,
• Higher performance,
• Manageability,
• Reliability, and
• Security.
Large database intensive applications such fraud detection and stock market uses In-database analytics. The most important application of In-
database processing is likely to be the predictive analytics for making quick prediction. In-Database processing usually comes as a standard for
large business solutions since businesses needs powerful data processing system for effective and timely business decision making process (What
is In-database Processing, http://www.wisegeek.com/what-is-in-database-processing.htm). As data volumes grow, In-database processing
techniques are gaining adoption for eliminating time-consuming data movements to take advantage of the scalability and processing power of the
databases and data warehouse hardware. Leading In-database analytical software modules includes R, SAS etc. Current In-database mining
trend is about scaling to the next generation of MPP databases. Advanced In-database Analytics package will help Oracle, EMC Greenplum, IBM,
Teradata and SAS in pushing the In-database technology further.
ANALYTICES AS A SERVICE (AaaS)
Recent trends in Enterprise Computing are defined by combined Big Data and Cloud computing applications. Big Data analytics in Cloud
environment has enabled many Small and Medium Enterprises (SME) to adopt AaaS due to Cost effectiveness and ease deployment. AaaS is a
cloud service delivery mechanism in which data analytics is provided as a service to the customers by the Cloud Service Provider (CSP).
According to Gartner, more than 30% of analytics projects by 2015 will provide insights based on structured and unstructured data. Another
dimension for AaaS is Model as a Service (MaaS) though this concept is not fully differentiated by the CSPs. In Model as a Service, CSPs provide
analytical models as a service to their customers. Customers in turn can use this model as a base to setup the infrastructure required for their
analytical applications. But for ease of delivery, this service is typically bundled and offered with AaaS by the CSPs.
First generation of Cloud Based Analytics was data warehouse and management appliances such as Netezza, Teradata, Greenplum. Second
generation of cloud based analytics AaaS or on demand BI is emerging as the potential business analytics for bridging the gap between enterprise
analytics and end-user analytics for data aggregation and self-service business analytics.
AaaS is emerging as a clear and compelling model. AaaS) is likely to be commonplace with different forms, including analytics services providers
and analytics focused SaaS. Gradually existing IT service providers, System integration and Data providers are moving into value added analytics
services. AaaS are being deployed in many emerging application areas of analytics such Google Web Analytics, Adobe Omnitur Web analytics,
Marketing analytics (M-factor), Hosted/On-demand business intelligence platforms such as Panorama, SAP Business Objects On Demand are a
few to be mentioned.
Mostly used open source predictive analytics software includes RapidMiner, R and WEKA and Revolution R Enterprise (Revolution Analytics).
Revolution Analytics and Teradata has jointly developed solutions for maximizing the value of Big Data by running R analytics in Teradata
database.
AaaS is approach is extensible platform providing cloud based analytical for a varied set of functionalities and use cases. From a functional
viewpoint AaaS covers end-to-end facilities of analytical application from Data acquisition, End user visualization, Reporting and Interaction.
Apart from traditional functionality the AaaS extends innovative concepts such as Analytical Applications, needs of the different users of various
roles in corporate. AaaS also frees corporate from the expense of maintaining own reporting infrastructures, hence making it possible to focus on
data analytics.
In this section ten analytics service providers offerings varieties of AaaS will be discussed (10 Enterprise Predictive Analytics Platforms
Compared, http://butleranalytics.com/10-enterprise-predictive-analytics-compared-platforms/).
FICO
• Deployment,
• Monitoring, and
• Management.
FICO has been developing and deploying these solutions in large organizations for over three decades. Many of the algorithms and capabilities
offered by FICO are specific to the types of business problems addressed accessible through R integration. The text analytics are implemented
through integration with Lucene. FICO’s customer base is found in banking, retail, government, healthcare and insurance since the technology
and services have broad applicability virtually in every industry.
• Scores.
Recently FICO has launched its Analytic Cloud, a suite of analytics solutions and Big Data storage resources offered as a cloud based service for
corporate. Built utilizing the open source technologies, the objective of FICO analytic cloud is to provide the application developers, business
users and FICO partners a worldwide direct access to FICO's analytics and decision management tools and technology.
The FICO analytic cloud also allows organizations to create their own applications and create services. This is likely to relieve many large
organizations from dealing with the inherent complexities associated with Big Data analytics. FICO Big Data analytics solution are capable to
store and analyse Big Data efficiently to derive the specific meaningful information hidden in Big Data (Big Data Analytics,
http://www.fico.com/en/Products/Pages/Big-Data-Analytics.aspx).
IBM
The IBM analytics solutions are capable of serving the interest to large organizations looking for more than a point solution and wanting to create
a viable, long term analytics infrastructure and capability. IBM has a number of vertical solutions to offer in this line:
A. Analytic Applications:
• Business Intelligence.
• Predictive Analytics.
B. The Key Platform Capabilities Include:
• Hadoop-based analytics.
• Stream Computing.
• Data Warehousing.
• Application Development.
• Systems Management.
• Reference Architectures.
These platforms blends with traditional technologies those are well suited for structured, repeatable tasks together with complementary new
technologies that address speed and flexibility and are ideal for Adhoc data exploration, discovery and unstructured analysis.
KXEN
KXEN is one of the leaders in predictive analytics particularly in the Communications, Financial services and Retail industries. The list of KXEN
customers includes Barclays, Vodafone and Sears. KXEN flagship product InfiniteInsight® is a cloud-based analytics platform.
Cloud Prediction™ is a KXEN predictive analytics engine capable of delivering Multi-tenant, cloud-based service. Cloud Prediction™ is a
powerful predictive platform making cloud applications smarter.
• Predictive Offers™.
• Predictive Retention™.
Oracle Advanced Analytics
Oracle provides full range of technologies to handle data mining and statistical analysis. They are all pur under the Oracle Advanced Analytics
umbrella and include:
• Predictive analytics,
• Text mining,
• Statistical analysis,
• Data mining,
• Visualization.
Being originally a database company and still a leader in Database solution Oracle introduced a very important technology know as In-database
processing enabling the data processing in the database itself eliminating sluggish data extraction operations from disk based database ultimately
leading to high speed processing.
• Oracle data mining tools such as SQL and PL/SQL focused In-database data mining and predictive analytics.
• Oracle R Enterprise integrating open source “R” with the Oracle database.
• Significantly extends the Oracle Database’s library of statistical functions and advanced analytical computations.
• Provides support for the complete R language and statistical functions found in Base R and selected R packages based on customer usage.
• Open source packages - written entirely in R language with only the functions for which we have implemented SQL counterparts - can be
translated to execute in database.
• Without anything visibly different to the R users, their R commands and scripts are oftentimes accelerated by a factor of 10-100x.
For organizations dealing with Big Data Oracle provides following two solutions:
1. Low cost platform for Big Data software on the Cloudera distribution of Hadoop and Oracle NoSQL Database Community Edition. They
can be used individually or in collaboration.
2. Exalytics platform In-memory processing for high throughput of analysis tasks and data visualization and exploration tools.
Oracle provides Predictive analytics Add-in for Microsoft Excel specifically for support vector machines (SVM). Oracle predictive analytics are the
natural route for existing Oracle users with full capability to move into Big Data. For those without an existing Oracle obligation there are lots of
alternative solutions to meet the business analytics requirements too.
Revolution Analytics
System “R” is the most widely used and one of the powerful analytics software. Revolution R Enterprise is built on open source “R” and has been
enhanced for performance, productivity and integration tools with visual user interface. R is a free statistics and analysis package very widely
used. Lots of faster R based embedded proprietary analytics environment are available with the visual tools or database interoperability from
various AaaS providers.
The R language is mostly used in Statistics, Data analysis, Data-mining algorithm development, Stock trading, Credit risk scoring, Market basket
analysis and in all forms of predictive analytics. Many organizations have recently deploying R beyond research into production applications
(Revolution R Enterprise for Big Data Analysis and Predictive Analytics, http://www.revolutionanalytics.com/products/enterprise-big-
data.php). Features of R are:
• Huge library of algorithms for data access, data manipulation, analysis and graphics.
• R scripts can be modified to work with Hive for data analysis with minimal code modification.
Revolution Analytics have addressed the performance and scalability challenges of Big Data analysis with terabyte-class data sets by innovative
and integrated solutions suchas Revolution R Enterprise. Revolution R Enterprise uses enterprise data sources particularly Apache Hadoop for
Big Data applications. Revolution Analytics support and training services are bundled on top of the technology to meet the organizations
requirements. Revolution Analytics new add-on package called RevoScaleR™ provides unprecedented levels of performance and capacity for
statistical analysis in the R environment. For the first time, R users can process, visualize and model their largest data sets in a fraction of the
time of legacy systems, without the need to deploy expensive or specialized hardware (Rickert, 2011).
Salford Systems
Salford Systems having products capable of traditional descriptive analytics and predictive analytics. The following are the products from Salford
Systems (http://www.salford-systems.com/company/the-company)
Salford Predictive Modeler (SPM) supports both traditional descriptive and predictive analytics.
The SPM (Salford Predictive Modeler®) software suite is very perfect and very fast analytics and data mining platform for predictive, descriptive
and analytical models from databases of any size and complexity for any type of organization. The SPM data mining tools includes the following
components (http://www.salford-systems.com/products):
• MARS,
• TreeNet,
• Random Forests,
• RuleLearner.
The SPM software suite's automation accelerates the process of model building by conducting substantial portions of the model exploration and
refinement process for the analyst. Latest version of SPM SPM v7.0 is having both the 32 bit and 64 bit versions. The trail version of SPM v7.0
can be downloaded for 10 days trail. The SPM 7 Product Versions are ULTRA, PROEX, PRO and BASIC. The prerequisite is Microsoft .NET
Framework 4 Client Profile for 32 bit version.
SAP
SAP is has integrated with “R” for the analytics with varieties of distributed databases such as Hadoop for SAP Big Data analytics applications.
Followings are the various algorithms used in SAP analytics based on R as well as developed by SAP themselves:
• Clustering.
• Decision tree.
• Neural network.
• Regression.
Resulting predictive models can be exported as PMML (Predictive Model Markup Language) for deployment in a production environment.
A Rich set of SAP client application allows user to intuitively design complex predictive models, Visualize, Discover and share hidden insights and
harness the power of Big Data with SAP HANA such as (SAP Predictive Analysis):
• Real-time answers.
SAP Predictive Analysis is a complete data discovery, visualization, and predictive analytics solution designed to extend the current analytics
capability and skill set of corporate to a new high regardless of the history with BI. SAP Predictive Analysis is simple enough to allow business
analysts to conduct forward-looking analysis using departmental data from Excel sheet.
SAS
With SAS predictive analytics and data mining solution, user can derive useful insights for fact-based decision making (Predictive Analytics and
Data Mining, Derive useful insights for fact-based decision making, http://www.sas.com/technologies/analytics/datamining/index.html):
SAS is widely accepted and used by a number of vertical solutions for addressing their requirements mainly in financial applications. The main
SAS financial solution includes:
• Fraud detection.
• Customer analytics.
Statsoft
Statsoft STATISTICA is a collection of analytics software providing a comprehensive array data analysis tools such as:
• Data analysis,
• Data management,
Pharmaceutical companies and other healthcare-related businesses are already market drugs and solicit consumer feedback from social networks
about new drugs and disease management options using Statistica. Statistica also offers several vertical solutions for Credit scoring and Quality
control. The Enterprise versions of Statistica support access to Big Data sources such as Hadoop with associated multi-threading techniques.
Statistica Enterprise is an enterprise solution for Role-based and automated data analysis and information retrieval system. The data mining
products of Statistica can identify the trends and patterns in unstructured data.
TIBCO
TIBCO (The Information Bus Company) has a full range of capabilities starting from simple graphing to real-time analytics. TIBCO provides a
full armoury of visual and computational analytics tools for delivering powerful analytical capabilities ranging from the preparation and
distribution of data visualisations to the development and implementation of sophisticated data mining models.
The Spotfire Analytics Platform is made up of three components each of which contributes to the speed of adaptability:
• Configure, is the Spotfire analytics client tool for configuring analytic applications.
• Administer, Spotfire Server providing centralized point of administration and integration tools for existing IT infrastructure.
• Extend, Spotfire analytic Developer Kits for extending the capabilities of the Spotfire Analytics Platform.
TIBCO Silver™ Spotfire® is an “Enterprise-class”, cloud-based data discovery tools to analyze patterns, outlier and find out unanticipated
relationships in data with speed and reliability.
• Clarity of Visualization.
• Freedom of Spreadsheets.
• Relevance of Applications.
• Confidence of Statistics.
• Reach of Reports.
Getting Started with Using AaaS
The following are the high steps to be performed by any customer before getting started with the usage of AaaS.
Step 1: Setting Up or Creating a Data Source
In this step, a data source is created. Data source refers to the source from which data needs to be pulled out by the CSP'S analytical applications
to make predictions. The data source can vary from anything like a huge database to a small set of files. As a part of this step, the customer needs
to ensure that the format in which data is fetched from the data source is compatible with the data type supported by the CSP. If it is not
compatible, the customer should work with the CSP to define or develop appropriate interfaces or adapters for data conversion. This step is very
important to get accurate prediction results because of the fact that if there is any error in the input data, it will adversely affect the prediction
results. Typically the User Interface to create a Data source will be very user friendly and will have options for the customers to upload the data
directly or fetch it from a remote source using some kind of a protocol at fixed time intervals. Data input provided by the customer at this stage
can be structured or unstructured and it can also be in multiple formats.
The different types of data which are the main candidates for big data analytics are:
• M2M Data: Machine to Machine (M2M) is a term used to describe solutions that focus on remote collection and transfer of data from
embedded sensors or chips placed on remote assets which are fixed or mobile. Collected data when transferred across networks, integrated
and analyzed, results in intelligence that can augment business processes and transform an Enterprise to a Smarter Enterprise. For example
in a hospital, some equipment like MRI scanning machines is very critical for its functioning. Hence it is very important to understand its
functioning to predict the time when it is likely to fail. For this purpose, it is required to analyze huge amounts of signals generated by it in
order to understand its working pattern and hence predict the point in time when it is likely to fail. Using AaaS, it is possible to use the
analytical applications from the CSPs to collect and analyze this data to make useful inferences. But the point to be noted here is that in such
cases, it becomes necessary to not only analyze but also collect and store such information. Hence apart from using AaaS, it becomes
necessary to use other infrastructural components like network and storage. These components will be charged separately as a part of IaaS
cloud delivery model.
• Data from Web Resources like Social Media Sites:Data from social media networking sites are used by many organizations for a
variety of purposes which include tracking the shopping interests of specific age groups, analyzing the impact of advertisements and other
brand promotion activities and so on. In this case, instead of specifying any data source, website links of these social media networks along
with some specific keywords can be given as option to AaaS CSP.
When data source inputs are specified for AaaS applications, it is very important for the customer to evaluate whether it is required to capture
real time data for analytical purposes. In case if it is required, it is very important to consider aspects such as network bandwidth available to
capture real time data. In these cases, it is also very essential to mention the permissible delay allowed in data capture. These aspects need to be
carefully mentioned in the SLA with the CSP.
The data to be used for prediction purposes can also be stored in CSP infrastructure. But this involves paying for the usage of the infrastructural
components (storage, network) as well.
Step 2: Dataset Creation
In this step, data from the data sources defined by the customer are pulled out and they are normalized to a common format to be used to create a
prediction model. In some cases, it is possible that the customer wants to use only specific data fields to be used for analytical purposes. This step
provides the option to the customer to define those fields. If no specific fields are chosen, data from all the fields in the data source defined by the
customer are used for building the prediction models. This step converts all different types of data from different sources to one specific format
with specific defined fields and data types of data present in each field.
Step 3: Prediction Model Creation
In this step, the data from the data sets will be used to generate a prediction model based on some underlying statistical concepts like regression
or correlation. The underlying statistical concept to be used can be defined by the customer while signing the Service Level Agreement (SLA) with
the CSP. The list of analytical applications available and the type of statistical concepts supported by each of the applications will be advertised by
the CSP in the service catalogue. The customer can use this catalogue to make appropriate decisions.
The output of this step will be typically in a visual diagrammatic format to facilitate easy understanding of the customer. This output will be a
predictive model that will show the most relevant patterns present in the data set of the customer. The most commonly used visual
representation format in AaaS is the tree structure. Apart from this, there are several other formats like spirals, bar graphs, pie charts etc. which
are used by different AaaS CSPs. The visual format for display of prediction model can be specified by the customer in the SLA. This step
corresponds to the concept of MaaS described earlier in the paper. Many customers opt only for this service. They tend to use the prediction
models which are generated in this step to make inferences of their own. It is also possible to transfer these generated models to some kind of a
visualization device required by the customer using the APIs provided by the CSPs.
Step 4: Prediction Result Generation Using the Model
In this step, the customer can typically specify the parameters based on which predictions need to be done. Choice of important parameters will
help the customers get accurate predictions based on factors which are of importance for the customer. Customer can opt to view these prediction
results as a form, report or any other format that would be convenient for him.
Limitations of AaaS
AaaS platform aggregate multiple sources of data from diverse sources to offer an unique focused analytics service. Although the advantages of
AaaS are enormous, AaaS is not free from limitations. Being the cloud based services the AaaS inherits all the challenges of cloud computing. The
challenges of AaaS can be summarized as folows:
• The real-time applications are not suitable for cloud-based analytics. Real-time access needs on-premise data. Transferring all that data to
the cloud for analytics might become a burden. In that case cloud-based analytics might not be the best choice.
• Another very important issue before putting business analytics in the cloud is legal requirements with respect to data auditing. Those
requirements are usually easier to meet if the entire data and analysis chain is controlled.
• Security around Big Data is an issue. Big Data adds data-in-motion and new file based analytical data stores to the data landscape thereby
making it more complex to manage security.
• Big Data from social media helps organizations undertake sentiment analysis on their consumers and better tailor their market outreach
programs. The challenge is, the data is lightly or poorly structured at best and completely unstructured at worst.
In this section some of the features of AaaS offered by some leading business analytics service provides were briefly discussed. It is found that
most of the AaaS providers are focusing on delivering the analytics service in cloud platform. Considering the business potentials of Big Data they
are focusing on cloud based Big Data Analytics as a Service (AaaS). Most of the AaaS offering are having the capabilities to adders the
requirements broad user base.
When data source inputs are specified for AaaS applications, it is very important for the customer to evaluate whether it is required to capture
real time data for analytical purposes. In case if it is required, it is very important to consider the aspects such as network bandwidth availability
to capture real time data. In these cases, it is also very essential to mention the permissible delay allowed in data capture. These aspects need to
be carefully mentioned in the SLA (Service-level Agreement) with the CSP.
The data to be used for prediction purposes can also be stored in CSP infrastructure. But this involves paying for usage of the infrastructural
components (storage, network) as well.
CONCLUSION
Cloud Computing and Big data has forced corporate to define new requirements for Data management and Enterprise information protection
technology and people with the appropriate analytical skills. The predictions and prescriptions must be correct and in synergy for prescriptive
analytics to produce the perfect forecast. Predictive analytics uses set of intelligent technologies to uncover the relationships and patterns within
large volumes of data to predict about the future outcomes using the current and historical facts while prescriptive analytics goes one step ahead
to see the effect of future decisions in order to adjust the decisions before they are actually implemented.
Exponentially growing deployment of cloud computing and Big data analytics has motivated lots business analytics solution provider to offer
AaaS meeting the requirements of a broad range of user base. These AaaS enables the corporate to utilize the advantages of social networking
site’s growing user base to reach their customers directly. As the technology matures the Big Data AaaS is likely to get maturity and widespread
adoption.
This work was previously published in the Handbook of Research on Cloud Infrastructures for Big Data Analytics edited by Pethuru Raj and
Ganesh Chandra Deka, pages 370391, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bertolucci. (2013). Prescriptive analytics and big data: Next big thing? InformationWeek: Connecting the Business Technology
Community. Retrieved from http://www.informationweek.com/big-data/news/big-data-analytics/prescriptive-analytics-and-big-data-
nex/240152863
Das. (2010). Adding competitive muscle with in-database analytics: Next generation approach powers better, faster, more cost-effective
analytics. Database Trends and Applications.Retrieved from http://www.dbta.com/Articles/Editorial/Trends-and-Applications/Adding-
Competitive-Muscle-with-In-Database-Analytics-67126.aspx
Grimes. (2008). In-database analytics: A passing lane for complex analysis. InformationWeek: Connecting the Business Technology
Community. Retrieved from http://www.informationweek.com/software/business-intelligence/in-database-analytics-a-passing-lane-
for/212500351?cid=RSSfeed_IE_News
Gualtieri. (2013). Evaluating big data predictive analytics solutions. FORRESTER Research. Retrieved from
http://www.biganalytics2012.com/resources/Mike-Gualtieri-Forrester-Research.pdf
Lustig, Dietrich, Johnson, & Dziekan. (2010, November). An IBM view of the structured data analysis landscape: Descriptive, predictive and
prescriptive analytics. The Analytics Journey, 11-18.
van Rijmenam. (2013). Understanding your business with descriptive, predictive and prescriptive analytics. Big DataStartupsthe Online Big
Data Knowledge Platform. Retrieved from http://www.Big Data-startups.com/understanding-business-descriptive-predictive-prescriptive-
analytics/
KEY TERMS AND DEFINITIONS
Analytics: The primary technology that facilitates decision making, along with other components, such as search and reporting. The key
platform capabilities of Analytics are Data collection, large volume Data Storage, Data synchronization, Data analysis and Reporting.
Open Source Analytics Platforms: Open Source analytics platforms adoption is has already gained momentum. Open source R has already
emerged as a leading platform for statistical innovation and collaboration both in academic and industry. Adoption is evident with commercial
vendors of R, such as Revolution Analytics, focusing on scaling the R computing language.
Web Analytics Systems: These systems work by collecting data from the Web site that is being monitored. Web analytics require the Web site
to send data to the analytics system. The mostly used system used by analytics systems collect data is HTTP interfaces.
CHAPTER 3
Big Data Computing and the Reference Architecture
M. Baby Nirmala
Holy Cross College, India
Pethuru Raj
IBM India Pvt Ltd, India
ABSTRACT
Earlier, the transactional and operational data were maintained in tables and stored in relational databases. They have formal structures and
schemas. However, the recent production and flow of multi-structured data has inspired many to ponder about the new ways and means of
capturing, collecting, and stocking. E-mails, PDF files, social blogs, musings, tweets, still photographs, videos, office documents, phone call
records, sensor readings, medical electronics, smart grids, avionics data, real-time chats, and other varieties of data play a greater role in
presenting highly accurate and actionable, timely insights for executives and decision-makers. The chapter provides an insight into the big data
phenomenon, its usability and utility for businesses, the latest developments in this impactful concept, and the reference architecture.
INTRODUCTION
Big data is the “Heart of the talk” in this current era. All Big people think of this Big data and talk about this Big data. Earlier the transactional
data were maintained as Tables and stored in relational Databases and Files. All other unstructured data were maintained for few years and then
thrown out. There is a lot of potential value in these kinds of non-traditional and less structured data like E-mail, Social media, Weblogs,
Photographs, Videos, Power points and Phone calls; Chats play a greater role in Business Intelligence analysis of the Enterprise data. Between
now and 2020, the amount of information in the digital universe will grow by an unimaginable 35 trillion gigabytes as all major forms of media-
Voice, TV, Radio, Print complete the journey from analog to digital (IDC, Sponsored by EMC2, 2012).
BIG DATA ANALYTICS
What is Big Data Analytics?
Big Data Analytics is the process of examining large amounts of data of a variety of types (Big data) to uncover hidden patterns, unknown
correlations and other useful information. In other words, Big data Analytics is the use of advanced analytical techniques against very large
diverse data sets that includes different types such as Structured/Unstructured and Streaming/Batch and different sizes from terabytes to
zettabytes.
Figure 1 shows how a Big data processing is done. By facilitating data scientists and other users to analyze huge volumes of transactional data as
well as data from other sources which are left untapped by conventional Business Intelligence(BI) Programs, Big data analytics help the
organizations to make better business decisions.
These other data sources may include Web server logs and Internet click stream data, Social media activity reports, Mobile-phone call detail
records and information captured by the sensors.
Some people exclusively associate Big Data and Big data Analytics with unstructured data. Consulting firms like Gartner Inc. and Forrester
Research Inc., consider transactions and structured data to be valid forms of Big data.
Big Data Analytics can be done with the software tools commonly used as part of advanced analytics discipline such as Predictive Analytics and
Data Mining.
Three Key Technologies for Extracting Business Value from Big Data
• Information Management: Manage data as a strategic, core asset, with ongoing process control for Big data analytics.
• HighPerformance Analytics: Gain rapid insights from Big Data and the ability to solve increasingly complex problems using more
data.
• Flexible Deployment Options: Choose between options for on-premises or hosted, Software-as-a-Service (SaaS) approaches for Big
Data and Big data analytics.
Four Primary Technologies for Accelerating Processing of Huge Data Sets
• Grid Computing: A centrally managed grid infrastructure provides dynamic workload balancing, high availability and parallel processing
for data management, analytics and reporting. Multiple applications and users can share a grid environment for efficient use of hardware
capacity and faster performance, while IT can incrementally add resources as needed.
• InDatabase Processing: Moving relevant data management, analytics and reporting tasks to where the data resides improves speed to
insight, reduces data movement and promotes better data governance. Using the scalable architecture offered by third-party Databases, in-
Database processing reduces the time needed to prepare data and build, deploy and update analytical models.
• InMemory Analytics: quickly solves complex problems using big data and sophisticated analytics in an unfettered manner. Use
concurrent, in-memory, multiuse access to data and rapidly run new scenarios or complex analytical computations. Instantly explore and
visualize data. Quickly create and deploy analytical models. Solve dedicated, Industry-specific business challenges by processing detailed
data in-memory within a distributed environment, rather than on a disk.
• Support for Hadoop: the power of Analytics can be brought to the Hadoop framework (which stores and processes large volumes of data
on commodity hardware).
• Visual Analytics: Using Visual Analytics, one can very quickly see correlations and patterns in big data, identify opportunities for further
analysis and easily publish reports and information to an iPad. Because it’s not just the fact that you have big data, it’s what you can do with
the data to improve decision making that will result in organizational gains.
BIG DATA TECHNOLOGY (ECKERSON, 2012)
Before handling Big data, there are several priorities that we need to be checked. The unstructured data sources used for Big data Analytics may
not fit in traditional data warehouses and they may not be able to handle the processing demands posed by Big data. Result of this, a new class of
big data technology has emerged and is being used in many Big data Analytics environments.
Director of Analytics, SAP, remarks that “Hadoop, an open-source Apache product, and Not Only SQL (NoSQL) Databases don’t require the
significant upfront license costs of traditional Systems, and that’s making setting up an analytics platform and seeing a return on the investment
(ROI) more accessible than ever before.” (SAP Solutions for Analytics, 2012)
1. NoSQL Databases
2. Hadoop
3. Map Reduce.
These technologies form the core of an open source software framework that supports the processing of large data sets across clustered Systems.
NoSQL
This is otherwise called as Not-Only SQL. NoSQL is the name given to a broad set of databases whose only common thread is that they don't
require SQL to process data, although some support both SQL and non-SQL forms of data processing.
Hadoop and Map Reduce (Stephenson, 2013)
Hadoop is an Open Source Software Project, Java based programming framework to run within the Apache Foundation for processing data-
intensive applications in a distributed environment with built-in parallelism and failover. The most important parts of Hadoop are the Hadoop
Distributed File System (HDFS), which stores data in files on a cluster of servers, and Map Reduce, a programming framework for building
parallel applications that run on HDFS. The open source community is building numerous additional components to turn Hadoop into an
enterprises-caliber, data processing environment. The collection of these components is called a Hadoop distribution.
Today, in most Customer Installations, Hadoop serves as a staging area and online archive for unstructured and semi-structured data, as well as
an analytical sandbox for data scientists who query Hadoop files directly before the data is aggregated or loaded into the data warehouse. But this
could change. Hadoop will play an increasingly important role in the analytical ecosystem at most companies, either working in concert with an
enterprise DW or assuming most of its duties.
How Hadoop Works
2. Hadoop breaks up and distributes the data across multiple machines. Hadoop keeps track of where the data resides, and can store data
across thousands of servers.
3. Hadoop executes Map Reduce to perform distributed queries on the data. It maps the queries to the servers, and then reduces the results
back into a single result set.
How Hadoop is Being Used
• For targeting marketing and fraud detection, Financial Service Providers such as Credit Card Providers use Hadoop.
• For predicting what customers want to buy, Retailers use Hadoop. They compare and organize information about product availability,
competitor’s prices, locl economic conditions.
• To support their talent management strategies Human Resources departments are using Hadoop and understand people-related business
performance, such as identifying top performers and predicting turnover in the organization.
BIG DATA ANALYSIS IN CLOUD: STORAGE, NETWORK AND SERVER CHALLENGES (SCARPATI, 2012)
Bandey, D.(2012), Doctor of Law says “When a Corporation mines the Big Data within its IT infrastructure a number of laws will automatically be
in play. However, if That Corporation wants to analyze the same Big data in the cloud-a new tier of legal obligations and restrictions arise. Some
of them quite foreign to a management previously accustomed to dealing with its own data within its own infrastructure”
Big Data Analytics is often associated with Cloud computing because the analysis of large data sets in Real-Time requires a framework like Map
Reduce to distribute the work among tens, hundreds or even thousands of Computers.
Following are the six key elements of analytics defined by Gartner, (Kelly, 2010)
1. Data Sources,
2. Data Models,
3. Processing Applications,
4. Computing Power,
In its view, any analytics initiative “in which one or more of these elements is implemented in the cloud” qualifies as Cloud Analytics.
The elasticity of the cloud makes it ideal for Big Data Analytics - the practice of rapidly crunching large volumes of unstructured data to identify
patterns and improve business strategies according to several cloud providers. At the same time, the cloud's distributed nature can be
problematic for Big data analysis.
Cloud Storage can Drag Down Big Data Analysis
The Cloud Storage challenges in big data analytics fall into two categories - Capacity and Performance.
Scaling Capacity, from a platform perspective, is something all cloud providers need to watch closely.
Data retention continues to double and triple year-over-year because [customers] are keeping more of it. Certainly, that impacts the enterprises
because they need to provide capacity. Storage performance in highly virtualized, distributed clouds can be tricky on its own, and the demands of
Big Data analysis only magnify the issue, several cloud providers said.
The big problem with clouds is making the storage perform well and this would be the biggest reason why some people wouldn't use the cloud for
Big data processing.
Cloud Networking and Architecture Considerations
The challenges of supporting customers demanding Big Data analysis in the cloud don't end with storage. Cloud providers say it requires a more
holistic approach to the network and overall Cloud Architecture. Big data analysis in the Cloud also raises networking issues for service providers.
By having all of its partners and customers in one cloud, CloudSigma makes the most of its ecosystem strategy of running a 10-Gigabit Ethernet
network, which means that terabytes of data can be fired around really, really quickly and at a very low cost . Savvis, which CenturyLink acquired
last year, is also considering the network implications of Big data in the cloud.
No need of shipping terabytes and petabytes around, instead keep the data and then move the analytics to that data.
Security
But what happens when we translate the matrix of data into the cloud? The first matter is that of security. Breaches occasioning the loss of data
can cause an abundance of law-based difficulties: from breach of contract, fines under the Data protection Law, uncapped damages due to the
release of third-party secrets and so on.
Personal Identifying Information
Moving on from security; there is the matter that is generally referred to as the trans-border movement of PII. Many countries either restrict or
prohibit the exporting of PII. To do so can even be a corporate crimecertainly exposing the wrongful exporter to the likelihood of a hefty fine,
adverse publicity, and reputation loss.
Analytics
There is no reason, in law, why Big Data analytics cannot be performed lawfully in the cloud. However, in order to do so, significant attention
needs to be directed to the actual software and hardware programming architectures to be employedand match those to the matrix of laws which
operate over the storage, use, processing, and movement of data.
There is no reason, in law, why Big Data analytics cannot be performed lawfully in the cloud. However, in order to do so, significant attention
needs to be directed to the actual software and hardware programming architectures to be employedand match those to the matrix of laws
which operate over the storage, use, processing, and movement of data. In order for Big data analytics in the cloud to be lawful, the
requirements of the law need to be accurately mapped onto the cloud computing technology at hand. (Bandey, D, 2012)
A WIDE RANGE OF DATA ANALYTICS
• Behavior Analytics: Deals with the analysis of individual’s behavior, such as buying behavior by making meaningful prediction on past
and present data.
• ClickStream Analyses: deals with the analysis of the recording of the parts of the screen when a Computer user clicks on while web
browsing is using another software application. This is useful for analyzing web activity, software, market research, and employee
productivity.
• Network Analyses: deals with the analysis of network information and the relationships between various nodes in the network.
Analyzing e social networks and Computer networks are great examples.
• Customer Analytics: deals with the analysis of data from customer behavior which helps companies to make key business decisions
through market segmentation and predictive analytics.
• Compliance Testing: deals with the determination of a product or System to meet some specified standard that has been developed for
efficiency or interoperability.
• Loyalty Analysis: deals with the analysis of purely customers loyalty in other words focusing on a customer’s commitment to a product,
company, or brand.
• Campaign Management: deals with the analysis of data used to conduct outbound marketing campaigns and to provide advanced
management capabilities.
• Promotional Testing: deals with analysis of data mainly associated with marketing and campaign management Systems to identify the
best criteria to be used for a particular marketing offer.
• Patient Records Analyses: deals with the analysis of medical records associated to patients to identify patterns to be used for improved
medical treatment.
• Fraud Monitoring: deals with the intentional deception made for personal gain or to damage another individual. Monitoring is the
process of identifying and predicting this activity.
• Financial Tracking: deals with ensuring regulatory and compliance with financial related data.
• Tick Data BackTesting: deals with the analysis of tick-by-tick historical market data to identify patterns compared to historical records.
(see Figure 2)
THE BIG DATA REFERENCE ARCHITECTURE (BDRA)
The big data discipline is a fast-growing one and its adoption and adaption level across business verticals is steadily on the climb. Its implications
and impacts for executives, entrepreneurs, and engineers are being projected and presented as path-breaking and trend-setting. However as the
complexity and changes in big data computing, especially in big data analytics are consistently growing, it is logically sensible to come out with
comprehensive and compact reference architecture for big data computing. The big data-specific reference architecture is definitely a tactic as
well as a strategic need for data scientists and others who play around with big data to bring out tangible business value with ease and elegance.
In this section, we are to discuss the prerequisites for the big data reference architecture, the key contributing components and their specific
contributions and capabilities for enabling new-generation big data analyses, how these components interact with one another seamlessly and
spontaneously to implement any desired business goals, etc.
There are a number of enabling technologies, products, platforms, tools, facilitating frameworks, runtime engines and design patterns emerging
and evolving in this hot and happening space. The compute, storage and network infrastructures for big data computing are also fast maturing
and stabilizing. A bevy of big data applications and services are being conceptualized and concretized. Especially in the big data space, Hadoop
and its allied technologies are being given prime importance. Hadoop is the big differentiator for the big data domain. Hadoop could work
reliably on hundreds of commodity servers and has the innate ability to help organizations gain insights from vast quantities of huge volumes of
data, high velocity, and multi-structured data in a cost-effective manner. Due to the exponential and extreme growth of data from myriad sources
(internally as well as externally) and their unique contributions for enriched business insights, enterprises are cautious and cognitively jumping
into the big data bandwagon. There are additional IT requirements to facilitate big data analytics and there are questions regarding the leverage
of existing IT assets and resources in order to quickly and affordably acquire the big data capabilities. The various constituents and ingredients of
big data space are discussed here towards the reference architecture.
The Emerging Big Data Sources
Firstly there are new sources such as web and social sites, sensors and actuators, scientific experiments, manufacturing machines, biological
information, business transactions, etc. With digitization and distribution concepts gaining more relevance and rewards, data generation through
men and machines is bound to grow fast.
Big Data Analytics
This is the most sought-after affair in the big data space. Data has acquired the asset status and hence extracting all kinds of embodied patterns,
associations, cues, clues, and other actionable insights have become mandatory for business behemoths to plan ahead to be competitive in their
deals, deeds and decisions.
Streaming Analytics
To work with streaming data, there is a new computing discipline coming up fast. Stream computing is being positioned as the one for efficiently
capturing streaming data and to spit out personal as well as professional insights in time. With sensors and actuators getting deployed randomly
and in large numbers in places of importance, all kinds of event messages need to be gleaned and gathered. Event driven architecture (EDA) and
event processing systems that completely comply with the EDA standard are the latest innovations and ingredients in deftly capturing and
capitalizing a variety of fast-moving events. Parallel processing is the key trick and trait here.
There are both open-source as well as commercial-grade products from different application infrastructure solutions providers across the globe.
IBM offers InfoSphere Streams for streaming data and Apache offers Flume that uses streaming data flows to collect, aggregate and move large
volumes of data into HDFS.
RealTime Analytics
Both batch processing and real-time processing needs big data analytics. In several occasions time really matters. If not utilized immediately for
transitioning into information and knowledge it loses its sheen. A variety of real-time data and it techniques to capture, transition, aggregation,
transformation, filtering, profiling and dissemination emerging these days. For getting useful intelligence to make correct decisions to act upon,
analytical techniques to be applied instantaneously, on emerging real-time data like Sensor, online, transaction, operational, security and
financial data. To extract usable and reusable information in real-time, there are special data analytics appliances and techniques such as in
memory, indatabase and inchip processing. For data storage, In-memory database management system rely on main memory. In-memory
databases are optimized for speed when compared to traditional database systems that store data to internal and external disks. In this era of big
data, real-time analytics of big data is an important aspect not to be sidestepped as the data velocity is seeing a significant increment. Related to
this phenomenon is data streaming. (see Figure 3)
Figure 3. Real-time big data processing
Text Analytics
IT operations logs, social sites, medical records, call centers, web contents, etc.. creates a lot of textual data. Through the identification of core
concepts, sentiments and trends, text analytics is a method for extracting usable knowledge from such unstructured textual data and using those
insights, it supports decision-making.
Machine Analytics
As per the reliable sources and statistics, machines generate more data than men. Another trend is that machines are getting interconnected with
one another in their vicinity as well as integrated with remote cloud-based applications. These deeper and extreme connectivity leads to more
date getting generated, transmitted and stored in high-end databases. Thus there is a need for extracting insights out of machine data heaps
towards people empowerment. Machine's performance is being closely watched over and any downtime is proactively identified in order to
enable higher productivity.
Predictive and Prescriptive Analytics
These are the direct derivatives of big data processing, mining, and analyzing. There is a number of promising and pioneering statistical and
mathematical methods and algorithms such as clustering, classification, slicing, dicing, etc. to extract details that perfectly and profusely predict
what is to happen and prescribe what need to be done for accomplishing the desired goals. Big data is the base for all these appealing yet
compelling aspects for next-generation enterprises.
The Hadoop Framework
The well-known Hadoop software framework and programming model is the key for the vociferous and the overwhelming success of big data
analytics. It has all that is required to efficiently and economically doing the increasingly complicated big data analysis. The Hadoop package
comprises multiple technologies and tools in its custody to comprehensively and compactly accomplishing the evolving big data requirements.
The Hadoop technology has become the core and central factor for many implementations. Almost all the leading vendors are building their own
data analytics products out of Hadoop. The Hadoop distributions are being made available in private as well as public clouds. Due to the
insistence on high performance, there are appliances (hardware as well as software) in plenty to attract companies and corporate to tinker and
tweak with big data (see Figure 4)
Figure 4. Big data architecture
Big Data Databases
Databases are very important in storing and managing data. Their role and responsibility in the emerging big data discipline is steadily growing
and glowing. The traditional databases find it very difficult to cope up with all the structural as well as behavioral requirements of big data
analytics. Therefore new types of databases such as NoSQL, NewSQL and hybrid models have firmed up in the recent past.
NoSQL databases are a category of new-generation database management systems (DBMSs) that do not use SQL as their primary query language
well known as Not Only SQL. They do not support joint operations so they may not require fixed table schemas. They are optimized for highly
scalable read-write operations rather than for consistency.
Remedied Database Management Systems
AS the name denotes, it is not true that NoSQL databases are the only way for big data analytics. To be ready for the big data battle, the legacy
relational DB systems can be appropriately modernized Vendors increasingly re-configured these systems to intrinsically handle big data. For
example, the IBM DB2 Analytics Accelerator leverages the IBM Netezza appliance to speed up queries issued against a mainframe-based data
warehouse. All kinds of data stores such as databases, data warehouses, cubes and marts are being refactored to suit for big data requirements.
The famous ETL (extract, transform and load) tools are also going through the necessary transitions to be usable in the new world of big data.
New noteworthy approaches such as store and analyze and vice-versa are being experimented and expounded for the big data era.
Big Data Integration
Integration has been an important affair in the increasingly distributed world. There are tools, tips, techniques and technologies for enabling data
integration. Unlike the traditional ways of data integration, in the ensuing big data world, the data quantity will be comparatively higher and
hence big data movement middleware solutions and systems are getting enormous attention. That is, bulk data transfer is an important factor. In
the case of data warehousing, ETL is the bulk data transfer technology.
There are different approaches and tools for data integration. Enterprise application integration (EAI), data middleware, enterprise information
integration (EII), enterprise service bus (ESB), enterprise content management (ECM), and in the recent past, data Virtualization (data
federation) have become the principal mechanisms for data integration. Data federation is gaining more market and mind shares as far as the
data integration in the big data world is concerned.
Data Virtualization allows an application to issue SQL queries against a virtual view of data in distributed and disparate sources such as in
relational databases, XML documents, social sites, and other multi-structured data sources.
Big Data and Master Data Management (MDM)
MDM has been an important ingredient in the traditional data environment. Master data is the high-value and core information used to support
critical business processes across the enterprise. Having accurate data is one such vital aspect of data-driven enterprises. MDM is capable of
presenting the single version of the truth. Master data is the information about customers, suppliers, partners, products, employees, and more
and is at the heart of every business transaction, application and decision. The MDM concept can provide a compelling starting point for big data
analysis.
Integration points between MDM and big data include consuming and analyzing unstructured data, creating master data entities, loading new
profile information into the MDM system. On the basis for the empowered big data analysis, it also includes sharing master data records or
entities with the big data platform and reusing the MDM matching capabilities in the big data platform. From the increasing volume, velocity,
variety, and decreasing veracity of data, in context, beyond what was previously possible, big data and MDM can help to extract valuable insights
from then
Big Data Warehouses and Data Marts
Large investments were made in data warehouses and data marts that may be based on relational databases by many Organizations. To facilitate
business intelligence, there are data warehouse appliances. There is an approach of seamless and spontaneous integration between what they
have with a litany of big data technologies such as the Hadoop, NoSQL, etc .because they want t leverage them for big data analytics.
Big Data Analytics and Reporting
Information visualization has become an indispensable requirement for decision-makers. All the insights extracted have to be presented to those
in different forms and formats. Analytical tools do a variety of ways for knowledge discovery and there are visualization and reporting tools for
knowledge dissemination in time for those authenticated and authorized entities. .
Big Data Security
In the ensuing era of big data, data transition, persistence and usage levels will sharply go up. The security of data in motion / transit, storage and
usage therefore has to be ensured at any cost otherwise the result will be irreparable, unpredictable and catastrophic. Besides data security,
infrastructure, application, and network security is drawing the attention of security researchers and professionals. In the context of cloud-based
big data analytics, the security scene is becoming more ruinous and more risky. Thus students, scholars and scientists are working collaboratively
to come out with viable and value-adding security mechanisms that will strengthen businesses to embrace big data analytics to be relevant to
their constituencies and customers.
Big Data Governance
This emerging approach for bringing in the needed confidence, and clarity to the big data domain. Policies play a very vital role here. Policy
establishment and enforcement bring a kind of policy-based management all kinds of business processes and interactions. If there is any slight
violation, then it can be identified and nipped in the bud. In a nutshell, governance is doing the right things.
Big Data Lifecycle Management
Information Lifecycle management (ILM) is a process and methodology for managing information through its lifecycle, from creation through
disposal, including compliance with legal, regulatory, and privacy requirements. In the similar way, big data-centric ILM will be in place so that
any kind of misuse, wastage, and slippage can be avoided to the fullest.
The Cloud Infrastructures
As indicated above, the clouds are being espoused as the economic and elastic infrastructure for efficiently doing big data analysis. Other
differentiators include perceived flexibility, faster time-to-deployment, and reduced capital expenditure requirements. As there is a lot of data
getting posted on social sites, the web-scale big data analytics can be performed in public clouds. Big data analytical platforms are also placed in
private clouds considering the confidentiality and integrity of data and the business-criticality. Big data appliances are also very popular drawing
the attention of CIOs. (see Figure 5)
Figure 5. Big data integration
CHALLENGES AND OPPORTUNITIES WITH BIG DATA ANALYTICS (AGRAWAL, BARBARA, BERNSTEIN, BERTINO,
DAVIDSON, DAYAL, ET AL., 2012)
Big Data Analytics Earns High Scores in the Field
While Industries vary greatly in what they need from their data, and even companies within the same Industry are unalike, virtually every
organization in every market has these two related problems: what to do with all the information pouring into data centers every second, and how
to serve the growing number of users who want to analyze that data. We are awash in a flood of data today. In a broad range of application areas,
data are being gathered at unmatched scale. Decision that were made in the past were based on guesswork but now they are made from insights
extracted from data itself. Every aspect of our modern society including mobile services, retail, manufacturing, financial services, life sciences,
and physical sciences are driven by such big data analytics. To name a few,
In Healthcare, the move to electronic medical records and the data analysis of patient information are being spurred by estimated annual savings
to providers in the tens of billions of dollars.
In the Manufacturing Sector, outsourcing a supply chain may save money, according to McKinsey & Company, but it has made it even more
critical for executives, business analysts, and forecasters to acquire, access, retain, and analyze as much information as possible about everything
from availability of raw materials to a partner’s inventory levels.
Challenges in Big Data Analysis
There are common challenges that underlie many, and sometimes all, of the phases of Big Data Analysis Pipeline. Following are some of those
2. Scale
3. Timeliness
4. Privacy
5. Human Collaboration
The challenges include not just the obvious issues of Scale, but also Heterogeneity, Lack of Structure, Error-Handling, Privacy, Timeliness,
Provenance, and Visualization, at all Stages of the analysis pipeline from data acquisition to result interpretation. These Technical Challenges are
common across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone.
Furthermore, these Challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial
products.
CONCLUSION
This is an introductory chapter for big data computing and about the reference architecture for big data. We have explained the humble origin of
big data, its dramatic and drastic changes and contributions in the way business function and in emboldening businesses to be resilient and to
provide premium and versatile services to the world market.
This work was previously published in the Handbook of Research on Cloud Infrastructures for Big Data Analytics edited by Pethuru Raj and
Ganesh Chandra Deka, pages 2237, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Agrawal, D., Barbara, S., Bernstein, P., Bertino, E., Davidson, S., & Dayal, U. … Vaithyanathan, S. (2012). Challenges and opportunities with big
data: A community white paper developed by leading researchers across the United States.
Stephenson, D. (2013, January 30). Corporate & finance, industry trends. Big Data: 3 Open Source Tools to Know. Retrieved from
http://www.firmex.com/blog/big-data-3-open-source-tools-to-know/
KEY TERMS AND DEFINITIONS
Big Data Analytics: It is the process of inspecting huge amount of varied data to uncover hidden patterns, unknown correlations and to extract
valuable information using advanced analytic techniques and business intelligence tools.
Big Data: Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates, data
that would take too much time and cost to load into a relational Database for analysis.
Cloud Analytics: any analytics initiative in which one or more of the following elements is implemented in the cloud” qualifies as Cloud
Analytics: Data Sources, Data Models, Processing Applications, Computing Power, Analytic Models, and Sharing or Storage of Results.
Hadoop: It is an open source apache framework for crunching database made of two main components HDFS and MapReduce.
HDFS: HDFS is an important part of Hadoop and is an abbreviation of Hadoop Distributed File System (HDFS), which stores data in files on a
cluster of servers.
Map Reducing: Map Reducing is an important part of Hadoop and it is a programming framework for building parallel applications that run
on HDFS.
NoSQL: Database is also called as Not Only SQL. NoSQL is the name given to a broad set of Databases whose only common thread is that they
don't require SQL to process data, although some support both SQL and non-SQL forms of data processing.
CHAPTER 4
A Holistic View of Big Data
Won Kim
Gachon University, South Korea
OkRan Jeong
Gachon University, South Korea
Chulyun Kim
Gachon University, South Korea
ABSTRACT
Today there is much hype about big data. The discussions seem to revolve around data mining technology, social Web data, and the open source
platform of NoSQL and Hadoop. However, database, data warehouse and OLAP technologies are also integral parts of big data. Big data involves
data from all sources, not just social Web data. Further, big data requires not only technology, but also a painstaking process for identifying,
collecting, and preparing sufficient amounts of relevant data. This paper provides a holistic view of big data.
INTRODUCTION
Big data is one of the current IT buzzwords. It refers roughly to extraction of actionable intelligence from a large amount of data, including social
Web data, and applying it to some important needs of an organization. The data may be stored in the proprietary databases of an organization or
purchased from third-party data providers or may be gathered from the Internet.
Although there is much hype about big data today, big data has been around for at least three decades or even longer, depending on how it is
defined. From the 1970s, database systems, report generators and decision support systems were the technologies used for managing and
analyzing large amounts of data. In the 1990s data warehousing and data migration technologies made organization-wide decision making easier
over data across various data sources. At about the same time, data mining technology emerged to allow for semi-automatic extraction of
grouping and classification of data. In all these, relational database systems and file systems have been used for storing data. Recently, the
Hadoop open-source platform has become popular for storing and processing big data.
(For expositional simplicity, henceforth we will use the term “big data” to mean not just “a huge amount of data”, but also “storing, managing,
and analyzing big data”. Big data certainly requires technologies.) The current hype about big data in the trade press appears to make big data
seem like it is all about technologies, is a fully automated magic, and is a requirement for the survival of every organization. In reality, big data is
not all about technologies, it requires considerable expert human efforts, and it can give competitive advantages to an organization only if used
properly. In fact, big data requires the following three critical elements, besides technologies.
1. Big data requires data. This may sound even dumb. However, the point is that the data must be the right kinds, must be sufficient in
quantity, and must be clean. If relevant data is not available, no actionable intelligence can be discovered. If the amount of data is not
sufficient, there may be no statistical significance in the results of big data. Even if there is a huge amount of data, when much of it is dirty,
the data is not usable.
2. Big data involves a painstaking process. If this process is not properly followed, efforts to extract actionable intelligence from big data are
not likely to succeed. The starting point of the process is to identify important objective for big data, and exploring feasibility of successfully
meeting the objective. The process ends after the actionable intelligence discovered is applied to the business needs. In between, the data
must be analyzed for suitability for analysis and be cleansed, transformed and encoded for analysis. The suitability analysis and preparation
for analysis require substantial human efforts.
3. Big data requires people who understand how to use the technologies and how to execute each step of the big data process. Data mining
technology is based on approximate computations that group data based on some measures of “mathematical similarity”, without
understanding the meanings of the data. There are many mining tasks, including grouping (clustering) similar objects, classifying new
objects into one of the existing groups, detecting anomalous data (outliers), etc. These tasks must be performed on various types of data,
including numeric data, words, text, Web pages, multimedia data, sequence data, etc. Further, these tasks must support the special
requirements and characteristics of numerous types of applications. To make the matter even worse, for any given task, there are many
algorithms with different tradeoffs. There has been much progress in the usability of the data mining software that embody the algorithms.
However, there is still a long way to go.
In other words, big data is difficult to do. In this paper, we provide a holistic view of big data, including technologies and non-technology
elements, so that the readers may have a more complete perspective of big data, rather than get sidetracked by the current hype.
The remainder of this article is organized as follows. In the Process section, we will discuss the big data process, along with the technologies
relevant to big data. In the Data Mining Technologies section, we will review data mining technologies. In the Database Platform section, we will
discuss the big data platform issues. In the Conclusion section, we will outline R&D directions and conclude the article.
PROCESS
Big data requires a process, which consists of about 10 steps. In this section, we outline a big data process, along with the technologies needed at
each of the steps. The technologies are absolutely necessary for big data. However, the process is also absolutely necessary. However, the
technologies are merely tools, and human experts need expertise in using the technologies and the application domains for the big data project.
The discussions in this section will make it clear that big data is not just about data mining technologies, and the Hadoop and NoSQL platform.
(We will say more about this later.) Further, we note that the data mining step (step 8 below) is only one of many steps in a big data process, and
it also takes the least amount of time, perhaps only a few % of the time it takes for the entire big data project.
We summarize the big data process below. Actually, before step 1 below, it is necessary for the organization to put in place the resources and skills
necessary and management support to carry out the big data project. We will assume that these have been taken care of. The organization may
outsource the big data project. We note that if the big project is similar to a past project, some of the early steps may be skipped.
2. Explore Feasibility Of Meeting The Objective: At this point, the availability of the data, technologies, and personnel needed for the
big data project has to be ascertained. Further, the expected duration of the project should be estimated. If the result of this feasibility
exploration is negative, either the big data project should be terminated, or the project plan must be altered.
4. Perform the First Round of Data Preparation: We separate “data preparation” for a big data project into two rounds. The first
round is needed for data analysis using a relational database system or an OLAP system. The second round is needed for data mining. In the
first round of data preparation, dirty must be cleansed and data from heterogeneous data sources must be homogenized (Kim & Seo, 1992).
The data may use different formats for date and time; units of money, weight, length, area, volume; variants of people’s names (e.g., Obama,
Barack Obama, President Obama; Paul’s Diner, Pauls Diner; Joe, Joseph); acronyms (e.g., UIUC, University of Illinois at Urbana-
Champaign); etc. Homogenization of data means conforming the formats and representations of data to chosen standards. Cleansing dirty
data and homogenizing data can be automated only to a limited extent, and thus require considerable human efforts.
6. Analyze the Data Using Query Tools or OLAP Tools:Much useful data analysis can be done, and unexpected insights may be
gained by issuing sequences of SQL queries to a relational database, and an OLAP database. Decision support systems, query tools, report
generators, and numerous other applications have been created to support this type of data analysis. Data mining can yield certain types of
insights that queries against a relational database or OLAP database cannot. However, the converse is also true. So, in general, combining
the results of data analysis using query tools and data mining tools should be the best.
9. Analyze the Results of Data Mining: The results of data mining often are usually not clear or definitive. For example, the results of
clustering or outlier detection need to be examined carefully to make sure they are satisfactory. In general, it takes repeated runs of data
mining tools to arrive at satisfactory results. This means human experts need to analyze the results after every data-mining run, not just
some magical “final” results. Further, the results of data mining usually need to be analyzed using visualization tools. This is because in
general the input data has many attributes (called independent variables), and the effects of each attribute or a combination of attributes is
difficult to understand otherwise. Further, data mining, such as clustering, yields many groups, and it is difficult to understand the
appropriateness of the grouping and the presence of outliers otherwise.
DATA MINING TECHNOLOGIES
Data mining discovers patterns among data in a large dataset. Data mining technologies have been founded on several fields, including statistics,
machine learning, information retrieval, information theory, databases, and information visualization. As such, there is extensive literature on
data mining technologies dating back at least three decades. There are numerous data mining products, both commercial and open source; and
there are numerous types of applications where data mining technologies have been successfully used
(http://en.wikipedia.org/wiki/Data_mining). Below we note the inherent difficulties of the data mining technologies and, as a consequence,
using the technologies. Despite these difficulties, the technologies, used properly by human experts, can yield remarkably accurate and usable
results.
First, even after a sufficient amount of data of sufficient quality has been properly prepared, in general it takes trial-and-error iterations with data
mining algorithms to obtain reasonable results. In other words, a data mining algorithm does not just take input data, perform computations,
and produce actionable intelligence magically in one try. During the trial-and-error iterations, the user often needs to make adjustments to the
attributes of the dataset and the amount of the input dataset (used for training and testing a model). The user of course also has to analyze the
intermediate results. Usability of data mining technologies has improved significantly over the years; however, they still require significant expert
human efforts.
Second, data mining technologies work primarily with numeric data and categorical data. They use measures of mathematical similarity among
data in order to separate data into groups of similar data, classify new data into one of existing groups of data, etc. The similarity measures may
be distance, statistical distribution, area density, etc. Because of the approximate nature of the similarity measures, it takes trial-and-error
iterations in running data mining algorithms to arrive at a satisfactory level of performance. The technologies for mining free-form text data, Web
pages, and multimedia data require special considerations beyond those for mining numeric data and word (short string) data.
Data mining technologies include several tasks, including clustering, classification, association rules, regression, and outlier detection
(http://en.wikipedia.org/wiki/Data_mining). There are many algorithms to support each of the tasks. Below we summarize principal data
mining tasks, and outline the inherent need for human experts in making use of the algorithms that support each of the tasks.
Clustering
A cluster is a set of data that is “closer” to one another than data that is not in the cluster. Clustering refers to separating a dataset into a number
of clusters using some measure of similarity (http://en.wikipedia.org/wiki/Cluster_analysis). There are several cluster models, and algorithms
for each model. The cluster models include distance-based model (single linkage algorithm, nearest neighbor algorithm), centroid model (k-
means algorithm), statistical-distribution-based model (Expectation-Maximization algorithm), density-based model (DBSCAN, OPTICS), and
graph-based model (clique).
As different objective functions lead to different optimal goals, they lead to different clustering results even for the same dataset. Moreover, for
the same dataset, it is very difficult to determine which model is better than others. Clustering requires trial-and-error iterations. For example,
for the k-means algorithm, the user needs to specify the number of clusters, k. But the user does not know how many clusters are in his dataset.
Thus, the algorithm needs to be repeated with various values for k until the results look reasonable.
Some clustering algorithms do not require the user to know the number of clusters; however, they require the user to provide other parameters
that are just as difficult to provide. For example, to use DBSCAN (Ester, et al., 2002), the user needs to provide the radius of a unit region, ε, and
the minimum number of points in the unit region. It is difficult to select proper values for these parameters, although these values are critical to
determine the boundaries of clustering.
High-dimensional data is data with up to many thousands of dimensions. High-dimensional data occurs, for example, in the analysis of medical
measurements data and text documents. If a word-frequency vector is used in clustering text documents, the number of dimensions equals the
size of the dictionary (http://en.wikipedia.org/wiki/Clustering_high-dimensional_data). With high-dimensional data, the dissimilarity between
any two data points is very large, and so the clustering algorithms cannot generate meaningful clusters. Therefore, before clustering, the user
needs a process to select an appropriate number of more relevant dimensions, as shown in the CLIQUE algorithm (Agrawal, et al., 2005).
Classification
Classification means classifying new data into one of the existing clusters (groups; http://en.wikipedia.org/wiki/Statistical_classification).
Classification requires two datasets to build and test a classification model (classifier): training set and test set. First, the user must build the
classification model by running an induction algorithm on the training dataset. There is more than one algorithm to choose from. Second, using
the test dataset, the user needs to test the model. The user is likely to repeat these steps to arrive at a model that delivers a desired level of
performance. Once the model has been built, the user can input new data to the model, which determines the existing groups where the new data
belong. There are several well-known classification algorithms, including k-nearest neighbors, decision trees, neural networks, Gaussian mixture
model, Naïve Bayesian, support vector machine, RBF (radial basis function) classifiers, etc.
The Naïve Bayesian algorithm is effective despite its simplicity. However, if attributes are highly correlated, they receive too much weight in the
final classification decision, resulting in a lower accuracy (Ratanamahatana & Gunopulos 2003). To improve accuracy, different weights need to
be assigned to different attributes (Langley & Sage 1993). One method combines feature selection with the Naïve Bayesian learning. The feature
selection part requires the user to eliminate from a large number of features (attributes) determined by the Naïve Bayesian algorithm those that
are correlated or irrelevant. As the number of features can be very large, some techniques have been developed to limit the search efforts (Cardie
& Howe 1997; Ratanamahatana & Gunopulos 2003). Another method for improving the accuracy of the Naïve Bayesian algorithm is for the user
to assign different weights to the features. Some techniques have been developed to calculate the weight of the features (Wettschereck, D. et al.,
1997). Such techniques are based on the k-nearest neighbors algorithm.
The k-nearest neighbors algorithm classifies a new object by a majority vote of its neighbors, with the object being assigned to the group most
common among its k nearest neighbors (http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). Just as with the k-means clustering
algorithm, different values of k may lead to different results. The user needs to experiment with different values of k until he obtains satisfactory
results.
Association Rules
An association rule has two parts: the “if” (antecedent) part, and the “then” (consequent) part. An example is “if a consumer buys a pair of suit
and shirts, he is likely to buy a pair of shoes”. The most important rules (i.e., the “if-then” patterns) can be discovered by applying some
constraints, such as support and confidence, on the dataset (Agrawal, et al., 1993; http://en.wikipedia.org/wiki/Association_rule_learning). A
transaction dataset is represented as a 2-dimensional matrix, where the rows represent transactions and the columns represent items. An itemset
(e.g., suit, shirts, shoes) is the items that are included in the “if-then” patterns. Support indicates the proportion of the itemset in the entire
transaction dataset.Confidence indicates the proportion of the transactions where the “if” part is true (e.g., suit and shirts) the “then” part (e.g.,
shoes) is also true. In an association rule algorithm, association rules are considered only if their support and confidence are not less than the
minimum thresholds. Association rules have been applied to shopping basket data analysis, product clustering, catalog design and store layout,
etc.
Unlike other data mining problems, association rules found by any association rule algorithms, such as APRIORI (Agrawal, et al., 1993) and FP-
GROWTH (Han, et al., 2000), are always deterministic, as long as the same support and confidence values are applied. The support value is
critical to the performance as well as findings of an algorithm. If the minimum support value is too small, too many itemsets will be examined to
generate association rules. It means that the algorithm should run for a very long time to enumerate the many itemsets and the association rules
become unreliable because most rules are inferred from very small cases. On the contrary, if the minimum support value is too large, only a small
number of association rules will be found and many useful rules may not be found. It is very difficult to select the proper minimum support
threshold. The threshold is dependent on the distribution of the dataset. When a minimum threshold is initially given to an association rule
algorithm, the user cannot know if the threshold is small or large. The user can settle on a proper threshold value after repeating the process of
finding association rules with several different threshold values.
Perhaps more troubling is that the significance (or interestingness) of association rules is an issue (Chen, et al., 1996; Brin, et al., 1997). For
example, the association rule that “if a customer buys milk, then she will also buy water” is not significant. The fact that the customer buys two
daily necessity items together is not significant. This means that the user of association rule algorithms needs to identify significant models of
association rules.
Outlier Detection
An outlier is data that is very different from most others. Outliers represent either exceptional data or erroneous data. If they are exceptional
data, they may be of special interest; for example, an onset of an epidemic, fraudulent insurance claims, failure of mechanical parts, etc. Outlier-
detection algorithms are based on classification, clustering, statistics, nearest neighbor, information theory, spectral decomposition,
visualization, etc. (Chandola, et al., 2009). In general, it is not easy to define outliers. There is no rigid mathematical definition of outliers, and
identifying outliers among the results of an outlier detection algorithm is left to the human user. If the user does not find the results acceptable,
he needs to change the outlier model or parameter values, and repeat the algorithm to arrive at acceptable results.
Regression
In a sense, regression is similar to classification. It first determines an algebraic equation, and then uses the equation to predict the position of
new data. There are two types of regression: linear regression (http://en.wikipedia.org/wiki/Linear_regression) and non-linear regression
(http://en.wikipedia.org/wiki/Nonlinear_regression). Linear regression in turn has different types: linear regression with a single dependent
variable, and linear regression with multiple correlated dependent variables. Linear regression with a single dependent variable may be simple
linear regression or multiple linear regression. In simple linear regression, the algebraic equation to be determined, given a dataset, is of the form
Y = aX + b. Y is the dependent variable and X is the independent variable. The Y value of new data is computed from the equation by substituting
the X value of the new data. Multiple linear regression uses more complex equations, such as a quadratic equation, to allow for multiple
independent variables.
Unfortunately, there is no general agreement on relating the number of observations to the number of independent variables in the regression
model.
In the following, we briefly summarize some of the data mining technologies for dealing with free-form text, Web pages, and multimedia data.
Text Analytics
Text analysis methods may be broadly grouped into text analytics and text summarization. Text analytics includes techniques for analyzing word
frequency distribution, feature extraction (e.g., names of people, organizations, places, events, etc.), special information recognition (e.g.,
telephone numbers, email addresses, quantities in certain measurement units, etc.), entity relations extraction, sentiment analysis, etc.
(http://en.wikipedia.org/wiki/Text_mining). Such techniques as the vector space model, content relevance analysis (with respect to the search
query), concept hierarchy analysis, etc. are used. The vector space model creates a matrix of term frequencies (TF) for every term that appears in
a document. As many of the frequently used terms may not be meaningful, their importance is reduced by introducing inverse document
frequency (IDF). Text summarization automatically produces tags for a document by analyzing the words, phrases, and sentences in the
document. It may also produce a summary of a document (e.g., snippets of the Internet search results) by selecting some of the sentences in the
document (http://en.wikipedia.org/wiki/Automatic_summarization).
Web Mining
Web data includes text, HTML tags, hyperlinks, and even multimedia data. Web mining consists of Web usage mining, Web page structure
mining, and Web page content mining (Srivastava, et al., 2009; http://en.wikipedia.org/wiki/Web_mining). Web usage mining analyzes the
server logs to determine Web usage, and Web browsing history to determine user’s browsing patterns. Web page structure mining determines the
(hierarchical) structure of the elements that form a single Web page, and the graph structure of the Web pages linked through hyperlinks. There
are several algorithms for analyzing the hyperlink-based structure of the Web pages (https://en.wikipedia.org/wiki/Link_analysis; Desikan, et
al., 2002). These include Google page rank, hubs and authorities, HITS (hyperlink induced topic search), Web communities, max flow and min
cut, etc. Web page content mining is done to mine the content of Web pages. Clustering is used for document grouping, classification for
classifying new documents to existing document groups, and association rules for analyzing relationships among documents. Techniques for text
analysis are used to analyze the text content.
Multimedia Mining
Multimedia mining analyzes multimedia data, such as images, photos, drawings, video, and audio (Murty). Mining multimedia data is
computationally much more expensive than mining numeric data. It clusters and classifies multimedia data. It searches for similar data, and
detects outliers. It also determines the boundary between data, for example, the boundary between persons in a conversation or broadcast.
Multimedia mining also analyzes sequences of multimedia data, such as traffic monitor data, EKG monitor data, etc.
DATABASE PLATFORMS
There are many steps in the big data process where software tools are used to examine data, prepare the data for analysis, perform data analysis,
analyze the results of data analysis. The operating systems and database systems, on which these software tools run, along with various utilities
comprise the platform for data analysis. The database platform includes relational database system and NoSQL-Hadoop. The selection of the
most suitable database platform depends on several factors, such as cost, the size of the datasets to analyze, the types of analysis to perform (i.e.,
SQL queries, OLAP queries, data mining), the number and cost of the servers needed for data storage, availability and cost of data analysis tools,
technical support needs, etc. For enterprises that have already made substantial investment in relational database systems, but need the
performance and scalability that Hadoop can deliver, a good way may be to use both platforms, and import and export data between them. (We
will discuss this further later in this section.)
NoSQL “database systems” include such proprietary systems as Google’s BigTable and Megastore, Amazon’s Dynamo and SimpleDB; and such
open-source systems as Hbase, Cassandra, CouchDB, MongoDB, etc. (https://en.wikipedia.org/wiki/NoSQL) They focus on meeting the high
performance, scalability, and availability requirements of huge interactive Web services of major Web companies. They support unnormalized
tables where the rows of a table may have variable number of columns, and a column may store a list of values. They do not support joins of
tables. Most of them support eventual consistency, a weaker form of data consistency guarantee than the ACID (atomicity, consistency, isolation
and durability) consistency guarantee that relational database systems provide. They also shift much of the data management burden to
application developers. It is not clear if the open-source NoSQL systems are ready for mass adoption.
HBase runs on HDFS (Hadoop distributed file system) and stores data in HDFS. HBase and HDFS, along with the Hadoop kernel, the Map-
Reduce parallel-processing engine, and a number of tools, such as Pig, Hive, Oozie, etc. constitute the Apache Hadoop open-source framework
(http://en.wikipedia.org/wiki/Apache_Hadoop). Hive is a data warehouse tool that provides data summarization, query and analysis. Oozie is a
workflow scheduler that manages Hadoop jobs. Map-Reduce makes it possible for the user to divide an application into many small fragments,
each of which may be executed on a server in a cluster of servers. Map-Reduce and HDFS are designed to handle server failures and
communications among the servers. Hadoop thus enables applications to be processed on a large cluster of servers and a large amount of data.
The Map-Reduce application requires low-level programming. Pig is a tool that allows map-reduce application developers to specify their logic at
a higher level (http://en.wikipedia.org/wiki/Pig_(programming_tool)).
HBase in particular,, despite its shortcomings and technical support issues, can gain significant advantage when coupled with the runtime
capabilities of Hadoop. One good way to have the best of the two platforms would be to extract part of a relational database and load it onto
HBase to take advantage of the parallel processing capabilities of Hadoop. In fact, Apache Sqoop allows the transfer of data between relational
database and Hadoop (http://en.wikipedia.org/wiki/Sqoop). It supports incremental loading of a single table, and import of data from a
relational database to tables in Hive or HBase. It also supports exports of data from Hadoop to a relational database. We note that there are
serious data model mismatches between Hadoop and relational database systems, and so transfer of data between them must be done carefully.
As we noted earlier, NoSQL databases use unnormalized tables, where a table does not have a fixed uniform schema. Relational databases use
normalized tables, where a table has a fixed uniform schema.
CONCLUSION
We provided a brief holistic view of big data, including both the technologies and the process required for successful extraction of actionable
intelligence from a huge amount of raw data. Big data requires many conventional software technologies, such as relational database systems and
OLAP systems, query tools, decision support systems, data mining tools, search engines, etc., not just NoSQL and Hadoop platform, which are
the subjects of much recent discussions. Further, big data includes many steps for preliminary analyses, cleansing, transformation and encoding
of the datasets to be analyzed. The steps in the big data process require substantial expert human efforts, since the software tools used do not
understand the meanings of the data that they process.
Although there are already numerous applications of big data, and numerous software tools, much still needs to be done. The data mining tools
need to be improved so that they will require much less expert human efforts. Further improvements in performance and scalability will be
important. Technology for analyzing multimedia data still has a lot of room for improvement.
This work was previously published in the International Journal of Data Warehousing and Mining (IJDWM), 10(3); edited by David Taniar,
pages 5969, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This research was supported by Gachon University Research Fund (GCU-2012-R115), and by Basic Science Research Program through the
National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(2012R1A1A1015241,
2013R1A1A3A04008339).
REFERENCES
Agrawal R. (1993) Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International
Conference on Management of Data (pp. 207-216).
Agrawal, R. (2005). Automatic subspace clustering of high dimensional data . Data Mining and Knowledge Discovery , 11(1), 94–105.
Brin S. (1997) Beyond market basket: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International
Conference on Management of Data (pp. 265-276).
Cardie, C., & Howe, N . (1997). Improving minority class prediction using case-specific feature weights. In Proceedings of the Fourteenth
International Conference on Machine Learning (pp. 57-65).
Chandola, V. (2009). Outlier detection: A survey . ACM Computing Surveys , 41(3). Retrieved from http://www.bradblock.com.s3-website-us-
west1.amazonaws.com/Outlier_Detection_A_Survey.pdf
Chen, M. S. (1996). Data mining: An overview from a database perspective . IEEE Transactions on Knowledge and Data Engineering , 8(6), 866–
883. doi:10.1109/69.553155
Desikan, P., et al. (2002). Hyperlink analysis: Techniques and applications. Army High Performance Computing Center Technical Report.
Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6190&rep=rep1&type=pdf
Ester M. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second
International Conference on Knowledge Discovery and Data Mining.
Han J. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on
Management of Data (pp. 1-12).
Kim, W., & Seo, J. (1992). On classifying schematic and data heterogeneity in multidatabase systems. IEEE Computer.
Langley, P., & Sage, S . (1993). Induction of selective Bayesian classifier. In Proceedings of the Tenth Conference on Uncertainty in Artificial
Intelligence (pp. 399-406).
Ratanamahatana, C., & Gunopulos, D. (2003). Scaling up the naive Bayesian classifier: Using decision trees for feature selection .Applied
Artificial Intelligence , 17(5-6), 475–487. doi:10.1080/713827175
Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning
algorithms . Artificial Intelligence Review , 11(1/5), 273–314. doi:10.1023/A:1006593614256
CHAPTER 5
A Review of RDF Storage in NoSQL Databases
Zongmin Ma
Nanjing University of Aeronautics and Astronautics, China
Li Yan
Nanjing University of Aeronautics and Astronautics, China
ABSTRACT
The Resource Description Framework (RDF) is a model for representing information resources on the Web. With the widespread acceptance of
RDF as the de-facto standard recommended by W3C (World Wide Web Consortium) for the representation and exchange of information on the
Web, a huge amount of RDF data is being proliferated and becoming available. So RDF data management is of increasing importance, and has
attracted attentions in the database community as well as the Semantic Web community. Currently much work has been devoted to propose
different solutions to store large-scale RDF data efficiently. In order to manage massive RDF data, NoSQL (“not only SQL”) databases have been
used for scalable RDF data store. This chapter focuses on using various NoSQL databases to store massive RDF data. An up-to-date overview of
the current state of the art in RDF data storage in NoSQL databases is provided. The chapter aims at suggestions for future research.
INTRODUCTION
The Resource Description Framework (RDF) is a framework for representing information resources on the Web, which is proposed by W3C
(World Wide Web Consortium) as a recommendation (Manola and Miller, 2004). RDF can represent structured and unstructured data (Duan,
Kementsietsidis, Srinivas and Udrea, 2011), and more important, metadata of resources on the Web represented by RDF can be shared and
exchanged among application programming without semantic missing. Here metadata mean the data that specify semantic information about
data. Currently RDF has been widely accepted and has rapidly gained popularity. And many organizations, companies and enterprises have
started using RDF for representing and processing their data. We can find some application examples such as the United States (Data.gov), the
United Kingdom (New York Times), New York Times (New York Times), BBC (BBC), and Best Buy (Chief Martec, 2009). RDF is finding
increasing use in a wide range of Web data-management scenarios.
With the widespread usage of RDF in diverse application domains, a huge amount of RDF data is being proliferated and becoming available. As a
result, efficient and scalable management of large-scale RDF data is of increasing importance, and has attracted attentions in the database
community as well as the Semantic Web community. Currently, much work is being done in RDF data management. Some RDF data-
management systems have started to emerge such as Sesame (Broekstra, Kampman and van Harmelen, 2002), JenaTDB (Wilkinson, Sayers,
Kuno and Reynolds, 2003), Virtuoso (Erling and Mikhailov, 2007 & 2009),4Store (Harris, Lamb and Shadbolt, 2009)), BigOWLIM (Bishopet
al., 2011) and Oracle Spatial and Graph with Oracle Database 12c (Oracle). Here BigOWLIM is renamed to OWLIM-SE and further to
GraphDB. Also some research prototypes have been developed (e.g., RDF-3X (Neumann and Weikum, 2008 & 2010), SW-Store (Abadi, Marcus,
Madden and Hollenbach, 2007 & 2009) and RDFox (CS Ox).
RDF data management mainly involves scalable storage and efficient queries of RDF data, in which RDF data storage provides the infrastructure
for RDF data management and efficient querying of RDF data is enabled based on RDF storage. In addition, to serve a given query more
effectively, it is necessary to index RDF data. Indexing of RDF data is enabled based on RDF storage also. Currently many efforts have been made
to propose different solutions to store large-scale RDF data efficiently. Traditionally relational databases are applied to store RDF data and
various storage structures based on relational databases have been developed. Based on the relational perspective, Sakr and Al-Naymat (2009)
present an overview of relational techniques for storing and querying RDF data. It should be noted that the relational RDF stores are a kind of
centralized RDF stores, which are a single-machine solution with limited scalability. The scalability of RDF data stores is essential for massive
RDF data management. NoSQL (for “not only SQL”) databases have recently emerged as a commonly used infrastructure for handling Big Data
because of their high scalability and efficiency. Identifying that massive RDF data management merits the use of NoSQL databases, currently
NoSQL databases are increasingly used in massive RDF data management (Cudre-Mauroux et al., 2013).
This chapter provides an up-to-date overview of the current state of the art in massive RDF data stores in NoSQL databases. We presents the
survey from three main perspectives, which are key value stores of RDF data in NoSQL databases, document stores of RDF data in NoSQL
databases and RDF data stores in graph databases. Note that, due to the large number of RDF data-management solutions, this chapter does not
include all of them. In addition to provide a generic overview of the approaches that have been proposed to store RDF data in NoSQL databases,
this chapter presents some suggestions for future research in the area of massive RDF data management with NoSQL databases.
The rest of this chapter is organized as follows. The second section presents preliminaries of RDF data model. It also introduces the main
approaches for storing RDF data. The third section introduces NoSQL databases and their database models. The fourth section provides the
details of the different techniques in several NoSQL-based RDF data stores. The final section concludes the chapter and provides some
suggestions for possible research directions on the subject.
RDF MODEL AND RDF DATA STORAGE
Being a W3C recommendation, RDF (Resource Description Framework) provides a means to represent and exchange semantic metadata. With
RDF, metadata about information sources are represented and processed. Furthermore, RDF defines a model for describing relationships among
resources in terms of uniquely identified attributes and values.
The RDF data model is applied to model resources. A resource is anything which has a universal resource identifier (URI) and is described using
a set of RDF statements in the form of (subject,predicate, object) triples. Here subject is the resource being described, predicate is the property
being described with respect to the resource, and object is the value for the property.
Formally an RDF triple is defined as in which s is called the subject, p the predicate (or property), and o the object. Triple
means that the subject s has the property p whose value is the object o. The abstract syntax of RDF data model is a set of triples. Among these
triples, it is possible that an object in one triple (e.g., oi in ) can be a subject in other triple (e.g., ). Then RDF data model is
a directed, labelled graph model for representing Web resources (Huang, Abadi and Ren, 2011). A key concept for RDF is that of URIs (Unique
Resource Identifiers), which can be applied in either of the s, p and opositions to uniquely refer to entity, relationship or concept (Kaoudi and
Manolescu, 2015). In addition, literals are allowed in the o position. Let I, B, and L denote infinite sets of IRIs, blank nodes, and literals,
respectively. Then we have
Concerning the syntaxes for RDF, we have RDF/XML, N-Triple and Turtle. Among them, N-Triple is the most basic one and it contains one triple
per line.
In (Bornea et al., 2013), a sample of DBpedia RDF data is presented and it contains 21 tuples and 13 predicates. Let us look at a fragment of the
sample, which contains 6 triples {(Google, industry, Software), (Google, industry, Internet), (Google, employees, 54,604), (Android, developer,
Google), (Android, version, 4.1), (Android, kernel, Linux)} and 5 predicates {industry, employees, developer, version, kernel}. For a triple, say
(Google, industry, Internet), its resource is Google, this resource has the property industry, and the value for the property is Internet. With the
set of triples, we have an RDF graph shown in Figure 1.
Figure 1. An RDF graph
As we know, in the context of common data management, data are stored early in file systems and late in databases (such as relational databases
and object-oriented databases). Similarly, RDF data management relies on RDF data storage also. It is especially true for managing large-scale
RDF data in the real-world applications. Nowadays many approaches of RDF data storages have been developed. Given the large number of RDF
data-management solutions, there is a richness of perspectives and approaches to RDF data storages. But few classifications of RDF data-storage
approaches have been reported in the literature. Basically proposals for RDF data storage are classified into two major categories in (Sakr and Al-
Naymat, 2009; Bornea et al., 2013), which are native stores and relational stores, respectively. Here the native stores use customized binary RDF
data representation and the relational stores distribute RDF data to appropriate relational databases. Viewed from three basic perspectives of the
relational, entity and graph-based perspectives, proposals for RDF data storage are classified into three major categories in (Luo et al., 2012),
which are relational stores, entity stores and graphbased stores, respectively.
RDF data stores can be classified relying on the underlying infrastructure. First, for the native stores, we can identify mainmemorybased RDF
native store and diskbased RDF native store. The major difference between these two native stores is that the former works on RDF data stored
completely in main memory and the latter works on RDF data stored in disk. The disk-based RDF native store is built directly on the file system.
It is not difficult to see that the native stores can only deal with small-scale RDF data. At this point, traditional relational databases are hereby
applied to store RDF data. In the relational stores of RDF data, different relational schemas can be designed, depending on how to distribute RDF
triples to an appropriate relational schema. This has results in three major categories of RDF relational stores (Sakr and Al-Naymat, 2009;
Bornea et al., 2013; Luo et al., 2012), which are triple stores, vertical stores and property (type) stores. In the tripe stores, all RDF triples are
directly stored in a single relational table over relational schema (subject, predicate, object), and each RDF triple becomes a tuple of the
relational database. In the vertical stores, a subject-object relation is directly represented for each predicate of RDF triples and a relational table
contains only one predicate as a column name. As a result, we have a set of binary relational tables over relational schema (subject, object), each
relational table corresponds to a different predicate, and this predicate can be the name of the corresponding relational table. In the type stores,
one relational table is created for each RDF data type, in which an RDF data type generally corresponds to several predicates, and a relational
table contains the properties as n-ary table columns for the same subject.
The reason why there are three major approaches for storing RDF data in relational databases is that each approach has its advantages and
disadvantages simultaneously. First, the triple stores use a fixed relational schema and can hereby handle dynamic schema of RDF data, but the
triple stores involve a number of self-join operations for querying. Second, the vertical stores using a set of relational tables generally involve
many table join operations for querying and cannot handle dynamic schema of RDF data because new predicates of new inserted triples result in
new relational tables. Finally the type stores involve fewer table join operations than the vertical stores using multiple relational tables and,
compared with the triple stores using a single relational table, no self-join operations, but the type stores generally contain null values and multi-
valued attributes, and cannot handle dynamic schema of RDF data because new predicates of new inserted triples result in changes to relational
schema.
Several representative relational RDF data stores and their major features are summarized in Table 1.
Table 1. Representative relational RDF data stores and their major features
It can be seen from Table 1 that any one of the approaches for relational RDF data stores presented above cannot deal with RDF data stores well.
So some efforts have been made to store RDF data by using two or more of three major store approaches together or revising three major store
approaches (Kim, 2006; Sperka and Smrz, 2012; Bornea et al., 2013). Such an approach is called hybrid stores, It should be noted that the hybrid
stores cannot still satisfy the need of managing large-scale RDF data.
The native RDF stores and the relational RDF stores (including the triple stores, vertical stores, type stores and hybrid stores) are actually
categorized as centralized RDF stores. Centralized RDF stores are single-machine solutions with limited scalability. To process large-scale RDF
data, recent research has devoted considerable effort to the study of managing massive RDF data in distributed environments. Distributed RDF
stores can hash partition triples across multiple machines and parallelize query processing. NoSQL databases have recently emerged as a
commonly used infrastructure for handling Big Data (Pokorny, 2011). Massive RDF data management merits the use of NoSQL databases and
currently NoSQL databases are increasingly used in RDF data management (Cudre-Mauroux et al., 2013). Typically, NoSQL database stores of
RDF data are distributed RDF stores. Depending on concrete data models adopted, the NoSQL database stores of RDF data are categorized
into keyvalue stores, columnfamily stores, document stores and graph databases stores.
Summarily, we can classify current RDF data stores into the centralized RDF stores and the NoSQL database stores of RDF data (Papailiou et al.,
2013). The centralized RDF stores can be further classified into the native RDF stores and the relational RDF stores, in which the native RDF
stores contain main-memory-based native RDF store and disk-based native RDF store, and the relational RDF stores include triple stores,
vertical stores and type stores. The NoSQL database stores of RDF data are further classified into key-value stores, column-family stores,
document stores and graph databases stores. Figure 2 illustrates this classification for RDF data stores.
The native RDF stores and the relational RDF stores have been reviewed in (Sakr and Al-Naymat, 2009). The focus of this chapter is to
investigate NoSQL database stores of RDF data. Before that, we first sketch NoSQL databases in the following section.
NoSQL Databases for Big Data
Big Data is a term used to refer to massive and complex datasets made up of a variety of data structures. Big Data can be found in various
application domains such as web clicks, social media, scientific experiments, and datacenter monitoring. Actually there is not a common
definition of Big Data so far (Stuart and Barker, 2013). But Big Data are generally characterized by three
basic Vs:Volume, Variety and Velocity (Laney, 2001).
• Volume: Volume means that Big Data have big data scale in the range of TB to PB and even more.
• Variety: Variety means Big Data have rich data types with many formats such as structured data, unstructured data, semistructured data,
text, multimedia, and so on.
• Velocity: Velocity means that Big Data must be processed speedily. Also Velocity means that Big Data are being produced speedily.
Then Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making. In addition to the three Vs above, a V associated with Veracity has been introduced to Big
Data in (Snow, 2012) and a V associated with Value has been introduced to Big Data in (Gamble and Goble, 2011).
• Veracity: Veracity means that inherent imprecision and uncertainty in data should be explicitly considered so that the reliability and
predictability of imprecise and uncertain Big Data can be managed.
• Value: Value means that Big Data must be worthwhile and has value for business.
The Veracity of data is a basis of Big Data processing because the data with volume and variety may contain errors, noises or imperfection.
Actually the Veracity of data is a precondition and guarantee of Big Data management, which can increase the robustness and accuracy of Big
Data processing. Regarding to the Value of data, Value sets a basic criterion in the choice and processing of Big Data, which is especially true in
the context of the data with volume, variety and velocity.
Also there are several other Vs which are applied to describe the properties of Big Data in literature
(e.g., Visualization,Visualization and Volatility). Among them, Volatility means that Big Data that we are interested in are temporally valid. It is
possible that, at one point, specific data are no longer relevant to the current processing and analysis of Big Data. It should be noted that the
several Vs mentioned above only characterize some properties of Big Data partially. So some other characteristics rather than the Vs presented
above are assigned to Big Data.
Being similar with common data (not Big Data) management, Big Data management needs database systems. As we know, the relational
databases are very powerful and have been widely applied for structured, semi-structured and even unstructured data. However the relational
databases are unable to manage Big Data. The reason is that the relational databases must meet ACID according to relational databases theory
but Big Data management need CAP Theorem. ACID means the type of transaction processing done by relational database management system
(RDBMS) as follows.
• (C)onsistency: whenever data is written, everyone who reads the DB will see the latest version.
• (P)artition: tolerance: database can still be read from/written to when parts of it are offline; afterwards, when offline nodes are back
online, they are updated accordingly.
It is clear that the relational databases are not solutions in Big Data management. A new type of databases called NoSQL is hereby proposed,
which means “not only SQL” or “no SQL at all”.
NoSQL database systems have emerged as a commonly used infrastructure for handling Big Data. Comparing to traditional relational databases,
NoSQL solutions provide simpler scalability and improved performance (Hecht and Jablonski, 2011; Pokorny, 2011; Moniruzzaman and Hossain,
2013; Gudivada, Rao and Raghavan, 2014) and generally have some characteristics as follows (Tauro, Aravindh and Shreeharsha, 2012).
• Distributed processing
• High availability
• Replication support
• Improvements in performance
It should be noted that NoSQL databases are very diverse and there are more than one hundred NoSQL databases. According to their data model,
the various NoSQL databases are classified into four major categories as follows (Hecht and Jablonski, 2011; Grolinger et al., 2013; Bach and
Werner, 2014).
• Key-value stores
• Column-family stores
• Document stores
• Graph databases
Key-value stores have a simple data model based on key-value pairs, which contain a set of couples (key, value). A value is addressed by a single
key. Here a value may be a string, a pointer (where the value is stored) or even a collection of couples (name, value) (e.g., in Redis (redis). Note
that values are isolated and independent from others, in which the relationship is handled by the application logic.
Most column-family stores are derived from Google BigTable (Chang et al., 2008), in which the data are stored in a column-oriented way. In
BigTable, the dataset consists of several rows. Each row is addressed by a primary key and is composed of a set of column families. Note that
different rows can have different column families. Representative column-family stores include Apache HBase, which directly implements the
Google BigTable concepts. In addition, Cassandra (Lakshman and Malik, 2010) provides the additional functionality of super-columns, in which
a column contains nested (sub)columns and super-columns are formed by grouping various columns together. According to (Grolinger et al.,
2013), there is one type of column-family store, say Amazon SimpleDB and DynamoDB (DeCandia et al., 2007), in which only a set of column
name-value pairs is contained in each row, without having column families. So SimpleDB and DynamoDB are generally categorized as key-value
stores as well.
Document stores provide another derivative of the key-value store data model that uses keys to locate documents inside the data store. Most
document stores represent documents using JSON(JavaScript Object Notation) or some format derived from it (e.g.,BSON (Binary JSON)).
JSON is a binary and typed data model which supports the data types list, map, date, Boolean as well as numbers of different precisions.
Typically CouchDB and the Couchbase server use the JSON format for data storage, whereas MongoDB stores data in BSON.
Graph databases, a special category of NoSQL databases, use graphs as their data model. A graph is used to represent a set of objects, known as
vertices or nodes, and the links (or edges) that interconnect these vertices. Representative graph databases include GraphDB (Güting, 1994) and
Neo4j. Neo4j is an open-source, highly scalable, robust native graph database that stores data in a graph. Note that graph databases actually are
developed originally for managing data with complex structures and relationships such as data with recursive structure and data with network
structure rather than Big Data management. Generally graph databases run in single server (e.g., GraphDB and Neo4j). Recently few efforts have
been made to develop distributed graph databases (Nicoara, Kamali, Daudjee and Chen, 2015). Typically Titan (thinkaurelius) is a scalable graph
database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine
cluster.
Illustrative representations of four kinds of NoSQL models are shown in Figure 3, which are presented by Grolinger et al. (2013).
Figure 3. Different types of NoSQL data model
Following the four major categories of NoSQL models, some NoSQL databases have been developed and applied. Table 2 lists several
representative NoSQL databases.
Table 2. Representative NoSQL databases
Representative Remarks
NoSQL
Databases
In the NoSQL data models presented above, column-family store and document store can be regarded as a kind of extended key value stores, in
which document store is regarded that the key values are set to be the documents, and column-family store is regarded that a key value is a
combination of ID of row, column number and timestamp. So sometimes column-family stores of NoSQL databases such as HBase and
Cassandra are generally called key value stores of NoSQL databases. The true key value store model is too simple for many application domains
such RDF data management.
NoSQL databases are designed for storing and processing Big Data datasets. In the following, we investigate how massive RDF data management
merits the use of NoSQL databases.
RDF Data Storage in NoSQL Databases
NoSQL databases have emerged as a commonly used infrastructure for handling Big Data. NoSQL databases are not designed especially for RDF
data management. But massive RDF data management merits the use of NoSQL databases for Big Data infrastructure because of the scalability
and high performance of NoSQL databases for Big Data management. Depending on the concrete models of NoSQL databases, basically we have
three kinds of RDF data stores in NoSQL databases, which are storing RDF data in columnfamily stores of NoSQL databases, storing RDF data
in document stores of NoSQL databases, and storing RDF data in graph databases.
Storing RDF Data in ColumnFamily Stores of NoSQL Databases
Among the NoSQL systems available, HBase, which is columnfamily stores of NoSQL databases, has been the most widely applied. Apache
HBase uses HDFS (Hadoop Distributed File System) as its storage back-end (Hua, Wu, Li and Ren, 2014), and supports MapReduce computing
framework (Dean and Ghemawat, 2008). Being similar to relational databases, HBase uses data table named HTable, in which each row is
uniquely identified by a row key. HBase generally creates indexing on the row keys. Like RDF triples storage in a relational database, RDF triples
can be stored in HTable of HBase. We can identify several storage structures for HTable-based RDF store.
First, based on the idea of Hexastore schema developed by Weiss, Karras and Bernstein (2008), scalable RDF store based on HBase is proposed
in (Sun and Jin, 2010). They store RDF triples into six HBase tables (S_PO, P_SO, O_SP, PS_O, SO_P and PO_S), which covers all
combinations of RDF triple patterns. And they index RDF triples with HBase provided index structure on row key. Also based on HBase, two
distributed triple stores H2RDF and H2RDF+ are developed in (Papailiou, Konstantinou, Tsoumakos and Koziris, 2012) and (Papailiou et al.,
2013), respectively. The main difference between H2RDF and H2RDF+ is the number of maintained indices (three versus six).
In addition, to manage distributed RDF data, HBase is sometimes applied by combining others together. HBase and MySQL Cluster, for example,
are used in (Franke et al., 2011). Combining the Jena framework with the storage provided by HBase, Khadilkar et al. (2012) developed several
versions of a triple store. Figure 4 presents an overview of the architecture typically used by Jena-HBase (Khadilkar et al., 2012).
Figure 4. Architectural overview of Jena-HBase
Storing RDF Data in Document Stores of NoSQL Databases
Document stores of NoSQL databases look like relational databases, in which each row is associated with a document instead of a tuple. It is clear
that such a structure is suitable for storing semi-structure data. As mentioned above, documents in the document stores of NoSQL Databases are
represented by usingJSON (JavaScript Object Notation) or BSON (Binary JSON). Being a kind of data exchange format, JSON adopts text format
and can be easily edited by persons and processed by machines simultaneously. As a result, it is convenient that JSON can be shared and
exchanged among different systems.
In order to store massive RDF data in document stores of NoSQL databases, RDF data should be represented with JSON format. As we know,
RDF triples contain three components, which are subjects, predicates and objects and meanwhile the structure of JSON only contains two
components, which are keys and values. It is clear that they are not consistent. So it is necessary to establish a mapping to transform RDF to
JSON. Basically we can identify several basic mappings from RDF to JSON as follows.
1. Triple-centered mapping
2. Subject-centered mapping
3. JSON-LD approach
In the triple-centered mapping, all triples in an RDF graph can be stored in a rooted document object. The value of this document object is an
array, and each element of the array corresponds to a triple of RDF. In the subject-centered mapping, each subject of RDF data is treated as the
“Key” of an JSON object, and the corresponding “Value” is some embedded objects (Alexander, 2008). The “Key” of these embedded objects is
the predicate of the present subject. JSON-LD (Github), developed by W3C as a recommendation, is a data serialization and messaging format
for Linked Data. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web
services, and to store Linked Data in JSON-based storage engines. With JSON-LD, Linked Data which are represented by RDF can be easily
stored in JSON-based document stores of NoSQL databases such as CouchDB. An JSON-LD document is an instance of RDF data model. JSON-
LD can serialize a generic RDF data set after RDF data model is extended. Also a reverse serialization can be made with JSON-LD. API
specifications in JSON-LD provide supports to obverse and reverse serializations between RDF and JSON-LD.
A simple example of JSON-LD from http://json-ld.org/ is shown in Figure 5. This fragment contains information about a person: his name is
John Lennon, he was born on 1940-10-09, and his spouse is described in the website of http://dbpedia.org/resource/Cynthia_Lennon. For terms
like “name”, “born” and “spouse”, their semantics are defined with IRIs (International Resource Identifiers). In addition, @context is used to
define the short-hand names that are used throughout a JSON-LD document, and @id is used to uniquely identify things that are being described
in the document with IRIs or blank node identifiers.
Figure 5. A simple example of JSON-LD
Storing RDF data in Graph Database
Compared to relational database model which uses flat relational table, graph databases adopt graph model with vertices, edges and property,
and are hereby very suitable to handle the data with network structure (Angles and Gutierrez, 2008). Many techniques and algorithms proposed
in the context of Graph Theory (GT) can be applied in graph databases. Neo4j is a popular graph database, in which the data model it uses to
express its data is a graph, storing nodes and relationships that connect them, supporting user defined properties on both constructs. Basically
Neo4j can traverse the vertices and edges of graph at the same speed and the traversing speed is not influenced by the amount of data
constituting the graph. So Neo4j can support scalable store of big graph well.
RDF data model can be regarded as a kind of special graph model. It is advocated in (Angles and Gutiérrez, 2005) to store RDF data in graph
databases. So it is a natural way to store massive RDF data in Neo4j. But the standard graph-database model is different from the triple-based
RDF model. In order to manage massive RDF data with Neo4j, two basic approaches can be adopted. First, Neo4j can be extended and provided
with an interface of processing RDF data, and RDF data are actually stored with native or relational stores. Second, massive RDF data can be
directly stored in Neo4j. Following the second approach, DBpedia data are stored in Neo4j in the project of Dbpedia4neo (W3), and SPARQL
(Simple Protocol and RDF Query Language) querying and some graph algorithms are then investigated.
As we know, SPARQL is a vendor-independent standard query language for the RDF triple data model (Prud'hommeaux and Seaborne, 2008),
which is developed by W3C (World Wide Web Consortium). It should be noted that, although it is shown that graph databases can model RDF
data in a natural way very well, some primary and useful querying operations for graph databases are not supported by current SPARQL query
language. In this context, in order to manipulate RDF data, SPARQL query language should be extended to incorporate graph database query
language primitives (Angles and Gutiérrez, 2005).
Summary and Future Work
With the increasing amount of RDF data which is becoming available, efficient and scalable management of massive RDF data is of increasing
important. NoSQL databases are designed for storing and processing Big Data datasets, and massive RDF data management merits the use of Big
Data infrastructure because of the scalability and high performance of cloud data management. In this chapter, we provide an up-to-date
overview of the current state of the art in massive RDF data stores in NoSQL databases. RDF data management is a very active area of research,
and there are a lot of research efforts in this area. The chapter presents the survey from three main perspectives, which are key value stores of
RDF data in NoSQL databases, document stores of RDF data in NoSQL databases and RDF data stores in graph databases. Note that this chapter
only concentrates on massive RDF data stores in NoSQL databases, and does not discuss issues of indexing and querying massive RDF data in
NoSQL databases.
RDF data management typically involves the scalable storage and efficient queries of RDF data. In addition, to better serve a given query, it is
needed to index RDF data, which is especially true for massive RDF data. RDF data stores in NoSQL databases provide an infrastructure for
massive RDF data management. Currently some efforts are concentrating on massive RDF data querying based on cloud computing (e.g., Garcia
and Wang, 2013; Husain et al., 2010 & 2011; Kim, Ravindra and Anyanwu, 2013, Li et al., 2013). It should be noted that RDF data management
based on NoSQL databases has only recently been gaining momentum, and the research in this direction is in its infancy. There are many
research challenges and many interesting research opportunities for both the data management community and the Semantic Web community.
Here we emphasize several major directions for future research.
• First, following the success of NoSQL approaches for Big Data outside the RDF space in cloud environment, a major direction for research
is the study and development of richer structural indexing techniques and related query processing strategies for RDF data in NoSQL
databases.
• Second, in addition to SELECT type of querying in SPARQL (Simple Protocol and RDF Query Language), a standard RDF querying
language, how NoSQL databases can well support other types SPARQL querying such as CONSTRUCT, ASK and DESCRIBE should be
investigated.
• Finally, for massive RDF data management in the context of the diversity in RDF application domains (e.g., computing biology (Anguita et
al., 2013) and geological information systems (Garbis et al., 2013)), novel RDF data store structures and querying strategies need to be
developed on the basis of NoSQL databases.
This work was previously published in Managing Big Data in Cloud Computing Environments edited by Zongmin Ma, pages 210229,
copyright year 2016 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
REFERENCES
Abadi D. J. Marcus A. Madden S. Hollenbach K. (2007). Scalable semantic Web data management using vertical partitioning, Proceedings of the
33th International Conference on Very Large Data Bases (pp. 411-422).
Abadi, D. J., Marcus, A., Madden, S., & Hollenbach, K. (2009). SW-Store: A vertically partitioned DBMS for Semantic Web data management
. The VLDB Journal , 18(2), 385–406. doi:10.1007/s00778-008-0125-y
Alexander, K. (2008). RDF in JSON: A specification for serialising RDF in JSON. Retrieved from http://www.semanticscripting.org/SFSW2008
Angles R. Gutiérrez C. 2005, Querying RDF data from a graph database perspective. Proceedings of the 2005 European Semantic Web
Conference (pp. 346-360). 10.1007/11431053_24
Angles, R., & Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40: 1:1-1:39.
Anguita, A., Martin, L., Garcia-Remesal, M., & Maojo, V. (2013). RDFBuilder A tool to automatically build RDF-based interfaces for MAGE-OM
microarray data sources. Computer Methods and Programs in Biomedicine, 111(1), 220-227.
Bach, M., & Werner, A. (2014). Standardization of NoSQL database languages: Beyond databases, architectures, and structure. Proceedings 10th
International Conference (BDAS 2014). Ustron, Poland: Springer. 10.1007/978-3-319-06932-6_6
Bishop, B., Kiryakov, A., Ognyanoff, D., Peikov, I., Tashev, Z., & Velkov, R. (2011). Owlim: A family of scalable semantic repositories . Semantic
Web , 2(1), 1–10.
Bornea M. A. Dolby J. Kementsietsidis A. Srinivas K. Dantressangle P. Udrea O. Bhattacharjee B. (2013). Building an efficient RDF store over a
relational database. Proceedings of the 2013 ACM International Conference on Management of Data (pp. 121-132).10.1145/2463676.2463718
Broekstra J. Kampman A. van Harmelen F. (2002).Sesame: A generic architecture for storing and querying RDF and RDF schema. Proceedings
of the 2002 International Semantic Web Conference (pp. 54-68). 10.1007/3-540-48005-6_7
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., & Gruber, R. E. (2008). BigTable: A
distributed storage system for structured data, ACM Transactions on Computer Systems, 26(2), 4:1-4:26.
Cudre-Mauroux P. Enchev I. Fundatureanu S. Groth P. Haque A. Harth A. Wylot M. (2013).NoSQL databases for RDF: An empirical evaluation.
Proceedings of the 12th International Semantic Web Conference (pp. 310-325). 10.1007/978-3-642-41338-4_20
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters . Communications of the ACM , 51(1), 107–113.
doi:10.1145/1327452.1327492
DeCandia G. Hastorun D. Jampani M. Kakulapati G. Lakshman A. Pilchin A. Vogels W. (2007). Dynamo: Amazon’s highly available key-value
store. Proceedings of the 21st ACM Symposium on Operating Systems Principles (pp. 205-220). 10.1145/1294261.1294281
Duan S. Kementsietsidis A. Srinivas K. Udrea O. (2011). Apples and oranges: a comparison of RDF benchmarks and real RDF datasets.
Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (pp. 145-156). 10.1145/1989323.1989340
Erling O. Mikhailov I. (2007). RDF support in the virtuoso DBMS. Proceedings of the 1st Conference on Social Semantic Web (pp. 59-68).
Erling, O., & Mikhailov, I. (2009). Virtuoso: RDF support in a native RDBMS. In De Virgilio, R. (Eds.),Semantic Web Information
Management (Ch. 21, pp. 501–519). Springer-Verlag.
Franke C. Morin S. Chebotko A. Abraham J. Brazier P. (2011). Distributed semantic web data management in HBase and MySQL Cluster.
Proceedings of the 2011 IEEE International Conference on Cloud Computing (pp. 105-112). 10.1109/CLOUD.2011.19
Gamble M. Goble C. (2011). Quality, Trust and Utility of Scientific Data on the Web: Toward a Joint model. Proceedings of the 2011 International
Conference on Web Science, Koblenz, Germany (pp. 15:1-15:8). 10.1145/2527031.2527048
Garbis G. Kyzirakos K. Koubarakis M. (2013). Geographica: A benchmark for geospatial RDF stores. Proceedings of the 12th International
Semantic Web Conference (pp. 343-359).
Garcia T. Wang T. (2013, September 16-18). Analysis of Big Data technologies and method - Query large Web public RDF datasets on Amazon
cloud using Hadoop and Open Source Parsers. Proceedings of the 2013 IEEE International Conference on Semantic Computing, Irvine, USA (pp.
244-251). 10.1109/ICSC.2013.49
Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2013). Data management in cloud environments: NoSQL and NewSQL data
stores. Journal of Cloud Computing: Advances, Systems and Applications, 2(22).
Gudivada V. N. Rao D. Raghavan V. V. (2014). NoSQL Systems for Big Data Management, 2014 IEEE World Congress on Services (pp. 190-197).
10.1109/SERVICES.2014.42
Güting, R. H. (1994, September 12-15). GraphDB: Modeling and querying graphs in databases. Proceedings of 20th International Conference on
Very Large Data Bases, Santiago de Chile, Chile (pp. 297-308).
Harris S. Lamb N. Shadbolt N. (2009). 4store: The design and implementation of a clustered RDF store. Proceedings of the 5th International
Workshop on Scalable Semantic Web Knowledge Base Systems (pp. 94-109).
Hecht R. Jablonski S. (2011).NoSQL evaluation: A use case oriented survey. Proceedings of the 2011 International Conference on Cloud and
Service Computing, Hong Kong, China. IEEE. 10.1109/CSC.2011.6138544
Hua, X. Y., Wu, H., Li, Z., & Ren, S. P. (2014). Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks
. Journal of Parallel and Distributed Computing ,74(8), 2770–2779. doi:10.1016/j.jpdc.2014.03.010
Huang, J., Abadi, D. J., & Ren, K. (2011). Scalable SPARQL querying of large RDF graphs . Proceedings of the VLDB Endowment , 4(11), 1123–
1134.
Husain M. F. Khan L. Kantarcioglu M. Thuraisingham B. M. (2010, July 5-10). Data intensive query processing for large RDF graphs using cloud
computing tools, Proceedings of the 2010 IEEE International Conference on Cloud Computing, Miami, USA (pp. 1-10). 10.1109/CLOUD.2010.36
Husain, M. F., McGlothlin, J. P., Masud, M. M., Khan, L. R., & Thuraisingham, B. M. (2011). Heuristics-based query processing for large RDF
graphs using cloud computing . IEEE Transactions on Knowledge and Data Engineering , 23(9), 1312–1327. doi:10.1109/TKDE.2011.103
Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: A survey .The VLDB Journal , 24(1), 67–91. doi:10.1007/s00778-014-0364-z
Khadilkar V. Kantarcioglu M. Thuraisingham B. M. Castagna P. (2012). Jena-HBase: A distributed, scalable and efficient RDF triple store.
Proceedings of the 2012 International Semantic Web Conference.
Kim H. S. Ravindra P. Anyanwu K. (2013, May 13-17). Optimizing RDF(S) queries on cloud platforms, Proceedings of the 2013 International
World Wide Web Conference, Rio de Janeiro, Brazil (pp. 261-264).
Kim, S. W. (2006). Hybrid storage scheme for RDF data management in Semantic Web . Journal of Digital Information Management , 4(1), 32–
36.
Lakshman, A., & Malik, P. (2010). Cassandra: A decentralized structured storage system . ACM SIGOPS Operating System Review , 44(2), 35–40.
doi:10.1145/1773912.1773922
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. Meta Group, Gartner. Retrieved from
http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Li, R., Yang, D., Hu, H. B., Xie, J., & Fu, L. (2013). Scalable RDF graph querying using cloud computing . Journal of Web Engineering , 12(1 & 2),
159–180.
Luo, Y., Picalausa, F., Fletcher, G. H. L., Hidders, J., & Vansummeren, S. (2012). In De Virgilio, R. (Eds.), Storing and indexing massive RDF
datasets, Semantic Search over the Web(pp. 31–60). Springer-Verlag Berlin Heidelberg. doi:10.1007/978-3-642-25008-8_2
Manola, F., & Miller, E. 2004, RDF Primer, W3C Recommendation, http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
McBride, B. (2002). Jena: A Semantic Web toolkit . IEEE Internet Computing , 6(6), 55–59. doi:10.1109/MIC.2002.1067737
Moniruzzaman, A. B. M., & Hossain, S. A. (2013). NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics
and Comparison . International Journal of Database Theory and Application , 6(4), 1–14.
Neumann, T., & Weikum, G. (2008). RDF-3X: A RISC-style engine for RDF . Proceedings of the VLDB Endowment , 1(1), 647–659.
doi:10.14778/1453856.1453927
Neumann, T., & Weikum, G. (2010). The RDF-3X engine for scalable management of RDF data . The VLDB Journal , 19(1), 91–113.
doi:10.1007/s00778-009-0165-y
Nicoara D. Kamali S. Daudjee K. Chen L. (2015). Hermes: Dynamic partitioning for distributed social network graph databases. Proceedings of
the 18th International Conference on Extending Database Technology (pp. 25-36).
Papailiou N. Konstantinou I. Tsoumakos D. Karras P. Koziris N. (2013). H2RDF+: High-performance distributed joins over large-scale RDF
graphs. Proceedings of the 2013 IEEE International Conference on Big Data (pp. 255-263). 10.1109/BigData.2013.6691582
Papailiou N. Konstantinou I. Tsoumakos D. Koziris N. (2012). H2RDF: Adaptive query processing on RDF data in the cloud. Proceedings of the
21st World Wide Web Conference, 397-400. 10.1145/2187980.2188058
Pokorny J. (2011).NoSQL databases: A step to database scalability in web environment. Proceedings of the 2011 International Conference on
Information Integration and Web-based Applications and Services (pp. 278-283). 10.1145/2095536.2095583
Pokorny, J. (2011). NoSQL Databases: A step to database scalability in Web environment . International Journal of Web Information
Systems , 9(1), 69–82. doi:10.1108/17440081311316398
Prud'hommeaux, E., & Seaborne, A. (2008). SPARQL Query Language for RDF, W3C Recommendation. Retrieved from
http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/
Sakr, S., & Al-Naymat, G. (2009). Relational processing of RDF queries: A survey . SIGMOD Record , 38(4), 23–28.
doi:10.1145/1815948.1815953
Sintek M. Kiesel M. (2006). RDFBroker: A signature-based high-performance RDF store, Proceedings of the 3rd European Semantic Web
Conference (pp. 363-377).
Snow, D. (2012). Dwaine Snow's Thoughts on Databases and Data Management. Retrieved from
http://dsnowondb2.blogspot.cz/2012/07/adding-4th-v-to-big-data-veracity.html
Sperka S. Smrz P. (2012).Towards adaptive and semantic database model for RDF data stores. Proceedings of the Sixth International Conference
on Complex, Intelligent, and Software Intensive Systems (pp. 810-815).
Stuart, J., & Barker, A. (2013). Undefined By Data: A Survey of Big Data Definitions.
Sun J. L. Jin Q. (2010). Scalable RDF store based on HBase and MapReduce. Proceedings of the 3rd International Conference Advanced
Computer Theory and Engineering (pp. V1-633-V1-636).
Tauro, C., Aravindh, S., & Shreeharsha, A. B. (2012). Comparative Study of the New Generation, Agile, Scalable, High Performance NOSQL
Databases. International Journal of Computer Applications, 48(20).
Wang Y. Du X. Y. Lu J. H. Wang X. F. (2010). FlexTable: Using a dynamic relation model to store RDF data. Proceedings of the 15th International
Conference on Database Systems for Advanced Applications (pp. 580-594_. 10.1007/978-3-642-12026-8_44
Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: Sextuple indexing for semantic web data management . Proceedings of the VLDB
Endowment , 1(1), 1008–1019. doi:10.14778/1453856.1453965
Wilkinson, K., Sayers, C., Kuno, H. A., & Reynolds, D. (2003). Efficient RDF storage and retrieval in Jena2. Proceedings of theSemantic Web and
Databases Workshop (pp. 131-150).
ADDITIONAL READING
Aluc, G., Özsu, M. T., & Daudjee, K. (2014). Workload matters: Why RDF databases need a new design . Proceedings of the VLDB
Endowment , 7(10), 837–840. doi:10.14778/2732951.2732957
Cardoso, J. (2007). The Semantic Web vision: Where are We?IEEE Intelligent Systems , 22(5), 84–88. doi:10.1109/MIS.2007.4338499
Cattell, R. (2011). Scalable SQL and NoSQL data stores . SIGMOD Record , 39(4), 12–27. doi:10.1145/1978915.1978919
Cogswell, J. (n. d.). SQL vs. NoSQL: Which is better? Retrieved from http://slashdot.org/topic/bi/sql-vs-nosql-which-is-better/
Das S. Srinivasan J. Perry M. Chong E. I. Banerjee J. (2014, March 24-28). A tale of two graphs: Property graphs as RDF in Oracle. Proceedings
of the 2014 International Conference on Extending Database Technology, Athens, Greece (pp. 762-773).
Decker, S., Melnik, S., van Harmelen, F., Fensel, D., Klein, M. C. A., Broekstra, J., & Horrocks, I. (2000). The Semantic Web: The roles of XML
and RDF . IEEE Internet Computing , 4(5), 63–74. doi:10.1109/4236.877487
Decker, S., Mitra, P., & Melnik, S. (2000). Framework for the Semantic Web: An RDF tutorial IEEE Internet Computing , 4(6), 68–73.
doi:10.1109/4236.895018
Horrocks I. Parsia B. Patel-Schneider P. Hendler J. (2005, September 11-16). Semantic Web architecture: Stack or two towers?Proceedings of the
2005 International Workshop on Principles and Practice of Semantic Web Reasoning, Dagstuhl Castle, Germany (pp. 37-41).
10.1007/11552222_4
Hunter J. Lagoze C. (2001, May 1-5). Combining RDF and XML schemas to enhance interoperability between metadata application profiles,
Proceedings of the 2001 International World Wide Web Conference, Hong Kong, China (pp. 457-466). 10.1145/371920.372100
Jing H. Haihong E. Guan L. Jian D. (2011). Survey on NoSQL database. Proceedings of the 2011 International Conference on Pervasive
Computing and Applications (pp. 363-366). 10.1109/ICPCA.2011.6106531
Kaoudi Z. Manolescu I. (2014, June 22-27). Cloud-based RDF data management. Proceedings of the 2014 ACM SIGMOD International
Conference on Management of Data, Snowbird, USA (pp. 725-729). 10.1145/2588555.2588891
Kelly, J. (2012). Accumulo: Why the world needs another NoSQL database. Big Data, Aug.
Kim H. S. Ravindra P. Anyanwu K. (2014). A semantics-oriented storage model for big heterogeneous RDF data. Proceedings of the ISWC 2014
Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, Riva del Garda, Italy, October 21, 2014, 437-
440.
Lane, A. (n. d.). A response to NoSQL security concerns. Retrieved from http://www.darkreading.com/blog/232600288/a-response-to-nosql-
security-concerns.html
Moniruzzaman, A. B. M., & Hossain, S. A. (2013). NoSQL database: New era of databases for big data analytics - classification, characteristics and
comparison.
Punnoose, R., Crainiceanu, A., & Rapp, D. (2012, August 31). Rya: A scalable RDF triple store for the clouds. Proceedings of the 2012
International Workshop on Cloud Intelligence, Istanbul, Turkey.
Schindler, J. (2012). I/O characteristics of NoSQL databases .Proceedings of the VLDB Endowment , 5(12), 2020–2021.
doi:10.14778/2367502.2367565
Shadbolt, N., Berners-Lee, T., & Hall, W. (2006). The Semantic Web revisited . IEEE Intelligent Systems , 21(3), 96–101.
doi:10.1109/MIS.2006.62
Shimel, A. (n. d.). Is security an afterthought for NoSQL? Retrieved from http://www.networkworld.com/community/blog/security‑
afterthought‑nosql
Stonebraker, M. (2010). SQL databases v. NoSQL databases .Communications of the ACM , 53(4), 10–11. doi:10.1145/1721654.1721659
Vidal V. M. P. Casanova M. A. Monteiro J. M. Arruda N. M. Jr Cardoso D. S. Pequeno V. M. (2014, October 21). A framework for incremental
maintenance of RDF views of relational data. Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th
International Semantic Web Conference, Riva del Garda, Italy (pp. 321-324).
Wu B. W. Zhou Y. L. Yuan P. P. Jin H. Liu L. (2014, November 3-7). SemStore: A semantic-preserving distributed RDF triple store, Proceedings
of the 2014 ACM International Conference on Information and Knowledge Management, Shanghai, China (pp. 509-518).
10.1145/2661829.2661876
KEY TERMS AND DEFINITIONS
ACID: ACID means four properties, which are (A)tomcity, (C)onsistency, (I)solation and (D)urability. ACID is the type of transaction processing
done by relational database management system (RDBMS).
Big Data: Big Data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. There is not a
common definition of Big Data and Big Data are generally characterized by some properties such as volume, velocity, variety and so on.
CAP: CAP Theorem means three properties, which are Consistency, Availability and Partition Tolerance.
JSON: JSON (JavaScript Object Notation) is a binary and typed data model which is applied to represent data like list, map, date, Boolean as
well as different precision numbers.
NoSQL Databases: NoSQL means “not only SQL” or “no SQL at all”. Being a new type of non-relational databases, NoSQL databases are
developed for efficient and scalable management of Big Data.
RDF: Resource Description Framework (RDF) is a W3C (World Wide Web Consortium) recommendation which provides a generic mechanism
for representing information about resources on the Web.
SPARQL: SPARQL (Simple Protocol and RDF Query Language) is an RDF query language which is a W3C recommendation. SPARQL contains
capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions.
ENDNOTES
1 http://hypertable.org
CHAPTER 6
On Efficient Acquisition and Recovery Methods for Certain Types of Big Data
George Avirappattu
Kean University, USA
ABSTRACT
Big data is characterized in many circles in terms of the three V’s – volume, velocity and variety. Although most of us can sense palpable opportunities presented by big data there are
overwhelming challenges, at many levels, turning such data into actionable information or building entities that efficiently work together based on it. This chapter discusses ways to
potentially reduce the volume and velocity aspects of certain kinds of data (with sparsity and structure), while acquiring itself. Such reduction can alleviate the challenges to some
extent at all levels, especially during the storage, retrieval, communication, and analysis phases. In this chapter we will conduct a non-technical survey, bringing together ideas from
some recent and current developments. We focus primarily on Compressive Sensing and sparse Fast Fourier Transform or Sparse Fourier Transform. Almost all natural signals or
data streams are known to have some level of sparsity and structure that are key for these efficiencies to take place.
1. INTRODUCTION
The scientific community as well as the intelligence agencies have traditionally led the field in collection and compilation of vast amounts of electronic data. Search engines (such as
Google, Yahoo!, and Microsoft) and e-commerce started amassing exponentially increasing amounts of data starting in the early 2000’s. After social networks, like Facebook or
Twitter arrived, with hundreds of millions of users, electronic data collection increased to a level beyond imagination.
Deriving actionable information from the data collected has challenged the best minds in many disciplines. Efficient storage and retrieval of data on demand needed new thinking.
From this need, many new technologies including the “Hadoop – MapReduce” ecosystem, with an ever increasing number of components was born. There are several scientific
communities and commercial or public entities hard at work to exploit this newest opportunity in spite of the unforeseen challenges in doing so. The traditional analysis of digital
data was limited to one’s own computing domain, often represented by an academic or corporate structure. However, with the advancement of computing and networking
technologies that lead to big data, there seems to be a paradigm shift in what we even consider to fit the definition of “data”.
The word “data” is readily conceptualized my most of us. However these concepts vary widely. Even most current dictionaries have generic and varying definitions of the term.
According to Oxford dictionary, data means, “Facts and statistics collected together for reference or analysis”. Oxford goes on to specify it meaning in Computing as, “the quantities,
characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media” and its meaning in Philosophy as, “things known or assumed as facts, making the basis of reasoning or calculation.” Merriam-Webster defines data as
“facts or information used usually to calculate, analyze, or plan something or as information that is produced or stored by a computer”. However, with the introduction of the World
Wide Web in the early nineties and its success in providing connectivity to digital information everywhere (and the subsequent development of unforeseen levels of acquisition,
storage, and analysis capabilities of digital information) one may wonder whether these definitions suffice what we consider as data.
Useful data can generally be considered as information of any kind that may evoke any of our senses about past or present. Such information often is embedded with high levels of
sparsity and redundancy, especially in one or other of its alternate representations. Any event that has occurred or is occurring and could lead to some form of sensation or thought in
one or more of us can be regarded as source of useful data. Data sources that interest us can perhaps be divided into two broad categories: data that can be attributed to humans, and
data that can be attributed to non-humans.
Some examples of the first kind are e-mails, internet searches, tweets, articles (scientific or otherwise), creative works including audio and video, commercial transactions and the
census. In this case since humans act as both the source and recipient, we have complete control of how the related data is perceived or interpreted. The second kind can be sourced
mostly to observations of natural phenomena around us, as in oceanography, seismology, geology and meteorology, astronomy, high energy physics, biology, and chemistry. This type
of data allows us perhaps our own impression or interpretation of what actually is taking place.
Analytics on both types of data holds promise. But strategies for analysis, however, may differ. The former will always be discrete and finite in size and dimension, no matter the
volume, velocity, variety, or any other characteristics. At least theoretically, it may not need as much processing in acquisition, storage, and retrieval. The latter, on the other hand
tends to be continuous and infinite in size and perhaps in dimension but full of sparsity and redundancy.
Regardless, analytics to divulge meaningful information from any data has better potential when they are used collectively through aggregation, composition, or integration. For
example, individual transactions by themselves are unlikely targets for analytics (although perhaps with information gathered from analytics one can and may go back to subsets or
individual data.)
Analytics, even after identifying patterns visually or otherwise, will largely have to be based more and more on principles of computational techniques, whether mathematical or
statistical. Much of the information content in any type of data depends on the amount and type variation contained in it (Rudder, 2014). Identifying this would require on data in
non-quantitative forms, (here forth we will assess only digital data regardless of source,) whether structured, unstructured, or qualitative, be transformed into quantitative. After all,
analytics on big data is only promising if the analytical techniques used can reveal patterns or trends that the “naked eye” cannot discern on its own. That is why, at the core of many
scientific techniques, the Fourier transform and its related techniques loom large. Such technology, roughly speaking, allows us to glean the intrinsic character of the data or signal in
question by transforming it from a “time domain” to a “frequency domain” and gives us an opening for all kinds of analysis and synthesis on the content. It all requires efficient
acquisition of the data or signal of interest and its successful recovery.
This is exactly where the developments such as compressive sensing (CS) come in. In a nutshell, CS promises a way to acquire the data, with sparsity and structure, proportional to its
useful content only and recover it. For decades, the gold standard has been the Shannon-Nyquist theorem, which roughly states that to guarantee exact recovery of information one
has to keep up with the fastest frequency (variation) component in the data. And that may not have anything to do with the amount of information content. One can then appreciate
the promise of CS: it eases the complexities of volume and velocity, acquiring only the necessary data by avoiding sparsity and exploiting structure.
Donoho (2006) in his seminal article on the topic asks “If much of the data that is acquired are thrown away later why acquire them in the first place?” He goes on to present
guarantees for exact recovery or at least probabilistic estimates for it, from a significantly reduced set of signal samples provided the signal is sparse (see Figure 1, for example.) About
the same time, Candice, Romberg, and Tao in their groundbreaking work on the topic, provide many results that spurred a large amount of work in the area now called compressive
sensing, which provides for efficient sampling and robust recovery methods.
Figure 1. Sparsity: A one megapixel (1,000,000) picture on the
left next to one that is reconstructed from 25,000 wavelet
coefficients on the right. The difference is hardly noticeable. It
turns out that the original picture can perfectly be
reconstructed from just 96,000 measurements alone
according to Candès in “Undersampling and Sparse Signal
Recovery”
Another group (Indyk & Kapralov 2014) apparently independent of compressive sensing developments, were coming up with surprisingly efficient algorithms for Discrete Fourier
Transform that they called sparse Fast Fourier Transform (sFFT) in order to achieve similar goals.
At the core of all these efforts is the desire to reduce conventional signal sampling frequency requirements so that sampling is proportional to content (in a natural or a transformed
form) and not only provide recovery guarantees but also hold the computational time for sampling and recovery to a minimum. There also are several other recent important
developments, seemingly independent of the aforementioned, for efficient acquisition and recovery. Some examples are the techniques applied to streaming data, such as sketching
(Cormode, Garofalakis, Haas, & Jermaine, 2012) and compressed counting (Ping, & Zhang, 2011)., which are beyond the scope of this chapter.
In this chapter, we are concerned primarily with the discussion of basic developments that holds promise in dealing with big data as a whole. We do not attempt to discuss existing
technologies, systems, specific issues, and their resolutions as is done in other chapters.
This chapter is organized into two main sections. First, Section II discusses big data and provides a general overview and context for the key section that follows. It is divided into
subsections on big data opportunities that are still being discovered, as well as the increasing challenges presented by it. Second, Section III deals with the efficient acquisition,
processing, and recovery of data with the two properties: sparsity and structure. This is quite a different from the characterization of big data in terms of three or more V’s. In this
section, also divided into subsections, we outline recent and somewhat parallel developments that may help alleviate many challenges associated with big data even before the data is
acquired by reducing the volume and perhaps the velocity aspects of it. This section is meant to be a brief and non-technical survey of some of the recent advancements that are
relevant in this regard.
2. BIG DATA
2.1. What is Big Data?
There seems to be no complete consensus on a definition for big data. However literature and research on big data itself is exploding for understandable reasons. Some notable
attempts on a definition: “Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” - from a report by
McKinsey Global Institute in June 2011 (McKinsey Global Institute, 2011.) “Big data is a resource and a tool” - from an essay titled: The Rise of Big Data: How It's Changing the Way
We Think About the World by K. Cukier & V. Mayer-Schoenberger (2013). “Big data is being generated by everything around us at all times. Every digital process and social media
exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful
value from big data, you need optimal processing power, analytics capabilities and skills.” – from the IBM website (October 2013.) “Big data is a popular term used to describe the
exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More
data may lead to more accurate analyses.” – from the SAS website (October 2014)
Perhaps the more recognized interpretation is the one that came from the title of Doug Laney’s article in 2001 entitled “3D Data Management: Controlling Data Volume, Velocity and
Variety” (Laney, 2001). Ever since, big data has been characterized by the 3V’s: Volume, Velocity and Variety. Recently other V’s have been added to these three; however there seems
to be no agreement on how many more nor on which ones should be added. We will keep in mind a more holistic way of looking at big data, as discussed in introduction.
The proceedings of the ACM SIGKDD international conference on Knowledge Discovery and Data mining, (most recently held in August 2014 in New York City, NY) serve as a key
benchmark display for various scientific and technological advancements made in making sense of big data.
2.2. Opportunities and Challenges
The variety of data being collected seems to have no end. It is hard to even classify the types of data along clear boundaries any more. Google, Facebook, Amazon, Twitter, and
security agencies are just a few global entities that are likely to have access to vast amounts of human generated data (the first type discussed in the introduction.) Other entities
include research labs, academic institutions, and public and government agencies. Growth in the data collection rate always seems to outpace increasing storage and processing
capabilities.
One common and basic way of categorizing data is into structured data and unstructured data. Many of the theories and techniques developed in the past apply exclusively to
structured data. However, the need for techniques to deal with unstructured data is incredibly pertinent as the variety of data collected becomes more and more complex. Just
consider tweets, for example, from different parts of the world with different socioeconomic conditions, customs, political structures, and different languages.
2. Organizing and storing data into scalable modern databases (such as the Cloud)
Examples of big data use are prolific. The two following sample examples (one public and the other private) demonstrate entities that have unlocked the potential of big data to
overcome the challenges mentioned above.
First, Dr. Lee, who placed the foundational architecture for the University of Pittsburg Medical Center’s (UPMC) new enterprise data warehouse, writes, “the integration of data,
which is the goal of the enterprise data warehouse, allows us to ask questions that we just simply couldn’t ask before”. Indeed, Pitt researchers recently were able to electronically
integrate for the first time clinical and genomic information on 140 patients previously treated for breast cancer (UPMC, 2013.)
Second, Uber’s rapid development and popularity amongst the millennial generation stems partially from its ability to reliably integrate a plethora of data and extract from that data a
link between individual drivers and consumers of the service Uber provides (Uber, 2014). “Uber owns no cars and hires no drivers. In many ways, the whole company is a data play.
Its systems know where you’ve come from, your favorite haunts and how you pay. The company’s ‘Math department,’ as Kalanick calls it, collects user behavior over time into a ‘God
view’ that allows them to know exactly which neighborhood will need more cars on a rainy day.’” (Scola & Peterson, 2104.)
3. EFFICIENT ACQUISITION AND RECOVERY
The celebrated Shannon-Nyquist theorem on sampling, guarantees full recovery of a signal from its samples provided that the sampling rate is greater or equal to twice the highest
frequency (Nyquist rate) in the signal (Shannon, 1949). However with the advent of big data, the demands of the Nyquist rate sampling on hardware, potential storage, and on the fly
retrieval and analysis is becoming harder to meet. Therefore, it is natural to look for alternate possibilities. It is quite timely that Donoho poses the question “Why go to so much
effort to acquire all the data when most of what we get will be thrown away? Can’t we just directly measure the part that won’t end up being thrown away?” (Donoho, 2006)
The origin of the mathematical ideas behind compressive sensing appears decades old. In fact, in the 1970’s, seismologists were already utilizing precursors of this idea to spot oil
underground. There have been two significant developments that perked up interest in compressive sensing within the last decade or so. First, let us consider the advances made in
sublinear sampling and recovery of certain signals with the work of Candès, Romberg, and Tao (2006a) between 2004 and 2006. Remarkably they conclude, “the resolution of an
acquired image is not controlled by the number of pixels (conventional wisdom) but proportional to the information content.” In fact, that is the way it ought to be, isn’t it?
“Information content” in this context refers specifically to non-sparse data in some convenient representation (such as a in “wavelet basis”) but not necessarily in natural or standard
form.
These results have then been followed by promising advances made in processing sparse signals called sparse Fast Fourier Transform (sFFT), remarkably, in sublinear time by
(Hassanieh, Indyk, Katabi, & Price, 2012) and others. Sublinear time roughly means that computation time may be less than the signal length!
3.1. Compressive Sensing
“Finally, in some important situations the full collection of n discrete-time samples of an analog signal may be difficult to obtain (and possibly difficult to subsequently compress).
Here, it could be helpful to design physical sampling devices that directly record discrete, low-rate incoherent measurements of the incident analog signal. (This) suggests that
mathematical and computational methods could have an enormous impact in areas where conventional hardware design has significant limitations.” (Candès & Wakin, 2008.)
At an abstract level the compressive sensor is a matrix that does not just take N sample measurements of the signal x of the same length but uses compressing techniques to only
take (significantly less) M << N samples y = . From this reduced sample y the original signal x needs to be either recovered completely or approximated sufficiently well. The
tradeoffs in making such recovery possible are, roughly speaking, requirements of sparsity and structure in the signal. It turns out a great many natural signals have either one or
both of these characteristics to a varying extent. Often the signal x itself may not manifest its sparsity, but generally upon transformation to an appropriate basis (such as wavelets or
sinusoids) it will. Suppose that , where is a matrix (such as wavelet transformation), be the transformed signal and that is what really being sampled. So y = ,
where is takes the samples, is an M dimensional vector from which we are to recover x. The hope is even if M is significantly less in length than the signal N, under certain
requirements on x, it can be reconstructed exactly or sufficiently well. The classic Nyquist condition requires that M 2F, F being the highest frequency in x. With these new
developments initiated by Donoho (2006), Candès et al. (2006) and Hassanieh et al. (2012) we may by-pass sampling frequency requirements. The key difference here is that it is not
the frequency but the information content (sparsity and structure) that drives the sampling requirements.
y =
where represents both transformation and sampling from x. Note that in most of the literature the difference between and is not spelled out as compressing a signal using
basis , such as wavelets or Fourier, and is somewhat well understood, and with many assuming x to be natively sparse.
It is crucial to understand though that the chances of recovery will not be good if the and are coherent. Somewhat of a variation of this incoherence requirement is that has
the restricted isometry property (RIP) which roughly states that any subset and be close to orthogonal (notice that they are not going to be orthogonal as there are more columns
than rows.) In other words the sampling basis must have columns that are as nearly orthogonal as possible to the compressing basis .
With M << N, recovering x from y is an ill-posed problem and will not present a unique solution in general. Under the sparsity (or near sparsity) and incoherence or RIP
requirements, it can be shown that out of the possible solutions, the sparsest one, the one with the least non-zero entries is the actual solution (or has very good probability of being
it.) Finding such solution though is an minimization problem (Duarte & Eldar, 2011). The and norms of a vector x are (this is really not a norm in the
mathematical sense, but is an accepted convention and simply counts nonzero entries) and . Finding solution to this problem would amount to finding
. This happens to be a combinatorial problem with high computational cost. Surprisingly though, the higher the dimension of the solution the more likely the sparse solution
from is the same as that of minimization: . Here the advantage is that, this minimization amounts to convex linear programming problem that has well developed, efficient,
algorithmic solution strategies. Such minimizations are also robust even in the presence of signal noise that is a reality in practical sampling implementations (Candès, et al.,
2006a).
Further refinements of minimizations in the form of iterative algorithms started appearing soon. This sequence of algorithms that were proposed to improve the computational
complexity is a combination of minimization and “greedy algorithms” that refine the choices step-by-step and are called orthogonal matching pursuits. Significant improvements in
efficient recovery were achieved by these.
One of the few practical applications compressive sensing that researchers Baraniuk and Kelly’s (2007) team at Rice University came up with is now the well-known single pixel
camera (SPC). A schematic diagram is in Figure 2 (Courtesy of Rice University and used with permission)
Figure 2. Schematic diagram of single pixel camera (SPC)
What actually happens here is that the signal (an image in this case) of the scene is focused on an array of small mirrors each of which can be randomly positioned to either reflect the
signal to the single pixel camera or elsewhere. This random positioning of the mirrors makes the signal received by the pixel an average of a randomly sampled image. The SPC
sequentially takes Klog(N/K)samples of the scene (K-sparse). The reconstructed image thus recovered just from Klog(N/K) is quite comparable with that from N samples in
conventional means.
Candès (2008) sketches a potential “analog-to-information conversion” architecture is another example of bringing CS theory to practice. Healy and Brady (2008) discuss them
further in their article.
3.2. sFFT (or SFT): Sparse Fourier Transform
“The sparse Fourier transform (SFT) addresses the big data setting by computing a compressed Fourier transform using only a subset of the input data, in time smaller than the data
set size.” (Gilbert, Indyk, Iwen, & Schmidt, 2014)
SFT methods attempt to calculate (estimate) discrete Fourier Transform (DFT) of K-sparse (compressible) N-length signals in sublinear K log(N) time. All the SFT techniques
essentially proceeds as follows: first divide the problem of identifying frequencies involved in the sparse signal into sub-problems until each frequency is basically isolated into
narrow bands called bins; second, identify (or at least estimate) the frequency by its position and value in each bin which can be done with just two samples. For the case of
frequencies that are close to each other and thus hard to separate, use a random permutation technique to move the positions of the frequencies around and then repeat isolating and
recovering those frequencies. Repeat these processing with different random permutation and subsequent isolation until all K frequencies are isolated. At the end we will have a
process that is of the order roughly of K log(N). There are several implementations of this method and the reader is referred to a survey such as (Gilbert et al., 2014) for more details.
In (Gilbert et al., 2014) the authors points out that SFTs also can be considered compressive sensing in a broader sense as these also recover (or estimate) frequency-sparse signals
from a reduced set of samples in sublinear time.
4. CONCLUSION
Through compressive sensing and other related technology, great advancements have been made towards extracting only the minimal information required to recover and
reconstruct data signals received. In terms of timing, these advancements couldn’t have been better, insofar as many are now experiencing the complexities of a deluge of data. It is
logical that compressive sensing type of technologies will be helping alleviate the volume and velocity, and perhaps other aspects of big data in near future.
These advances are exciting because hardware based on conventional sampling will be limited due to Nyquist - Shannon sampling requirements. That will impede our ability to
sample and acquire digital data as fast as we desire. But even if we circumvent that issue the enormity of the captured data poses major challenges. Compressive sensing makes a two-
fold promise regarding its potential to revolutionize data analytics. First, hardware limitation issues will be turned into mathematical and algorithmic issues related to recovery and
processing speed. These in turn, eventually become software design issues. Second, CS promises a potentially significant reduction in volume and velocity of data for us to deal with.
Despite the aforementioned, significant future research needs to be done in bringing the promises of CS technology to fruition. First, we are still waiting for implementations of this
compressive sensing technology that can simultaneously transform and sample analog signals into discretized and compressed form at the acquisition stage itself. Second, our ability
to utilize such sampled (perhaps even transformed) data across all phases, access during cleaning and integration just for example, of big data processing has to be realized yet.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 137147, copyright year 2016 by Information Science
Reference (an imprint of IGI Global).
REFERENCES
Baraniuk, R. G. (2007). Compressive sensing . IEEE Signal Processing Magazine , 24(4), 118–121. doi:10.1109/MSP.2007.4286571
Candès, E. J., Romberg, J. K., & Tao, T. (2006a). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on
Information Theory , 52(2), 489–509. doi:10.1109/TIT.2005.862083
Candès, E. J., Romberg, J. K., & Tao, T. (2006b). Stable signal recovery from incomplete and inaccurate measurements.Communications on Pure and Applied Mathematics , 59(8),
1207–1223. doi:10.1002/cpa.20124
Candès, E. J., & Wakin, M. B. (2008). An introduction to compressive sampling. IEEE Signal Processing Magazine , 24(2), 21–30. doi:10.1109/MSP.2007.914731
Committee on the Analysis of Massive Data et al. (2103). Frontiers in Massive Data Analysis. The National Academies Press.
Cormode, G., Garofalakis, M., Haas, P., & Jermaine, C. (2012).Synopses for Massive Data: Samples, Histograms, Wavelets and Sketches. doi:10.1561/1900000004
Cormode, G., & Muthukrishnan, S. (2005). Improved data stream summaries: The count-min sketch and its applications. Journal of Algorithms , 55(1), 58–75.
doi:10.1016/j.jalgor.2003.12.001
Duarte, M., & Eldar, Y. (2011). Structured Compressed Sensing: From Theory to Applications. IEEE Transactions on Signal Processing , 59(9), 4053–4085.
doi:10.1109/TSP.2011.2161982
Gilbert, A., Indyk, P., Iwen, M., & Schmidt, L. (2014). A compressed Fourier transform for big data. IEEE Signal Processing Magazine, 31(5), 91 - 100
Han J. Wang C. El-Kishky A. (2104). Bringing structure to text: mining phrases, entities, topics, and hierarchies.Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM.
Hassanieh, H., Indyk, P., Katabi, & Price, E. (2012). Nearly Optimal Sparse Fourier Transform. ACMSIAM Symposium on Discrete Algorithms. ACM.
Indyk, P., & Kapralov, M., (2014, October). Sample-Optimal Sparse Fourier Transform in Any Constant Dimension. FOCS.
Ping, L., & Zhang, C. (2011). A new algorithm for compressed counting with applications in Shannon entropy estimation in dynamic data . COLT.
Shannon, C. E. (1949). Communication Theory of Secrecy Systems.The Bell System Technical Journal , 28(4), 656–715. doi:10.1002/j.1538-7305.1949.tb00928.x
Khumbulani Mpofu
Tshwane University of Technology, South Africa
Samson Mhlanga
National University of Science and Technology, Zimbabwe
ABSTRACT
The evolving Information and Communication Technologies (ICTs) has not spared the manufacturing industry. Modern ICT based solutions have
shown a significant improvement in manufacturing industries' value stream. Paperless manufacturing, evolved due to complete automation of
factories. The chapter articulates various Machine-to-Machine (M2M) technologies, big data and data modelling requirements for manufacturing
information systems. Manufacturing information systems have unique requirements which distinguish them from conventional Management
Information Systems. Various modelling technologies and standards exist for manufacturing information systems. The manufacturing field has
unique data that require capturing and processing at various phases of product, service and factory life cycle. Authors review developments in
modern ERP/CRM, PDM/PLM, SCM, and MOM/MES systems. Data modelling methods for manufacturing information systems that include
STEP/STEP-NC, XML and UML are also covered in the chapter. A case study for a computer aided process planning system for a sheet metal
forming company is also presented.
INTRODUCTION
The quest to improve the quality of manufactured products, achieve higher efficiency, improve communication, and complete integration of
processes has resulted in large data being collected by manufacturers. The current era for large and complex data sets has been termed big data.
Big data is difficult to process using traditional data processing applications. In order to achieve better control of their processes, manufacturers
need to capture, store, search, share, transfer, analyse and visualise data pertaining to their products, machinery, and various stages of raw
material conversion. Technologies used to achieve data processing requirements in the manufacturing field are a bit complex compared to
conventional database management tools. The complete life cycle of products start as a need from the customer, progresses into CAD model of
the interpreted need, raw material conversion into finished product, use by the customer and regular maintenance support and finally disposal or
recycling. CAD systems used to model the product store attribute data for the various parts of the product and also use unique file exchange
formats. The CAD models are analysed using computer aided engineering (CAE) systems so that optimal designs can be achieved. The
manufacturing process is aided by computer aided manufacturing (CAM) systems that interpret the CAD data and generate files for manipulating
the raw work-piece into the finished products using appropriate machine tools. As the world population continues to grow, production output
and product varieties in many companies continue to grow generating a lot of data that should be processed often. The challenges for the
manufacturer are not only centred upon the product and its life cycle but the health of the manufacturing plant (plant maintenance), supply chain
integration, and in some instances tracking the product during its usage period. This led to the advent of e-based maintenance systems that track
the performance of plant machinery in real time.
Whilst the talk about big data is popular in customer relationship management (CRM), supply chain management (SCM), and social media, there
is significant application in in-house manufacturing processes. Some manufactures use big data to leverage their technical capabilities so that
they can offer customisation to their customers. The trend has been influenced by the growing trend in mass customisation and reconfigurable
manufacturing systems (RMS). The capability of RMS systems to be reconfigured or changed to meet new customer demand enables
manufactures to offer customised products. However reconfiguring a manufacturing facility requires highly integrated systems if ramp-up time
and optimised set-ups are to be achieved without costly investments. In this instance, big data can be viewed as measuring the finite details in
manufacturing plants or factories.
Achieving tightly integrated systems in manufacturing pose challenges to manufacturing system designers and system integrators. The major
aspects tackled in the chapter include data modelling, data exchange formats, and systems architecture in order to address the gap in traditional
and modern practice. Cooperation with suppliers and customers require neutral data exchange formats in product modelling so that both ends
can view the transmitted data across different software platforms.
BACKGROUND
Industry has evolved from traditional factories dominated by mechanical production facilities powered by steam and water, through mass
production based on division of labor, to introduction of electronics and IT, and currently cyber-physical systems (CPS). Koren (2010)
categorized the industrial revolution into four phases namely industry 1.0, industry 2.0, industry 3.0, and industry 4.0. Industry 4.0 is the current
stage which is driven by cyber-physical production systems. Rajkumar, et al (2010) defined Cyber-physical systems (CPS) as physical and
engineered systems, whose operations are monitored, coordinated, controlled and integrated by a computing and communication core. CPS can
be considered to be a confluence of embedded systems, real-time systems, distributed sensor systems and controls. CPS is leading towards smart
future factories with a network of intelligent objects linking products and assets with information from the internet, as well as capturing context
information.
There are many factors that drive the manufacturing trend through its progression to the current phase (industry 4.0) and beyond. Some of them
include shorter product life cycles, increasing product variation (mass customisation), volatile markets, cost reduction pressures, scarce
resources, cleaner production, lack of skilled workforce and aging community. In order to cope with the pressure, the modern day factories must
have big data repositories to analyse and make informed decisions. The global manufacturing village is also threatened by low-volume high-
mixture factories. Competition among manufacturers is ever increasing and every player must deliver high quality goods, efficiently and at low
cost. Regulatory authorities are also mounting pressure on manufacturers, demanding cleaner and sustainable manufacturing practices. This
results in more data being captured so as to comply with regulations. It is difficult to quantify and control anything that is not measured, hence
the need for manufacturers to keep tight track of their process data, power consumption, gaseous emissions and resource consumption.
In a bid to address issues mentioned above, manufacturing automation evolved from production management and control using material
resource planning (MRPI and MRPII), and ERP systems. Due to the quest to achieve tighter integration and collaboration with suppliers and
clients, CRM, SCM, PDM and PLM systems became popular. These systems partly enabled manufactures to gain competitive advantage over
their counterparts who did not have similar technologies. Many companies even in this current era still do not have old technologies mentioned
previously. These companies, particularly in the African continent, face stiff competition from Asian and European manufactures that are
shipping low cost goods across the globe. Those companies at risk still think that they can industrialise and beat the competition but the reality is
that technology is increasing fast which prompts manufactures to be very strategic and speedily adopt some modern technologies which allow
them more flexibility, adaptability, scalability, and rapidly meet the customers’ dynamic needs at low cost. Various alternative technologies and
case studies are articulated in the chapter to provide a clear roadmap to potential adopters and the researchers to further research efforts so that
better systems can be availed to modern manufacturing industries.
BIG DATA AND DATA MODELLING FOR MANUFACTURING INFORMATION SYSTEMS
The chapter narrates the trends in manufacturing systems automation. Traditional systems are tracked right through to the state-of-the-arts in
manufacturing automation systems. Case studies are provided for real applications in manufacturing companies. Future research is also outlined
in order to further the interests for achieving more efficient and sustainable manufacturing. Big Data can be clearly distinguished in three
categories as follows (Oracle®, 2013):
• Traditional Enterprise Data: Includes customer information from CRM systems, transactional ERP data, and general ledger data
• MachineGenerated/ Sensor Data: Includes Call Detail Records (“CDR”), smart meters, manufacturing sensors, equipment logs (often
referred to as digital exhaust)
• Social Data: Include customer feedback streams, also takes the form of micro-blogging such as Twitter and social media platforms like
Facebook.
The world has adopted specific terminology and standards to suit industry-specific needs. Terminology and standards used in the manufacturing
sector are elaborated in the sections that follow.
Manufacturing Operations Management (MOM) and Manufacturing Execution Systems (MES)
MOM is a holistic solution to improve manufacturing operations performance. The systems are built to consolidate the management of several
production processes, such as quality management, sequencing, production capacity analysis, Work-in-Process (WIP), inventory turns, standard
lead times, non-conformance management, asset management, and many other processes within one system. MOM systems expand focus from a
single facility to the entire supply network, monitoring a variety of aspects of the manufacturing process. Managing different aspects of the whole
manufacturing organisation is associated with various challenges in trying to automate the tasks. Different software tools are needed to collect
and analyse real-time data and translate it into valuable knowledge that can be used to inform decision making. Traditionally manufacturers had
their own tailor made software applications because of lack of standard platforms to integrate systems at the shop floor level. In order to resolve
the challenge, software providers started to package multiple execution management components into single, integrated solutions called
manufacturing execution systems (MES) (Saenz,et al, 2009). An MES keeps track of all manufacturing information in real time, receiving up-to-
the-minute data from robots, machine monitors and employees. MES solutions now include integrated components such as Computer Aided
Process Planning (CAPP), CAM, Product Data Acquisition (PDA), Machine Data Acquisition (MDA), and Personnel Time Recording, as well as
Time and Attendance (PTR/T&A).
Generally MES is a process-oriented manufacturing management system which acts as the comprehensive driving force for the organization and
execution of the production process. Its major tasks can be summarised as follows:
• Implementation of the closed loop of all actions related to the execution of the production processes (planning, initiation, managing,
controlling, documentation, evaluation, and review).
• Exchange of information with other levels such as corporate management (Enterprise Resource Planning, ERP) and the manufacturing /
process levels, as well as operational support systems, and Supply Chain Management (SCM).
In order to achieve integration and interoperability in MES, common standards are required in modeling manufacturing information systems
from the shop floor to the business logistics level. The Instrumentation, Systems, and Automation Society (ISA) developed a standard to address
the integration issues. ISA-95 is a multi-part standard that defines the interfaces between enterprise activities and control activities (Gifford,
2013). The ISA-95 standard aims to enhance the development of MES applications and integrate them to other information systems of
manufacturing companies, particularly ERP systems. The overall architecture for MES and its relationship to other enterprise systems is shown
in Figure 1. Another standard for MES is the IEC 62264 “Enterprise-Control System Integration” set of standards, which defines the functional
hierarchy levels of an enterprise, in which decisions with differing timescales and varying levels of detail must be made (IEC®, 2007). The
hierarchy of levels are as listed below whilst Figure 2 gives the pictorial view.
Level 2: Monitoring as well as supervisory and automated control of the production process (Batch control, Continuous control and
Discrete control)
Level 0: Production process.
Figure 3 provides an overview of the divisions and the relationships among the relevant functions of manufacturing operations management
activities according to the ISA-95 standards.
Figure 3. Manufacturing operations management activities
according to the ISA-95 standards.
Enterprise Resource Planning (ERP) Systems
ERP is an organization’s software management system which incorporates all facets of the business, automates and facilitates the flow of data
between critical back-office functions, which may include financing, distribution, accounting, inventory management, sales, marketing, planning,
human resources, manufacturing, and other operating units. The use of ERP systems in organizations allow all departments to have one source of
information, streamline their business process and also access the required information almost instantly. Some degree of customisation may be
necessary to enable the system users to access analytics which can aid in decision making. Plug-in business intelligence (BI) tools are also
available specifically for data mining and analysis. According to Nah et al (2001), the most important attributes of an ERP system are its abilities
to:
• Automate and integrate business processes across organizational functions and locations.
• Enable implementation of all variations of best business practices with a view towards enhancing productivity.
• Share common data and practices across the entire enterprise in order to reduce errors, produce and access information in a real-time
environment to facilitate rapid and better decisions and cost reductions.
Manufacturing software systems evolved from material requirements planning (MRP I) systems which translated the master production schedule
built for the end items into time phased net requirements for the sub-assemblies, components and raw materials planning and procurement.
Since 1970s, the MRP I system has been extended from a simple MRP tool to become the standard manufacturing resource planning (MRP II).
The basic architecture of an ERP system builds upon one database, one application, and a unified interface across the entire enterprise. A
generalized architecture is represented in Figure 4.
During the 2000s ERP vendors added more modules and functions as add-ons to the core modules giving birth to the extended ERPs. These ERP
extensions include Advanced Planning and Scheduling (APS), e-business solutions such as Customer Relationship Management (CRM) and
Supply Chain Management (SCM). To date (2014) much talk is being given to off-site based ERP solutions referred to as Cloud computing and
Software as a service (SaaS) where the ERP vendors host the software on their own infrastructure and the companies only purchase the services
from that package. The evolution of ERP is represented in Figure 5.
Figure 5. ERP Evolution
MachinetoMachine (M2M) Technologies
M2M describes technology that enables networked devices to exchange information and perform actions without dependence on human beings
(Lu. et al, 2011). It also means Machine-to-Machine, Man-to-Machine and Machine-to-Mobile. The technology has wide applications in remote
monitoring. This technique enlightens manufacturing decisions by providing valuable production information computed from real time machine
data. Typical data acquired include machine performance, engineering and quality data. The data will be conveyed to responsible authorities
across the manufacturing organization, both in-house users as well as those residing in remote geographical locations. Key performance
indicators (KPIs) for the operations being monitored are communicated via dashboards and customised reports. Many manufacturers are now
adopting the ‘software as a service’ (SaaS) model in order in order to improve their efficiencies and also save themselves from the tedious task of
running locally hosted systems. With SaaS, businesses leave the tasks of managing the cloud infrastructure and platform running their software
applications to the cloud hosting provider. SaaS delivery mode makes it easy to enjoy the benefits of the remote monitoring cloud without having
to manage complex computer network systems.
M2M is at the core of any manufacturing facility and the infrastructure builds out from there. Prior to M2M in manufacturing, industrial
automation has used direct wire connections between the sensors, actuators and the controlling PCs. However due to high costs of acquiring
smart production machinery, many manufacturers haven’t adopted complete M2M technology rather they use manual methods of gathering and
entering data into their ERP systems. Today, examples of M2M in manufacturing can be seen through the use of analog sensors to measure real-
world conditions and where process control systems perform analysis and control of manufacturing processes. Another instance can be
illustrated through control commands converted to analog signals to control actuators.
There are basically four components in an M2M system: 1) the intelligent device (machine or appliance) where the data originates, 2) the gateway
that extracts and translates data, 3) the network which serves the data and 4) the remote client which ultimately receives the data. M2M software
applications are optional but can facilitate communications, enable Web access and provide the user interface. Examples of technology used in
M2M systems include sensors, RFID, Wi-Fi or cellular communications link and autonomic computing software programmed to help a
networked device interpret data and make decisions.
The open connectivity (OPC) Foundation in industrial automation and the enterprise systems has become a widely accepted M2M
communication standard in manufacturing (Paine, 2011). It is based on open standards and specifications to ensure that interoperability can be
achieved for M2M communications in manufacturing. Other new standards created by the OPC Foundation include OPC-UA. OPC-UA is
designed to be platform independent and operating system independent, supporting Windows, Linux, and a variety of Embedded Operating
Systems that M2M technology vendors will be able to leverage. Figure 6 shows the typical architecture for an M2M system (Brandon, 2013).
Figure 6: M2M Architecture
(Source Brandon, 2013)
M2M communication is an important aspect of warehouse management, remote control, robotics, traffic control, logistic services, supply chain
management, fleet management and telemedicine. It forms the basis for a concept known as the Internet of Things (IoT).
• Fleet Tracking: Monitoring fleet arrivals/departures and flagging exceptions can improve end-to-end visibility and improve planning.
• EventBased Monitoring of Driver Behaviour:Documenting speed, idle time, and hard braking of delivery vehicles can reduce fuel
and insurance costs, while increasing driver safety.
• Field Force Management: Overseeing field-force activities from a centralised location can make it possible to practice real-time routing
based on traffic information.
• InventoryLevel Monitoring: Viewing and communicating inventory levels can help companies build automated replenishment
programs and share information with suppliers.
• Tagging HighValue Assets and Inventory: M2M systems can help companies keep track of particularly valuable assets, such as
computers, data-storage devices, consumer electronics, and ATMs.
• Smart Warehouses/Supply Chain Facilities: Through remote metering and control, companies can optimise energy use in
warehouses, production facilities, and other locations, thus reducing operating costs.
Product Data Management (PDM) and Product LifeCycle Management (PLM) Systems
Product Data Management (PDM) systems provide the tools to control access to and manage all product definition data. It does this by
maintaining information (meta-data) about product information. In 3D CAD systems, product files rely heavily on each other - they have
relationships and contain dependencies that drive details like feature size and placement in other models. The PDM vault allows engineers to
better manage the complex interrelationships between the part, assembly and drawing files. They can share files with other team members and
keep each other up to date on design modifications through a file check-in/check-out process. The product data management (PDM) tools are
often integrated directly with the CAD program being used by the team for design modeling. Product Lifecycle Management can be taken as a
strategic business approach that applies a consistent set of business solutions that support the collaborative creation, management,
dissemination, and use of product definition information. Product Lifecycle Management (PLM) systems support the management of a portfolio
of products, processes and services from initial concept, through design, launch, production and use to final disposal. The product life cycle
covers the lifespan of a product from the idea through development up to disposal or recycling. The product’s life cycle - period usually consists of
five major steps or phases: product development, product introduction, product growth, product maturity and finally product decline. These
phases exist and are applicable to all products or services. The phases can be split up into smaller ones depending on the product and must be
considered when a new product is to be introduced into a market since they dictate the product’s sales performance. All PLM systems use some
form of PDM as the underlying data foundation on which they operate.
Many companies have migrated from 2D to 3D CAD systems for their primary product development. This has resulted in PDM becoming a
virtual necessity for manufacturers. The benefits of 3D are reduced cycle time, cost savings, quality improvements, and greater innovation.
However 3D CAD systems bring in challenges of data management where engineers are generating greater volumes of data. 3D files contain a
variety of references, associations, and interrelationships that link them to other files, such as parts, drawings, bills of materials (BOMs), multiple
configurations, assemblies, NC programming, and documentation. Product developers use 3D models to carryout various analysis and
simulations so as to validate their designs. The models can also be used to demonstrate product functionality concepts to customers before
committing to full production. The 3D geometric component modeling software or “kernel” modelers must reliably manage data accuracy and
consistency while providing the openness and interoperability needed to facilitate the seamless exchange of 3D product data. Interoperability is
crucial in product development and manufacturing where different applications for design, validation and manufacturing engineering are used at
the same time requiring these systems to interoperate in upstream processes. Engineers therefore require a reliable system for managing,
preserving, and safeguarding these links. Numerous product revisions are usually the norm in manufacturing firms, requiring different engineers
to work within assemblies, or to collaborate on a design.
Product life is an issue for complex products with long life cycles such as plant machinery, trucks and airplanes. Plant machinery currently last
longer that the products they produce. It is important to take care of life cycle data about such resources so as to maintain a healthy plant that will
produce quality goods. Wear and status information can cause the machines to deviate from original settings resulting in defective products. This
will also shorten the machinery life causing the investors to lose their capital investment. Examples of life cycle data for plant machinery include
replaced parts, maintenance and wear indications such as capability and accuracy information. Monitoring machinery life cycle data makes it
possible to carry out preventive maintenance in order to avoid unplanned maintenance and machine failures which are more costly.
PLM can be split into various disciplines as shown in Figure 7. The top four disciplines are only shown. Engineering Change Management (ECM)
is a Business Process Management (BPM) discipline that spans all the phases of the product life cycle (IBM®, 2008).
Figure 7. PLM disciplines and the product life cycle
Data Modeling for Manufacturing Systems
The advent of the CAD and CAM software brought about integration of designing and manufacturing processes. CAD software enables direct link
between CAD and CAM. CAD enables automation of designing, while CAM enables automation of manufacturing processes. The database created
by the integration of CAD/CAM is also known as manufacturing database. It includes all the data about the product generated during design like
shape and dimensions, bill of materials and part lists, material specifications etc. It also includes additional data required for the manufacturing
purposes. There is no time gap between the two processes and there is no duplication of efforts required on the parts of designer and the
production personnel. As the integration of computer aided design and manufacturing (CAD/CAM) systems progresses, the need for
management of the resulting data becomes critical. Database management systems (DBMS) have been developed to assist with this task, but
currently do not satisfy all of the needs of CAD/CAM data. CAM systems usage requires a number of catalogues, especially in process planning.
Machine tools, cutting tools, inserts and tooling catalogues are sources of data which are necessary in cutting process planning.
Development of systems requires database modelling languages (DML) for the analysis stage. A DML is for specifying, visualizing, constructing,
and documenting the artefacts of software systems. Common modelling languages for manufacturing systems are the unified modelling language
(UML), EXPRESS/EXPRESS-G and XML.
One of the purposes of UML is to provide the development community with a stable and common design language that can be used to develop
and build computer applications. The UML notation set is a language and not a methodology. This is important, because a language, as opposed
to a methodology, can easily fit into any company's way of conducting business without requiring change. Since UML is not a methodology, it
does not require any formal work products, yet it does provide several types of diagrams that, when used within a given methodology, increase
the ease of understanding an application under development (Rumbaugh et al, 2005). The UML class diagram can explicitly represent the
relationships between objects. Hence, the data model, even the complex one, can be well modelled by the UML manner.
EXPRESSG Modelling Language
Express is a standard data modelling language for product data. Express is formalized in the ISO standard for the exchange of product model
STEP (ISO 10303), and standardized as ISO 10303-11. EXPRESS-G is a standard graphical notation for information models. It is a useful
companion to the EXPRESS language for displaying entity and type definitions, relationships and cardinality. This graphical notation supports a
subset of the EXPRESS language. One of the advantages of using EXPRESS-G over EXPRESS is that the structure of a data model can be
presented in a more understandable manner. A disadvantage of EXPRESS-G is that complex constraints cannot be formally specified.
Case Study for a Computer Aided Process Planning (CAPP) System
This section illustrates a system and addresses all the aspects involved in building up the process planning system for a sheet metal products
manufacturing company. Methods used and the guidelines that are followed in system development are discussed. Figure 8 illustrates the
architecture of the process sequencing system. The sequencing system has different interoperating modules each providing one or two essential
functions to the system.
Figure 8. System architecture
• FR Module: Feature extraction module defines the part features and geometry required from the product model given in 3d format
• Capability Taxonomy: The capability taxonomy is a method for storing the available tools in a hierarchical assortment of classes.
• Database: The system database stores the information extracted from the feature extraction module.
• ProcessSequence Module: The system module operates resembling a central operation planner being a governing module that keeps
track of the potential alternatives and optimizes the operation sequence.
Figure 9 illustrates a generic procedure-design for a sheet metal part, denoted as (hurricane clip, BPH). The Figure outlines specific parameters,
machine, tools and geometry in making a decision.
The database module for the process planning system is made up of repositories of data that are generated, updated, and retrieved by the FR,
Taxonomy and Process sequencing modules. A unit in the database stores new sheet product geometry information extracted by the FR module,
which is used as the feature input for the other modules such as the tool selection, and operation queuing. Another unit in the database stores
data on cutting or punching machine configuration, available punching tools, and fixtures. It is updated by an operator. The database stores
process plans (including cutting and punching plans) generated by the process planner. A process is described by the part features and stamping
operations including task strategy, cutting conditions and tools. Database stores past processes, long-term design, planning and stamping
knowledge that is used by product designers, process planners and machine operators. The UML diagram shown in Figure 10 illustrates the
system model.
Each UML model consists of a number of UML class diagrams connected to each other to show the relationship between these classes. The
capabilities of machinery used in the manufacturing processes, particularly the machine type, supplier, capacity (general specifications e.g.
tonnage, operating speed etc), nature of jobs suitable for the machine are modelled in Figure 11. For the demonstration, Figure 12 shows the UML
model of a feature. The two classes, namely peripheral features and inner-face features are sub-classes of the class feature. The class feature is a
sub-class of the class part and defines two types of the part feature regions.
Figure 11. Capability UML model
The typical manufacturing based system illustrated provides information that is an intermediate between the design (CAD) and the
manufacturing (CAM) phases of production. The output display is in the form CAD models for the parts being manufactured (Figure 13) as well
as process sequence reports (Figure 14).
Using the Extensible Mark-up Language (XML) file format for the product/ part models is recommended since the files can be exchanged across
different user platforms. The design goals of XML emphasize simplicity, generality, and usability over the internet. It is a textual data format with
strong support via Unicode for the languages of the world. When other CAD applications are used as the working design environment, the
interface commands may need to be changed, since different CAD developers have their own programming interface functions. In order to avoid
this problem, instead of using CAD interface functions, a stand-alone feature modeller is required so that the feature-based systems become CAD
software independent.
Solutions and Recommendations
There are several advantages brought to manufacturing industries by Big Data. In order to stay competitive, industries must continue to invest in
modern ICT infrastructure. The more the smarter manufacturers become, the more sense they can make from Big Data. Standardisation is also at
the heart of a successful ICT strategy since it brings interoperability and seamless integration across the business logistics as well as
manufacturing operations on the shopfloor. Manufacturers must also adopt neutral data and file exchange formats such as XML. This enables
manufacturing sites and clients in different geographical locations and also with different software applications to share their data files without
experiencing compatibility problems. The whole world is increasingly becoming digitalised as witnessed by the growing number of people owning
smart phones and other mobile gadgets. Whilst the social media has been in the forefront of adopting the digital trend, the manufacturing
industry should also make efforts towards occupying more space in the smart gadgets platform. Building smart factories, with smart machines
and systems provides a green solution to the world that is now emphasis sustainability.
Whilst there is more to gain from highly networked smart systems, there are risks associated with it. Major security areas that companies need to
be worry of are 1) authorisation and authentication, 2) role-based access control, 3) data validation, session management, 4) data integrity and
confidentiality, 5) auditing and monitoring and 6) the trusted environment. Technology providers have developed cyber security standards to
address vulnerabilities and assessment programs to identify known vulnerabilities in their systems. Prevention, detection, response and recovery
are key issues of any security strategy. Another important aspect is reliability. Unreliable sensing, processing, and transmission can cause false
data reports, long delays, and data loss which could be catastrophic.
FUTURE RESEARCH DIRECTIONS
The demand for standard products with no variations or with limited variation is declining causing stocks to pile in the retail centers since
customers are opting for products with their own quality specifications. Mass customization (MC), for a variety of reasons, is considered a
promising approach for domestic manufacturers to maintain and grow their operation as pointed out by Buehlmann (2005). MC is always
achieved through make-to-order (MTO). According to Koren (2010) mass customised production is in practice realised with flexible production
systems which are able to deal with a variety of manufactured parts and adjustable assembly operations. In order to embrace MC companies must
determine the depth of customisation suitable for the company’s level of technology.
One of the main challenges faced by the manufacturing industries is rapid reconfiguration of manufacturing systems to handle rapid change in
business environment without human intervention. An important criteria of the manufacturing factory is the flexibility characteristic of
producing multiple variations of customised products. The factory must be flexible enough to support different sequences of production as well as
allowing changes in the production system for new products offerings (Leitão, 2004). Reconfiguring existing manufacturing systems require
detailed data about the capability of each machine and also its current state is very necessary so that an optimal solution can be achieved Alsafi
and Vyatkin, (2014). In order to minimise ramp-up time during the reconfiguration process, there should be less manual involvement.
Automating the whole reconfiguration process minimises the reconfiguration process overheads and also enables the manufacturer to meet
demand within a short time frame. Gwangwava et al (2014) proposed full-automatic reconfiguration for reconfigurable systems (FARR). The
process requires an agent based approach using ontology knowledge of the manufacturing environment so as to make decisions.
In a typical research undertaken at the author’s institution, customers enter their needs (customer requirements) through a web based interface.
The data is fed into the company’s in-house system and analysed through a quality function deployment (QFD) based system. The system can be
used to generate completely new designs or to create a process plan for the customised order which will be used to reconfigure the existing
system or machine tool.
CONCLUSION
Big Data is a huge fortune to manufacturing industries. The use of business analytics and intelligent tools can mine huge data repositories
generated by manufacturing information systems and present reports which can be very handy in decision making. Customers in the whole globe
are ever becoming cost conscious and it is only those companies that strive to provide cost effective solutions which will remain competitive. In
order to make informed decisions, companies must be able to measure every detail of their processes. Cyber Physical Systems (CPS) enable
companies to build smart networks that enable them to deliver green solutions which are sustainable. The smart networks enable detailed
process measurement. Although this generates huge quantities of data (Big Data), the analysis of the data brings more benefits. The current
advancement in technology, particularly CPS, is showing no signs of taking a downward turn hence manufactures should incorporate ICT
strategies that will give them leveraged advantage over their competitors. In today’s global village, collaboration across the whole supply chain is
the way to go by. Interoperability and seamless integration is the only solution to gain clear visibility of all the downstream processes.
Manufacturers cannot ignore security in this era of Big Data. There should always be tight security measures so as to avoid catastrophic
occurrences. Investing in technology only is not enough; manufacturers should also strive to develop manpower skills with a balanced expertise
in business intelligence and software development skills.
This work was previously published in the Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 266288, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Buehlmann, U., & Bumgardnar, M. (2005). Evaluation of furniture retailer ordering decisions in the United States. Forintek Canada Corp.
Gifford, C. (2013). The MOM Chronicles: ISA-95 Best Practices Book 3.0 . International Society of Automation.
Gwangwava, N., Mpofu, K., Tlale, N., & Yu, Y. (2014). A methodology for design and reconfiguration of reconfigurable bending press machines
(RBPMs) . International Journal of Production Research , 52(20), 6019–6032. doi:10.1080/00207543.2014.904969
Koren, Y. (2010). The Global Manufacturing Revolution: Product-Process-Business Integration and Reconfigurable Systems . John Wiley & Sons
Inc.doi:10.1002/9780470618813
Lu, R., Li, X., Liang, X., Shen, X., & Lin, X. (2011). GRS: The Green, Reliability, and Security of Emerging Machine to Machine
Communications. IEEE Communications Magazine , 49(4), 28–35. doi:10.1109/MCOM.2011.5741143
Nah, F., Lau, J., & Kuang, J. (2001). Critical factors for successful implementation of enterprise systems. Business Process Management
Journal , 7(3), 285–297. doi:10.1108/14637150110392782
Rajkumar, R., Lee, I., Sha, L., & Stankovic, J. (2010). Cyber-Physical Systems: The Next Computing Revolution. InProceedings of the 47th Design
Automation Conference (pp. 731-736). New York: ACM Digital Library. Retrieved June 20, 2014, from http://dl.acm.org/citation.cfm?
id=1837461
Rumbaugh, J., Jacobson, I., & Booch, G. (2005). The Unified Modeling Language Reference Manual (2nd Ed.). Pearson Education Inc. Retrieved
August 10, 2013, from https://www.utdallas.edu/~chung/Fujitsu/UML_2.0/Rumbaugh--UML_2.0_Reference_CD.pdf
Saenz, B., Artiba, A., & Pellerin, R. (2009). Manufacturing Execution System- A Literature Review. Production Planning and Control , 20(6),
525–539. doi:10.1080/09537280902938613
ADDITIONAL READING
Ake, K., Clemons, J., Cubine, M., & Lilly, B. (2003). Information Technology for Manufacturing: Reducing Costs and Expanding Capabilities .
UK: CRC Press. doi:10.1201/9780203488713
Allen, R. D., Harding, J. A., & Newman, S. T. (2005). The application of STEP-NC using agent-based process planning.International Journal of
Production Research , 43(4), 655–670. doi:10.1080/00207540412331314406
Application Handbook, S. T. E. P. ISO 10303 Version 3. (2006). SCRA. Retrieved August 15, 2014, from
http://www.engen.org.au/index_htm_files/STEP_application_hdbk_63006_BF.pdf
Cattrysse, D., Beullens, P., Collin, P., Duflou, J., & Oudheusden, D. V. (2006). Automatic Production Planning of Press Brakes for Sheet Metal
Bending. International Journal of Production Research , 44(20), 4311–4327. doi:10.1080/00207540600558031
Chang, K. H. (2014). Product Design Modeling using CAD/CAE: The Computer Aided Engineering Design Series (1st ed.). USA: Academic Press.
Chen, M., Wan, J., & Li, F. (2012). Machine-to-Machine Communications: Architectures, Standards, and Applications.Transactions on Internet
and Information Systems (Seoul) , 6(2), 480–497.
Ciurana, J., Ferrer, I., & Gao, J. X. (2006). Activity model and computer aided system for defining sheet metal process planning.Journal of
Materials Processing Technology , 173(2), 213–222. doi:10.1016/j.jmatprotec.2005.11.031
Ebner, G., & Bechtold, J. (2012). Are Manufacturing Companies Ready to Go Digital. Retrieved June 20, 2014, from
http://www.capgemini.com/resource-file-access/resource/pdf/Are_Manufacturing_Companies_Ready_to_Go_Digital_.pdf
Foundations for Innovation. (2013). Strategic R&D Opportunities for 21st Century Cyber-Physical Systems. Retrieved June 15, 2014, from
http://www.nist.gov/el/upload/12-Cyber-Physical-Systems020113_final.pdf
Garcia, F., Lanz, M., Järvenpää, E., & Tuokko, R. (2011). Process planning based on feature recognition method, Proceedings of IEEE
International Symposium on Assembly and Manufacturing (ISAM2011), ISBN 978-1-61284-343-8, 25-27 May, 2011, Tampere, Finland.
10.1109/ISAM.2011.5942296
Gifford, C. (2007). The Hitchhiker's Guide to Operations Management: ISA-95 Best Practices Book 1.0 . USA: ISA- Instrumentation, Systems,
and Automation Society.
Gonzalez, F., & Rosdop, P. (2004). General information model for representing machining features in CAPP systems. International Journal of
Production Research , 9(42), 1815–1842. doi:10.1080/00207540310001647587
Heng, S. (2014). Industry 4.0: Upgrading of Germany’s industrial capabilities on the horizon. “Deutsche Bank Research”. Retrieved June 18,
2014, from http://www.dbresearch.com/PROD/DBR_INTERNET_EN-
PROD/PROD0000000000333571/Industry+4_0%3A+Upgrading+of+Germany%E2%80%99s+industrial+capabilities+on+the+horizon.PDF
Lemaignan, S., Siadat, A., Dantan, J.-Y., & Semenenko, A. (2006). MASON: A Proposal For An Ontology Of Manufacturing Domain Distributed
Intelligent Systems. Proceedings of International IEEE Workshop on Distributed Intelligent Systems (DIS 2006), Collective Intelligence and Its
Applications. 195 –200.
Lohtander, M., Lanz, M., Varis, J., & Ollikainen, M. (2007). Breaking down the manufacturing process of sheet metal products into features.
ISSN 1392 - 1207. MECHANIKA , 2(64), 40–48.
Meyer, H., Fuchs, F., & Thiel, K. (2009). Manufacturing Execution Systems (MES): Optimal Design, Planning, and Deployment . USA: McGraw-
Hill Professional.
Muruganandam, S. (2011). Harnessing the Power of Big Data in Global Manufacturing. Retrieved June 24, 2014, from
https://www.eiseverywhere.com/file_uploads/83cea5eaa2393225cbd34d98a249cd74_BIAP_2011_Soma_M.pdf
Nedelcu, B. (2013). About Big Data and its Challenges and Benefits in Manufacturing. Database Systems Journal. , 4(3), 10–19.
Pedro, N. (2013). Off-line Programming and Simulation from CAD Drawings: Robot-Assisted Sheet Metal Bending. In proceeding of: Industrial
Electronics Society, IECON 2013 - 39th Annual Conference of the IEEE. Retrieved April 5, 2014, from
http://export.arxiv.org/ftp/arxiv/papers/1311/1311.4573.pdf
Russom, P. (2013). Operational Intelligence: Real-Time Business Analytics from Big data. Retrieved June 24, 2014, from
http://www.splunk.com/web_assets/pdfs/secure/Real-time_Business_Analytics_from_Big_Data.pdf
SAS Institute Inc. (2013). 2013 Big Data Survey Research Brief. Retrieved June 10, 2014, from
http://www.sas.com/resources/whitepaper/wp_58466.pdf
Systems, D. ®. (2012). A Practical Guide to Big Data: Opportunities, Challenges & Tools. Retrieved June 24, 2014, from
http://www.3ds.com/fileadmin/PRODUCTS/EXALEAD/Documents/whitepapers/Practical-Guide-to-Big-Data-EN.pdf
Wan, J., Chen, M., Xia, F., Li, D., & Zhou, K. (2013). From Machine-to-Machine Communications towards Cyber-Physical Systems. Computer
Science and Information Systems, 10(3), 1105-1128. Retrieved June 5, 2014, from http://www.doiserbia.nb.rs/Article.aspx?id=1820-
02141300018W&AspxAutoDetectCookieSupport=1
Wiendahl, H. P., ElMaraghy, H. A., Nyhuis, P., Zäh, M. F., Wiendahl, H. H., Duffie, N., & Brieke, M. (2007). Changeable Manufacturing -
Classification, Design and Operation. Annals of the CIRP , 56(2), 783–809. doi:10.1016/j.cirp.2007.10.003
Xie, S. Q., & Xu, X. (2006). A STEP-compliant process planning system for sheet metal parts. International Journal of Computer Integrated
Manufacturing , 19(6), 627–638. doi:10.1080/09511920600623708
Zhao, Y., Kramer, T., & Brown, R. And Xu, X. (2011). Information Modeling for Interoperable Dimensional Metrology. London: Springer.
KEY TERMS AND DEFINITIONS
Enterprise Resource Planning (ERP): Automation and integration of a company's core business to help them focus on effectiveness &
simplified success. ERP software applications can be used to manage product planning, parts purchasing, inventories, interacting with suppliers,
providing customer service, and tracking orders. ERP can also include application modules for the finance and human resources aspects of a
business.
MachinetoMachine (M2M): A term used to describe any technology that enables networked devices to exchange information and perform
actions without the manual assistance of humans. M2M is considered an integral part of the Internet of Things (IoT) and brings several benefits
to industry and business in general as it has a wide range of applications such as industrial automation, logistics, Smart Grid, Smart Cities,
health, defence etc. mostly for monitoring but also for control purposes.
Manufacturing Execution Systems (MES): A control system for managing and monitoring work-in-process on a factory floor. An MES
keeps track of all manufacturing information in real time, receiving up-to-the-minute data from robots, machine monitors and employees.
Manufacturing Information Systems: A management information system designed specifically for use in a manufacturing environment.
The role of manufacturing information systems is to support manufacturing operations by providing relevant and timely information for decision
making at different levels of the company hierarchy. It also automates and secures the sequencing of manufacturing and business processes.
STEP/STEPNC: STandard for the Exchange of Product model data, a comprehensive ISO standard (ISO 10303) that describes how to
represent and exchange digital product information. STEP is a means by which graphical information is shared among unlike computer systems
around the world. It is designed so that virtually all essential information about a product, not just CAD files, can be passed back and forth among
users.
Supply Chain Management (SCM): An integrated approach to planning, implementing and controlling the flow of information, materials
and services from raw material and component suppliers through the manufacturing of the finished product for ultimate distribution to the end
customer. It includes the systematic integration of processes for demand planning, customer relationship collaboration, order
fulfillment/delivery, product/service launch, manufacturing/operations planning and control, supplier relationship collaboration, life cycle
support, and reverse logistics and their associated risks.
Unified Modelling Language (UML): UML is a standard language for specifying, visualizing, constructing, and documenting the artefacts of
software systems. It is also used to model non software systems as well like process flow in a manufacturing facility.
XML: Stands for “Extensible Mark-up Language. XML is used to define documents with a standard format that can be read by any XML-
compatible application. It is a file format-independent language, designed primarily to enable different types of computers to exchange text, data,
and graphics by allowing files to be shared, stored and accessed under different application programs and operating systems.
CHAPTER 8
Information Visualization and Policy Modeling
Kawa Nazemi
Fraunhofer Institute for Computer Graphics Research (IGD), Germany
Martin Steiger
Fraunhofer Institute for Computer Graphics Research (IGD), Germany
Dirk Burkhardt
Fraunhofer Institute for Computer Graphics Research (IGD), Germany
Jörn Kohlhammer
Fraunhofer Institute for Computer Graphics Research (IGD), Germany
ABSTRACT
Policy design requires the investigation of various data in several design steps for making the right decisions, validating, or monitoring the
political environment. The increasing amount of data is challenging for the stakeholders in this domain. One promising way to access the “big
data” is by abstracted visual patterns and pictures, as proposed by information visualization. This chapter introduces the main idea of
information visualization in policy modeling. First abstracted steps of policy design are introduced that enable the identification of information
visualization in the entire policy life-cycle. Thereafter, the foundations of information visualization are introduced based on an established
reference model. The authors aim to amplify the incorporation of information visualization in the entire policy design process. Therefore, the
aspects of data and human interaction are introduced, too. The foundation leads to description of a conceptual design for social data
visualization, and the aspect of semantics plays an important role.
INTRODUCTION
The policy modeling process and lifecycle respectively is characterized by making decisions. The decision making process involves various
stakeholders, that may have diverse roles in the policy making process. The heterogeneity of the stakeholders and their “way of work” is a main
challenge for providing technologies for supporting the decision making as well as technologies to involve various stakeholder in the process.
Stakeholders in this context may be citizens too, whereas often the term “eParticipation” is used in this context. Information visualization
techniques provide helpful instruments for the various stages of decision making. To elaborate the different stages of policy making and the role
of visualization in each stage, we have developed three-stepped design process for the roles of visualizations in the policy modeling lifecycle
(Kohlhammer et al. 2012). The model propagates the steps of information foraging, policy design and impact analysis, where various
visualization techniques can be applied to. These steps are investigated in particular for the FUPOL project, where the information foraging stage
covers the visual representation of various data and data formats to get a comprehensible and understandable view on the given masses of
information without losing the context and targeted task. The impact analysis step will use and cover both, the outcomes of the simulation
activities of FUPOL. The outcomes of the statistical data mining methods will be covered to support both, the active and passive involvement of
the citizens and to provide a kind of “public mood” about a certain topic.
For decision making in the policy life cycle, Data, information, and knowledge are crucial and important resources. Beside storing, managing and
retrieving data, one important factor is the access to the increasing amount of data. A promising discipline facing the information-access
challenge by investigating the areas of human perception, human-computer interaction, data-mining, computer vision, etc. is information
visualization. One main goal of information visualization is the transformation of data to visual representations that provides insights (Keim et al.
2010) to users and enable the acquisition of knowledge. The access to data is provided by interactive “pictures” of knowledge domains and
enables solving various knowledge and information related policy tasks. These “pictures” are generated through transformation and mapping of
data (Card et al. 1999) to visual variables (Bertin 1983) that are perceived by human to solve tasks (Shneiderman 1996). Different approaches on
creating this “picture” of data provide various ways of perceiving visual representation of data and interacting with them. The most popular way
is to get first an overview of the entire domain knowledge in an abstracted way, followed by zooming and getting more detailed information about
the knowledge-of-interest (Shneiderman 1996). This top-down approach (Information Seeking Mantra), proposed by Shneiderman
(Shneiderman 1996) makes use of our natural interaction with real world. Getting into a new situation forces us to build association of known or
similar situations and create an overview of the context. Further interactions in this situation are more goal-directed and detailed. The
complementary bottom-up approach, premises that we are able to verbalize a problem or direction. The visual representation is then generated
by the results of a search query. Based on the amount and complexity of the results various visualizations may provide abstracted views or
detailed visual knowledge representations.
The process of information search can be further optimized by the technologies and methods of formalized semantics and ontologies, in
particular in context of the Semantic Web.
Semantic Web targets on a machine-readable annotation of data to provide a “meaning” by defined and formalized relationships between
resources on web. (Kohlhammer 2005) While Semantic We focuses on the machine-readability, Information visualization focuses on the
maximization of our perceptual and cognitive abilities (Chen 2004).
In context of Information Visualization the aspects of data, user and tasks are of great importance. For designing Information Visualization tools
the question: which data to what kind of users and for solving which tasks may provide an adequate design process. In this context the recent
research investigates in particular the feedback loop to the data in Visual Analytics, the model-based visual knowledge representation in
Semantics Visualization and the cognitive-complexity reduction of users in Adaptive Information Visualizations (AIV).
This chapter introduces information visualization as a solution for enabling the human information access to the heterogeneous data that are
necessary during the policy modeling process. Therefore we first identify the steps of policy design, where information visualizations are required
based on an established policy life-cycle model. Thereafter a foundational overview of information visualization will be given, investigating beside
visualization techniques, the entire spectrum of data to visualization. In this context data and interaction methods will be introduced. We will
conclude this chapter with a conceptual example of visualizing social data in the domain of policy modeling.
ABSTRACT POLICY MODELING STEPS
Policies are usually defined as principles, rules, and statements that assist in decision-making and that guide the definition and adaptation of
procedures and processes. Typically, government entities or their representatives create public policies, which help to guide governmental
decision-making, legislative acts, and judicial decisions.
Some policy-modeling researches emphasize theoretical respectively formal modeling techniques for decision-making, whereas applied research
focuses on process-driven approaches. These approaches determine effective workflows through clearly defined processes whose performance is
then monitored (for example, as in business process modeling). This applied-research approach is widely seen as one way to effectively create,
monitor, and optimize policies. One aspect of process-driven policy making is the clear definition of the sequence of steps in the process. This
ensures the consideration of the most relevant issues that might affect a policy’s quality, which is directly linked to its effectiveness.
Ann Macintosh published a widely used policy-making life cycles; it comprises these steps (Macintosh 2004):
1. Agenda setting defines the need for a policy or a change to an existing policy and clarifies the problem that triggered the policy need or
change.
2. Analysis clarifies the challenges and opportunities in relation to the agenda. This step’s goals are examining the evidence, gathering
knowledge, and a draft policy document.
3. Policy creation aims to create a good workable policy document, taking into consideration a variety of mechanisms such as risk analysis or
pilot studies.
4. Policy implementation can involve the development of legislation, regulation, and so on.
5. Policy monitoring might involve evaluation and review of the policy in action.
The general process model of Macintosh was applied to identify the need and advances of information visualization in the entire process
(Kohlhammer et al. 2012). Therefore the model was abstracted to a highest level for identifying general and abstract information visualization
steps: The need for a policy, the policy design, and impacts of the designed policy are shown in Figure 1.
For adopting visualization in policy making, we simplified the general model and introduced three iterative stages (Kohlhammer et al 2012):
1. Information Foraging: Supports policy definition. This stage requires visualization techniques that obtain relations between aspects
and circumstances, statistical information and policy-related issues. Such visualized information enables optimal analysis of the need for a
policy.
2. Policy Design: Visualizes the correlating topics and policy requirements to ensure a new or a revised functional interoperability of a
policy.
3. Impact Analysis: Evaluates the potential or actual impact and performance of a designed policy, which must be adequately visualized to
support the further policy improvement.
Figure 2. Mapping of the five policy steps to the simplified
model of information visualization in the policy making
process (adapted from Macintosh 2004 and Kohlhammer et
al. 2012). (Own drawing).
All phases involve heterogeneous data sources to allow the analysis of various viewpoints, opinions, and possibilities. Without visualization and
interactive interfaces, handling of and access to such data is usually complex and overwhelming. The key is to provide information in a topic-
related, problem-specific way that lets policy makers better understand the problem and alternative solutions.
Today, many data sources support policy modeling. For example, linked open government data explicitly connects various policy-related data
1
sources1. Linked data provides type-specific linking of information, which facilitates information exploration and guided search to get an
overview and a deeper understanding of a specific topic. Further data sources may be the massive and growing statistical data provided by
various institutions, including the EC2.
Current policy modeling approaches do not use visualizations intensively neither for the general process nor for the entire identified stages.
The gap between information need and information access can be efficiently closed via information visualization techniques. The next sections
will introduce some main aspects of information visualization independently from policy making and design. This should amplify actors in policy
design to investigate information visualization as an instrument for the information provision process.
FOUNDATIONS OF INFORMATION VISUALIZATION
Model of Information Visualization
One of the most influential model in information visualization is the model of Card, Mackinlay and Shneiderman. It is a data flow diagram that
models the data processing from its raw form into a visual representation. The visualization is described as a series of partly independent
transformations. Its main contribution is that the complexity of the visualization process is split into smaller sub-processes. This is why it still
serves as a basis for many visualization system architectures today. Usually, scientific contributions in the information visualization domain can
be mapped precisely onto particular parts of the pipeline. Another important aspect of their work is the idea of user interaction in the pipeline. A
visualization technique is not static process. Every component along the data processing pipeline serves as a basis for process control
mechanisms.
The pipeline starts off with the transformation of the raw input data into data formats that are suitable for the visualization. This standardization
is necessary if more than one data source should be attached to the process or if a single data source is used for different visualization techniques.
This transformation aims at a data representation that is normalized in terms of content and structure so that the visualization can be decoupled
from the input data. This is an important strategy that permits to adapt techniques to different scenarios and data sets. It might involve trivial
operations like converting one data format into another, but in many cases it is also necessary to identify and deal with incomplete, imprecise or
erroneous data. Depending on the application the outcome of this step is well-defined data for the visualization.
The second step in Card’s visualization pipeline is the mapping of standardized, but raw data into the visual space. This mapping can be
considered as the core transformation that forms the actual visualization. That is why the different visualization techniques can be differentiated
in thispart of the pipeline. The visual space is described by a series of visual attributes which inherently represent the basic tools of the
visualization techniques. Ware identified several groups of these attributes: form, color, animation and space (Ware 2013). While the second part
of the pipeline describes the transformation into the visual space, the third block is about transformations within the visual space, the view
transformation. In almost any case the transformation also takes place within the value set of a single visual attribute. This includes, for example,
rotation, zoom and other camera settings as well as modifications of the color map for an attribute.
Card’s model of the visualization pipeline is a also a model for a technical realization of visualization techniques and processes. Together with
Mackinlay and Shneiderman he also develops a model for what he calls “Knowledge Crystallization Process”. Instead of describing the data flow
through the technical components they model the path from input data to application-dependent, domain-specific knowledge. This
crystallization resembles the classification of analytic artifacts as done by Thomas & Cook (Thomas and Cook 2005). It models a cyclic process
that repeats of the following steps:
• Instantiate schema.
• Problem-solve.
The proposed sequential cycle can be altered by several feedback and feed-forward loops that are the main characteristics for this model.
Whether or not these loops are executed depends strongly on the application scenario. In most cases, human-interaction is required whenever a
decision has to be made. In order to do that the human must be able to judge the available results. This task can be performed through automatic
analysis if the judging process can be explicitly formalized.
The model of Card and Thomas et al. complement one another in the sense that the model for the knowledge crystallization process is
independent of the technical realization. The single steps solely describe the way knowledge is gained and the tasks that perform in each step. The
model of Thomas et al. is still valid if the interactive visualization techniques are replaced by automatic analysis methods, as done, for example in
data mining.
Thomas & Cook define as the principle of knowledge crystallization as analytic deduction but focus on different aspects. Analytic artifacts appear
in knowledge crystallization only implicitly whereas the transformation process and their application is put in the foreground. In many cases, the
approaches for the theory and the models in information visualization can be assigned to one of two groups. These are “data-centered” and
“decision- or user-centered” tasks. They differ mainly by the information that is available in the design phase. Amar and Stasko (Amar and Stasko
2005) put those two principles in juxtaposition in the context of information visualization. Visualization in data-centered approaches aims at a
realistic representation of data and its structure. In its most consequent form, this idea is completely independent of the human user and the
tasks that should be solved using that visualization. Its main goal is to create an identical replication of the input data in the mental model of the
user. Viewing the data is an elementary low-level process. It is supported through visualization, but it does not support the user in solving a high-
level task. According to Amar, the static connection between analytic activities is based on the assumption that the aims of the user are also
formulated in a static and explicit manner. They find it necessary to link the user tasks on different abstraction layers through information
visualization, i.e. low-level and high-level tasks.
In the following sections we will present two parts of the Card pipeline: the visual mappings and the interaction techniques. Mappings can be
partitioned in five different groups that map fundamentally different structures into the visual space. Interaction techniques can be roughly
classified by the part of the visualization pipeline they control. In this manner, the differentiation is performed through technical criteria.
However, it would also be possible to separate the visualizations by the task they support. Although many techniques are advertised through the
tasks they claim to solve, comprehensive studies that compare many different techniques is not yet available in the literature. Wherever possible,
we will present reviews as found in the literature and express our own opinion where appropriate.
DATA FOUNDATIONS
The information visualizations model that was described in the previous part always starts with the transformation of data in their raw form.
Heterogeneous data types need to be investigated for the transformation process. Shneiderman (Shneiderman 1996) introduced a taxonomy of
data types, which distinguishes data types in one-, two- and three-dimensional data, temporal and multidimensional data, and tree and network
data. We will shine light on these categories in this section of the chapter. Together with an independent taxonomy of analysis tasks,
Shneiderman also presented a matrix of visualization techniques, which provides solutions for specific tasks and data. It has to be stated,
however, that it is quite common that a given dataset falls into more than one of these categories of the taxonomy. The term “dimensionality”
may either refer to the dimension of the actual data, or to the dimension of the display. In some cases, if the data set has a “native” dimensionality
(as is the case with most geo-spatial datasets) the preferred visualization techniques map this data onto its native space. Also note that most of
the visualization systems presented here employ one or more navigation and interaction concepts that were described in the previously, without
being mentioned here. We make a clear distinction between publications introducing basic technology and visualization techniques of the
“second generation”, in which most of these technologies are implemented as a quasi-standard an in nearly all cases used in combination. The
work of Keim (Keim 2002) gives a contemporary survey on the basis of Shneiderman’s taxonomy.
OneDimensional/Temporal Data
Tables with two columns are a typical example for one-dimensional datasets. If they contain at least one temporal component in their structure,
they are referred to as temporal dataset and form a special subclass of 1-dimensional data. Shneiderman also includes textual documents,
program source-code, lists and all other kind of sequentially arranged data to the category of one-dimensional data. Whether text documents
actually belong to this category depends on the perspective and task. If the central focus lies on the individual items in the sequence (as for
searching words in a document), the corresponding space is one-dimensional. If the focus lies on the sequence as a whole (as in document
analysis and classification), the data space actually is multidimensional. Given the usual complexity of input data sets, they do not fall in the
category of one-dimensional data alone. In this paragraph we present a number of visualization approaches which emphasize the temporal / one-
dimensional components of the datasets.
Havre presents a visualization technique called ThemeRiver as part of a document analysis of news reports (Havre et al. 2000). It maps the
change of headline stories in the news onto a time scale. The basis of this technique is the appearance of a specific keyword appearing in a
number of articles and shows how specific themes may appear at the same time (though not on a granularity level of a single article). Card et al.
describe a type of visualization (Card et al., 2006) that maps the temporal data is also onto a single axis, a time-line. This visualization couples
temporal and hierarchical data. For the problem of mapping temporal data to a visual aspect, which is neither a time-line nor an animation, no
convenient solution exists. In most cases, one of these variants is chosen, because they can be intuitively understood.
The work of Hochheiser and Shneiderman (Hochheiser and Shneiderman 2004) lies in the tradition of a number of tools which refine the
dynamic queries technique. As in the other visualization techniques, the temporal information is mapped onto the timeline. The use of so-called
TimeBoxes covers a spatial and temporal interval to intuitively define a number of data filters to identify time-series, which share a common
behavior. Timebox queries are combined to form conjunctive queries of arbitrary complexity. These techniques are conceptually not restricted to
temporal data. Every temporal dataset that is used in these techniques can be replaced with one-dimensional data of any other (ordinal) type. Lin
et al. give a survey on the different techniques for the analysis of the same kind of data, including Timebox-Queries, calendar based visualization
techniques. The authors also contribute VizTree, which interactively visualizes a similarity analysis in a number of data graphs, producing
similarity trees (Lin et al., 2005).
Hao et al. (Hao et al. 2005) propose another combination of clustered / hierarchical data together with a large time-series data set. In their
application scenario, the time-series entities show intrinsic hierarchical relationships. This technique combines tree-map properties with the
ability to show temporal development of stock-market prices. The hierarchical properties of the underlying data are used to match the level of
interest and importance in the layout.
The approach proposed by Voinea at el. (Voinea et al. 2005) in the field of collaborative document creation management deals with a completely
different kind of data. The authors focus on software development source code files which require significantly different processing than plain
text documents. The creation process of the software is clearly separated in a one-dimensional aspect (the position of lines added to the source-
code), and the temporal aspect (the development of the source over time), both of which are combined in a two-dimensional overview. Different
parts of the code can be identified by their author(s), such as stability and other aspects.
Two and ThreeDimensional Data
The mapping of abstract two and three-dimensional data has by far the longest tradition. All kinds of geospatial information visualization can be
identified as a mapping from data in a two-dimensional space (geographical maps) or three-dimensional space (a virtual model of our physical
world). Every atlas can be considered a collection of physical data and geographic metadata which accounts for most of the earliest efforts in
actual information visualization. Embedding abstract data into a representation of our physical world is one of the most powerful metaphors,
because humans are attuned to organize and arrange mental mappings while copying our physical world. Hence, many visualization techniques
for this embedding have been developed. Over the years, this concept evolved from plain satellite image visualization to a collaborative platform
for which the (virtual) world serves as a common frame of reference to contribute, search and analyze large amounts of additional geographic
metadata. Not surprisingly, many visualization techniques have been developed that use this platform as a basis for their data (Chen and Zhu
2007). With a special focus to the spread of avian flu, Proulx et al. combine the embedding of spatial, temporal and other metadata to actually
formulate and test hypothesis on the basis of “events” (Proulx et al. 2006). Events serve as metadata containers which are used to bind the
information to a place, time, etc.
One of the most prominent mappings of abstract data into two-dimensional space is the scatterplot technique, which appears in a large number
of variants (North 2000). Despite the fact, that the native display space is only two-dimensional (although three-dimensional scatterplots exist),
they are often used in combination as scatterplot matrices or with other techniques to be used in multidimensional data analysis. Scatterplots
work best for numerical data (which can be mapped on the x and y coordinates respectively), and is of limited usage to convey purely semantic
information. Because of their simple metaphors (points in n-dimensional space become points in 2-dimensional space), they are most
conveniently used to visualize projections between n-dimensional data-space and display-space, which usually are supported by numerical
methods just as factor analysis, matrix decomposition and similar methods.
One field of two- and three-dimensional mappings has been left out on purpose: Scientific visualization as is separated from information
visualization by the data that is displayed. By definition, it deals with physical data which inherently lies in physical space rather than abstract
information and metadata. Consequently, the techniques of scientific visualization are out of scope.
Multidimensional Data
Most of the techniques presented here involve data which covers more than three independent dimensions. Visualization techniques for
multidimensional (or multivariate) data explicitly address the problem to visualize and identify inherent dependencies in the datasets, which
cannot be expressed by simple correlations. Hidden relations may incorporate ten or more dimensions of data, and one of the major goals in all
of these techniques is to display a sufficiently large number of dimensions in (2-dimensional) screen space to make these correlations visible.
Defining “Visual Data Mining” as a concept the work of Keim gives a survey on a number of visualization techniques for multi-dimensional
databases (Keim 1996). Aside from graph-based visualizations for networks and hierarchies, two classes of techniques evolved over the years to
become prominent representatives for the visualization of multi-dimensional data: The first one is the so-called parallel coordinates technique,
the other one falls into the category of pixel-oriented layouts. It has to be noted, that all of these techniques virtually never appear in their “pure”
(i.e. conceptual) form. Most of the recent frameworks and techniques derive their improvements from an adequate combination of different basic
techniques – in some cases in the same display. This holds true especially for glyphs, which also constitute a group of multidimensional
visualization techniques, but does not refer to the layout (i.e. the positioning of visual objects in screen space) but on the appearance of objects.
Basically every single visual object that conveys more information than its position can be considered as a glyph.
The parallel coordinate technique, as the name suggests, has all axis in the display arranged in a row of parallel lines. Basically this technique can
be used for nominal, ordinal or numerical axis, but it works best for ordinal and numerical data. A “point” in the n-dimensional space is drawn as
a poly-line connecting the (coordinate-) values on every axis. While the basic idea is relatively old, contemporary studies on parallel coordinates
emphasize their use for the analysis of datasets (Siirtola 2000). In many cases, this technique is tightly coupled with the generation of dynamic
queries. Both of these techniques illustrate the identification of data clusters by visual/manual methods (Siirtola 2000) and a method the display
the data at different structural levels (Fua et al. 1999).
Complementary to that, the general idea of pixel based methods is to use the screen space in the most efficient way possible: Every pixel in the
display area is used to convey different information: The use of “non-data-ink” is reduced to a minimum. Pixel-based techniques must cope with
the layout problem of an adequate mapping of the (multidimensional) data-space onto the screen space. In many cases there is no strict
correspondence between the similarity of the data items and their distance. (Keim 1995, Keim 1996 and Keim 2000) provide a good overview
over the general idea of these techniques.
Tree and Network Data
Graph visualization has become an important topic in information visualization area over the past years. The display of networks helps to analyze
of relationships between entities rather than the entities themselves. Graph visualization is used in many different application areas. For
example, the site maps of web sites as well as the browsing history of a web browser can be displayed in a directed graph. In biology and
chemistry, graphs are applied to evolutionary trees, molecule structures, chemical reactions or biochemical pathways. In computing, data flow
diagrams, subroutine-call graphs, entity relationship diagrams (e.g., UML and database structures) and semantic networks and knowledge-
representation diagrams are the main application fields. Furthermore, document management systems profit from document structure and
relationship visualization. Social networks visualization has also become a popular application of graph visualization methods.
The key issues in graph visualization are the graph structure (directed vs. undirected graphs, trees vs. cyclic graphs) and their size. A survey of
graph visualization techniques for different graph types can be found, for example, in the work of Herman (Herman et al. 2000). The graph
display is driven by its layout. There are different graph layout techniques suited for different graph types.
For trees (graph in which any two vertices are connected by exactly one path) the classic layouts will position children nodes “below” their
common ancestor (Reingold and Tilford 1981), in 3D a cone layout is used (Robertson et al. 1991). For large graphs the high node and link
overplotting requires new visualization and clustering techniques, For example, 3D hyperbolic space layouts (Munzner 1997) or treemaps (van
Wijk et al. 1999).
VISUALIZATION TECHNIQUES
As described in the previous section, a visual mapping is a transformation of the data flow performed by visual techniques and will be used for
their classification. It is important to note here that visualization techniques contain almost never a visual mapping in pure form. Especially
newer techniques are often combinations of older approaches. Some of them are explicitly mentioned in a separate sub-chapter at the end of this
chapter.
Each of the following sub-chapters presents a category of visualization- and interaction techniques with a focus put on newer approaches. The
classification we performed is similar to the multi-dimensional visualization technique classification done by Keim (Keim 2000) which we extend
with a class that deals with projection methods. Whenever possible, the techniques are presented independently of their application domain.
Where ratings of a technology are provided, then these are typically related to the technique’s ability to solve a particular task rather than the type
of data they display.
Instead of describing iconic data like Keim does, we focus on projection methods, because they are tightly coupled with methods from data-
mining. Moreover, the class of pure iconic techniques has lost importance during the past couple of years. Today, the results of this domain are
reused particularly in glyph-based designs. Glyphs are singular symbols for data objects that represent one or more attributes.
Keim provides a classification survey of visualization techniques combined with a comparison regarding different characteristics of the data, the
tasks and inherent properties of the visualization itself.
The survey can be separated into three independent task groups: task-related, data-related and visualization-related characteristics. The
associated questions are: Which tasks can be solved? What kind of data is suitable? What are the inherent properties of the technique?
Keim starts his evaluation by testing task-related capabilities of the techniques. The first task is the support for cluster identification in a dataset
(“clustering”) and describing the distribution/cumulation of points in high-dimensional space (“multivariate hotspots”). Data-related capabilities
comprise the number of attributes, the number of data objects and the possibility to faithfully map nominal scales (“categorical data”). Among
others, the inherent properties of the technique comprise effective use of available space measured through the overlapping area of the visual
items (“visual overlap”). The last criterion is the experienced difficulty learning a technique (“learning curve”).
Geometric Methods
Every visualization technique that maps a data element directly on a visual attribute that is more complex that a single pixel (e.g. lines, glyphs,
etc.) belongs to the group of geometric methods. It is highly heterogeneous and contains many hybrids that also belong to class of projection
methods. Most of the classical diagrams like starplots, pie charts, bar charts, line charts, histograms, etc. as well as geographic maps, parallel
coordinates, scatterplots and scatterplot matrices. As an example, scatterplots can also be considered as a projection method.
One of the most important visualization techniques are line charts. They display one-dimensional functions like time-series in many application
areas. Hochheiser and Shneiderman. (Hochheiser and Shneiderman 2004) present a Timebox-Widget that allows for interactive selection and
dynamic filtering of the displays data sets. It is based on the older technique “Dynamic Queries” combined with a new visualization approach. The
user defines a box selection implicitly through one or more intervals of attribute values that are mapped to the x- or y-axis in the display. These
intervals define the data sets that lie completely within these data sets.
Equally important are geo-related data mappings. Every atlas can be seen as a collection of geo-data and geographic metadata. Embedding this
abstract information in a geographic representation is one of the most abundant metaphors possible, because the reference to a location is one of
the most important relations people use to organize information. Proulx et al. (Proulx 2007) display geo-data together with a time-based
mapping in order to combine the two natural reference frames (space and time).
An example for an interesting combination of techniques is presented by Bendix et al. (Bendix et al. 2005). It has been chosen, because it
combines two of the most popular techniques – parallel coordinates and dynamic queries. Compared to most other techniques, the parallel
coordinates approach excels in that as the number of attributes is only limited through the amount of available screen space. Every attribute is
mapped on its own axis which is parallel to every other axis. One element of a data set is thus represented as a polyline that intersects with all
axes at that point that represents the value of the respective attribute. Data clusters and correlations can be easily identified if the attributes are
adjacent. Bendix et al. put their focus on the search of describing expressions rather than the data set itself. This search of expressions is, apart
from the search of patterns, a major aspect in visual data analysis. Technically spoken, they deal with the mapping of nominal data types. As they
do not have a natural ordering, they display the relations between different classes instead of the data set itself. (Bendix et al. 2005)
PixelBased Techniques
A visualization technique belongs to the group of pixel-based methods if the number of used visual attributes comprises only the position and
color of a single pixel. Consequently, every pixel represents a data element which permits to display a maximum number of data elements at the
same time. Pixel-based methods impose two design-problems. The value set of an attribute must be mapped to the range of available colors, but
this is a problem that persists in most visualization techniques (Wijffel 2008).The second problem is about arranging the pixels related to the
data set. The visualization can be seen as a function that values from high-dimensional space on the 2D screen.
A definition of pixel-based methods and a more formal description and can be found in the work of Keim (Keim 2000). The function that maps
data elements in the visual space can be seen as the result of an optimization process. Assuming that the data set is ordered, this optimization
must ensure that the one-dimensional ordering is kept also in the two-dimensional display. Equally important is the selection of the display area
that ensures that the average distance between pixels that belong to the same dataset is minimal. The purpose of that is to aid the user in finding
relations between different attributes in a data set.
May et al. present a visualization technique that maps multiple attributes on the same display. Every single pixel stands for a range of values that
covers several data objects at the same time. The aggregation of the data values defines the final pixel color (May et al. 2008). In contrast to many
other techniques the interesting information is hereby contained in frequencies. Pixels that relate to similar value sets can be, but do not need to
be contiguous. Repetitions in well-defined horizontal or vertical distances also indicate correlations. The human recognition is able to detect
patterns in complex structures even if the data is distorted by noise. While pattern detection is easy, interpreting their meaning is often
challenging.
Figure 3. Pixel-based visualization from May et al. 2008 (with
permission)
Pixel-based techniques are often suitable for explorative analysis of patterns and distinctive features. Displaying previously found relations is a
different task that is usually performed by different visualizations. More formally, the data model that describes the input data structure is linked
but not equal to the analytic model that describes relations in the data set. Accordingly, different tasks often require different perspectives.
Hierarchies and Trees
Trees describe binary relations between differentiable elements can be described in a finite set. Most approaches in terms of visualization expose
the hierarchy as dominant structure although several other attributes of the elements are present in the visualization. As the hierarchy does not
impose a particular spatial structure, visualization techniques can be separated in two distinct parts. The first group deals with the design of
visual mappings, i.e. the selection of attributes and metaphors for the display of elements and their connections. The element position in the 2D
space does not play a major role for them. The second group is dedicated to different layout algorithms that map the elements according to one or
more properties into the visual space.
Keim et al. present two space-filling methods that display hierarchies in different manners (Hao et al. 2005) and Mansmann 2007). The first one
displays child nodes in their own separate space whereas the latter uses – similar to a treemap – the space of the parent node. Among others, the
importance of leaves compared to inner nodes has influence on which one of the two methods makes more sense. The treemap puts the focus on
the leaves of the tree. In contrast, the hierarchical layout highlights nodes that are close to the root node and less dominant in the treemap.
The nodes are displayed as simple rectangles in both cases which leaves room to show additional information. They can be used as a basis for a
visualization of its own. The only restriction is that the amount of available screen space is defined by the tree layout. However, practically all
visualizations for trees and graphs have in common that their ability to query and to display details is rather limited and often insufficient. This is
why they are often combined with other methods, e.g. graph visualizations. Holten (Holten 2006) gives an example of such a combination. A
node-link diagram is shown on top of a hierarchy with different aspects of the data. The edges between nodes are gathered in bundles in order to
reduce the overdrawing and thus increase the readability of the graph.
A simple variation of node-link diagrams is the traditional Dendrogram. It is characterized by the fact that all nodes of a hierarchy level are in the
same line. This significantly improves the visual arrangement of the tree. The simplicity of the structure and the display allows more complex
information presentation. Up to a certain point it is possible to create abstractions of the components and use more or less independent
techniques to display nodes, edges and the structure itself. The arising number of combinations is thus a source of new designs even without
fundamental novelties.
Facing aesthetic, scientific and task-related aspects, designs tend to become overly complex which is conflicting with the user’s need for easy-to-
understand interfaces. A good visualization provides the relevant information on first sight without need for the user to actively search for it. This
conflict has been actively discussed in the scientific community in the past years (Lorensen04) and (van Wijk 2005). The task defines, which data
should be displayed, but it inherently defines as well which data should be hidden from the user as well. The data types impose a natural
limitation on the repertoire of visual mappings. Today, scientists debate about the basic properties of visual mappings that are required support
specific tasks with specific data sets in an adequate manner.
Graphs and Networks
Even if trees are only a specific subgroup of graphs they are typically depicted by very different techniques. Visualizations for trees exploit their
simple structure, especially the fact they typically describe orderings. Compared to that, the placement of nodes in an arbitrary graph layout that
fulfills certain optimality constraints is more complex, or mathematically spoken: NP-hard (Brandes et al. 2003).
Most graph visualizations are variations of node-link-diagrams. Some examples have already been given in the previous sub-chapter. As with
trees and hierarchies the publications can be split in two categories: the graph layout on one side and the visualization of nodes and edges on the
other side. The quality of a layout is measured in different criteria which often impose conflicting constraints. It is, for example, desirable to be
able to see the most significant structures and clusters. But it is also desirable to minimize the spatial distance of related partitions. This makes it
per se difficult to find a layout that is optimal for all demands.
Technically, the layout is often computed by mass-spring-simulations, so called “spring-embedders”. They model the optimality criteria as an
energy function. The simulation then tries to find a global minimum for that function. In a mathematical sense, layout algorithms are related to
non-linear or local-linear projection methods.
One fundamental problem in graph visualization is the sheer amount of nodes many datasets contain. The number of nodes that can be displayed
on the screen is rather limited. Considering that the focus of the user is either on the global structure or on a particular group of nodes it often
makes sense to hide a large part of the data set. Balzer and Deussen (Balzer and Deussen 2007) create a visual abstraction on the basis of existing
node hierarchies. It can be, for example, generated by hierarchical clustering algorithms. The nodes and the edges of a cluster are then combined
into one single graphical element. A variation of this has been presented by Henry et al. (Henry 2007) who model this graphical element as an
adjacency matrix. Their main contribution, however, is to provide interaction tools for the user.
A system that is dedicated to navigate in large graphs has been developed by Abello et al. (Abello et al. 2006). The basis for that is again a given
node hierarchy. It is used to display an overview on the graph that is used for navigation. At the same time, it acts as filter for the nodes that are
displayed in a detailed view. Depending on the level of detail, sub-trees are expanded or collapsed.
Van Ham (van Ham 2009) faced the same problem from the opposite side. Based on an initial node pick, only a small region around a focus node
is displayed. This idea has been picked up by May et al. (May et al. 12) whose system allows for more than just one focus node. It also add
landmarks as graphical cues to give information on the context of the visible sub-graph. The arrows point along the shortest-path to regions in
the graph that might be worth exploring.
Figure 4. Signposts for navigation in large graphs (from May
et al., 12, with permission)
Many combinations of techniques for the graph structure and the detail view are possible. Displaying details in the graph makes sense only if the
information can be classified and processed on first sight, for example a mapping on a color scale.
The number of currently available visualization indicates already that there is no single best visualization, neither for the graph layout nor for
displaying nodes and edges. The complexity of network graphs is often distributed on many structural levels. Many techniques assume that it has
an inherent hierarchy. They exploit that by computing and using hierarchical structures for the display. Even if a visualization technique is able to
switch between different levels in the hierarchy, it is probably not able to display all levels of the structure this at the same time. This does not
work, because the user’s visual ability to focus is limited to one or two levels. The essential task of graph visualization is thus to display one
structural level as good as possible and to support user-controlled switches between different levels if necessary.
Projection Methods
This part of the book deals with projection methods. They project the data space onto the 2D visual space. This transformation is performed prior
to the visual mapping. Originally, projection methods can be compared to methods from data mining domain even if the projections are of higher
degree. The data space describes the set of all possible combinations of different data set attributes. Every element is represented by one point in
this space. The projection tries to map the information that is inherent in this high-dimensional space into 2D. As with graphs, the focus is on the
distribution rather than on accurate representation of single data elements.
Scatterplots are projection methods that are rather easy to understand. Basically, two attributes, typically numeric scales, are mapped onto the
vertical and horizontal axes of a diagram. The main advantage compared to other techniques is their simplicity and the fact that most users know
the concept already from math courses in school. The drawback is that only two attributes can be compared at the same time as the projection is
linear along with the axes of the coordinate system. Elmquist et al. overcome this limiting with a Scatterplot-Matrix. It displays all possible
scatterplots with a given number of attributes of a dataset in a matrix. (Elmquist et al. 2008) Every entry in that matrix is a miniaturized
scatterplot. These small scatterplots give a first idea if and how two attributes are linked. The matrix display provides an overview and helps the
user to find interesting attribute combinations, but it also solves the coherence problem for the scatterplot: modifying parameters (in this case
the selected axes) modifies the user perspective in a way that the user cannot comprehend. The display before and after the modification differ so
much that the user is not able to recognize the influence of the modified parameter. Animated transitions between those settings are an often
used strategy to fight that problem.
A linear projection can be described as an optimization process that tries to find an optimal direction. As most optimizations do, the Principal
Component Analysis (PCA) (Müller 2006) tries to minimize an objective function. For PCA, it describes the variance of points along an arbitrary
axis in space. Linear projections screen all information along one projection axis, but highlight structures that are orthogonal to that axis. In case,
a dataset contains structures that become manifest along several (in the worst case perpendicular) axes, linear projections fail to display the
dataset properly.
Schreck at el. present a projection method that is based on self-organizing maps (sometimes also referred as Kohonen maps, named after Teuvo
Kohonen) (Schreck at el. 2008). As the name already inclines, the maps are self-organizing neuronal networks that map high-dimensional
attribute space in the two-dimensional display space. In contrast to other methods, the display space is discrete rather than continuous. Every
discrete element corresponds to a set of classes and every data element is represented by an element that belongs to exactly one of the classes.
Every class contains one element that represents the class as a whole. The classes can then be put in relation with each other in terms of
similarity, or simply spoken, similar classes lie close to each other in the map.
With the exception of scatterplots, all linear projection methods work with numerical data only. Non-linear projection methods are able to work
with other data types if the spatial distance between two data elements is metrically defined.
Above all, projections describe the data distribution in a multi-dimensional space. As a result, the points are mapped so that elements that are
close in the data space are also close in the 2D space. Thus, these methods are particularly useful for clustering, similarity detection and outlier
detection.
VISUAL INTERACTION
Many different information visualization techniques for interaction and navigation within the abstract data space exist. Hearst considers the
following as the most important ones: brushing and linking, panning and zooming, focus and context, magic lenses, animation and as an
additional combination overview plus detail (Hearst 1999). These techniques can be seen as the fundamentals (together with the visualization
metaphors) for the design and implementation for visualization techniques.
Brushing and Linking
The interaction technique “brushing and linking” describes a connection between two or more views of the same data, based on a user-defined
selection. Selecting a certain representation in one view affects the representation in other views as well. This requires that the raw data is
mapped not only to one view at a time, but to several views. More specifically, brushing refers to the idea that the user picks a subset of the
original data whereas linking refers to the visual highlighting in different complementary views. This Highlighting can occur in a number of
forms. They all have in common that the selected item(s) can be distinguished in an intuitive way from the unselected items. This naturally limits
the number of scalar dimensions which can be used in the same display. The work of Ware gives an overview on visualization features and
presents how different visualization can be used to judge whether groups of objects belong together or not (Ware 2013). The basic feature classes
presented are form, color, motion and spatial position. His work on preattentive perception gives important information which types of features
can be used which each other, and which types of features should not be used for different information. Examples include using a different color,
font, background or symbol, and adding additional labels for highlighted items (Eick and Karr 2000; Wills 1995). Depending on the sources, the
brushing and linking technique is either considered as a change of the visual mapping (Hearst 1999) or as a technique which modifies the data
transformation (Card et al. 1999). Most importantly, every visual mapping is required to provide an inverse mapping, by which visual structures
can be remapped to a common data reference.
An example for a system implementing brushing and linking for the visualization of search results is the INQUERY-based 3D-visualization by
Allan (Allan 1997).
Panning and Zooming
The view transformation from visual structure to views is often controlled by panning and zooming operations. Changing the viewpoint of the
user alters the portion of the displayed part of the visual structures. Hearst uses the metaphor of a movie camera (Hearst 1999). Card et al. use
the term “panning and zooming” in their listing of interaction techniques (Card et al. 1999). Their equivalent is camera movement and zooms. In
contrast to simple panning, camera movement includes the third dimension, when dealing with three-dimensional visualizations. In both cases,
zooming includes possible changes of the level of details displayed, when changing the zoom-factor – the virtual distance to an object of interest.
An interesting contribution on zooming is the “single-axis-at-a-time-zooming”, discussed by (Jog and Shneiderman 1995). While normal
zooming can be explained by using a camera metaphor, this fails to work, when only the scale of one of the axes is changed.
The camera metaphor for movement in virtual (3D-)space is better-known from virtual world and games. However, a classical example for a
system implementing panning and zooming for the visualization of browsing and searching is Pad++ (Bederson et al. 1996). One of the central
characteristics of this system is the fact that scale is added as a first class parameter to all items displayed. In addition to implementing simple
panning and zooming, Pad++ goes far beyond this interface technique. It also offers focus-plus-context views as well as overview plus detail,
which are described later. In general, at least simple forms of panning and zooming are today one of the general techniques implemented in many
of the available visualization systems.
FocusPlusContext
An inherent problem of zooming is that the higher the zooming factor is, the more details can be shown about particular items or the better the
separation between close up items, but less can be perceived from the surroundings or the overall structure of the information. Focus-plus-
Context techniques mitigate this problem by presenting more details about the items in focus, and less about the context while trying to avoid
that the context of the information in the focus is completely hidden. Card et al. list three points as premises for focus plus context (Card et al.
1999):
• Information needed in the overview may be different that that needed in detail.
Overview-plus-Detail (Furnas 1981; Furnas 1986) methods can be used to cope with the mentioned problem of zooming and at least the first two
of the premises, but overview plus detail does not combine both types of information in a single display. Hearst describes a fisheye camera lens as
a metaphor for focus-plus-context (Hearst 1999). The trailblazers for fisheye views were two publications of Furnas (Furnas 1981; Furnas 1986)
on “Degree of Interest” (DOI) functions and Sarkar (Sarkar 1992) with their extensions for graphical fisheye views. Card et al. list the following
techniques for selective reduction of information for the contextual area: Filtering, selective aggregation, micro-macro readings, highlighting and
last but not least distortion (Card et al. 1999). They interpret focus-plus-context as a data transformation, whereas for zooming, where a sort of
filtering can also occur, they categorized the complete technique as working on the view transformation.
Examples for systems using focus-plus-context for the visualization of search results or browsing are the document lens, the table lens or the
Pad++ system. The document lens (Robertson and Mackinlay 1993) is a component of the Information Visualizer system. It is a 3D tool for large
rectangular presentations of documents or web page collections, like the web-book. The pages of the document of a collection are exploded out,
so that all pages are available simultaneously and can be viewed using a rectangular lens magnifying the page in focus, and therefore distorting all
the other pages. Another component, also using a lens metaphor, is the table lens (Rao et al. 1994). The table lens can be used for viewing of
result lists or other lists in tabular form, and includes functions for magnifying lines or groups of lines whilst keeping the rest of the table
viewable in compressed form. An entirely different method for a focus-plus-context, which uses semantic information technique, is presented in
(Kosara 2001). Blurring is used for highlighting relevant information, without compromising the ability to show an overview of the situation.
Semantic Zooming
In contrast to ordinary zooming techniques, semantic zoom does not only change the parameters of a graphical representation, but modifies the
selection and structure of the data that is displayed. Graphical zooming usually affects the displayed size of an object and – if applicable – also
affects the graphical level of detail of a given object representation (i.e. the number and complexity of graphical primitives shown), based upon
some distance measure. Semantic zooming, on the other hand, changes or enhances the actual type of information conferred in the graphical
object(s). Usually additional graphical objects, just as annotations, flags or similar metaphors appear in the display while zooming. For every type
of entity and every level of detail the structural information has to be defined. Semantic zooming is a technique for details-on-demand to avoid
display cluttering in the panoramic view, while retaining all information for a more local field of interest.
Boulos presents a survey about the use of graphical map for browsing metadata resources (Boulos 2003). Map-based visualization techniques
provide a natural frame of reference, by which an intuitive search strategy can be imposed to the user: The mapping defines the spatial topology –
especially the “similarity in the abstract space” between points, mapped into their mutual distance. Modjeska gives an extensive survey about the
navigation in virtual information worlds (Modjeska 1997). Semantic zooming can be developed for hypermedia and spatial worlds with a variety
of information structures. It uses semantic information to change the physical representation of objects according to viewing scale. In their early
work, Ahlberg et al. present a coupling of the semantic zooming technique and dynamic query technique in a starfield display. (Ahlberg et al.
1994)
The Magic Lens is a special form of a semantic zoom which connects the interaction method with a lens metaphor. Magic lenses allow to select an
area of the view port (of either fixed or arbitrary size), and to manipulate this area with specific operators. They can be overlapped on items, and
change their applied to the underlying data (Hearst 1999).
Animation
While the other techniques described so far affect data transformations, visual mappings, and/or view transformations, animation does not
influence these conversions, but is affected by them. For a discussion about animation in the larger context of motion and the general usage of
motion see the work of Bartram (Bartram 1997). Animation is used more and more in information visualization systems to help users keeping
their orientation when transformations or changes of mappings occur. In the transition between images of the same data objects, animation is
used to keep the path of an individual object coherent to human perception. The cognitive load on the user is reduced by providing object
constancy and exploiting the human perceptual system (Robertson et al. 1993). Animation is used in a number of information-seeking systems
like the Information Visualizer*, the Navigational View Builder*, Pad++*, or SPIRE. In the Information Visualizer*, animation is used in several
ways, like for example animation rotations of Cone Trees to track substructure relationships without thinking about it (Robertson et al. 1991). In
addition to animate changes (Bryan and Gershman 2000) used movement in their “aquarium” interface for a large online store to reinforce the
absence of structure in the displayed items.
Especially in the context of semantic information, is has to be noted that animation is also used in the Prefuse toolkit (Heer et al. 2005) for the
animation of graphs and networks. Depending on the field of interest, a different part of the structure must be centered in the viewport. This
usually requires the movement of the different nodes in the network for the new arrangement. In most cases, this motion is animated to keep the
mental image of the network structure consistent (Abello 2006). Animation can also be used to display actual – usually time-dependent – data
(Tekusova and Kohlhammer 2007), which can also be used to add a new data-dimension to the display. This can be exploited to spot significant
transition patterns over time.
Overview Plus Detail
For Overview-Plus-Detail, two or more levels of linked visualizations with different zoom factors are used. In contrast to semantic zooming,
where different zooming levels are used in the same display, two or more separated displays are used. The technique helps users, while looking at
a portion of the data at a detailed level, keeping an overview of the whole structure. Card et al. differentiate between time multiplexed overview
plus detail displays, and space multiplexed ones (Card et al. 1999). Time multiplexing means, that overview and details are shown one at a time.
Spatial multiplexing means, that overview and details are shown both at the same time at different locations on the screen. Time multiplexed
overview plus detail views are conceptually not far away from simple zooming. Overview plus detail is sometimes also called map view concept
(Beard and Walker 1990). Card et al. report that typical zoom factors (that is the relation between the size of the shown area in the two displays)
range from 5 to 15, and that there is a limit for effective zoom factors of about 3 to 30.
Examples for systems using overview plus detail for the visualization of search results or browsing are the Harmony VRWeb 3D scene viewer, or
the pre-VIR prototype by (Bekavac 1999). The Harmony VRWeb 3D scene viewer (Andrews 1995) uses a 2D-map for navigation in an
information landscape. pre-VIR uses Overview plus detail in a horizontal tree view of the graph of the search results to ease navigation through
the graph.
Dynamic Queries
The dynamic query technique has been presented in some foundational work on information visualization (Shneiderman 1994; Ahlberg 1994).
Accessing information in databases is a major activity of knowledge workers. Unfortunately, traditional database query languages trade off ease
of use for power and flexibility. The dynamic query technique is a convenient visualization of local database queries, with a simple, intuitive,
interactive query refinement method. The basic idea of this technique is to generate moderate to complex queries on a database by purely visual
means and to ensure that there is an instant feedback in the display showing the search results. One or more selectors control the value range of
one or more attributes. Viewing a graphical database representation, users manipulate the selectors to explore data subsets rapidly and easily. In
a navigational environment, dynamic queries may offer a useful way to reveal attributive information, which can facilitate way finding.
Direct Manipulation
Direct Manipulation basically manifests in two slightly different ways, depending on the relation of the manipulated object to the data displayed
in the display. The graphical user interface provides elements and metaphors (buttons, sliders, etc.) which can be manipulated. In many
techniques, including the dynamic query techniques presented above, the manipulation of the GUI elements may control the actual visualization,
as is the case with most dynamic query techniques. This is some sort of direct manipulation regarding the GUI elements, but it is indirect with
regards to the actual visualization. The means of manipulation do not necessarily correspond to the effect they cause. SHNEIDERMANpresents
techniques by which this mental gap can be bridged to design intuitive interfaces (Shneiderman 2004).
SEMANTICS VISUALIZATION
Knowledge as Semantics Data
Since the announcement of the idea of Semantic Web (Berners-Lee 2010) the interest for semantic technologies and semantic data management
increased. Berners-Lee et al. describe this idea as a new form of web content that provides meaning for computers systems and unleashes a
reformation of new possibilities in the “web of data” (Berners-Lee et al. 2001). In this description two scientific developments joined and formed
the understanding of semantic data: the developments of the World Wide Web and the semantics formalisms. These formalisms where
predominantly subject in the field of artificial intelligence. (Berners-Lee et al. 2001)
In artificial intelligence formalisms for formal semantics where elaborated as knowledge base. Typically this knowledge base was designed for a
specific application scenario. Hence the possibilities of reuse were limited. To overcome this limitation web-based semantic markup languages
emerged in the Semantic Web. In the first step this markup language had been an extension inside HTML code to assign metadata, as semantic,
to data fragments, like e.g. a telephone number. These machines enable to the interpretation the data fragments, e.g. as a base for calculating the
relevance of a data fragment for solving an information need of the user. But here the interpretation logic is nested within the machines.
Therefore, shortly afterwards the first semantic extension of websites, the trend moved to formalize also the interpretation logic within the data
representation. Thus the web-based semantic markup languages provide the representation of semantics metadata, formal implications,
restrictions etc.
The semiotic triangle describes an interpretation of semantic markup languages. In the semiotic triangle a sign invokes a concept. The concept in
turn identifies an abstract or concrete thing in the world (Guarino et al. 2009). The formalized semantics is designed to be used for representing a
data fragment's potential usage. The metadata captures part of the meaning of data (Antoniou et al. 2008). This formalization enables data
reusability, machine-readability, inference mechanisms and semantic interoperability (Gómez-Pérez 2010).
Formalisms for Representing Semantics
Semantics formalisms describe the metadata as machine-readable formal semantics (knowledge representation paradigm). Semantic networks,
frame-based logics, and description logics can be mentioned as most common existing formalisms. (Hitzler et al. 2008)
Semantic Networks describe data entities as nodes, which are connected among each other if a semantic relation exists (Fensel et al. 2003). Each
of these connections is labeled to express the pragmatic idea behind this link. But in semantic networks the labeled link has to be interpreted if
the underlying semantic is important. A well-known example for semantic networks is theResource Description Framework (RDF). (Hitzler et al.
2008)
In addition frame-based logics may be used, which represent each named object as a frame. Frames have data slots in which a property or
attribute of the object is represented. Slots can have one or more values and furthermore these values may be pointers to other frames (Fensel et
al. 2003). The extension of RDF, the RDF Schema (RDFS), is a frame-based layer extending the expressiveness of RDF.
Another semantic formalism is the so called Description Logics. These allow constructing more expressive semantics, in terms of quantitative
(numeric) and qualitative (structural) limitations, formal implications and restrictions. Substantially description logics constitute fragments of
first-order logic, restricted to a certain complexity class to allow the construction of a high expressive language (Hitzler et al. 2008). Using
description logics the semantics is represented as a terminological box (TBox) and an assertional box (ABox). In the TBox abstract information
for concepts are specified. Information assigned to a concepts hold for all individuals (ABox) of this concept, thus this knowledge describes
general properties of concepts. In the ABox the described real world objects are represented as individuals (Gómez-Pérez et al. 2010). Description
logics based semantics formalisms are e.g. the Web Ontology Language (OWL) and OWL2.
Semantics data representations consist of concepts, concept taxonomies, relationships or roles between concepts, and properties describing the
concepts. Thus on the concept level mainly concept taxonomies are described. Therefore semantics data representations consisting of these
components are called lightweight formal semantics.
On the other hand heavyweight formal semantics allow representing more formal implications. This enables to model restrictions on domain
semantics by adding formal axioms, functions, rules, procedures and constraints to lightweight formal semantics (Gómez-Pérez et al. 2010).
There are important relations and implications between the knowledge components (concepts, roles, etc.) used to build the formal semantics, the
formal semantics formalism, used to represent the components, and the language, used to implement the semantics data (Gómez-Pérez et al.
2010).
Semantics Visualizations
Semantics Visualization plays a key-role in enlightening various relationships between data entities. Furthermore the relationships enable to
gather information and adopt knowledge. Semantically annotated data can be visualized with semantics visualization, commonly known as
“Ontology visualizations”. The following section gives an overview about existing visualization techniques for representing semantically enriched
data.
TGVizTab (TouchGraph Visualizaiton Tab) is the TouchGraph(Alani 2003) visualization Technique in the Protégé (Noy e al. 2000) ontology
management tool. It provides different level of details by choosing variable radius of visibility. The user can navigate through graph by visualizing
the parts of the graph gradually. The users can also rotate the graph to see the graph from different perspectives. Furthermore, She can also
switch the graph to hyperbolic tree. It offers the also personalization features, which allows the user to choose focal point, color for the nodes,
fonts and visibility of nodes. The ontology is also presented as tree structure on the left (Class Browser). It is a desktop solution, which the
favorite ontology management tool for the experts. It is does not allow the aspects like brows and editing in “one-single-view”, role based editing
and collaboration. The GUI and UE design is suitable for the experts and does not meet the needs of the average user.
OntoTrack (Liebig and Noppens 2004) is a browsing and editing “in-one-view” ontology authoring tool for OWL lite ontologies. It offers a user
friendly Graphical User Interface (GUI), which allows the users navigation and manipulation of large ontologies. It offers also intuitive User
Experience Design concepts e.g. miniature branches or selective detail views to handle and manipulate ontologies in one-view. The system is
based on SpaceTree(Plaisant et al. 2002). It is desktop application and it is not available as a web-based solution. It supports the scalability
issues but does not provide features like personalization, role based view and collaboration.
TMViewer (Topicmap Viewer) (Godehardt and Bhatti 2008) is topic map based ontology visualization tool. TM-Viewer offers fields or sectors,
which can be extracted from the ontology. The concepts in each field are represented with specific icons, lines between the knowledge concepts
represent the associations and the levels represent the abstraction level of the concepts (inner level show generic concepts). Furthermore, the
graphical metaphor with special icons for each sector supports the user to recognize the concept and navigate through the map easily. TM
Viewer allows to user to personalize the GUI with the help of configuration file completely. The user can choose not only the color for sectors or
association, but also change the icons. It is web-based solution, but it does not allow role based and collaborative ontology visualization.
The visualization of huge number of knowledge items e.g. more than 100 topics can overstress the user. That is why, TM-Viewer uses cluster
concept to keep the visualization manageable for the users. According to the cluster concept all the topics, which have same sibling will be
clustered as it is shown in fig. The History component helps the user to keep the track of their navigation through the Topicmap. (Godehardt and
Bhatti 2008)
COE (Hayes et al. 2003) is an RDF/OWL ontology viewing, composing and editing tool built on top of the IHMC CmapToolsconcept mapping
software suite. Concept maps provide a human-centered interface to display the structure, content, and scope ofan ontology. Concept mapping
software solutions are used in educational settings, training, and knowledge capturing.
COE uses concept maps to display, edit and compose OWL, in an integrated GUI combining Cmap display with concept search and cluster
analysis. COE imports OWL/RDFS/RDF ontologies from XML files (or URIs using http) and displays them as a new concept map. Layout is
automatic. Stored ontology Cmaps can be modified and archived using Cmap Tools.
CropCircles (Parsia et al. 2005 and Wang and Parsia 2006)is an ontology visualization which represents the class hierarchy tree as a set of
concentric circles. CropCirces aims to provide users intuitions on the complexity of a given class hierarchy at glance. Nodes are given the
appropriate space in order to guarantee enclosure of all the sub trees. If there is only one child, it is placed as a concentric circle to its parents,
otherwise the child - circles are placed inside the parent node from the largest to the smallest.
In order to navigate the ontology structure, the user may click on a circle to highlight it and see a list of its immediate children on a selection
pane. The selection pane can let the user drill down the class hierarchy level-by-level and it also support user browsing history. The user may also
select which top level nodes to show in the visualization.
Jambalaya (Storey et al. 2001) is a visualization plug-in for the Protégé ontology tool (Noy et al- 2000) that uses the SHriMP(Simple
Hierarchical Multi-Perspective) 2D visualization technique to visualize regular Protégé and OWL knowledge-bases.SHriMP is a domain-
independent visualization technique designed to enhance how people browse and explore complex information spaces.
SHriMP uses a nested graph view and the concept of nested interchangeable views. It provides a set of tools including several node presentation
styles, configuration of display properties and different overview styles.
OntoRama (Eklund 2002 and Eklund et al. 2002) is a RDF browser used for browsing the structure of an ontology with a hyperbolic – type
visualization.
The hyperbolic visualization is motivated mainly by two arguments. Firstly an order of magnitude more nodes of a tree can be rendered in the
same display space and secondly the focus of attention is maintained on the central vertex and its neighborhood. This means that the hyperbolic
view is particularly useful for hierarchical diagrams with large numbers of leaves and branches and where neighborhood relationships are
meaningful.
Unfortunately, Ontorama currently does not support “forest structures”, which are sub-hierarchies neither directly nor indirectly connected to
the root. It uses cloning of nodes that are related to more than one node, in order to avoid cases where the links become cluttered. It can support
different relation types. Apart from the hyperbolic view, it also offers a windows explorer – like tree view.
OntoSpere3D (Bosca et al. 2005) is a Protégé plugin for ontologies navigation and inspection using a 3-dimensional hyper-space where
information is presented on a 3D view-port enriched by several visual cues (as the color or the size of visualized entities).
OntoSphere proposes a node – link tree type visualization that uses three different ontology views in order to provide overview and details
according to the user needs. The OntoSphere3D user interface is quite simple; mouse centered, and supports scene manipulation through
rotation, panning and zoom. It is strongly bound to the “one hand” interaction paradigm, allowing to browse the ontology as well as to update it,
or to add new concepts and relations. Ontology elements are represented as follows: concepts are shown as spheres, instances are depicted as
cubes, literals are rendered as cylinders and the relationships between entities are symbolized by arrowed lines where the arrow itself is
constituted by a cone.
User interface features direct manipulation operations such as zooming, rotating, and translating objects in order to provide an efficient and
intuitive interaction with the ontology model being designed. Since the tool aims at tackling space allocation issues the visualization strategy
exploits dynamic collapsing mechanisms and different views, at different granularities, for granting a constant navigability of the rendered
model.
Concepts and instances within scenes are click-able with the following outcomes: (1) Left clicks perform a focusing operation, shifting the
currently visualized scene to a more detailed view, i.e. clicking on a concept in the tree view leads to a detailed view of such a concept. (2) Central
clicks are used to expand collapsed elements. The actual behavior of the central click is slightly different from scene to scene: in the Main Scene it
simply expands a concept replacing it with his children; in the Tree View expands a collapsed sub tree and collapses the others; in the Concept
Focus Scene when clicking the central concept it shows/hides its children. (3) Right clicks, instead, open a contextual menu offering a set of
alternatives dependent on the current scenes and the element properties.
When between 2 concepts in a scene occurs more than a single relation and a single line represents them all, no relation label is explicitly
reported and arrow-head cones can be clickable as well. In that case the cone is depicted in white and left clicking onto it lists such relations.
A certain degree of scene personalization in terms of sizes of graphical components, distances between them and colors is supported through a
proper option panel that is evocable by a button localized in the sx panel of the plugin.
Furthermore logical views can be defined on this hyper-space in order to easily manage interface complexity when the represented data gets
huge, and thousands of concepts and/or relations must be effectively visualized.
The 3D Hyperbolic Tree visualization was created for web site visualization but has been used as a file browser as well (Munzner 1997 and
Munzner 1998).
It presents a tree in the 3D hyperbolic space in order to achieve greater information density. The nodes of the tree are placed at a hemisphere of a
sphere. The graph structure in 3D hyperbolic space shows a large neighborhood around a node of interest. This also allows for quick, fluid
changes of the focus point. Additionally, it offers animated transitions when changing the node on-focus.
IsaViz (Pietriga 2001) is a visual environment for browsing and authoring RDF ontologies represented as directed graphs.
It presents a 2D user interface allowing smooth zooming and navigation in the graph. Graphs are visualized using ellipses, boxes and arcs
between them. The nodes are class and instance nodes and property values (ellipses and rectangles respectively), with properties represented as
the edges linking these nodes.
IsaViz enables user import ontologies of RDF/XML, Notation 3 and N-Triple formats and export of RDF/XML, Notation 3 and N-Triple export,
but also SVG and PNG formats.
OntoViz (Sintek 2003) is a Protégé (Noy 2000) visualization plug-in using the GraphViz library to create a very simple 2D graph visualization
method.
The ontology structure is presented as a 2D graph with the capability for each class to present, apart from the name, its properties and
inheritance and role relations. The user can pick a set of classes or instances to visualize part of an ontology. The instances are displayed in
different color.
It is possible for the user to choose which ontology features will be displayed (for example slot and slot edges), as well as prune parts of the
ontology from the “config” Panel on the left. Right-clicking on the graph allows the user to zoom – in or zoom – out
Grokker is a system to display the knowledge maps. It offers graphical representation for the search results or a file search. It uses a graphical
metaphor for documents, clusters and category circles. The size of cluster and category circle shows the number of contained documents i.e.
larger category circles contain more documents or results. The right panel offers further details about search results and allows the users to create
own working list or tag to del.icio.us. The left panel offers filtering mechanism by date or domain and search within shown map. It is web-based
solution and offers a user friendly and easy to GUI and UE design concept. It does not support role based, aspect oriented and collaboration
aspects.
Kartoo is a search engine, which displays the search results with topographical interface. It displays the results closer to each other, if they have
close relationship. The keywords show the relationship between the search results. The users can also click the keywords to navigate through the
map. Kartoo uses different icons as graphical metaphor for different type of results e.g. documents, website etc. On the left side, all the topics are
listed and serve as additional view of all displayed results. Furthermore, the description and thumbnail of the results are shown by roll over on
the left side. It offers a user friendly and easy to GUI and UE design concept and it is a web-base solution. It does not support role based, aspect
oriented and collaboration aspects.
Webbrain allows the visualization of the search results and organization of information. The organization in the Webbrain is associative instead
of hierarchical. The users can organize the information by defining the association between the information items. The information
in Webbrain is thoughts and they can be all type of documents like website, word or pdf-files. When the user chooses one thought, then it moves
to the center and related thoughts to the selected thought branching out around it. The Company “The brain” offers different versions “Personal
Brain (Desktop)”, “Webbrain (Web)” and “Enterprise Knowledge Platform”. The enterprise solution allows the collaboration as well. It offers a
very easy and intuitive GUI and UE concept. It can also be used as Mindmap.
A more recent approach for visualizing complex semantics and ontologies is the SemaVis visualization technology. (Nazemi et al.
2011) SemaVis provide a more comprehensive view on heterogeneous semantics structures and uses several visualization techniques as described
in previous chapters for graphically presenting semantic data. The main goal of SemaVis is to provide a core-technology for heterogeneous
semantic data, different users and user groups and support heterogeneous tasks. Therefore a three layered model was developed, based on the
model of Card et al. (Card et al. 1999), to provide a fine granular adaptation at different levels of abstractions. SemaVis subdivides the
visualization layer into the layers Semantics, Layout andPresentation. With its modular characteristics several visualization techniques can be
chosen while working with the visualization to present different views on the same data.
With an integrated Visualization Cockpit (Nazemi et al. 2010) the vies can be combined to solve different visualization tasks, e.g. exploring
knowledge, comparing data structures etc. (Nazemi et al. 2011)
VISUALIZATION OF SOCIAL DATA AS SEMANTIC INFORMATION
The involvement of citizens, their opinions, discussions etc. in the policy modeling domain plays an increasing role. The web provides mass
amounts of social data, which can be used to identify problems and involve the citizens’ opinions in the policy creation process. The masses of
information are very difficult to handle. Everyday new opinions, discussions etc. and there with new data are available. In FUPOL various
technologies faces this challenge from different point of views. The crawling of data, the extraction of topics-of-relevance (hot-topics) and their
causal relationships are investigated in FUPOL. From the users’ point of view, the visualization of that data would provide an efficient way to
acquire knowledge, e.g. for identifying problems and impacts of policies.
FUPOL provides therefore a visualization model that applies a top-down explorative metaphor for gathering knowledge in problem identification,
impact analysis on social (subjective) level etc. The top-down approach integrates various overview visualizations, which first give an overview of
topics in categorical, temporal and geographical aspects and provide further a faceting to reduce the information amount on relevant aspects. On
the visualization level “details-on-demand” and graph-based visualizations provide a comprehensible view on the information relationships. With
various visualization techniques the level of detail may prove fine granular or textual information. A model of data analysis for (data-based)
adaptation provides an adaptable and adaptive multi-visualization view on the data. This approach enables the detection of policy related issues
easier (more time efficient).
The benefit is to apply quantitative data analysis and visual mapping mechanisms in the domain of policy modeling to bridge the gap between
masses of information entities (instances) and users’ task. Therefore the quantity and attributes of the underlying data are analyzed and
visualized in combined multi-visualization user interfaces. The analysis provides further a data adaptive visual representations, which may be
integrated with further adaptation rules. The here proposed top-down (overview to detail) visualization cockpit provides another scientific value
that is not investigated so far in context of social data for policy modeling.
This section describes an exemplary conceptual design for social data visualization based on semantics. In order to get the most suitable solution
when analyzing social network data the appropriate visualization needs to fulfill some informational requirements. This means the visualization
needs the capability to communicate available information as the result of the analysis process to the user interacting with the system. Depending
on the available context information of the interaction (e.g. the actual task of the user, the political question to be solved) the visualization may
also support the visual highlight of the relevant information artifacts. In this step of the conceptualization of the social web visualization
predominantly the fulfillment of the identified informational requirements will be addressed. Further an overview-to-detail approach with
combined semantics and quantitative visualizations will be introduced by investigating the FUPOL social data ontology. We propose in this
chapter a solution to gather the semantic information in different levels of visual abstraction and provide therewith a new integrated approach of
user experience with social data. In particular, the domain of policy modeling and the requirements on the policy process act as foundation.
Informational Requirements
The first and most obvious requirement is the illustration of the structural information or social groups, which predominantly is given by the
relations between the nodes. Structures of the social network can depict interesting direct or indirect relations leading or promising to a specific
result. Furthermore the nature of the connection structure provides informative bases where groups of people or topic-related documents occur
and illustrate their impact to the neighbored nodes. Thus this structural requirement also represents social groups and social relations. To name
some, these structures may be cliques, clustered cliques per clustering index, paths and different graph patterns.
In contrast to the structural information concerning multiple nodes and their intermediate relations, the second informational requirement
corresponds to the position of a single node within the network. The social position describes for example the influence of a single user to other
community members.
The third informational requirement describes the nodes itself. A user interacting with the social web visualization has always to be aware of the
type of node which is connected with other entities. The most common node type may be a single person or a group of persons, which interacts in
a social community. But also the type of posting or way of participating may be the information of interest.
To ensure the user may interpret the information type, glyphs are used to represent the nodes content type. In social web the nodes’ content may
be e.g. a single person, a group of persons, text, files, configuration files, audio, video or statements/messages. This can be indicated by icons or
visual variables e.g. color or shape.
Due to the fact that the informational requirement information referee may vary in general (variable), but for a concrete situation or view of
interest is a defined fix subset, the visualization concept needs to be designed appropriately. There are two types of information referee: the node
references a concrete thing which can be pointed to (e.g. a person, a group of persons, a street, a building) and a non-physical object (e.g. an
opinion, a topic) which can only be described textually but holds for multiple nodes (in terms of persons/actors).
The first type of information referee will be depicted in an information panel metaphor which displays the associated information like name,
textual information and location. Users are able to interact and interpret this type of presentation, and the detailed information does not disturb
the general interaction and navigation process within the visualization. Thus this visual metaphor corresponds to Shneidermans visual
information seeking mantra: (1) overview first, then (2) zoom and (3) filter and present (4) details on demand. To ensure this mantra, details on
demand will be displayed the information panel only if a user hovers over the visual element with the mouse; or if the user does an active
interaction like a click on it.
Thus the second type of information referee will be visualized as a statistical distribution using a well-known visualization type, the pie chart.
This pie chart is positioned directly behind the person, group of person or document discussing about the topics, like depicted in the subsequent
picture. Due to their closeness this statistical information is directly associated to the node (law of closure, Gestalt psychology).
The social position represents the influence of a person or a group of persons to others. In addition here the social position will also be
interpreted as the influence of a document, a statement, etc. with regard to the readers or the opinion generation process. Therefore this measure
can be visually represented so that users are able to interpret the relations in a correct manner.
Social positions or social influences can be measured with two different measures. The first is a kind of measurement concerning only one
element, which means this node has the power of x which is higher/lower than the power of y of another node. The second type is the resulting
effect of the social position to other nodes, for example statements of a person influence others behavior/opinion with an intensity of value z. Due
to the fact that the effect of social positions or social influence always relates to two or more nodes, usually in a directed way with a specific
intensity this informational requirement will be visually represented using the edges between the entities. In contrast, social position or influence
measures concerning only one node will be represented per size of the visual element itself.
Information attributes are important for the meaningful interpretation of the social data. The requirement information attributes summarize
time stamps and trends (evolution/progress) and the consequential actuality of the information. In addition, geographic data, e.g. in which
country a person lives, is addressed by this requirement. In order to visually meet the powerfulness of these information attributes, different
visual layouts will be presented in this section.
The standard view on a social network is a graph view or linked network between the nodes. The nodes differ in their information type, so
documents, movies, and persons build up the whole network structure.
To investigate the time stamps and time trends of the social data a timeline-based visualization is used. Here a stacked graph is appropriate for
trend visualizations (e.g. topic evolutions) or a timeline-based bar chart if the data entries have the attributes of start and end times. In the
subsequent picture an example is given for a stacked graph visualization presenting topic evolutions. The same concept can be applied to topics
discussed by groups or single persons, like depicted in the subsequent picture.
Visualization of Social Data as Semantic Information
The work of the visualizations in context of social data is to provide a sufficient tool for identifying:
Formal Semantic Description of Social Data
The formalization of the crawled social web data is provided in FUPOL as a light-weight ontological representation. The technologies provide
feature extractions based on statistical models. The extracted features are then formalized in a semantic relationship model, based on SIOC and
FOAF, whereas FUPOL-specific classes are enhancing the ontology (see FUPOL Ontology).
Although the ontology provides a formalized and accessible way of the masses of social data, the problem still remains that only a low hierarchy is
provided with masses on instances in each concept, e.g. the class “Topic” may contain a large number of topics. This is in particular a challenge
for web-based visualizations and a transparent and comprehensible on the data. To face this problem following approaches will be developed:
• Use of the described visualization primitives in the various levels of visual representation.
Overview Visualization
The main challenge of visualizing the social data is the masses of instances in the described semantic representation. We have elaborated two
ideas of partner technologies to face this problem on the data level, but beside a solution reducing the amount of instances per class/concept, the
challenge of visualizing a mass amount of data still remains. An adequate way of facing this challenge on the visualization-level is the appliance of
Shneiderman’s Information Seeking Mantra (Shneiderman 1996). Shneiderman proposed a three-level seeking mantra containing the following
steps: overview first, zoom and filter then details-on-demand. In the context of visualizing the social information the overview aspect plays a key
role. In particular, we identify in context of social data visualization three main views on this information-level:
• Overview on categorical level,
The levels of overview visualizations are not distinct and can be combined to view on different information aspects.
Overview on CategoricalLevel
The thematic arrangement enables a visual overview definition of “categories-of-interest”, whereas all are some part of information are visualized
interactively. We apply in this context two main visualization types to visualize the computed relevance and the result of a quantitative analysis
on the user request. The different informational requirements are then visualized on the presentation level by using the visual variables. The size
of a graphical entity will provide quantitative information whereas the relevance is visualized by their color.
We provide as the first categorical visualization an hierarchical treemap that uses the thematic hierarchy of the ontology as one visual indicator,
the relevance of the topics as another visual indicator and the size as a third indicator for providing an overview of a topic on categorical level.
The following figure illustrates a very simple example of the described view. The parameters are abstracted to highest level. The hierarchy is
simplified visualized as an overlapping (superimposing) and integrating spatial spaces. The size is illustrating the quantity and the color the
relevance, as shown in Figure 8.
Figure 8. Simplified abstract illustration of the hierarchical
treemap (own development)
In contrast to that very simple visual view, a graph-based layout will be integrated that targets on the same information values. Therefore the size
of circle will be used as the indicator for the quantity of information in one category, the hierarchy will be displayed as smaller integrated circles,
and the color will be used for the computed relevance. We are dismissing any semantic relationships in this view, to not confuse the user with too
many information.
Overview on Temporal Level
Another way of visualizing an overview of the whole spectrum of information is the consideration of the temporal attributes. With visualizing the
temporal overview and providing a faceting in time another dimension of the data is investigated. We propose that the temporal view is the most
beneficial way to:
• Interacting with and filtering semantic data for topic-of-relevance based on time.
Here we propose the use of a stacked graph with the using the following informational requirements on the information dimensions:
• Size: Quantity of topics, terms or extracted features.
• XAxis: Temporal spread.
Overview on Geographical Level
The overview aspect can be investigated from the geographical point of view too. This visual representation here investigates the geographical
spread. This visualization is beneficial when the data can be assigned to geographical attributes and the temporal space is set to a specific value,
e.g. today’s topics-of-relevance in Barnsley. The quantitative value cannot be considered in this view. It visualizes the geographical spread of
topics on a map. The color indicates the topic related to a hot-area and this area can be named by the identified topics.
DetailsonDemand Visualization on GraphBased Structures
The next step after the overview is a more detailed view with relational information. Therefore the existing graph-based visualizations will be
extended to visualize the dependencies between actors and topics, between actors themselves and between topics themselves. This step can be
done after a refinement on the overview visualization or based on a specific search that contains a comprehensible number of entities.
We propose to use a force-directed visual graph algorithm with quantitative analysis for this issue. In this case the size of a circle indicates the
number of entities, the color the relevance, the size of entities the number and/or relevance of a topic or actor himself and the relations the
semantic relationship design in the FUPOL social data ontology.
Figure 9. Graph-based detail visualization (own development)
The detailed visualizations can further provide more information by requesting more details on demand. For example in the figure, we see one
actor with a greater size than the others. With this information we can assume that this actor is an opinion maker, because either he has many
postings or the postings are read by many people (regarding to the underlying data and goal). By clicking on this actor the visual representation
will first give more information about him and further provide detailed information (as far as available) about the person.
In all the steps we have defined different visualization types that are appropriate to meet the informational requirements from the social data
part of view. One of the main contributions in this task is that the visual change of the steps from overview to details and vice versa is recognized
and appropriate visualizations are provided in combined user interfaces.
The categorical, temporal, and geographical view can be combined in various ways to provide a sufficient view on the social data. One promising
way to provide a fruitful way for visualizing the different informational requirements of social data and statistical data respectively is the
juxtaposed orchestration of visualizations (Nazemi et al. 2010) as illustrated exemplary in Figure 10.
FUTURE RESEARCH DIRECTIONS
Social media, linked-data and data on web provide masses of information that may help to find out the intentions of citizens and ease the
decision making process in the entire policy creation process. The access to the information on web is getting more and more difficult, due to the
growing amount of data. One promising way to illustrate the data and interact with them are information visualization tools, whereas commonly
information visualization is either too simple (pie and bar charts) or too complex (analyst tools from visual analytics). This aspect is a great
challenge in particular for the policy and eGovernment community. Thus, although the visualization techniques provide promising ways to
interact with knowledge, they are not really accepted in that domain. Future research topics should cover more the human factor and the human-
centered design of visualization systems. In particular adaptive and intelligent visualizations, incorporating machine learning algorithms for
recognizing and modeling users’ behavior, tasks to be solved, and the underlying data will play an essential role for the acceptance of complex
visual systems.
CONCLUSION
The policy creation and modeling cycle is characterized by the need of information in particular to have a valid foundation for making decisions.
In this context various kinds of information plays a key-role: social-data enable mining opinions and identify opinion leaders, while ground truth
statistical data helps to identify policy indicators and therewith enables monitoring, validating or identifying policy needs and changes. The main
problem in this context is the mass amount of data, especially on web. One promising way to face the challenge of “big data” is the use of
interactive visual representations. Information visualization provides here various techniques for visualization. But not only the visualizations
themselves play an important role, the way how human interacts with the visualizations and the way how data are modeled, transformed and
enriched gains more and more intention.
We introduced in this chapter information visualization as a solution for enabling the human information access to the heterogeneous data that
are necessary during the policy modeling process. Therefore we first started to identify the steps of policy design, where information
visualizations are required based on an established policy life-cycle model. Thereafter a foundational overview of information visualization was
given, investigating beside visualization techniques, the entire spectrum of data to visualization. In this context data and interaction methods
were introduced too. We concluded the chapter with an conceptual example of visualizing social data in the domain of policy modeling.
This work was previously published in the Handbook of Research on Advanced ICT Integration for Governance and Policy Modeling edited by
Peter Sonntagbauer, Kawa Nazemi, Susanne Sonntagbauer, Giorgio Prister, and Dirk Burkhardt, pages 175215, copyright year 2014 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Abello, J., van Ham, F., & Krishnan, N. (2006). ASK-Graphview: A Large Scale Graph Visualization System. IEEE Transactions on Visualizations
and Computer Graphics, 12(5).
Ahlberg, C., & Shneiderman, B.Visual Information Seeking. (1994). Tight Coupling Of Dynamic Query Filters with Starfield Displays . Boston,
MA: ACM Transactions on Human Factors in Computing Systems.
Amar, R. A., & Stasko, J. T. (2005). Knowledge Precepts For Design and Evaluation of Information Visualizations . IEEE Transactions on
Visualization and Computer Graphics , 11(4). doi:10.1109/TVCG.2005.63
Andrews, K. (1995). Information Visualization in the Harmony Internet Browser. In Proceedings of IEEE Information Visualization. Atlanta,
GA: IEEE.
Antoniou, G., & Van Harmelen, F. (2008). A Semantic Web Primer. MIT Press.
Balzer, M., & Deussen, O. (2007). Level-of-Detail Visualization of Clustered Graph Layouts. In Proceedings of the AsiaPacific Symposium on
Visualization (APVIS). APVIS.
Bartram, L. (1997). Can motion increase user interface bandwidth in complex systems?. In Proceedings of the IEEE Conference on Systems, Man
and Cybernetics. IEEE.
Beard, D. V., & Walker, J. Q. (1990). Navigational techniques to improve the display of large two-dimensional space. Behaviour and Information
Technology, 9(6).
Bederson, B. B., Hollan, J. D., & Perlin, K. (1996). Pad++: A zoomable graphical sketchpad for exploring alternate interface physics . Journal of
Visual Languages and Computing , 7.
Bekavac, B. (1999). Hypertextgerechte Suche und Orientierung im WWW: Ein Ansatz unter Berücksichtigung hypertextspezifischer Struktur
und Kontextinformation. (Dissertation). Universität Konstanz, Fachgruppe Informatik und Informationswissenschaft.
Bendix, F., Kosara, R., & Hauser, H. (2005). Visual Analysis Tool for Categorical Data – Parallel Sets. In Proceedings of IEEE Symposium on
Information Visualization (INFOVIS). IEEE.
Berners-Lee, T. (2010). Weaving the Web: The original Design and Ultimate Destiny of the World Wide Web by its Inventor . HarperBusiness.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American , 284(5), 34. doi:10.1038/scientificamerican0501-34
Bertin, J. (1983). Semiology of Graphics: Diagrams, Networks, Maps (Berg, W. J., Trans.). Madison, WI: University of Wisconsin Press.
Bosca, A., Bomino, D., & Pellegrino, P. (2005). OntoSphere: More than a 3D ontology visualization tool. In Proceedings of SWAP, the 2nd Italian
Semantic Web Workshop. CEUR. Retrieved from http://ceur-ws.org/Vol-166/70.pdf
Boulos, K. (2003). The use of interactive graphical maps for browsing medical/health internet information resources. Internal Journal of Health
Geographics, 2 (1).
Brandes, U., Gaertler, M., & Wagner, D. (2003). Lecture Notes in Computer Science:Vol. 2832:Experiments on Graph Clustering Algorithms .
Springer Verlag.
Bryan, D., & Gershman, A. (2000). The Aquarium: A Novel User Interface Metaphor For Large Online Stores. In Proceedings 11th International
Workshop in Database and Expert Systems Applications. Academic Press.
Card, S. K., Mackinlay, J. D., & Shneiderman, B. (Eds.). (1999).Readings in Information Visualization - Using Vision to Think . San Francisco,
CA: Morgan Kaufman Publishers, Inc.
Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The Information Visualizer, an information workspace. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (pp. 181—186). ACM Press.
Chen, C., Zhu, W., Tomaszewski, B., & MacEachren, A. (2007). Tracing Conceptual and Geospatial Diffusion of Knowledge. In D. Schuler
(Ed.), Online Communities And Social Computing, (HCII 2007), (pp. 265-274). Springer-Verlag.
Eick, S. G., & Karr, A. F. (2000). Visual Scalability (Technical Report Number 106). National Institute of Statistical Sciences.
Eklund, P. W., Roberts, N., & Green, S.P. (2002). OntoRama: Browsing an RDF Ontology using a Hyperbolic-like Browser. InProceedings of the
First International Symposium on CyberWorlds (CW2002). IEEE Press.
Elmquist, N., Dragicevic, P., & Fekete, J.-D. (2003). Rolling the Dice: Multidimensional Visual Exploration using Scatterplot Matrix
Navigation. IEEE Transactions on Visualization and Computer Graphics, 14(6).
Fua, Y., Ward, M. O., & Rundensteiner, E. A. (1999). Hierarchical Parallel Coordinates For The Exploration Of Large Datasets. InProceeding of
the 10th IEEE Visualization Conference (Vis ’99). IEEE.
Furnas, G. W. (1981). The FISHEYE View: A New Look at Structured Files . Murray Hill, NJ: AT&T Bell Laboratories.
Gershon, N., & Eick, S. G. (1995). Visualisation’s new tack: Making sense of information. IEEE Spectrum . doi:10.1109/6.469330
Godehardt, E., & Bhatti, N. (2008). Using Topic Maps for Visually Exploring Various Data Sources in a Web-based Environment. InProceedings
of Scaling Topic Maps: Third International Conference on Topic Map Research and Applications (TMRA 2007). Berlin: Springer.
Guarino, N., Oberle, D., & Staab, S. (2009). What Is an Ontology?. In Handbook on Ontologies, International Handbooks on Information
Systems, (pp. 1-17). Springer.
Hao, M. C., Dayal, U., Keim, D. A., & Schreck, T. (2005). Importance-Driven Visualization Layouts For Large Time-Series Data. In Proceedings of
IEEE Symposium On Information Visualization. Minneapolis, MN: IEEE.
Havre, S., Hetzler, B., & Nowell, L. (2000). ThemeRiver: Visualizing Theme Changes Over Time. In Proceedings of IEEE Symposium on
Information Visualization. IEEE.
Hayes, P., Saavedra, R., & Reichherzer, T. (2003). A Collaborative Development Environment for Ontologies . CODE.
Hearst, M. A. (1999). User Interfaces and Visualization . In Ribeiro-Neto, B. (Ed.), Modern Information Retrieval. Addison Wesley Longman.
Heer, J., Card, S., & Landay, J. (2005). Prefuse – A toolkit for interactive information visualization . Portland, Oregon: Conference On Human
Factors In Computing. doi:10.1145/1054972.1055031
Henry, N., Fekete, J.-D., & McGuffin, M. J. (2007). NodeTrix: A Hybrid Visualization of Social Networks . IEEE Transactions on Visualization
and Computer Graphics , 13(6). doi:10.1109/TVCG.2007.70582
Herman, I., Melancon, G., & Marshall, M. S. (2000). Graph Visualization and Navigation in Information Visualization: A Survey. IEEE
Transactions on Visualzation and Computer Graphics, 6, 24-43.
Hitzler, P., Krötzsch, M., Rudolph, S., & Sure, Y. (2008). Semantic Web: Grundlagen . Springer.
Hochheiser, H., & Shneiderman, B. (2004). Dynamic Query Tools for Time Series Data sets: Timebox Widgets for Interactive exploration (Vol. 3).
Information Visualization, Palgrave MacMillan.
Holten, D. (2006). Hierarchical Edge Bundles . IEEE Transactions on Visualization and Computer Graphics , 12(5). doi:10.1109/TVCG.2006.147
Jog, N. K., & Shneiderman, B. (1995). Starfield Information Visualization with Interactive Smooth Zooming. In Proceedings of the 3rd IFIP 2.6
Working Conference on Visual Database Systems Conference. Lausanne, Switzerland: IFIP.
Keim, D., Kohlhammer, J., May, T., & James, J. T. (2006). Event Summary of the Workshop on Visual Analytics. Computers & Graphics, 30(2).
Keim, D., Kohlhammer, J., Ellis, G., & Mansmann, F. (Eds.). (2010). Mastering the Information Age: Solving Problems with Visual Analytics .
Eurographics Association.
Keim, D. A. (2000). Designing Pixel-oriented Visualization Techniques: Theory and Applications . IEEE Transactions on Visualization and
Computer Graphics , 6(1), 59–78. doi:10.1109/2945.841121
Keim, D. A. (2002). Information Visualization and Visual Data Mining . IEEE Transactions on Visualization and Computer Graphics , 8(1).
doi:10.1109/2945.981847
Keim, D. A., & Kriegel, H. P. (1996). Visualization Techniques For Mining Large Databases – A Comparison . IEEE Transactions on Knowledge
and Data Engineering , 8(6). doi:10.1109/69.553159
Keim, D. A., Kriegel, H.P., & Ankerst, M. (n.d.). Recursive Pattern: A Technique for Visualizing Very Large Amounts Of Data. InProceedings of
Visualization 95. Academic Press.
Kohlhammer, J., Nazemi, K., Ruppert, T., & Burkhardt, D. (2012). Towards Visualization in Policy Modeling . IEEEJournal of Computer
Graphics and Applications .
Liebig, T., & Noppens, O. (2004). OntoTrack: Combining Browsing and Editing with Reasoning and Explaining for OWL Lite Ontologies.
In Proceedings of the 3rd International Semantic Web Conference ISWC 2004. Hiroshima, Japan: Academic Press.
Lin, J., Keogh, E., & Lonardi, S. (2005). Visualizing and Discovering Non-Trivial Patterns In Large Time Series Databases .Palgrave Macmillan
Journal On Information Visualization , 4(2), 61–82. doi:10.1057/palgrave.ivs.9500089
Mansmann, F., Keim, D. A., North, S. C., Rexroad, B., & Sheleheda, D. (2007). Visual Analysis of Network Traffic for Resource Plannung,
Interactive Monitoring and Interpretation of Security Threats. IEEE Transactions on Visualization and Computer Graphics, 13(6).
May, T., & Kohlhammer, J. (2008). Towards closing the analysis gap: Visual Generation of Decision Supporting Schemes from Raw
Data. Eurographics IEEEVGTC Symposium on Visualization (EuroVis), 27(3).
May, T., Steiger, M., Davey, J., & Kohlhammer, J. (2012). Using Signposts for Navigation in Large Graphs. Computer Graphics Forum , 31(2-3),
985–994.
Modjeska, D. (n.d.). Navigation in electronic worlds: A Research Review (Technical Report). Computer Systems Research Group, University of
Toronto.
Müller, W., Nocke, T., & Schumann, H. (2006). Enhancing the Visualization Process with Principal Component Analysis to Support the
Exploration of Trends. In Proceedings of the APVIS 2006. APVIS.
Munzner, T. (1997). H3: Laying Out Large Directed Graphs in 3D Hyperbolic Space. In Proceedings of the 1997 IEEE Symposium on
Information Visualization. Phoenix, AZ: IEEE.
Munzner, T. (1998). Exploring Large Graphs in 3D Hyperbolic Space . IEEE Computer Graphics and Applications , 18(4), 18–23.
doi:10.1109/38.689657
Nazemi, K., Breyer, M., Burkhardt, D., & Fellner, D. W. (2010). Visualization Cockpit: Orchestration of Multiple Visualizations for Knowledge-
Exploration. International Journal of Advanced Corporate Learning , 3(4).
Nazemi, K., Breyer, M., Forster, J., Burkhardt, D., & Kuijper, A. (2011). Interacting with Semantics: A User-Centered Visualization Adaptation
Based on Semantics Data. In Human Interface and the Management of Information: Part I: Interacting with Information. Berlin: Springer.
Nazemi, K., Stab, C., & Kuijper, A. (2011). A Reference Model for Adaptive Visualization Systems . In Jacko, J. A. (Ed.), Human-Computer
Interaction: Part I: Design and Development Approaches (pp. 480–489). Berlin: Springer. doi:10.1007/978-3-642-21602-2_52
Nightingale, F. (1857). Mortality of the British Army . London, UK: Harrison and Sons.
Noy, N. F., Fergerson, R. W., & Musen, M. A. (2000). The knowledge model of Protege-2000: Combining interoperability and flexibility.
In Proceedings of 2nd International Conference on Knowledge Engineering and Knowledge Management (EKAW 2000). Juanles- Pins,
France: EKAW.
Parsia, B., Wang, T., & Goldbeck, J. (2005). Visualizing Web Ontologies with CropCircles. In Proceedings of the 4th International Semantic Web
Conference. Academic Press.
Plaisant, C., Grosjean, J., & Bederson, B. B. (2002). SpaceTree: Supporting Exploration in Large Node Link Tree, Design Evolution and Empirical
Evaluation. In Proceedings of IEEE Symposium on Information Visualization. Boston: IEEE.
Proulx, et al. (2006). Avian Flu Case Study with nSpace and GeoTime. In Proceedings of IEEE Symposium on Visual Analytics Science and
Technology. Baltimore, MD: IEEE.
Rao, R., & Card, S. K. (1994). The Table Lens, Merging graphical and symbolic Representations in an interactive Focus+Context visualization for
Tabular Information. In Proceedings Human Factors in Computing Systems. Boston, MA: ACM.
Reingold, E.M., & Tilford, , T. (1981). Drawing of Trees . IEEE Transactions on Software Engineering , 7(2), 223–228.
doi:10.1109/TSE.1981.234519
Robertson, G. C., Card, S. K., & Mackinlay, J. D. (1993). Information Visualization Using 3-D Interactive Animation .Communications of the
ACM , 36(4). doi:10.1145/255950.153577
Robertson, G. G., & Mackinlay, J. D. (1993). The Document Lens. In Proceedings of 6th ACM Symposium on User Interface Software and
Technology. ACM Press.
Robertson, G. G., Mackinlay, J. D., & Card, S. K. (1991). Cone Trees: animated 3D visualizations of hierarchical information. InProceedings of the
SIGCHI Conference on Human Factors in Computing Systems: Reaching through Technology, (pp. 189 – 194). ACM.
Sarkar, M., & Brown, M. H. (1992). Graphical Fisheye Views of Graphs, Digital SRC Research Report 84a. In Proceedings of ACM CHI’92. ACM.
Schreck, T., Bernard, J., Tekušová, T., & Kohlhammer, J. (2008). Visual Cluster Analysis of Trajectory Data with Interactive Kohonen Maps.
In Proceedings of IEEE Symposium on Visual Analytics Science and Technology (VAST). IEEE.
Shneiderman, B. (1994). Dynamic Queries for Visual Information Seeking . IEEE Software , 11(6), 70–77. doi:10.1109/52.329404
Shneiderman, B. (1996). The Eyes Have It: A Task By Data Type Taxonomy For Information Visualization. IEEE Visual Languages, 336-343.
Shneiderman, B. (1999). Dynamic Queries, Starfield Displays and the path to Spotfire, Report Human-Computer Interaction Lab . University of
Maryland.
Shneiderman, B. (2007). Designing the user interface: Strategies for effective human-computer-interaction (4th ed.). Addison-Wesley.
Siirtola, H. (2007). Interactive Visualization of Multidimensional Data. Dissertations in Interactive Technology, Number 7 . University of
Tampere.
Storey, M.-A., Mussen, M., Silva, J., Best, C., Ernst, N., Fergerson, R., & Noy, N. (2001). Jambalaya: Interactive visualization to enhance ontology
authoring and knowledge acquisition in Protégé. In Proceedings of Workshop on Interactive Tools for Knowledge Capture (K-CAP-2001).
Victoria, Canada: K-CAP.
Tekusova, T., & Kohlhammer, J. (2007). Applying Animation to the Visual Analysis of Financial Time-Dependent Data. InProceedings of the
IEEE Conference on Information Visualization(IV07). Zurich, Switzerland: IEEE.
Thomas, J., & Cook, K. (2005). Illuminating the Path: Research and Development Agenda for Visual Analytics . IEEE Press.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press.
van Ham, F., & Perer, A. (2009). Search, Show Context, Expand on Demand: Supporting Large Graph Exploration with Degree-of-Interest. IEEE
Transactions on Visualization and Computer Graphics , 15(6), 953–960. doi:10.1109/TVCG.2009.108
van Wijk, J. J., & van de Wetering, H. (1999). Cushion Treemaps: Visualization of Hierarchical Information. In Proceedings of IEEE Symposium
on Information Visualization (INFOVIS’99). San Francisco, CA: IEEE.
Voinea, S. L., Telea, A., & Chaudron, M. (2005). Version-Centric Visualization Of Code Evolution. In Proceedings of the IEEE Eurographics
Symposium on Visualization (EuroVis’05). IEEE Computer Society Press.
Wang, T., & Parsia, B. (2006). Cropcircles: Topology sensitive visualization of owl class hierarchies. In Proceedings of International Semantic
Web Conference. Academic Press.
Wijffel, A. M., Vliegen, van Wijk, & van der Linden. (2008). Generating Color Palettes using Intuitive Parameters. InProceedings of IEEEVGTC
Symposium on Visualization(EUROVIS). IEEE.
ADDITIONAL READING
Brusilovsky, P. wook Ahn, J.; Dumitriu, T. & Yudelson, M. (2006) Adaptive Knowledge-Based Visualization for Accessing Educational Examples
Information Visualization, 2006. IV 2006. Tenth International Conference on, 2006, 142-150
Card, S. K., Mackinlay, J. D., & Shneiderman, B. (1999) Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann, 1999.
Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The Information Visualizer, an information workspace. In CHI’91: Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems (pp. 181—186). ACM Press.
Kohlhammer, J., Nazemi, K., Ruppert, T., & Burkhardt, D. (2012): Towards Visualization in Policy Modeling. In Journal of Computer Graphics
and Applications, IEEE, Sept.-Oct.2012.
Kohlhammer, J. Proff, D.U. Wiener, A. (2013) Visual Business Analytics: Effektiver Zugang zu Daten und Informationen. dPunkt Verlag.
Macintosh, A. Characterizing E-Participation in Policy-Making. In Proceedings of the Proceedings of the 37th Annual Hawaii International
Conference on System Sciences (HICSS'04), Vol. 5. IEEE Computer Society, Washington, DC, USA, 2004.
Nazemi, K., Stab, C., & Kuijper, A. (2011). A Reference Model for Adaptive Visualization Systems . In Jacko, J. A. (Ed.), Human-Computer
Interaction: Part I: Design and Development Approaches (pp. 480–489). Berlin, Heidelberg, New York: Springer. doi:10.1007/978-3-642-21602-
2_52
Shneiderman, B. (1996) The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations, VL, 1996, 336-343
Ward, M., Grinstein, G., & Keim, D. (2010). Interactive Data Visualizations Foundations . Natick, Massachusetts: Techniques, and Applications A
K Peters, Ltd.
KEY TERMS AND DEFINITIONS
Adaptation: Adaptation in human-computer interfaces is the automatic and system-driven changes on content, structure, and presentation of
system-behavior that involve some form of learning, inference, or decision making based on one or many influencing factors to support users.
Adaptive Visualizations: Adaptive visualizations are interactive systems that adapt autonomously the visual variables, visual structure,
visualization method, or the composition of them by involving some form of learning, inference, or decision making based on one or many
influencing factors like users’ behavior or data characteristics to amplify cognition and enable a more efficient information acquisition.
Information Visualization: It is the interactive visual representation of data to amplify cognition and support information and knowledge
acquisition.
Semantics Visualization: Semantics visualizations are computer-aided interactive visualizations for effective exploratory search, knowledge
domain understanding, and decision making based on semantics.
Semantics: Semantic can be defined as data with meaningful relations of at least two information or data entities, to provide in best case a
disambiguated meaning.
SemaVis: SemaVis is an adaptive semantics visualization technology developed by Fraunhofer Institute for Computer Graphics Research.
Visual Analytics: Visual Analytics is the interactive coupling of data analysis and information visualization to provide insights and knowledge.
ENDNOTES
1 http://data.gov.uk/linked-data
2 http://epp.eurostat.ec.europa.eu
CHAPTER 9
The Performance Mining Method:
Extracting Performance Knowledge from Software Operation Data
Stella Pachidi
Utrecht University, The Netherlands
Marco Spruit
Utrecht University, The Netherlands
ABSTRACT
Software Performance is a critical aspect for all software products. In terms of Software Operation Knowledge, it concerns knowledge about the
software product’s performance when it is used by the end-users. In this paper the authors suggest data mining techniques that can be used to
analyze software operation data in order to extract knowledge about the performance of a software product when it operates in the field. Focusing
on Software-as-a-Service applications, the authors present the Performance Mining Method to guide the process of performance monitoring (in
terms of device demands and responsiveness) and analysis (finding the causes of the identified performance anomalies). The method has been
evaluated through a prototype which was implemented for an online financial management application in the Netherlands.
INTRODUCTION
As the exponential increase of information is forcing us into the era of Big Data (Demirkan & Delen, 2013) and the organizations of the 21st
century function in a global marketplace defined by intense and growing turbulence (Heinrichs & Lim, 2003), the need for tools that enable quick
and effective extraction of business insights is more evident than ever. Business Analytics has evolved and includes techniques to access, report
and analyze data to understand, analyze and forecast business performance (Delen, & Demirkan, 2013). Such techniques include data mining,
which has been used to support knowledge workers in generating relevant insights (Heinrichs & Lim, 2003) to support decision making in
various processes at different strategic levels (Bolloju, Khalifa, & Turban, 2002; Courtney, 2001).
Knowledge Management has been recognized as an important process in software organizations for supporting core software engineering
activities in order to decrease costs, increase quality and lead to better decisions (Bjørnson & Dingsøyr, 2008; Lindvall & Rus, 2002). Although a
lot of research has focused on managing knowledge in software development (Moreno Garcı́a et al, 2004), little attention has been drawn to the
management of knowledge on the customers' experience with deployed software products (van der Schuur, Jansen, & Brinkkemper, 2010) for
software maintenance purposes (Midha & Bhattacherjee, 2012). However, the rise of cloud computing (Marston et al., 2011) designates the need
of Software-as-a-Service (SaaS) organizations to extract and analyze knowledge on how their software operates (van der Schuur et al., 2010), in
order to efficiently manage with changing requirements (Srikanth & Jarke, 1989), unexpected performance issues (Zo, Nazareth & Jain, 2010)
and increasing scalability (Delen, & Demirkan, 2013), to maintain quality of their services (Sun, He & Leu, 2007). With this paper we aim to
contribute to the management of knowledge related to software operation (van der Schuur et al., 2010), with the goal to support software
maintenance processes (Midha & Bhattacherjee, 2012). Specifically, we answer the research question: How can we detect performance
anomalies and their causes in software operation data? We examine several data mining techniques applicable for the extraction of
performance knowledge from software operation data, and construct a method that incorporates different techniques and helps structure the
complicated process of performance mining. We evaluate the method and the selected data mining techniques by means of a prototype that was
run with operation data of an established online financial management application.
SOFTWARE PERFORMANCE EVALUATION
Performance is described by the proportion of the amount of effective work that is accomplished by a software system, over the time and
resources used, in order to carry out this work (Arinze, Igbaria, & Young, 1992). In practice, software performance may be characterized by
different aspects, which are evaluated in measurable terms using different performance metrics, such as response time, throughput, availability,
latency, or utilization of resources (e.g. percentage of CPU or memory usage). Software Performance Evaluation is a critical process for all types
of software products, indispensable at every stage of the product’s life (Arinze et al., 1992). It is performed to determine that the system meets the
user-defined performance goals and detect possible improvement points; to compare a number of alternative designs and select the best design;
or to compare a number of different solutions and select the system that is most appropriate for a given set of applications (Jain, 1991).
A common technique for evaluating software performance is monitoring the system, while it is being subjected to a specific workload (Jain, 1991).
Monitoring consists in observing the performance of systems, collecting performance statistics, analyzing the data and displaying the results. The
commonest functionalities of a monitor include: finding the most frequently accessed software segments, measuring the utilization of resources,
finding the performance bottlenecks, etc. A performance monitor can provide valuable information about the application and system run-time
behavior, in order to carry out a dynamic analysis of the system’s performance.
Extracting Performance Knowledge from Software Operation Data
As the amount of data that software vendors have to process is raising exponentially, efficient extraction of knowledge from these data is
indispensable. Knowledge on software performance (in terms of device demands or responsiveness) may be elicited from data gathered while the
software product is used by its end-users (software operation data) (van der Schuur et al., 2010). So far research has focused on the recording of
software operation data, and on the integration, presentation and utilization of the knowledge acquired through general statistics and
visualization techniques (van der Schuur et al., 2010). However, sufficient support of data analysis functionalities is missing.
Data mining techniques could be used to extract software performance knowledge from operation data: we could inspect the duration of an
operation until it is fulfilled and get an alert when everything starts to slow down; identify when the capacity of resources is limited; find
deadlocks and query time-outs in the database; calculate the throughput of processing requests on a web server, etc. By monitoring the
performance on a regular basis (daily, weekly, monthly), we are also able to compare the values of the metrics and find the causes of performance
problems: there might be a hardware problem on a server, or overuse from a specific customer, a hardware or bandwidth problem on the end-
user’s side, or a software bug on a release update that may incur extra computation time, and other things.
Performance monitoring and analysis through software operation data mining (e.g. Pachidi, Spruit & Weerd, 2014) could contribute significantly
to the performance testing and quality control processes, as it can provide an automated alert system that identifies problems on the performance
of a software product and even analyzes their cause. The direct advantage would be the increase of quality of the product, which would also help
increase the customer satisfaction (Anderson & Chen, 1997). Another advantage would also be a decrease in the costs for performance testing,
since many problems would be identified, or even prevented, through the automated analysis of the software operation data.
Although research has been performed to some extent on the use of data mining techniques for performance monitoring (Jain, 1991; Vilalta et al.,
2002), most of these approaches are used to test performance models or for simulations, rather than to analyze how the software actually
performs in the field. The approaches that do so, focus on one layer only (e.g. network behavior) and do not cover all possible tiers. Also, most
techniques try to produce alerts for performance bottlenecks (Imberman, 2001), and are not sufficient for pointing out the problem causes.
Approaches that use data mining techniques to monitor in-the-field performance (Imberman, 2001), focus on one tier only (Borzemski, 2009), or
they are used for producing alerts and not for pinpointing the problem causes. We still need a uniform way of analyzing software operation data,
in order to monitor and analyze the performance of software products in the field. This paper aims to cover this gap by exploring how data
mining techniques can be performed on software operation data, in order to 1) detect anomalies in the performance of the software on the end-
user’s side, and 2) reason about the cause of each identified anomaly.
Related Work
Performance monitors have been used for several purposes, such as capacity planning (Thakkar et al., 2008), performance tuning (Loukides,
1996), workload characterization (Avritzer et al., 2002), and program profiling (Ammons, Ball, & Larus, 1997). An example methodology that
uses monitors is NetLogger (Gunter & Tierney, 2003), which implements real-time diagnosis of performance problems in complex high-
performance distributed systems through the generation of timestamped event logs.
Some preliminary research on testing performance with log file analysis was published by Andrews (1998), who suggested a framework for
finding software bugs and testing the system performance, by tracking for example internal implementation details (e.g. memory allocation).
Johnson et al. (2007) also incorporated the generation of performance log files in the performance testing process, to enable developers and
performance architects to examine the logs and identify performance problems. The Microsoft Application Consulting and Engineering (ACE)
Team (2003) have built the System Monitor: a process that monitors hardware and software resources in a machine for the real-time monitoring
of performance counters -metrics for hardware and software resources. The System Monitor logs the counter information for later analysis,
triggers alerts when performance thresholds occur, and generates trace logs which record data (processes, threads, file details, page faults, disk
I/O, network I/O, etc.) when certain activities occur.
As far as data mining is involved in the performance analysis task, some related research has been conducted. Imberman (2001) identifies the
parts of the Knowledge Discovery in Databases process (Fayyad, Piatetsky-Shapiro & Smyth, 1996), which are currently being followed by
performance analysts: regression, graphical analysis, visualizations and cluster analysis. She also explains further the parts that are less often
used so far, but could be useful for analyzing performance data (association/dependency rule algorithms, decision tree algorithms and neural
networks). Jain (1991) suggests the regression technique, for modeling performance related variables, and time series analysis, for managing and
analyzing performance data gathered over a time period. Vilalta et al. (2002) study the use of predictive algorithms in computer systems
management for the prediction and prevention of potential failures. Pray and Ruiz (2005) present an extension of the Apriori algorithm for
mining expressive temporal relationships from complex data, and test it with a computer performance data set. Finally, Warren et al. (2010)
suggest an approach for finding the “hot spots” of a program, i.e. the operations that have the highest impact on performance.
RESEARCH METHODOLOGY
In this research we follow the Design Science Research (DSR) approach (March & Smith, 1995): questions are answered through the creation and
evaluation of innovative artifacts that render scientific knowledge and are fundamental and useful in understanding and solving the related
organizational problem (Hevner & Chatterjee, 2010). In our case, the research artifacts correspond to the method and to the prototype (which
actually constitutes an instantiation of a method) for software performance mining. We construct the method for performance mining by using
the Method Engineering approach provided by Weerd and Brinkkemper (2008), as previously introduced in this journal by Vleugel, Spruit &
Daal (2010). To evaluate the effectiveness and applicability of the method, we perform case study research in a software company.
This research is delimited by some specific constraints. First of all, we limit our research to the analysis of performance for software-as-a-service
applications. However, the extensibility to other types of product settings will be considered in the construction of our method. Furthermore,
concerning the acquisition of software operation data, we assume that they are gathered through logging tools and libraries, which are embedded
in the software product’s code. Thus, all actions performed by the end-users are recorded and stored in a central database. Finally, we have
selected to work with the R environment end programming knowledge (Venables et al, 2014). This decision influences our research on the
selection of techniques that will be implemented in the method and the prototype.
In order to structure our data mining research, but also to assemble our Performance Mining method, we follow the CRISP-DM Reference Model
(Chapman et al, 2000). The CRoss Industry Standard Process for Data Mining consists of six phases, each of which constitutes a classification of
data mining activities on the high level, and the information flows between them. In this paper, the Business Understanding phase may be
mapped to the goals of performance analysis presented in the Introduction as well as to the hypotheses for performance analysis that are set in
section 4. The Data Understanding phase includes the identification of performance variables in section 5. The Modeling phase corresponds to
the data mining techniques presented in section 6. The Evaluation phase may be mapped to section 8, where the evaluation by means of a case
study is presented. Part of the Deployment phase may be mapped to the discussion in section 9. Even though the Data Preparation phase was a
significant part of this research, it is not thoroughly discussed in the current paper.
HYPOTHESES FOR PERFORMANCE ANALYSIS
As the aspect of performance is a very complicated and wide domain, it is necessary to set some hypotheses about what types of performance
problems we aim to detect and analyze with our suggested techniques and method. As we also mentioned in the introduction, this paper presents
research on performance monitoring and analysis of software-as-a-service applications. Logging techniques are used to collect the performance
data during the utilization of the software product in the field.
The performance anomalies that we aim to inspect in software operation data can be categorized into two types: problems related to device
demands on the servers, which include unusually high utilization of resources (CPU, memory, disk, etc), in the computers on which the servers
are running; and problems related to responsiveness, which refer to situations when the response time (i.e. the time interval between the end of a
request submission by the user and the end of the corresponding response from the system) is much higher than usual.
Concerning responsiveness problems, we consider the following possible causes that could be responsible for increasing the response time:
• A problem on the server (e.g. a memory leak, a processor problem, etc.) delays all operations running on that server.
• The responsiveness of a server will be lower when the load is too high during particular, repetitive time intervals of the day/week/month
(seasonal effect).
• There is a software problem (e.g. a software bug) with a specific application, function, action, etc. which delays the respective response
time.
• There is a processing problem (e.g. parsing problem) that delays the processing time (Processing duration = Total response time Query
Duration).
• A problem on the database (e.g. a deadlock or a query time-out) delays the duration of a query execution.
• The size of data from specific customers might be substantially high, compared to the usual size, thus the response time to view, edit, store,
etc. the related records can be much longer.
• A problem on the user’s side (e.g. too heavy customization of a menu, or a bandwidth problem) might delay the duration of the
applications, functions, etc. that he/she is using.
The aforementioned hypotheses create a specific context, in which we search for the appropriate data mining techniques and we construct the
performance mining method and the prototype.
PERFORMANCE VARIABLES
In order to gather the software operation data that are necessary to monitor and analyze performance, we need to decide which variables will
need to be inspected. We distinguish three categories of variables: a) performance metrics related to the resources usage on the machines where
the servers are running; b) performance metrics related to responsiveness; c) variables related to the software operation details.
Resources usage variables include performance metrics, related to the resources used on the computers on which the servers that host the
software-as-a-service product are running. These variables are useful to detect bottlenecks located on a specific server, which decrease the
performance of the product. The metrics are classified according to the resource they refer to (Team A. C. E., 2003): processor, system object,
physical disk, memory, network interface, web tier and SQL tier.
Responsiveness variables concern how fast a request, action etc. is processed by a server. We suggest three variables that can be used to measure
the responsiveness of the software product: duration, which refers to the response time, i.e. the total time elapsed (in milliseconds) from the
moment the end-user finishes a request (e.g. request to load a .aspx page) until the server completes the response to that request; query duration,
which is part of the total duration and describes the time elapsed to process and respond to a query request; and throughput, which refers to the
rate of requests that are processed by the server in a sample time interval (usually one second).
Operation details variables provide information about each action that is performed on the software product by an end-user: who is using the
software product (customer, user id, ip address), where the application is being hosted (web server, database), which functionalities are accessed
(specific application, page, function, method, etc.), how (which specific action e.g. “new”, “edit”, “save”, etc.), when (date and time, session, etc.),
which background tasks are running at the same time, which errors appear, and other details (e.g. how many records are loaded or stored, and so
forth).
DATA MINING FOR PERFORMANCE ANALYSIS
In this section we present data mining techniques that may be applied on software operation data, in order to analyze the performance of
software-as-a-service products. We suggest three main performance analysis tasks: Monitoring performance on the servers is related to
monitoring the resources utilization on the machines where the servers are running. We are interested in detecting performance bottlenecks
related to the resources usage, and we aim to provide an estimation of the future capacity of the resources. Monitoring performance of
applications consists in monitoring the responsiveness of the system, with regard to individual applications (e.g. .aspx pages, or methods, etc.).
We are interested in detecting delays in the response time of applications, or even predicting possible decrease of the responsiveness given the
current configuration. Analyzing performance on applications aims to find the causes (as presented in section 4) for events where response time
is significantly higher than usual.
Monitoring Performance on the Servers
In order to monitor the utilization of resources on the server machines, values from the metrics presented in section 5 should be collected
periodically (between time intervals of e.g. 1-3 minutes). Each metric value will be stamped with the specific date and time of its measurement;
hence we get to have a sequence of values, which can be considered as time-series data (Han, 2005).
The treatment of all performance metrics variables as time-series data will be the same: each performance counter is a time series variable, which
can be viewed as a function of time t: Y=Y(t) and can be illustrated as a time-series graph (Han, 2005). We will apply time series analysis and
burst detection to model the time series, and subsequently trend analysis, to forecast future values of the time-series variable.
Time series analysis consists in analyzing a time series into the following set of features, which characterize the time series data (Cowpertwait &
Metcalfe, 2009; Han, 2005): Trend or long-term movements are systematic changes in a time series, which do not appear to be periodic, and
indicate a general direction (e.g. a linear increase or decrease), in which a graph is moving over a long period of time. We are interested in
identifying the trends, in order to identify when there is a long-term increase or decrease of a performance counter, which could give insight into
a performance problem. Cyclic movements or cyclic variations are long-term oscillations of a trend line or curve, and may appear periodically or
aperiodically. Seasonal movements or seasonal variations correspond to systematically repeating patterns that a time series appears to follow
within any fixed period (e.g. increased usage every Monday morning). Irregular or random movements indicate the occasional motion of a time
series because of random or chance events. An example of a time-series decomposition may be viewed in Figure 1.
Apart from reflecting interesting patterns and features of the data, a time series plot may also expose “burstiness”, which implies that more
significant events are happening within the same time frame (Vlachos et al., 2008) and therefore may provide awareness for an impendent
change of the monitoring quantity, so that the system analyst can proactively deal with it. Burst detection is necessary in order to remove the bias
of the burst intervals in the trend analysis. The burst intervals will be replaced with new values through linear interpolation and the new values of
the data points will correspond to the respective mean value of the moving average. The time series will then be decomposed again, in order to get
the updated trend.
When monitoring performance counters we want to receive an alert when significant increases or decreases of the performance counters are
expected, in order to prevent potential performance problems. In the context of forecasting the time series of performance counters, we suggest
following trend analysis, which includes the extraction of the general direction (trend) of a time series, the analysis of the underlying causes
which are responsible for these patterns of behavior, and the prediction of future events in the time series based on the past data. We use the
technique of trend estimation (Han, 2005), which fits a linear model in the trend curve of the time series through linear regression, and
subsequently performs extrapolation, i.e. constructing new data points based on the known data points. The fitted line helps us quickly and
automatically get conclusions about possible significant increase or decrease of the counter being monitored. An estimation of the percentage of
the increase/decrease can also be calculated using the equation of the predictor variable. This technique provides the stakeholders with an
automated way to identify and study significant increases or decreases of the performance counters, as well as predict and thus prevent future
performance problems.
Monitoring the Performance of Applications
In order to monitor the responsiveness of the system, we need to collect information every time the end-user performs an action on the software
product. Every action that happens on the end-user’s side is represented by a record that includes a date and time, details of the operation and
the response times. Therefore, the responsiveness data are sequence data, since they constitute sequences of ordered events, and can be treated
as time-series data.
However, the data related to the performance of individual applications, differ from the time-series data that we studied in the previous task:
here, we have sequences of values which are measured at unequal time intervals. Furthermore, the same application may be accessed by more
than one users at the same time (e.g. running on different servers). Therefore, although we propose again time series analysis, burst detection
and trend analysis, their implementation will be different. More specifically, we need to select a sufficient time period (e.g. one hour) within
which most software applications are expected to have been used by the end-users. Also, we choose to omit hours (e.g. between 00:00 and 06:00)
or days (e.g. weekends), if the software product has too low usage during these time intervals.
The trend analysis will help detect an important increase of the response times of an application, which needs further inspection and analysis to
identify the cause of that increase. The seasonal variations curve informs when an application is over-used by the end-users, when the system
responsiveness is lower, etc. The bursts detection will identify time intervals with response times significantly higher than usual, which should be
further inspected and removed from the time series.
Analyzing the Performance of Applications
When the response time of the system to process a request from an end-user is significantly higher than usual, we consider that we have a
performance problem. Several reasons may lie behind the delaying of performance (e.g. a problem on the server). In this research, we consider
the causes described in section 4. Since the reasons of performance problems vary, we will need to use as much information as possible, thus all
variables presented in section 5 are relevant to this analysis task.
Association rules mining is a popular data mining technique, which is used to discover interesting relationships that are hidden in large data sets.
The relationships constitute relations between attributes on the value-level, which are represented by association rules, i.e. expressions in the
form of A→ C, where A and C are sets of items in a transactional database, and the right arrow implies that transactions which include the
itemset A tend to include also the itemset C (Agrawal, Imielinski, & Swami, 1993). The problem of association rules mining on performance data
consists of searching for interesting and significant rules that describe relationships between attributes which expose performance problems (i.e.
high response times).
In order to select interesting rules, there are several measures of interestingness, among which the commonest are the measures of support and
confidence (Agrawal & Srikant, 1994): The support supp(X) of a rule X →Y is defined as the proportion of transactions in the database that
contain both itemsets X and Y, over the total number of the transactions in the database. The confidence of a rule X →Y is defined as the
conditional probability of having the itemset Y given the itemset X; in other words, the proportion of transactions in the database that contain
both itemsets X and Y, over the number of the transactions which contain the itemset X. Consequently, the problem of association rules mining
could be defined as the discovery of interesting rules, i.e. rules that outweigh the thresholds we set for the minimum support and the minimum
confidence (Tan, Steinbach & Kumar, 2005).
Frequent pattern mining constitutes the discovery of sets of items in a database that appear frequently in the same transactions, i.e. with support
measure higher than the minimum support threshold (Srikant, Vu & Agrawal, 1997). The computational complexity of frequent itemset
generation could be reduced by the A Priori principle, according to which “a set of items can only be frequent if all its (non-empty) subsets are
frequent” (Agrawal & Srikant, 1994). Also, the number of comparisons can be reduced by compressing the dataset or using advanced data
structures (Tan et al., 2005). After mining all frequent itemsets, we can generate all possible rules that have confidence higher than the minimum
confidence threshold. The resulting rules may be post-processed, i.e. ordered by their measures of interestingness, such as the lift measure,
measures whether the transactions that contain itemset X actually have an increased probability of having itemset Y compared to the probability
of all transactions of the database.
In this section we reviewed the data mining techniques which may be performed on performance data of software-as-a-service applications, for
detecting performance anomalies, as well as reasoning about their causes. An overview is provided in Table 1.
Table 1. Summary of data mining techniques for performance monitoring and analysis
PERFORMANCE MINING METHOD
In order to provide guidance to the software vendors in the analysis and monitoring of their software products’ performance, we propose the
Performance Mining Method. The method has been constructed using the Method Engineering approach as described by Weerd and
Brinkkemper (2008). The method is aimed to be used for mining performance and usage data of software-as-a-service applications, which are
collected in a central point on the software vendor’s side. However, it could easily be adjusted-expanded for other types of products. In the
method we only included a subset of the aforementioned data mining techniques, which includes the techniques employed in the Performance
Mining Prototype. It would be an interesting piece of future research to experiment with the remaining suggested techniques, and then include in
the method the possibility to select which data mining technique will be used.
The method includes three activities, which correspond to the three suggested tasks: monitoring the performance on the servers, monitoring
performance of applications, and analyzing performance of applications. Each activity consists of several steps (sub-activities), which follow the
CRISP-DM cycle (Chapman et al, 2000), starting from the business understanding phase, and following the phases of data understanding, data
preparation, modeling, until the evaluation phase. The whole Process Deliverable Diagram is not provided here, because of its size and
complexity. Instead, the process diagram is presented in Figure 2.
Figure 2. Process diagram of the performance mining method
EVALUATION OF THE PERFORMANCE MINING METHOD
The Performance Mining Method is instantiated in a prototype, which has been created in order to implement and validate the method and the
suggested data mining techniques. In order to validate our method, we performed a case study in a Dutch software company, where we
implemented the Performance Mining Method and ran the prototype in the context of a real software product.
Performance Mining Prototype
The Performance Mining Prototype is implemented in the programming language and environment R (Team R. C., 2012), and is aimed to
analyze usage and user data which are gathered through logging tools, and are stored in logs, in a central database. The prototype consists of
three scripts which correspond to the three main analysis tasks: monitoring performance on the servers, monitoring performance of applications,
analyzing performance on applications. Furthermore, one function has been implemented in Java, with the task to prepare the log file in a format
suitable for importing in R.
For the task of monitoring performance on the servers, we created a function that is used to estimate the trend of each performance counter, after
having pre-processed the data (following the steps that were presented in the respective activity of the Performance Mining Method). The
function creates the time series object, performs some extra preprocessing (e.g. handling the missing values), performs time series analysis,
calculates the moving average and identifies the burst intervals, and then removes the bursts, and finally estimates the trend through linear
regression. All outputs from the analysis are printed in image files, which are stored in the previously defined output folder.
The script for the task of monitoring performance of applications includes time series decomposition, burst detection and trend estimation. Thus,
it is very similar to script that was created for monitoring performance on the servers. However, here we do not have measurements that are
collected periodically 24×7. The solution to overcome this complication is to select a sufficient time period (e.g. one hour) within which most
software applications are expected to have been used by the end-users. Also, it might make sense to omit hours (e.g. night time) or days (e.g.
weekends), if the software product has too low usage during these time intervals. A function with similar functions as the aforementioned one is
implemented in this task for the estimation of the trend for the response time of each application.
In the analysis of performance of applications, we try to detect the causes for performance problems (i.e. too long duration of an application) that
occur during software operation. We use Associations Rules Mining through Frequent Pattern Mining and search for interesting and significant
rules that describe relationships between operation attributes which expose performance problems. We use the Apriori algorithm (Hahsler et al.,
2009). We set as parameters for minimum support threshold minsup = 0.005 and for minimum confidence minconf = 0.6. Although especially
the support threshold may seem too low, we should remind here that the performance problems that we try to analyze constitute very unique and
rare cases, especially if we consider the variety of users who may use the product, and the variety of applications that are inspected. However,
these parameters can and should be changed and adjusted to the specifications of the product that is under study each time.
CASE STUDY
In order to validate the Performance Mining Method and Prototype, we performed a case study in a Dutch software organization that delivers
business software solutions to more than 100,000 small to medium enterprises. Our case study was performed in the context of an internet-
based accounting solution which constitutes one of the main software products of the company. This Software-as-a-Service application serves
over 10.000 customers and is used by more than 3.500 users per day.
The application logs performance and usage data, through a log-generating code that is incorporated in the system layer. Every time the
application is used by an end-user, several variables are stored in log tables in the administration database, on the server’s side. The information
that is logged contains: usage data (user id, application used, date/time, action taken, etc.), performance data (how long it took to process a
query, how long it took to load an application, etc.), quality data (errors that appeared during usage, etc.) and other useful information such as
which background tasks were running during usage. All logs are stored in the database for the period of 90 days. Furthermore, the System
Monitor is also used to log counters related to usage of resources in the machines where the servers (Microsoft Windows Servers) are running.
However, no sophisticated analysis is performed in any of the logs, and the analysts only perform basic statistical operations on the data and
produce common statistical plots for their monitoring.
In the first months of year 2010, the application faced several performance issues, as a lot of new customers started working with the product.
Similarly to a traffic jam, the load on the web servers, network, SQL server etc. increased significantly and resulted in clogging the system and
eventually bringing it to a standstill. In numbers, the performance analysts observed an increase of the average time of processing web requests
from about 350ms to about 500ms.
We tried to observe how our suggested Performance Mining method could be applied for this case, in order to monitor its performance and
analyze the performance problems that appear. We also ran our prototype with sample log data collected during the product’s operation, in order
to validate its functionality. In the following paragraphs we present the result of each performance mining task.
Monitoring Performance of Servers
We selected metrics related to categories of resources that are used on the machines where the web servers are running: processor, system object,
the physical disks, the memory, network interface, the web tier, and the SQL tier. We used the related script from our prototype to monitor
performance counters that were measured on the machines of two web servers in the period of two months, from 1st February 2010 until 1st April
2010. We used the prototype to pre-process the dataset (transforming, cleaning, selecting data, etc.), create the time series object, and perform
time series decomposition, burst detection and trend estimation. In the end, the knowledge was represented in the format of images, which could
be inspected by the performance analysts.
An example of the output can be viewed in Figure 1, which illustrates the time series decomposition for the performance counter %Processor
Time on server x1 . In the observed data plot we can clearly identify a burst interval between day 13 and 15. Such a burst would influence the
result of our trend analysis and therefore needs to be handled before estimating the trend of the counter. Furthermore, the burst raises a lot of
concern to the performance analyst, who will need to conduct a thorough analysis of the logs recorded in that time interval, in order to look for
the causes of these anomalous values of the counter. An increasing trend on the performance counter might influence the performance architect
to consider the possibility of expanding the hardware, or updating the server architecture in the future.
Monitoring Performance of Applications
As far as monitoring the performance of applications is concerned, we selected which applications were interesting to monitor. That may depend
on the type of applications the performance analyst is interested in. For our case, it was interesting to monitor grid and matrix type of
applications, which may have several delays when the number of records that have to be loaded or stored is high. We used the related script from
our prototype to monitor performance of applications that were used in the test environment of the application from 20th February 2010 until
20th May 2010. We used the prototype to preprocess the dataset (transforming, cleaning, selecting data, etc.) and to prepare the time series
object, which goes through decomposition, burst detection and trend estimation. In the end, the knowledge was represented in the format of
images, which could be inspected by the performance analysts. An example of the output is provided in Figure 3, which illustrates the trend
analysis for the duration of BankStatus application (after having removed any identified bursts). As we can roughly see from the fitted line on the
graph, the duration of the application has increased for more than 180ms. This is a significant increase that the performance analyst should look
further into.
Figure 3. Trend estimation for the BankStatus application
after removing the burst intervals
Analyzing Performance of Applications
In order to analyze the performance of applications, we first had to make a list of the possible performance problems that might arise, similar to
the one presented in section 4. Also, since there were many details that influenced the performance of a specific (category of) application(s), we
made a list of performance constraints which may influence the duration of applications. We suggested solutions on how to handle these
constraints in the analysis, in order to decrease any “noise” in the dataset, which could bias the resulting rules. Considering the hypotheses and
the constraints, we made a plan for the association analysis, which included which variables would be included in the logs, what type of response
time we would use to find performance problems, etc.
We used the related script from our prototype to analyze the performance of applications from the logs that were collected in three consecutive
days, from 6th April 2010 until 8th April 2010. The reason for which we selected these specific days is because several delays in the response time
for processing web requests were observed in that time period; hence we would like to find the reasons behind these problems. After
preprocessing the data (which contained 441662 rows), we created the binary incidence matrix, which was mined through the Apriori function to
find the most frequent itemsets and the rules which had high duration of the right hand side. All the rules that resulted from the association
analysis, together with their interestingness measures, were also exported in CSV file format. An excerpt of the most interesting rules with
“Duration = high” on the right hand side, is presented in Table 2.
Table 2. Example of association rules for high duration
In this section, we presented how we applied our Performance Mining method and prototype in the case study. We customized our prototype
code in order to process the usage logs and provided the performance engineers with all the necessary codes and documentation to perform the
same analysis in the future.
DISCUSSION AND CONCLUSION
In this paper, we tried to answer the research question of how we can detect performance anomalies and their cause in software operation data.
We searched for data mining techniques that are particularly applicable for identifying performance problems that appear on the end-user’s side;
and techniques that could be used to pinpoint the causes of these problems. We suggested a method to apply performance mining in a uniform
way. We also validated this method and tested a subset of the proposed techniques through a prototype, which was deployed in the case of an
online financial management application.
In the case study that we performed, our method could easily be implemented and was sufficiently apprehended by the stakeholders. Through the
activities that were followed, the organization gained insights related not only to the performance of the product, but also, to the way that the
performance was analyzed before and how the process could be improved in order to receive more analytical results in an automated way.
The two monitoring tasks gave very satisfactory results: Through the time series decomposition, we managed to extract the trend, seasonal and
random movements of the performance counters and the response times of applications in the long term; the burst detection helped identify
burst intervals, which deserve further inspection and analysis from the performance analysts, but also helped remove the noise created in the
trend estimation by singular events. The trend analysis helped estimate the trend of the counter/response time, and thus prevent from future
performance problems. However, the analysis task, although seemed to be very useful in theory, did not yield satisfactory results. Running the
mining for association rules encountered memory allocation problems, due to the static memory allocation in R. Some possible solutions include
processing the data in chunks, or adjusting the memory usage settings of R environment, or updating the hardware of the machine on which the
analysis takes place. Further research should be conducted in the performance analysis task. An implementation of association rules mining in
another language, an update of the system hardware, or the experimentation with other data mining techniques (e.g. exceptional model mining)
should be considered. Nevertheless, even the current state of the prototype, at least provides an automated technique to associate performance
problems with possible reasons, and can at least find the evident ones.
In the future, it would be interesting to extend the method and prototype to analyze software operation data of more products, but also to try out
other data mining techniques that were suggested for the related tasks. From a technical perspective, this research could also be extended to the
performance analysis of very large software products, such as Twitter or Facebook; in such large settings we may need to explore the combination
of data mining techniques with techniques on distributed computing on large data sets, on clusters of computers (e.g. MapReduce). From a
business perspective, it would be interesting to study how the performance mining method could help automate the performance evaluation of
software products, and whether it would succeed in reducing costs and increasing customer satisfaction (Anderson & Chen, 1997).
In conclusion, research contributes to the domain of knowledge management in software organizations (Bjørnson & Dingsøyr, 2008; Lindvall &
Rus, 2002), and specifically to the extraction of knowledge on how software operates in the field (van der Schuur et al., 2010). In addition, this
research contributes to the Software Performance Evaluation domain, by suggesting specific data mining techniques, and a method to employ
them, in order to performance anomalies and their causes from software operation data. We expect that the performance knowledge that can be
derived from software operation data may be used to support the processes of software product management, development and maintenance
(Midha & Bhattacherjee, 2012).
This work was previously published in the International Journal of Business Intelligence Research (IJBIR), 6(1); edited by Virginia M. Miori,
pages 1129, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Agrawal, R., Imieliński, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. [). ACM.]. SIGMOD
Record , 22(2), 207–216. doi:10.1145/170036.170072
Agrawal R. Srikant R. (1994). Fast algorithms for mining association rules in large databases. In 20th International Conference on Very Large
Data Bases, San Francisco, CA, USA.
Ammons, G., Ball, T., & Larus, J. R. (1997). Exploiting hardware performance counters with flow and context sensitive profiling.ACM Sigplan
Notices , 32(5), 85–96. doi:10.1145/258916.258924
Anderson, E. E., & Chen, Y. M. (1997). Microcomputer software evaluation: An econometric model. Decision Support Systems ,19(2), 75–92.
doi:10.1016/S0167-9236(96)00042-5
Andrews, J. H. (1998, October). Testing using log file analysis: tools, methods, and issues. In Automated Software Engineering, 1998.
Proceedings. 13th IEEE International Conference on (pp. 157-166). IEEE. 10.1109/ASE.1998.732614
Arinze, B., Igbaria, M., & Young, L. F. (1992). A knowledge based decision support system for computer performance management.Decision
Support Systems , 8(6), 501–515. doi:10.1016/0167-9236(92)90043-O
Avritzer A. Kondek J. Liu D. Weyuker E. J. (2002, July). Software performance testing based on workload characterization. In Proceedings of the
3rd international workshop on Software and performance (pp. 17-24). ACM.10.1145/584369.584373
Bjørnson, F. O., & Dingsøyr, T. (2008). Knowledge management in software engineering: A systematic review of studied concepts, findings and
research methods used. Information and Software Technology , 50(11), 1055–1068. doi:10.1016/j.infsof.2008.03.006
Bolloju, N., Khalifa, M., & Turban, E. (2002). Integrating knowledge management into enterprise environments for the next generation decision
support. Decision Support Systems , 33(2), 163–176. doi:10.1016/S0167-9236(01)00142-7
Borzemski, L. (2009). Towards Web performance mining . In Web Mining Applications in E-commerce and E-services (pp. 81–102). Springer
Berlin Heidelberg. doi:10.1007/978-3-540-88081-3_5
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0 Step-by-step data mining guide.
Courtney, J. F. (2001). Decision making and knowledge management in inquiring organizations: Toward a new decision-making paradigm for
DSS. Decision Support Systems , 31(1), 17–38. doi:10.1016/S0167-9236(00)00117-2
Cowpertwait, P. S., & Metcalfe, A. V. (2009). Introductory time series with R (pp. 51–55). New York: Springer.
Delen, D., & Demirkan, H. (2013). Data, information and analytics as services. Decision Support Systems , 55(1), 359–363.
doi:10.1016/j.dss.2012.05.044
Demirkan, H., & Delen, D. (2013). Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in
cloud. Decision Support Systems , 55(1), 412–421. doi:10.1016/j.dss.2012.05.048
Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery: an overview . In Fayyad, U. M., Piatetsky-
Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.),Advances in knowledge discovery and data mining (pp. 1–34). Menlo Park, CA, USA: American
Association for Artificial Intelligence.
Gunter, D., & Tierney, B. (2003, March). NetLogger: A toolkit for distributed system performance tuning and debugging. InIntegrated Network
Management, 2003. IFIP/IEEE Eighth International Symposium on (pp. 97-100). IEEE. 10.1109/INM.2003.1194164
Hahsler, M., Grün, B., & Hornik, K. (2005). Arules a computational environment for mining association rules and frequent item sets . Journal of
Statistical Software , 14, 1–25.
Hahsler, M., Grün, B., Hornik, K., & Buchta, C. (2009).Introduction to arules–A computational environment for mining association rules and
frequent item sets . The Comprehensive R Archive Network.
Han, J. (2005). Data Mining: Concepts and Techniques . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Heinrichs, J. H., & Lim, J. S. (2003). Integrating web-based data mining tools with business models for knowledge management.Decision
Support Systems , 35(1), 103–112. doi:10.1016/S0167-9236(02)00098-2
Hevner, A., & Chatterjee, S. (2010). Design science research in information systems (pp. 9–22). Springer, US. doi:10.1007/978-1-4419-5653-8_2
Imberman S. P. (2001, December). Effective use of the kdd process and data mining for computer performance professionals. In Int. CMG
Conference (pp. 611-620).
Jain, R. (1991). The art of computer system performance analysis: techniques for experimental design, measurement, simulation and modeling .
New York: John Willey.
Johnson, M. J., Maximilien, E. M., Ho, C. W., & Williams, L. (2007). Incorporating performance testing in test-driven development. Software,
IEEE , 24(3), 67–73. doi:10.1109/MS.2007.77
Lindvall, M., & Rus, I. (2002). Knowledge management in software engineering. IEEE software, 19(3), 0026-38.
Loukides, M. K. (1996). System Performance Tuning (1st ed.). Sebastopol, CA, USA: O’Reilly & Associates, Inc.
March, S. T., & Smith, G. F. (1995). Design and natural science research on information technology. Decision Support Systems ,15(4), 251–266.
doi:10.1016/0167-9236(94)00041-2
Marston, S., Li, Z., Bandyopadhyay, S., Zhang, J., & Ghalsasi, A. (2011). Cloud computing—The business perspective. Decision Support
Systems , 51(1), 176–189. doi:10.1016/j.dss.2010.12.006
Midha, V., & Bhattacherjee, A. (2012). Governance practices and software maintenance: A study of open source projects. Decision Support
Systems , 54(1), 23–32. doi:10.1016/j.dss.2012.03.002
Moreno García, M. N., Quintales, L. A. M., García Peñalvo, F. J., & Polo Martín, M. J. (2004). Building knowledge discovery-driven models for
decision support in project management. Decision Support Systems , 38(2), 305–317. doi:10.1016/S0167-9236(03)00100-3
Pachidi, S., Spruit, M., & van der Weerd, I. (2014). Understanding Users' Behavior with Software Operation Data Mining. Computers in Human
Behavior , 30, 583–594. doi:10.1016/j.chb.2013.07.049
Pray, K. A., & Ruiz, C. (2005). Mining expressive temporal associations from complex data . In Machine Learning and Data Mining in Pattern
Recognition (pp. 384–394). Springer Berlin Heidelberg. doi:10.1007/11510888_38
Srikant, R., Vu, Q., & Agrawal, R. (1997, August). Mining Association Rules with Item Constraints (Vol. 97, pp. 67–73). KDD.
Srikanth, R., & Jarke, M. (1989). The design of knowledge-based systems for managing ill-structured software projects. Decision Support
Systems , 5(4), 425–447. doi:10.1016/0167-9236(89)90020-1
Sun, Y., He, S., & Leu, J. Y. (2007). Syndicating Web Services: A QoS and user-driven approach. Decision Support Systems , 43(1), 243–255.
doi:10.1016/j.dss.2006.09.011
Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining (1st ed.). Boston, MA, USA: AddisonWesley Longman Publishing
Co., Inc.
Team, A. C. E. (2003). Performance Testing Microsoft. NET Web Applications . Redmond, Washington: Microsoft Press.
Team R. C. (2012). R: A language and environment for statistical computing.
Thakkar D. Hassan A. E. Hamann G. Flora P. (2008, June). A framework for measurement based performance modeling. In Proceedings of the
7th international workshop on Software and performance (pp. 55-66). ACM.10.1145/1383559.1383567
van de Weerd, I., & Brinkkemper, S. (2008). Meta-modeling for situational analysis and design methods. Handbook of research on modern
systems analysis and design technologies and applications, 35.
van der Schuur, H., Jansen, S., & Brinkkemper, S. (2010, September). A reference framework for utilization of software operation knowledge.
In Software Engineering and Advanced Applications (SEAA), 2010 36th EUROMICRO Conference on (pp. 245-254). IEEE.
10.1109/SEAA.2010.20
Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM
Systems Journal , 41(3), 461–474. doi:10.1147/sj.413.0461
Vlachos, M., Wu, K. L., Chen, S. K., & Philip, S. Y. (2008). Correlating burst events on streaming stock market data. Data Mining and Knowledge
Discovery , 16(1), 109–133. doi:10.1007/s10618-007-0066-x
Vleugel, A., Spruit, M., & van Daal, A. (2010). Historical data analysis through data mining from an outsourcing perspective: The three-phases
method. International Journal of Business Intelligence Research , 1(3), 42–65.
Warren, C. E., Payne, D. V., Adler, D., Stachowiak, M., Serlet, B. P., & Wolf, C. A. (2010). U.S. Patent No. 7,644,397. Washington, DC: U.S. Patent
and Trademark Office.
Zo, H., Nazareth, D. L., & Jain, H. K. (2010). Security and performance in service-oriented applications: Trading off competing
objectives. Decision Support Systems , 50(1), 336–346. doi:10.1016/j.dss.2010.09.002
CHAPTER 10
Aspects Regarding Detection of Sentiment in Web Content
Cristian Bucur
Bucharest Academy of Economic Studies, Romania & Petroleum and Gas University of Ploiesti, Romania
ABSTRACT
Last decade evolutions on telecommunications changed the way the information is created and presented. New technologies, has allowed the
transition from static presentation of information, to a dynamic way, directly involving users. The Web is currently a platform that allows users to
interact with each other, to facilitate the exchange of information. Users have become, from mere consumers of information in the online
environment, active participants which increases the information content.
OVERVIEW
Recent advances have produced also changes in the way information is retrieved. Techniques of opinion mining and sentiment analysis offers a
possibility of automatic analysis of user-generated content. Current research in this area allows automatic identification and extraction of
opinions and emotions. Information generated by Internet users has increased exponentially in recent years, become important for the extraction
of knowledge from a virtual environment. The sheer volume of data makes it impossible for manual processing, and also, automatic analysis
requires additional difficulties, due to the use of informal language by users. Sometimes, depending on the specific of text, the propositional
structure is absent (Cvijikj, 2011).
Sentiment analysis, called also in scientific literature as opinion mining, involves the determination and classification of opinions or feelings
expressed in text, through the use of computing machines. An opinion presumes the existence of an opinion holder, a target entity on which
opinion is issued, a particular aspect of the entity and a sentiment orientation of that opinion (Palmer, 2009).
APPLICATIONS OF SENTIMENT ANALYSIS
Opinions are important for humans because it influence their behavior and are a base for decision making. An important part of getting
knowledge in a market field represents finding what people think about that field. People search online for comment and reviews posted by
others when they want to make a buying transaction or want to inform themselves in a specific topic.
Identifying consumer opinions on a company's products is as important as knowing sales volumes, but often it is more difficult to obtain.
Companies can no longer rely only on internal data in their business analysis. Current research in sentiment analysis addresses these needs of
expanding the amount of information collected by companies, by analyzing the huge volume of information generated by the online social
networks (Facebook, Twitter, Google+), the comments from e-commerce sites (Amazon.com) and reviews on products or services on specific
platforms (Yassine, 2010).
By revealing users opinion on specific entities or topics, sentiment analysis can be used for (Collomb& Costea 2014):
• Determining the relevant information for a statistical analysis by removing the subjective part of content.
Given the applications presented above there are multiple areas where such analysis can be used:
• Marketing for determining customers expectations and needs, also for determining brand reputation, market perception and keeping
clients (Gautami Tripathi, 2014)
• Government and administration for analyzing public opinion on policies and adopted reforms
• Stock market for predicting the evolution of stocks in correlation with market perception regarding a listed company
DATA SOURCES
An important role in the growth of user-generated information, have had social networks. These have changed the way information reaches to
potential customers, changing the communication traditional mode of one to many to one-to-one communication (Harrison, 2013). Opinion
mining techniques used in social networks helps to understand how certain products or services, are perceived in the market. Marketers have
significantly changed the way of communication with potential customers, understanding the potential of using marketing in social networks
(Liu, 2011). Studying the social environment also provides consumer information about needed products, through feedback provided in
comments and reviews.
The large number of online reviews present on various specialized platforms meet the information needs of consumers. They can be used for
comparing offers from the market in order to make an informed purchasing decision. For the average user, taking an informed choice, based on
the information in the online environment, is difficult due to the huge amount of content and the inability to browse enough of it to create an
accurate image. Also knowledge of a user, on comparison metrics for a product can be reduced on a given area.
Depending on the domain the analysis on sentiment is made there could be several sources for the web content:
• E-commerce sites – lately all major ecommerce sites have huge databases with clients reviews regarding products (www.amazon.com,
www.flipkart.com)
• Specialized reviews sites – web platforms where users can express their opinions regarding products or services. There are site specialized
in reviews regarding products, movies (www.rottentomatoes.com), hotels or restaurants (www.yelp.com, www.burrp.com) (Govindarajan,
2013)
• Blog, micro blogging sites – are full of opinion expressed by people recording their daily activities, feelings, emotions in an on line journal
• News articles – all news sites besides articles have a section in which allow readers to comments on subjects debated
CLASSIFICATION OF ANALYSIS
An efficient process based on sentiment analysis, should classify the reviews made on a particular area, to sum up existing opinions providing
users results easy to interpret, and should facilitate the selection of aspects of entities the opinions are made on, classifying them in positive or
negative values.
Sentiment analysis can be performed using knowledge-based system or a machine learning system. Knowledge-based systems requires the
creation of rules and manual adjustment, unlike automated learning systems, that require the existence of a significant set of data for training
and process automation.
The most important aspects from which a sentiment analysis can be classified are the following:
• Expected rating
• Using machine learning methods – there are used learning algorithms trained using datasets with known classification
• Lexicon-based – in this approach a value called sentiment polarity is obtained by calculating the value of subjectivity of sentences or words
in text
• Rule based – the classification is done by extracting opinion words in text and calculating their number proportion. There are rules for
negation words, stop words, and dictionary based polarity.
• Statistical approach in which aspects and ratings are represented as multinomial distributions and sentiments are clustered in ratings and
terms into aspects (Collomb & Costea, 2014).
• On sentence level
• On document level
Opinions resulted from analysis can be summarized as scalar or polarity values. In this case study approach, will be presented two ways of
detecting polarity of opinions extracted from online platforms. One way of analysis is, to use words semantic polarity, and an algorithm to
determine the semantic orientation of sentences and documents. Another way is to use supervised automatic algorithms. In this case the
algorithm is first trained on a data set representative for the analysis domain.
Using semantic orientation has the advantage that it is independent of the domain the analysis is made on, but achieving better accuracy in
detection of opinions require training a dataset domain oriented.
METHODOLOGY
Opinion mining process requires completion of certain phases similar to a process of knowledge extraction:
1. In the first phase it is determined the analyzed domain, are identified the appropriate data sources (web sites from which the analyzed data
will be collected under) and is determined how data is collected
2. The second stage involves the actual data acquisition and processing for identifying opinions in the next step. At this stage will be
constructed the dataset (user reviews from web pages) using a web crawler. These comments are subject to pre-processing, eliminating
irrelevant text in analysis. There are stored only those sentences in which opinions are expressed, the rest are removed. Also not all the
words in sentences are relevant to determining an opinion. There are some proposition parts in the sentence irrelevant that must be removed
from the dataset.
Figure 1. Stages of an opinion mining analysis (source:
adapted by (Smeureanu & Bucur, 2012):Applying Supervised
Opinion Mining Techniques on Online User Reviews)
1. The central stage is the process of classification. In the analyzed process the reviews obtained in previous stages are classified, determining
their polarity. The classification algorithms used identifies a review as part of one of the classes: positive, negative or neutral.
2. At this stage determining the general classification of reviews is done by aggregating opinions at sentences level after a chosen algorithm.
Also establishing opinion on document level is calculated as a summary of results at the sentence level.
3. The final stage is, the process of interpretation of the results obtained, and the presentation of classification results for the required level of
analysis (Smeureanu & Bucur, 2012).
Results can be delivered in the form of a score identifying the polarity of opinion, usually positive or negative depending on the two classified
classes, or as textual or visual categories using graphs.
Following the classification the obtained results validity, must be checked. The study presents results of the assessment classification using text
mining domain-specific measures. These measures are: precision, recall and accuracy. In this study, for having an accurate assessment of the
classification of the two proposed methods, it will be evaluate the effectiveness, using the same set of data, on which sentiment analysis is applied
on sentence-level. For classifying opinions in classes, defining propositions as instances of a class will have the following measures:
Since the precision and recall are not always relevant individual, we calculate for greater relevance harmonic mean of the two, called indicator
measure F:
CASE STUDY
The study aims to determine the opinions in user reviews on movies, reviews taken from the specialized websites.
For the case study we used a collection of reviews extracted and classified into positive and negative classes. Collection taken from (Lee, 2012) is
divided into a class of positive sentences and one containing negative sentences. Reviews submitted on websites consist usually of several
sentences. To facilitate the process the analysis was performed on the sentence level. Using an aggregation algorithm, it can be determined the
general review opinion based on the opinions extracted from each sentences. In Figure 2 we present reviews on one movie, posted by users, as
they appear on a prestigious movie reviews site (www.rottentomatoes.com).
Figure 2. Example of user reviews on a movie (source:
http://www.rottentomatoes.com)
The process of sentiment analysis is conducted with a supervised algorithm based on a naive Bayesian classifier. The way of performing this type
of analysis is described in (Smeureanu & Bucur, 2012). The advantage of using Bayesian classifier is that it provide good results while being quick
and easy to implement.
Classifier based on Bayes' theorem, it use contrary event probabilities and independent probabilities of events to determine a conditional
probability.
The algorithm analyzes a set of positive and negative ranked examples and based on frequency of occurrence in each class, estimates probability
that a word has positive or negative significance. Based on the probability of each word occurrence, document or sentence probability is
computed by calculating the product of these probabilities. The process requires a pre classified in two classes, positive and negative data set,
specific to supervised learning processes, with which is calculated the occurrences of words in classes.
This method trait each word of a sentence as independent, with no connection between them. But in reality some words appear more frequently,
or are dependent on each other in certain contexts expressing an opinion. Also certain words are rarely present in expressing opinions.
Classification efficiency is calculated using the measures discussed above. For classification it will be used a training set consisting of 5000
sentences extracted from reviews. Based on this training set, 300 sentences will be analyzed and the following results are obtained:
Table 1. Efficiency of initial supervised algorithm
(source: author)
Thus by applying naive Bayesian classification algorithm we obtained an accuracy of 0.814332247557. It will be studied ways to improve the
accuracy. The first step is to remove from reviews those words that cannot express opinions. These words are articles, prepositions, pronouns and
conjunctions. We obtain the following values of the measured indicators:
Table 2. Efficiency of supervised algorithm after removing stop words
(source: author)
It is obtained a slightly higher precision value. We continue by trying to include correlations between words. Original algorithm considers words
as uncorrelated with each other, but in expressing opinions there are certain words that occur frequently together. The introduction of these
correlations involves assessing the probability of occurrence of groups of several words instead of one word. The tables below present the results
for groups of n = 2 words:
Table 3. Efficiency of supervised algorithm for groups of n=2 words
(source: author)
For n = 3 word groups results are presented in Table 4:
Table 4. Efficiency of supervised algorithm for groups of n=3 words
(source: author)
It is noted that maximum efficiency is obtained for groups up to n = 2 words. For more complex groupings accuracy decreases.
CONSIDERATIONS ON PERFORMED ANALYSIS
By making a review of studies in sentiment analysis made by researchers and by synthetizing the results obtain in the above opinion minig
analysis performed on movie reviews, we can make the following considerations:
• A study for opinions analysis requires lexical resources incorporating syntactic properties and negations of a particular language.
• A particular importance should be given to adaptation of resources (dictionaries / lexicons) in various fields of analysis and the possibility
to reuse these existing resources.
• From researches made it was observed that the opinion mining process is centered on the studied field, so a solution for a particular area
(e.g. analysis reviews for movies) will not be suited in another (e.g. user opinions on a tablet PC).
• Because the expression of feelings varies ranging from one domain to another models developed for analysis require adaptation. Currently,
most research studies are made on the analysis of English and to a lesser extent German, Chinese and Spanish languages.
For increase the accuracy of made analysis in the future researches must be made to:
• Improving accuracy by concomitant use of several analysis algorithms, testing using multiple lexical databases and implementing self-
learning capabilities, thus developing a more extensive training set.
• Strict specialization on a certain area of analysis will improve efficiency.
• In terms of processing performance, using a programming language with native capabilities for processing multiple threads and optimized
for a specific hardware platform will significantly reduce execution times.
• Performance in processing can be improved also by using an individual specially designed server for each operation of storage or analysis
algorithms processing operations.
CONCLUSION
The article presents a review of the sentiment analysis domain. In the last few years the research in opinion detection has progressed a lot, and
many methods have been proposed but it is not a matured domain that could easily by transferred into production application. The most used
methodology is machine learning, which need large datasets for training. In the future research must be done for acquiring better precision of
algorithms and deal with problems regarding different languages, syntax of analyzed text and semantics.
This work was previously published in the International Journal of Sustainable Economies Management (IJSEM), 3(4); edited by Dorel
Dusmanescu, Andrei JeanVasile, and Gheorghe H. Popescu, pages 2432, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This work was co-financed from the European Social Fund through Sectoral Operational Programme Human Resources Development 2007-
2013, project number POSDRU/159/1.5/S/134197 “Performance and excellence in doctoral and postdoctoral research in Romanian economics
science domain”.
REFERENCES
Bucur, C. (2011). Implications and Directions of Development of Web Business Intelligence Systems for Business Community .Economic
Insights-Trends and Challenges , 64(2), 96–108.
Bucur, C., & Tudorica, B.G. (2012) A research on retrieving and parsing of multiple web pages for storing then in large databases.Revista
Economica, Supplement No. 5/2012, 119-127. ISSN: 1582 – 6260.
Collomb, A., Costea, C., Joyeux, D., Hasan, O., & Brunie, L. (2014).A study and comparison of sentiment analysis methods for reputation
evaluation. Rapport de recherche RR-LIRIS-2014-002.
Cvijikj I. P. Michahelles F. (2011). Understanding Social Media Marketing: A Case Study on Topics, Categories and Sentiment on a Facebook
Brand Page.Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media Environments, pp.175-182.
10.1145/2181037.2181066
Fabrizio Sebastiani ., & A. E. (2006). SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining.Proceedings of the 5th
Conference on Language Resources and Evaluation (LREC’06).
Gautami Tripathi, N. S. (2014). Opinion mining: A review.International Journal of Information & Computation Technology ,4(16), 1625–1635.
Govindarajan, M., & R. M. (2013). A Survey of Classification Methods and Applications for Sentiment Analysis. [IJES].International Journal of
Engineering Science , 2(12), 11–15.
Liu, B. (2011). Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data, Secound Edition . Springer.
Palmer, A. L., & Koenig-Lewis, N. (2009). An experiential, social network-based approach to direct marketing. Direct Marketing: Int. J. , 3(3),
162–176. doi:10.1108/17505930910985116
Smeureanu, I., & Bucur, C. (2012). Applying Supervised Opinion Mining Techniques on Online User Reviews. Revista Informatică Economică,
Vol. 16 No. 2/2012, 81-91.
Tudorica, B. G., & Bucur, C. (2011). A comparison between several NoSQL databases with comments and notes, RoEduNet International
Conference Networking in Education and Research, 2325 Iunie 2011, pp. 1-5
Yassine M. a. (2010). A Framework for Emotion Mining from Text in Online Social Networks.IEEE International Conference on Data Mining
Workshops. Sidney, Australia, December 13, 2010. 10.1109/ICDMW.2010.75
CHAPTER 11
The Interoperability of US Federal Government Information:
Interoperability
Anne L. Washington
George Mason University, USA
ABSTRACT
Interoperability sets standards for consistency when integrating information from multiple sources. Trends in e-government have encouraged the
production of digital information yet it is not clear if the data produced are interoperable. The objective of the project was to evaluate
interoperability by building a retrieval tool that could track United States public policy from the legislative to the executive branch using only
machine-readable government information. A case study of policy created during the 2008 financial crisis serves as an illustration to investigate
the organizational, technical, syntactic, and operational interoperability of digital sources. The methods of citing law varied widely enough
between legislation and regulation to impede consistent automated tracking. The flow of federal policy authorization exemplifies remaining
socio-technical challenges in achieving the interoperability of machine-readable government data.
INTRODUCTION
Legislatures pass laws that authorize executive agencies to enact public policy. Is it possible, using only digital government documents, to track
the flow of policy authorization? The answer is an investigation into interoperability and the potential of building “big data” to describe
government activity. Big data connects machine-readable sources in order to identify patterns through computational analysis. Interoperability
sets consistency standards for the integration of both technology and information structures. Aside from promoting transparency, digital public
sector information (PSI) contains valuable descriptions of how governments do business. Breaking barriers to institutional data silos could have a
broader impact on promoting transparency objectives. Computational methods could enhance internal and external understandings of
governance patterns.
Management scholars have embraced the study of big data as a continuation of existing information systems research (Agarwal & Dhar, 2014;
Dhar, 2013; Sundararajan, Provost, Oestreicher-Singer, & Aral 2013). Big data brings the potential of using methods such as business analytics
(Chen, Chiang & Storey, 2012), predictive analytics (Shmueli & Koppius, 2011), machine learning (Domingos, 2012), and data mining (Hand,
Mannila & Smyth 2001). Business analytics is a driving force behind many private sector innovations (Goes, 2014; Quaadgras, Ross, & Beath
2013). The public sector has equal potential (Kim, Trimi, & Chung, 2014; Washington, 2014). In response to eGovernment directives,
transparency initiatives, and open data advocacy, public sector managers have published increasing amounts of digital material. The proliferation
of government information has opened up the potential for linking multiple sources into large-scale big data collections.
This chapter investigates interoperability as part of a larger investigation into the research potential of open government data. In a multi-year
funded effort our research group, PI-Net, is using “big data” computational methods to ask questions about policy, politics, and governance
(Wilkerson & Washington, 2012). New data collections give researchers new ways to ask questions, and large government collections present
many opportunities. The unique features of government information may give rise to computational research challenges not seen in other data.
Interoperability was a starting place to identify the challenges and potential of current data sources. We approached the problem by building a
collection of data and documents from multiple government organizations.
The connection between the legislative and executive branches exemplifies a critical aspect of governance as well as the coordination of multiple
organizations. Legislation instructs other government organizations to implement policy described in laws. Laws must be widely distributed to
those who are impacted and to those who must implement public policy. Regulations, also known as secondary legislation or administrative law,
are the specific rules set by agencies for implementing policy. Law is already conceptually interoperable with standardized systems of references
and citations. Legal citations reference specific blocks of text, although each reference might be to a portion or to the entirety of the text. Previous
research has recognized the potential of digital legal documents and investigated law as a digital library (Arnold-Moore, Anderson & Sacks-Davis
1997), as hypertext (Wilson, 1988), and as artificial intelligence (Liebwald, 2013; Matthijssen, 1998). This chapter builds on previous research on
digital legislation.
The research design was to build and evaluate a system that tracks the policy authorization process using only United States federal open
government data. The United States has a long history of federal information policy that promotes the release of material. The commitment to
print publications has been expanded to a commitment to publish digital information. The project tracked policy released in electronic formats
from Congress to federal agencies. Although conceptual systems for tracking legislative proposals to administrative activity have existed, it is only
recently that the requisite standards, mandates, and requirements are in place to attempt machine-readable tracking.
The objective of the project was to evaluate interoperability by building a retrieval tool. First, we designed a logical model of information shared
across organizations. The logical model is refined to specific documents and activities in a model we call Authority-Tracker. Next, we built a
proof-of-concept retrieval tool based on that model called AUTHORITY-TRIEVE. The retrieval tool consolidated digital sources from the
executive and legislative branches. Finally, we used the tool to trace United States public policy and evaluate the interoperability of the data. This
paper presents a case study of tracking a single legislative provision into administrative law. The Troubled Asset Relief Program (TARP) was the
first Congressional policy response to the 2008 global financial crisis. TARP is used as a snapshot to evaluate the interoperability of US open
government data and the potential of building government “big data” collections. While the technical standards and data were widely available,
we found that different levels of operational compliance hampered complete automated tracking.
INTEROPERABILITY
Public sector information, a rich description of networks, invites connections and combinations. The ability to share information has been a
consistent aspect of eGovernment scholarship (Dawes, 1996; DeNardis, 2010; Pardo, Nam & Burke, 2012). Interoperability supports
compatibility and interdependence in eGovernment projects (Janssen, Chun & Gil-Garcia, 2009). Charalabidis, Lampathaki and Psarras (2009)
highlight the range of worldwide interoperable efforts from Germany, Denmark, Hong Kong, and the European Union.
Interoperability is the ability to exchange between two or more separate ways of doing things. It can be the exchange between independent
technology systems, information sources, organization infrastructures, data models, syntax standards, semantic definitions, or administrative
processes. Interoperability standards make it possible to share machine-readable data between applications. Transparency disclosure mandates
and eGovernment infrastructure projects have encouraged public sector organizations to steadily release data and documents in interoperable
formats.
Technical interoperability is the compatibility of hardware components, software functions, file formats, or data models. Wegner (1996)
identified two approaches to technical interoperability, which he defined as the ability to cooperate despite differences. First, interoperability can
be designed directly through the use of a common standard. For instance, battery sizes are standardized so they can fit in many types of devices.
Second, interoperability can be dynamic, where the exchange is negotiated in real-time based on an exchange protocol. For instance, any Internet
browser on any device can display a web file using the exchange protocols established for the World Wide Web. Kubicek, Cimander, and Scholl
(2011) offer specific types of technical interoperability: semantic and syntactic. Syntactic interoperability is an exchange that uses similar syntax,
citations, or formal structure of a shared language (Van der Aalst & Kumar, 2003). Semantic interoperability is an exchange of shared meaning
based on established definitions or interpretations (Charalabidis & Askounis, 2008). A syntactic interoperability example might clarify that dates
are in month-day order instead of day-month order. Semantic interoperability would be able to match the name “Upper Chamber” to the
“Senate” in one jurisdiction and “Lords” in another.
Although interoperability is often considered to be only technical by information systems managers, there are multiple other aspects. Dawes
(1996) identified three barriers to information sharing: technical, organizational, and political. The European Interoperability Framework for
pan-European e-Government Services defined three types of interoperability: organizational, semantic, and technical (European Commission,
2009). Pardo, Nam, and Burke (2012) argue that information sharing in eGovernment is based on policy, management, and technology
interoperability. Organizational or management interoperability is defined by shared business practices and administrative processes. Policy and
political definitions of interoperability include the concerns of leadership and stakeholders. DeNardis (2010) considers the interoperability life
cycle. The life cycle begins with an open standards process for all stakeholders. The next stage in the life cycle is implementation, when standards
are designed into project requirements. Finally, the standards must meet operational compliance when they are in active use in the production of
data. Interoperability, especially for the public sector, extends beyond technical compatibility.
The forms of interoperability evaluated in this study are organizational, technical, syntactic, and operational, as shown in Table 1. The exchange
of information will be evaluated to see if it matches administrative processes, file formats, citation standards, and production compliance.
Table 1. Interoperability types
INFORMATION POLICY
The interoperability of digital material emerges from earlier information policy. In the public sector and particularly in legislatures, information
policy consists of choices made to manage documents. Information policy is considered to include the laws, regulations, policies, technologies,
and practices that impact the interoperability of public sector sources. Information policy has been defined as guidelines for creating, collecting,
exchanging, presenting or deleting objects that contain recorded knowledge (Braman, 2011; Relyea 1989). Often associated with governments
only, information policy can apply to any institution that manages knowledge.
Tracking the flow of government business between organizations has historically been a challenge because of competing policy, conceptual and
technical systems. Before the advent of reliable and regular production of digital information, government agencies had little incentive to share
(Relyea, 1989). Data often languished in individual repositories or were isolated in incompatible information systems. Moreover, individual
agencies rarely are mandated to create systems that extend beyond their internal needs. Buckley Owen, Cook and Matthews (2012) considered
how a single policy did not uniformly apply across multiple United Kingdom departments. To avoid policy fragmentation, the UNESCO National
Informatin Systems (NATIS) suggests that nations have a single coordinating body (UNESCO, 1975). The worldwide push for modernization
created the technology infrastructure for producing open government data. Legislatures from the late 1990s through the 2000s were very active
in developing systems that could be shared with constituents on the Internet (Washington & Griffith, 2007).
An important development in the United States was the digital government research program sponsored by the National Science Foundation
(NSF) from 1998-2004. Partnerships between agencies and universities improved the technology available to government and opened up many
research opportunities for understanding the public sector (Marchionini, Haas, Plaisant & Shneiderman, 2006). The United States government
continued to emphasize the interoperability of information infrastructure through funding opportunities and mandates. If US digital government
information is widely available, what is the level of interoperability between information from the legislative and executive branches?
AUTHORITYTRACKER MODEL
The goal of the Authority-Tracker model was to create an abstract representation of the movement of documents and information between
different branches of government. The model had to make logical sense as well as represent actual documents that were available as electronic
files. Because we were interested in the movement of documents, we chose to map the full path from end to end. We start with the creation of a
policy in the legislature and end with its refinement and enforcement. This is not a technical model. The model captures the flow of information
exchanged between agencies. The model is also intended to identify points of interoperability between documents that contain shared
information.
The simplest description of coordinated activity is similar to what is taught in most civics classes. A legislative proposal with a successful vote is
codified into law and becomes a regulation implemented by an agency. Each stage of activity produces documents. All documents are
systematically numbered and regularly printed. A series of related Congressional documents is called a legislative history. If a bill is proposed,
considered in committee hearings, and debated on the floor, the legislative history includes hearing transcripts, a committee version of the bill,
committee reports, and debate transcripts. Executive agencies assign numbers to proposed and final regulations and file them with associated
documents, such as public comments. Conceptually, we can divide this into three logical sections: legislation, law, and regulation. While these
steps are generally true, there are many exceptions to this basic description.
The first step in the research design was to build a logical model describing how US federal organizations coordinate activity by sharing
information. The model, called Authority-Tracker, established the organizational interoperability between the US legislative and executive
branches. The Authority-Tracker model began with a description of the documents and naming models that make up Congressional legislative
history. The open government data produced from the legislative history is mapped in the first stage of the model, as shown in Figure 1.
Figure 1. Model for legislation
The first stage of the conceptual model for organizational
interoperability considered the legislative branch.
The US Congress is really two organizations: the House of Representatives and the Senate. Each chamber of Congress operates differently in
accordance with its constitutional role (Krehbiel, 1991; Oleszek, 2001; Polsby, 1968). The House and the Senate must agree to a legislative
proposal, or bill, for it to become law. As Congress proposes, amends, and votes on legislation, each step is recorded in a transactional database,
and often a document is created. Legislative documents are numbered sequentially, with additional acronyms to indicate the document type and
originating chamber. For instance, S.1 is the first bill introduced in the Senate for a Congressional term.
Legislation may be closely related to other legislation. Most policy ideas are represented by multiple proposals that compete for legislative
attention. Consider the simplest example of opposing perspectives on an issue.
A very complicated example of legislation is how two bills with different numbers may be the continuation of the same idea. In other words, the
same bill number throughout the process is not the equivalent of following the same policy idea. S.1 may contain environmental policy when
introduced, but by the final vote it may contain education policy. This complication was a strong factor in understanding the financial crisis policy
discussed below. For organizational convenience and to meet procedural rules, the contents of one proposal are often replaced with the text of
another unrelated proposal. This detail was difficult for our model to consider but is an important component of tracing a new legislative
proposal through to its regulatory implementation. Conversely, this creates problems for tracing a final bill to early legislative proposals. The
practice of substituting the entire text means that bill numbers are stable but bill topics might radically shift between stages in the process.
If legislation passes, it becomes law and generates several document types that are widely promulgated, as shown in Figure 2. Enacted legislation,
or law, first becomes a statute. While Statute numbers are often unique, it is possible to have multiple laws with the same citation if they are less
than one page long. In the current era, where the legislative documents are long, the first page of the Statute number is usually a unique
identifier.
Figure 2. Model for laws
Since 1926, when the United States Code (USC) started, general and permanent laws receive two numbers: the Statute-At-Large number and the
USC number. All laws are published in date order in the Statutes-At-Large, while the United States Code is organized in subject order. The United
States Code is an essential part of the interoperability between the legislative and executive branches.
The key to interoperability between the two branches is a table that converts citations. The first connection point is Table III of the United States
Code. When a bill becomes a law, it is entered into this table. Table III shows how each law is distributed by topic across subject titles in the
United States Code. For instance, a law might have provisions relating to taxes in Title 26 and others relating to commerce in Title 15. Table III
serves as an important organizing point for moving from legislation to regulation.
Once a law is placed in the United States Code, it is used by the executive branch to set administrative law, which establishes regulations through
a process known as rule-making. Proposed regulations are described and then codified in a way similar to legislation, as shown in Figure 3.
The third stage is the executive branch authorization of
regulation, also referred to as secondary legislation.
The Federal Register, published daily, contains all administrative notices that have the force of law. A citation such as 74 FR 28405 indicates
volume 74 of the Federal Register, page 28,405, and it is usually followed by a specific date such as (July 19, 1999). The Federal Register is similar
to the Statutes-At-Large because it is a date-ordered list. The Code of Federal Regulations (CFR) is similar to the United States Code in that it is a
subject-ordered list. The Code of Federal Regulations citations follow a structure similar to the United States Code: 12 CFR 112, which may
include a part number.
The second connection point between the two branches is a table that converts administrative law back to authorizing law. The Parallel Table of
Authorities (PTOA) converts Code of Federal Regulations citations into United States Code citations. Like Table III of the United States Code, it is
an appendix to the main codification document. The PTOA appears as an index to the Code of Federal Regulations.
References to the law are key to both the executive and legislative branches. The indexes attached to the two codification publications are vital for
moving dynamically between legislation and regulation.
Authority-Tracker is a conceptual model of how documents are shared between the two branches of the US government. The model captures the
organizational interoperability between Congress and the federal executive agencies.
The complete Authority-Tracker model, as shown in Figure 4, indicates the documents and citations that span between legislative and executive
organizations. The consistency of the numbering systems masks a complex system of usage. All documents referenced throughout the legislative
process use the bill number, but as indicated the bill number might not reference the same policy ideas. Laws do not have a consistent naming
structure. A law may be referenced by public law number, statute number, or United States Code number. In addition, a law may be referenced in
executive documents by its act name, such as the Social Security Act.
Figure 4. Authority-Tracker model
The Authority-Tracker model captures the basic flow of information from legislation to regulation. There are several limitations to the Authority-
Tracker model. Administrative law is restrained to two documents, the Code of Federal Regulations and the Federal Register. The model does not
include other documents that are part of the rule-making process, such as the Unified Agenda, which tracks agency actions. Interoperability with
the third branch of government, the judiciary, was not considered. These omissions could be useful in future research. The interoperability of
case law with the other branches would be an avenue for an expanded model.
DATA SOURCES
The second step in the research design was to identify digital sources produced by organizations in the Authority-Tracker model. A digital source
is a machine-readable file with an equivalent paper publication. Any digital source described here was openly available through the Internet from
a government website. All documents were available for download from the Government Publishing Office, and many were available directly
from agencies.
While all the materials were available in some digital form, there were significant variations in the formats available. All file formats require some
preprocessing, but PDF files, or Portable Document Format, are very difficult to parse into meaningful structures. We used the text files, which
are flat files that contain only simple text encoding characters and no structure. Indexes that coordinate between the two branches were
unfortunately not in easy-to-parse file formats. These indexes, which are regularly produced tables, serve as crosswalks between legislation and
regulation. Although available in digital formats, the files were not in any structured form and had to be parsed as simple text. The tables, Table
III and the Parallel Table of Authorities, explicitly connect documents from legislative and executive organizations. The text file had to be
carefully parsed to turn text tables into rows and columns. After parsing both tables for standard citation formats, it was possible to build a
structured table of connections. Table III connected public law numbers to their United States Code citations. The Parallel Table of Authorities
connected CFR citations to their equivalent USC citations.
Documents from both branches reference laws using act names and unique identifiers. Act names, such as the Social Security Act, refer to specific
laws by the names of provisions. The granularity of a citation may fluctuate. Granularity is defined here as the level of specificity of a document.
Is it referenced by a name, a citation to the whole document, or a citation to a part of the document? A citation could refer to a specific provision
or to the beginning of the entire act name.
There were sufficient digital sources, as shown in Figure 4, to build a data collection based on the model. The data sources for the Authority-
Tracker model included the Parallel Table of Authorities, the Code of Federal Regulations, the Federal Register, the United States Code, and the
Statutes-At-Large, as shown in Table 2. Some specific digital sources were not available in file formats that were easy to parse. In particular, key
index tables were not available in structured formats.
Table 2. Sources used to track legislation to regulation
Digital files that are available based on the legislation to regulation model.
METHODS
The intent of the retrieval tool was to track the flow of policy authorization through related agencies and organizations. After determining that all
files were digitally available, the next stage was to build a collection and a retrieval tool. The tool, called AUTHORITY-TRIEVE, retrieved the
chain of authority from legislation through regulation using the Authority-Tracker model. AUTHORITY-TRIEVE was a proof-of-concept tool to
evaluate the feasibility of connecting digital sources. It retrieved the text of a law and the text of its related regulations simultaneously.
Specifically, given any United States Code citation AUTHORITY-TRIEVE would provide the related public law, Statutes-At-Large, Code of
Federal Regulations, and Federal Register citation and associated texts.
All available historic information was downloaded and parsed. Although semi-structured formats such as XML (Extended Markup Language) are
widely available for some document types, historic data are often only available in text. To increase the breadth of the tool's function, all sources
were reduced to plain text file formats.
This tool is designed so that a United States Code citation retrieves administrative law. In other words, it moves from legislation to regulation but
not in the opposite direction. Future improvements to the tool will enable a searcher to start with any citation and find its related documents.
This would involve parsing the entire Code of Federal Regulations to include notices not related to United States Code citations.
This experience emphasized the need to combine both technical skills and policy knowledge. The tool itself, a MySQL database, required basic
technical skills and was built by a law student as a summer project. This was seen as a promising step because the goal is to make these data
available to social scientists. The team behind the tool design had experience in computer science, political science, and law. All sources were
combined into a series of relational tables in a MySQL database. The tables contained citations and complete texts. In addition, two tables were
created for the codification of citation formats. One key was for the United States Code and the other was for the Code of Federal Regulations.
The key tables clarified citations by parsing them into all possible subdivisions. The range of citations were approximated since they occurred
irregularly.
The retrieval tool served as an illustration of the interoperability of public laws, the United States Code, the Federal Register, and the Code of
Federal Regulations. It proved that open government data had enough conceptual interoperability to surpass institutional data silos. However,
reducing rich data sources to text to gather longitudinal data highlighted a weakness in existing sources. While it was possible to track legislation
to regulation using only digital files, it was based on the simplest file formats.
AN EXAMPLE: FINANCIAL CRISIS POLICY
The collapse of prominent financial services firms in 2008 initiated the Congressional response to the worldwide financial crisis. On September
29, 2008, the House of Representatives failed to pass a legislative proposal and the stock market dropped−its worst single-day descent−within 30
minutes of the vote. The policy eventually passed a few days later and was implemented by several federal agencies. What aspects of this story are
visible in the available open data?
The simplest tracking would indicate that The Troubled Assets Relief Program (TARP) was voted into law as H.R. 1424 of the 110th Congress on
October 3, 2008. The program was established through the authority of the “Emergency Economic Stabilization Act of 2008,” Public Law 110-
343, also known as 122 Stat. 3765. Authority to implement the policy was given, in part, to the Security and Exchange Commission and to the
Department of the Treasury. It was established in many federal register listings, including at least two parts of the Code of Federal Regulations:
17 CRF 240, Part 240 – General Rules and Regulations, Securities Exchange Act of 1934 and 31 CRF 30, Part 30 – TARP Standards for
Compensation and Corporate Governance. The Troubled Assets Relief Program was in effect from October 3, 2008 through October 3, 2010;
however, some regulation and enforcement provisions, such as executive compensation rules, were still in effect in 2012. The retrieval tool was
able to depict a broad understanding of events and connect legislative documents to executive ones, but it did not fully capture the story
described above.
The legislative history was relatively short but included complex Congressional procedure. First, the most important historic event was not visible
through our model and required extensive knowledge of Congressional procedure to uncover. The vote that sent shock waves through financial
markets on September 29, 2008 was a procedural resolution to H.R. 3997, a different bill than the one that passed. Resolutions are often used for
procedural votes, which determine how more substantive issues are handled. The text of this important legislation was buried as a conference
report to a resolution connected to another bill number. When this three-page proposal became several hundred pages in the final bill, there were
few who could trace its origins.
Second, in order to bring the proposal to a vote quickly in response to the crisis, bills that already existed were used. The contents of existing bills
were substituted in their entirety with the text of the TARP policy. As discussed earlier, replacing the contents of a bill in its entirety is common
practice. H.R.1424, the bill that eventually became the law, was introduced in March 2007 with the title “Paul Wellstone Mental Health and
Addiction Equity Act of 2007.” Because legislators have to vote to change titles, titles often do not change when the contents of bills are
completely substituted with other text. While the bill is referred to as the “Emergency Economic Stabilization Act of 2008,” that name is the title
of Division A of H.R. 1424. The title of H.R. 1424 was never officially changed.
Third, the practice of using act names can vary. The act name can refer to a subsection, a supersection, or the entire law. The granularity of a
name can change depending on which title is used. In this example, the authority could be given under the act name, “Emergency Economic
Stabilization Act of 2008,” or under the program name, “The Troubled Assets Relief Program.” Without structured information, automatically
determining name references requires complex processing although popular name tables do exist.
Finally, granularity is an issue both for act names and for unique identifiers. The most puzzling finding was that unique identifiers could be used
in different ways. The public law for TARP contains three different legislative acts. TARP is one program in one of the acts. Because a Statues-At-
Large citation is a page number, it can refer to a specific provision or to the beginning of a set of provisions. 122 Stat. 3765 is the beginning of
Public Law 110-343. 122 Stat. 3767 is the beginning of the “Emergency Economic Stabilization Act of 2008.” 122 Stat. 3776 is the policy on
executive compensation. A fully automated system would need to determine whether a reference is to the beginning of a public law or a specific
act or program.
In summary, this case study showed that the retrieval tool can be used to identify general movements of documents between the executive and
legislative branches. The current data do not interconnect failed legislative proposals, which may be important in interpreting current law.
Because of the varieties of use in titles and referencing, it was difficult to confirm that the right items were found without further inspection of the
actual text or knowledge of procedure. Given distinctions in use, tracking policy authorization through sets of organizational documents can be
very difficult to interpret without knowledge of Congressional procedure.
EVALUATION
Our research found that legislation and regulation documents share similar conceptual and technical standards. The building of the retrieval tool
made it possible to evaluate the interoperability of digital material that traces legislation to regulation. Four types of interoperability are
considered in this evaluation: organizational, technical, syntactic, and operational, as shown in Table 3.
Table 3. Evaluation US executive and legislative branch interoperability
Stable organizational and technical interoperability made it possible to logically track legislation to regulation yet inconsistent syntactic and
operational interoperability are still challenges for computational tracking.
Organizational interoperability was confirmed in the Authority-Tracker model. Both the legislative and executive branches interact by
exchanging documents. A law, which is a final Congressional document agreed upon by the House and Senate, is shared with the executive
branch. The United States Code and Code of Federal Regulations are key transition points between the two branches.
Technical interoperability was confirmed across all organizations. The House and Senate share a data model that is used in the legislative XML
exchange standard. The Government Publishing Office (GPO) publishes important documents shared across organizations in XML file formats,
easy-to-read PDFs that perfectly represent print documents, and text files that can be processed by any computer.
Syntactic interoperability was not consistent. Although there are many structured languages and syntax structures for law, there is no consistency
in use. Sometimes a law is referred to by its name or by the many documents in which it appears. For instance, the Emergency Economic
Stabilization Act of 2008 is also Public Law 110-343 or 122 Stat. 3765. However, the more worrisome inconsistency is the difference in
granularity of the reference. A syntactic reference can be to an overall bill or a specific provision. There is no automated way to identify when
these shifts in granularity occur. Current citation use is adequate for a person with some knowledge of the system but not precise enough for
automatic parsing. Granularity (Blair & Kimbrough, 2002) is a concern for both the names of laws and for parsing citations.
The final type of interoperability was operational compliance. The Code of Federal Regulations uses a mix of act names, United States Code
citations, and public law numbers, making it difficult to consistently track back to the legislative branch. It is not clear that conventions for citing
laws have been mandated across organizations as a standard. One exception is the legislative XML exchange standard (xml.house.gov), which
describes a canonical set of references. However, legislative proposals can reference existing law as either an act name or a law citation. Multiple
conventions for referencing law make sense to knowledgeable human readers but impede automated computational analysis.
In summary, organizational processes and technical standards are well coordinated. However, things begin to fall apart with the syntax used to
reference laws. Even if a specific syntax is used, the level of granularity may cause misinterpretations in automatic processing. Overall the case
study found stable organizational and technical interoperability but inconsistent syntactic and operational interoperability.
RESULTS
The evidence suggests that open government data does not yet contain enough information in and of itself to completely track governance
patterns. The translation from legislation to regulation includes two crosswalk tables that are available in digital form. However, they require
additional work to process into machine-readable structure. What was available for the case study is only accessible to sophisticated data
customers proficient in computational skills and government procedure. However, that leaves open many possible future directions for
computational research using open government data.
Policy collections may prove interesting for combining research in natural language processing, human-computer interaction, and information
retrieval. The unique features of policy documents challenge assumptions about the uniformity of texts. For instance, titles used in legislation
may not be associated with the meaning of the text, and titles of policy documents may contain completely unrelated information due to
procedural complexities. What innovative interface designs could support this complexity? How can topic maps (Blei, 2012) help to identify and
differentiate policy ideas across time?
Government information is an abundant resource of spoken, written, and visual material that can advance computational science. The ability to
connect unstructured information using “big data” computational methods enables longitudinal studies on extensive natural language sources.
Machine learning scholars also may be interested in these data because they are tied to defined outcomes that may be used for predictive
analytics. Those interested in natural language can compare the differences between video and audio of fully transcribed meetings without
licensing restrictions.
These opportunities have in many ways been met by commercial legal publishers. With the expectations of profits, commercial vendors have
added value by building tools that interpret government data. Open government data efforts have primarily emphasized releasing increasing
amounts of information. Perhaps future transparency efforts could focus on how information is combined and not just what information is
produced.
This paper reports on the first stage of Poli-Informatics, an interdisciplinary funded effort to investigate the research potential of open
government data. These collections require some expert knowledge to interpret. Only through interdisciplinary collaborations can we find the
best balance for learning from these documents.
Connecting between US legislation and regulation has the appearance of interoperability, but the operational differences are still a hindrance.
Future researchers could continue to explore the challenge of building models for interoperability between two branches of government. Large-
scale technical infrastructure is essential especially within organizations like the European Union (Charalabidis & Askounis, 2008; Charalabidis,
Lampathaki & Psarras, 2009; European Commission, 2009), however, work on the logical interconnections is still needed. Legislative open data
efforts currently take place at a distance from executive efforts (Janssen, Chun & Gil-Garcia 2009; Lewis, 1995). Ideally, we need approaches that
consider both how laws are made and how they are used by lawyers, incorporating scholarship on artificial intelligence, law and legal information
systems. We need to understand how these two distinct ecosystems fit together to gain a holistic perspective on how government, overall, is
doing.
The transition from eGovernment interoperability to big data will have to confront the legacy of existing documents. Government documents are
carefully constructed artifacts representing detailed public sector work. The historic format and use of these documents is not built to withstand
the level of granularity expected for fully reliable machine-readable processing.
The documents that were critical to build connections between legislation and regulation are remnants of the print era. Both Table III of the
United States Code and the Parallel Table of Authorities of the Code of Federal Regulations are indexes. An index was the primary form of
interoperability between print documents. Often considered as an alphabetical list, an index is a table that matches a location with a concept
(Anderson, 1985). Conceptual analysis is the result of the often overlooked intellectual work of information infrastructure (Bowker, 1996; Star,
1999). An index can organize the content within a collection of any size. Some collections are so complex that they require multiple indexes to
provide sufficient access to different conceptual arrangements. For instance, the United States Code is a subject-ordered index, while the
Statutes-At-Large is date-ordered index. An index can be a critical crosswalk between different systems of thinking (Star & Griesemer, 1989).
Different systems of thinking are more pronounced at scale and across competing modes of organizing (Baker & Bowker, 2007). This is
particularly true for government indexes that connect the output of multiple organizations.
What can a simple index do in comparison to advanced computational tools such as algorithmic analytics, personalized retrieval, and precision
Internet search engines? The current indexes were developed when page space was limited. They are more like a compass than exact directions.
The 2013 Parallel Table of Authorities connects 1 USC 112 to 1 CFR Part 2. The United States Code citation is about 200 words long, while the
Code of Federal Regulations citation is closer to 1000 words. Perhaps a specific Code of Federal Regulations paragraph could be connected to the
smallest unit in the United States Code. While there are limits to human-generated indexes, the connections between the print indexes have been
highly stable over decades. They will be a crucial step towards either generating or checking the accuracy of advanced computational techniques
(Bookstein & Swanson, 1976; Salminen, Tague-Sutcliffe & McClellan, 1995; Willis & Losee, 2013). In an advanced indexing system, granularity
(Blair & Kimbrough, 2002; Hearst, 2009) will be an additional point of access.
DISCUSSION
Future research may consider how to use advanced search interface designs to retrieve specific targets and related adjacencies within a large
document.
Some big data analysts will abandon all retrieval tools in favor of text mining tools. The debate about searching or browsing is a classic concern
(Bates, 1989; Speier & Morris, 2003; Van Noortwijk, Visser & Mulder, 2006). Search tools locate a specific known item. Browse tools locate
related groups of items. Brown and Duguid (2002) consider that direct search can sometimes lead to tunnel vision, where possibilities are overly
restricted. Consider the conundrum of searching within the document called the “Federal Register” for the organization that creates it. Because
organizations are listed without designations, the “Office of the Federal Register” is listed as the “Federal Register.” Only through a dedicated
index, not search, is it possible to locate recent rule-making activity from the Office of the Federal Register in the Federal Register. In this
example, a browse had more successful results that a search. Assuming that an algorithm always brings clarity and simplicity can mask hidden
nuance and subtlety (Diakopoulos, 2014). An index holds layers of intellectual work that are still useful as machine-readable retrieval systems are
perfected.
Government information is driven by processes. The identifiers, documents, and other structures are unique to internal processes within the
organizations that produce them. What tools are available to support operational compliance? Internal mechanisms could identify and
standardize consistent use within an organization. As e-science initiatives establish elaborate communication schemes (Borgman, Bowker,
Finholt & Wallis, 2009), eGovernment could develop mechanisms for cross-organization use. Without internal consistency or explicit translation
tools, the interpretative complexity increases with machine-generated connections between organizations.
The intent of any index is to build stable connections. Big data collections rely on stable information for interpretation. Government indexes
provide potential sources for understanding the design, procedures, and intent behind document collections. While adding convenience for
human readers, an index also builds connections for machine-readable interoperability. Despite the limits to both human-generated and
algorithmic methods, index publications will serve as pivotal guides to the next generation of tools for interoperable government data.
CONCLUSION
The promise of eGovernment interoperability has been partially reached with the necessary first step of releasing digital files to the public.
Meanwhile, there are many opportunities for scholars who are interested in using “big data” computational methods on unstructured data. This
case study of tracking policy authority from legislation to regulation contributed an understanding of the current state of interoperability for US
federal open government data. Data can cross boundaries, but operational compliance between organizations needs to be resolved through new
standards or dynamic translation tools. Complete interoperability will be possible when files in presentation formats are available in machine-
readable formats. The interoperability of information between two US federal branches exemplifies remaining socio-technical challenges in
achieving the integration of machine-readable government data.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 119, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
This project was supported by the U.S. National Science Foundation Grant No. #1243917.
REFERENCES
Agarwal, R., & Dhar, V. (2014). Editorial—Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research.Information
Systems Research , 25(3), 443–448. doi:10.1287/isre.2014.0546
Anderson, J. D. (1985). Indexing systems: Extensions of the mind's organizing power . In Ruben, B. D. (Ed.), Information and behavior (pp. 287–
323). New Brunswick, NJ: Transaction Books.
Arnold-Moore, T. J., Anderson, P., & Sacks-Davis, R. (1997). Managing a digital library of legislation. In ACM/IEEECS Joint Conference on
Digital Libraries (pp. 175–183). ACM/IEEE.
Baker, K. S., & Bowker, G. C. (2007). Information ecology: Open system environment for data, memories, and knowing. JIIS Journal of
Intelligent Information Systems , 29(1), 127–144. doi:10.1007/s10844-006-0035-7
Bates, M. J. (1989). The design of browsing and berrypicking techniques for the online search interface. Online Review , 13(5), 407–424.
doi:10.1108/eb024320
Blair, D. C., & Kimbrough, S. O. (2002). Exemplary documents: A foundation for information retrieval design. Information Processing &
Management , 38(3), 363–379. doi:10.1016/S0306-4573(01)00027-9
Bookstein, A., & Swanson, D. R. (1976). Probalistic models for automatic indexing. JASIS . Journal of the American Society for Information
Science , 25(5), 312–318. doi:10.1002/asi.4630250505
Borgman C. L. Bowker G. C. Finholt T. A. Wallis J. C. (2009). Towards a virtual organization for data cyberinfrastructure. In Proceedings of the
9th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 353–356). New York: ACM Press. 10.1145/1555400.1555459
Bowker, G. C. (1996). The history of information infrastructures: The case of the international classification of diseases. Information Processing &
Management , 32(1), 49–61. doi:10.1016/0306-4573(95)00049-M
Brown, J. S., & Duguid, P. (2002). The social life of information . Boston: Harvard Business School Press.
Buckley Owen, B., Cooke, L., & Matthews, G. (2012). Information Policymaking in the United Kingdom: The Role of the Information
Professional. Journal of Information Policy , 2(0).
Charalabidis, Y., & Askounis, D. (2008). Interoperability registries in eGovernment. In Hawaii International Conference on System
Sciences,Proceedings of the 41st Annual (pp. 195–195). IEEE.
Charalabidis, Y., Lampathaki, F., & Psarras, J. (2009). Combination of interoperability registries with process and data management tools for
governmental services transformation. InSystem Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on (pp. 1–10). IEEE.
Chen, H., Chiang, R., & Storey, V. (2012). Business Intelligence and Analytics: From Big Data to Big Impact. Management Information Systems
Quarterly , 36(4), 1165–1188.
Dawes, S. S. (1996). Interagency Information Sharing: Expected Benefits, Manageable Risks. Journal of Policy Analysis and Management , 15(3),
377–394.
DeNardis, L. (2010). E-Governance Policies for Interoperability and Open Standards. Policy & Internet , 2(3), 129–164. doi:10.2202/1944-
2866.1060
Dhar, V. (2013). Data Science and Prediction. Communications of the ACM , 56(12), 64–73. doi:10.1145/2500499
Domingos, P. (2012). A Few Useful Things to Know About Machine Learning. Communications of the ACM , 55(10), 78–87.
doi:10.1145/2347736.2347755
Goes, P. (2014). Editor’s Comments: Big Data and IS Research.Management Information Systems Quarterly , 38(3), iii–viii.
Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. IEEE Intelligent Systems , 24(2), 8–12.
doi:10.1109/MIS.2009.36
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining . Cambridge, MA: MIT Press.
Hearst, M. A. (2009). Search user interfaces . New York: Cambridge University Press. doi:10.1017/CBO9781139644082
Janssen, M., Chun, S. A., & Gil-Garcia, J. R. (2009). Building the next generation of digital government infrastructures.Government Information
Quarterly , 26(2), 233–237. doi:10.1016/j.giq.2008.12.006
Kim, G.-H., Trimi, S., & Chung, J.-H. (2014). Big-data Applications in the Government Sector. Communications of the ACM , 57(3), 78–85.
doi:10.1145/2500873
Krehbiel, K. (1991). Information and legislative organization . Ann Arbor, MI: University of Michigan Press.
Kubicek, H., Cimander, R., & Scholl, H. J. (2011). Organizational Interoperability in E-Government: Lessons from 77 European Good-Practice
Cases . Springer. doi:10.1007/978-3-642-22502-4
Lewis, J. R. T. (1995). Reinventing (open) government: State and federal trends. Government Information Quarterly , 12(4), 427–455.
doi:10.1016/0740-624X(95)90078-0
Liebwald D. (2013). Vagueness in law: a stimulus for’artificial intelligence & law’. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Law (pp. 207–211). 10.1145/2514601.2514628
Marchionini, G., Haas, S., Plaisant, C., & Shneiderman, B. (2006). Integrating data and interfaces to enhance understanding of government
statistics: Toward the national statistical knowledge network project briefing. In International Conference on Digital Government Research
Dg.o (pp. 334–335). Academic Press.
Matthijssen, L. (1998). A task-based interface to legal databases.Artificial Intelligence and Law , 6(1), 81–103. doi:10.1023/A:1008291611892
Pardo, T. A., Nam, T., & Burke, G. B. (2012). E-Government Interoperability Interaction of Policy, Management, and Technology
Dimensions. Social Science Computer Review , 30(1), 7–23. doi:10.1177/0894439310392184
Polsby, N. W. (1968). The Institutionalization of the U.S. House of Representatives. The American Political Science Review , 62(1), 144–168.
doi:10.2307/1953331
Quaadgras, A., Ross, J. W., & Beath, C. M. (2013). You May Not Need Big Data After All. Harvard Business Review .
Relyea, H. C. (1989). Historical Development of Federal Information Policy . In Mcclure, C. R., & Hernon, P. (Eds.), United States Government
Information Policies (pp. 25–48). ABLEX publishing.
Salminen, A., Tague-Sutcliffe, J., & McClellan, C. (1995). From text to hypertext by indexing. ACM Transactions on Information Systems , 13(1),
69–99. doi:10.1145/195705.195717
Shmueli, G., & Koppius, O. R. (2011). Predictive Analytics in Information Systems Research. Management Information Systems Quarterly , 35(3),
553–572.
Speier, C., & Morris, M. G. (2003). The influence of query interface design on decision-making performance. Management Information Systems
Quarterly , 27(3), 397.
Star, S. L. (1999). The Ethnography of Infrastructure. The American Behavioral Scientist , 43(3), 377–391. doi:10.1177/00027649921955326
Star, S. L., & Griesemer, J. R. (1989). Institutional Ecology, “Translations” and Boundary Objects. Social Studies of Science ,19(3), 387–420.
doi:10.1177/030631289019003001
Sundararajan, A., Provost, F., Oestreicher-Singer, G., & Aral, S. (2013). Information in Digital, Economic, and Social Networks.Information
Systems Research , 24(4), 883–905. doi:10.1287/isre.1120.0472
UNESCO. (1975). Intergovernmental Conference on the Planning of National Documentation, NATIS, national information systems: COM
74/NATIS/3 Rev. Paris: UNESCO.
Van der Aalst, W. M. P., & Kumar, A. (2003). XML-Based Schema Definition for Support of Interorganizational Workflow.Information Systems
Research , 14(1), 23–46. doi:10.1287/isre.14.1.23.14768
Van Noortwijk, K., Visser, J., & Mulder, R. V. D. (2006). Ranking and Classifying Legal Documents using Conceptual Information.The Journal of
Information Law and Technology , 2006(1).
Washington, A. L. (2014). Government Information Policy in the Era of Big Data. Review of Policy Research , 31(4), 319–325.
doi:10.1111/ropr.12081
Washington, A. L., & Griffith, J. C. (2007). Legislative Information Websites: Designing Beyond Transparency. In A. R. Lodder & L. Mommers
(Eds.), Legal Knowledge and Information Systems JURIX 2007 The Twentieth Annual Conference (p. 192). JURIX.
Willis, C., & Losee, R. M. (2013). A random walk on an ontology: Using thesaurus structure for automatic subject indexing. Journal of the
American Society for Information Science and Technology ,64(7), 1330–1344. doi:10.1002/asi.22853
Wilson E. (1988). Integrated information retrieval for law in a hypertext environment. In Annual International ACM SIGIR Conference On
Research And Development In Information Retrieval (pp. 663–677). 10.1145/62437.62505
CHAPTER 12
Big Data and Web Intelligence:
Improving the Efficiency on Decision Making Process via BDD
Alberto Pliego
Escuela Técnica Superior de Ingenieros Industriales, Spain
Fausto Pedro García Márquez
Escuela Técnica Superior de Ingenieros Industriales, Spain
ABSTRACT
The growing amount of available data generates complex problems when they need to be treated. Usually these data come from different sources
and inform about different issues, however, in many occasions these data can be interrelated in order to gather strategic information that is
useful for Decision Making processes in multitude of business. For a qualitatively and quantitatively analysis of a complex Decision Making
process is critical to employ a correct method due to the large number of operations required. With this purpose, this chapter presents an
approach employing Binary Decision Diagram applied to the Logical Decision Tree. It allows addressing a Main Problem by establishing different
causes, called Basic Causes and their interrelations. The cases that have a large number of Basic Causes generate important computational costs
because it is a NP-hard type problem. Moreover, this chapter presents a new approach in order to analyze big Logical Decision Trees. However,
the size of the Logical Decision Trees is not the unique factor that affects to the computational cost but the procedure of resolution can widely
vary this cost (ordination of Basic Causes, number of AND/OR gates, etc.) A new approach to reduce the complexity of the problem is hereby
presented. It makes use of data derived from simpler problems that requires less computational costs for obtaining a good solution. An exact
solution is not provided by this method but the approximations achieved have a low deviation from the exact.
INTRODUCTION
The information and communication technologies (ICT) have grown up with no precedents, and all aspects of human life have been transformed
under this new scenario. All industrial sectors have rapidly incorporated the new technologies, and some of them have become de facto standards
like supervisory control and data acquisition (SCADA) systems. Huge large amounts of data started to be created, processed and saved, allowing
an automatic control of complex industrial systems. In spite of this progress, there are some challenges not well addressed yet. Some of them are:
the analysis of tons of data, as well as continuous data streams; the integration of data in different formats coming from different sources; making
sense of data to support decision making; and getting results in short periods of time. These all are characteristics of a problem that should be
addressed through a big data approach.
Even though Big Data has become one of the most popular buzzword, the industry has evolved towards a definition around this term on the base
of three dimensions: volume, variety and velocity (Zikopoulos and Eaton, 2011).
Data volume is normally measured by the quantity of raw transactions, events or amount of history that creates the data volume. Typically, data
analysis algorithms have used smaller data sets called training sets to create predictive models. Most of the times, the business use predictive
insight that are severely gross since the data volume has purposely been reduced according to storage and computational processing constraints.
By removing the data volume constraint and using larger data sets, it is possible to discover subtle patterns that can lead to targeted actionable
decisions, or they can enable further analysis that increase the accuracy of the predictive models.
Data variety came into existence over the past couple of decades, when data has increasingly become unstructured as the sources of data have
proliferated beyond operational applications. In industrial applications, such variety emerged from the proliferation of multiple types of sensors,
which enable the tracking of multiple variables in almost every domain in the world. Most technical factors include sampling rate of data and
their relative range of values.
Data velocity is about the speed at which data is created, accumulated, ingested, and processed. An increasing number of applications are
required to process information in real-time or with near real-time responses. This may imply that data is processed on the fly, as it is ingested, to
make real-time decisions, or schedule the appropriate tasks.
However as other authors point out, Big Data could be also classified according to other dimensions such as veracity, validity and volatility.
Data veracity is about the certainty of data meaning. This feature express whether data reflect properly the reality or not. It depends on the way
in which data are collected. It is strongly linked to the credibility of sources. For example the veracity of the data collected from sensors depends
on the calibration of sensors. The data collected from surveys could be truthful if survey samples are large enough to provide a sufficient basis for
analysis. In resume, the massive amounts of data collected for Big Data purposes can lead to statistical errors and misinterpretation of the
collected information. Purity of the information is critical for value (Ohlhorst, 1964).
Data validity is about the accuracy of data. The validity of Big Data sources must be accurate if results are wanted to be used for decision making
or any other reasonable purpose (Hurwitz et al, 2013)
Data volatility is about how long the data need to be storage. Some difficulties could appear due to the storage capacity. If storage is limited, what
and how long data is needed to be kept. With some Big Data sources, it could be necessary to gather the data for a quick analysis (Hurwitz et al,
2013).
These data are often used for decision making. DM processes are done continuously by any firm in order to maximize the profits reliability, etc.
or minimize costs, risks, etc. There are software to facilitate this task, but the main problem is the capability for providing a quantitative solution
when the case study has a large number of BCs. The DM problem is considered as a cyclic process in which the decision maker can evaluate the
consequences of a previous decision. Figure 1 shows the normal process to solve a DM problem.
Decision Trees (DTs) represent one of the most commonly used tools for depicting an issue in business and encompasses all the alternatives
involved.
Starting from the background given in (Lopez, 1977), the Logical Decision Tree (LDT) is introduced. It gives an alternative method to depict a DM
issue including the interrelation between every single Basic Cause (BC).
Mentioned LDTs describe graphically which the roots of a certain problem are as well as its interrelation. With this purpose, and in accordance
with (Mallo, 1995), both logical operators ‘AND’ and ‘OR’ are introduced in order to encompass a wider spectrum when analysing DM. Such
logical operators will allow a better comprehension of the problem itself and it will fully establish the necessary background to a subsequent
conversion from LDT to Binary Decision Diagram. To demonstrate that LDTs are perfectly suitable for depicting a certain condition in a DM
context with a big amount of data and interrelations is sought in these lines.
There are several ways to depict a particular situation/issue but LDT has been chosen in this chapter. The advantage of representing a business’s
issue as LDTs is that it can be seen at a glance. Thereby, a reduction in the resources needed by the business will be achieved with its consequent
time and money saving.
Indeed, managers must be very careful when building the LDT because mentioned BCs must be the minimal and necessary causes which lead to a
problem. Thus, for instance, BCs such as “Limited tools” or “Low tools reliability” are the minimum and necessary causes that lead to a “Wrong
tools’ stock”.
It is important to denote that MP’s causes must be mutually independent. The chance to analyse the main issues of business with a high number
of BCs involved is introduced. Furthermore, companies have to deal with quite tough difficulties when having too many BCs.
To evaluate a DM problem, regardless its magnitude is carried out through LDT and its subsequent conversion to BDD.
This chapter considers a new approach based on breaking down the problem into different causes that could lead to non-desired situations. This
disaggregation leads to determine the number of BCs and also identifies the manner in which all these BCs are logically interrelated. With this
purpose. LDT is introduced as an alternative method to draw a DM, considering the interrelation between each BC (Lopez, 1977), that will take
into account the logical operators ‘AND’ and ‘OR’ (Mallo, 1995). The Appendix shows a LDT case study composed by:
It is showed only with OR gates because its topology will be changing in the experiments in order to analyse different scenarios that will provide
different solutions.
LDT TO BDD CONVERSION
LDT conversion to BDD provides some advantages in terms of efficiency and accuracy, see (Lee CY, 1959), (Akers,1978), (Moret, 1982) and
(Bryant,1986). BDD helps to show the occurrence of a serious issue in the business in a disjoint form, which indeed provides an advantage from
the computational point of view.
BDD is a directed graph representation of a Boolean function, where equivalent Boolean sub-expressions are uniquely represented. A directed
acyclic graph is a directed graph with no cycles, i.e. to each vertex v there is no possible directed path that starts and finishes in v. It is composed
of some interconnected nodes in a way that each node has two vertices. Each can be either a terminal or non-terminal vertex. BDD is a graph-
based data structure whereby the occurrence probability of a certain problem in a DM is possible to be achieved. Each single variable has two
branches: 0-branch corresponds to the cases where the variable is 0; 1-branch cases are those where the event is being carried out and
corresponds when the variable is 1.
The transformation from DT to BDD is achieved applying some mathematical algorithms. Ite (If-Then-Else) conditional expression is one of the
BDD’s cornerstones (Artigao, 2009), see Figure 2:
Figure 2 is defined as: “If BCi variable occurs, Then f1, Else f2”. The solid line always belongs to the ones as well as the dashed lines to the zeroes.
Having into account the Shannon’s theorem it can be obtained the following expression:
Importance of a Right Variable Order
In the transformation from LDT to BDD is necessary to establish a correct ranking of BCs. A strong dependence on the variable ordering is one of
the most tangible drawbacks when BDDs are used. In fact, both the BDD size as well as the CPU runtime have an awfully dependence on the
variable ordering. A poor variable ordering is resulting in poor efficiency in computational terms.
BDD variable ordering has been largely treated through last decades, and there is plenty of related literature. Nevertheless, nowadays there is no
method capablePrepositions act like bridges between the subject and the object of the sentence, giving the reader information about the time or
space between the subject and the object. Some common prepositions are “about”, “across”, “after”, “against”, “along”, “among”, “around”,
“before”, “below”, “beneath”, “despite”, “down”, “during”, “for”, “from”, “near”, “of”, “off”, “out”, “outside”, “over”, “past”, “since”, “through”, “to”,
“under”, “until”, “up”, “with”, and “without”. Each preposition has a different meaning, though some of the differences may be slight. Seriously
consider the relationship between your noun (or pronoun) and its object, and select the preposition which best describes that relationship.
Heuristic methods have been used taking into account that the best variable ordering is desired to be achieved. The software used in herebyMake
sure the adverb is not modifying a noun. Adverbs can describe verbs, adjectives, and other adverbs, but never a noun. Remember that not all
adverbs end in “-ly”; check in a dictionary if you are not sure which role a word is playing.
chapter uses five different methods in order to obtain the right variables ordering. Mentioned methods are well known in the BDD environment
and may be described as follow:
• Topological Heuristic Methods: The difference between the topological and both the Level as well as AND methods is that the
topological are the easiest way to describe the tree (Artigao, 2009). Just several slight rules must be followed to rank the BCs.
o TopDownLeftRight (TDLR): The tree is read as the method itself says i.e., from the top to the bottom and from the left to the
right. That means a list of BCs will be created with the more important BCs starting up and in the right-side of the tree and follow as
above mentioned.
o DepthFirstSearch (DFS): The tree is read from top to bottom and starting at its left. The sub-trees are identically read.
o BreathFirstSearch (BFS): The tree is read from left to the right and the variable ordering depends on the place they appear in the
tree following that steps. It must be stated that if a repeated BC is found while reading the tree, it has to be ignored, regardless the
method used.
• Level Method: This type of method is not so simple and direct. It makes a difference between the BCs depending on the position in the
tree they are. That is, it is directly related with the number of gates there are above it. There are some important reminders to keep in mind.
The repeated events will have a major importance when some BCs are in the same level. Moreover, in the cases where some BCs have the
same above logic gates and are located in the same level, those that emerge the first will be given a major importance.
• AND Method: This method takes into account the number of AND logic gates which are in the path to the MP of a BC given. In such
cases, the fewer number of AND logic gates, the major importance associated have the BC. It’s fundamentally based on the idea that the BCs
under an AND logic gates less important than the ones below OR logic gates (Artigao, 2009).
• Weight ordering method: Several considerations must be taken into account when trying to apply this method:
o The path followed by each BC to the MP i.e., the path defined by the different logic gates that must be crossed leading to the MP, will
define the importance of each BC by multiplying them.
o The weighting assigned to each logic gate relies on two main causes:
Furthermore, recently published method has been proved to give some very good results when big-size trees are trying to be solved (García
Márquez, 2012). The method associates a related weight depending on the logic gate. It considers only one of the 2n possible states will be spread
whenever the logic gate found is AND.
Thus,
On the contrary, if the logic gate found is an OR, it considers that only one of the 2n possible states will not be spread. Therefore,
Further detailed information about the conversion and variable ordering methods can be found in (García Márquez, 2012). In this chapter only
AND and OR gates are used in the LDTs presented to express the interrelation between the BCs. Figure 3 shows the conversion from LDT to its
corresponding BDD using the following order for BCs:
Once the conversion from DT to BDD is done, it is possible to obtain an accurate expression of the probability of occurrence of the MP by
assigning a probability value to each BC.
CutSets and the Analytical Expression from BDD
Cut-sets (CS) turn into an important concept when referring to BDDs. They are the paths “from the top to the ones” that provide significant
information due to the fact that the probability of occurrence of the MP could be achieved from them. The following CSs have been obtained from
the BDD in Figure 3.
The MP probability is possible to be achieved due to the fact that the different paths (CSs) are mutually exclusive and may be expressed as the
sum of probabilities of all the BDD paths, i.e. an analytic expression consisting of the sum of each analytic expression that forms the CSs. This
expression will represent the utility function in the DM process.
Example
A reduced example is proposed directly in order to have a better notion of the IMs. It will base the background to the results given in the real case
study. LDT in Figure 4 has been chosen toAn infinitive verb, which is a verb preceded by the word “to” (e.g. “to do”, “to see”), should not have any
words between the “to” and the verb. Modifiers generally should be placed before the verb.N.B.This is a tradition in formal writing that has
always been argued about. If you are not writing a formal, academic text, you don't need to worry too much about this. Just make sure your
sentence flows smoothly.
The following mathematical expression completely defines the logical function of the LDT:
The variable ordering generated with the BFS method produce the lower number of CSs with the following ranking:
Thus, the CSs produced are:
With all this information, once the CSs are obtained the problem is ready to achieve the MP occurrence probability by using a software simulated
algorithm. For instance, the MP probability is possible to be hand-calculated in this brief example. Nonetheless, in a real case, where too many
BCs are found, it has been proved that the software used is strongly needed in order to have both accuracy and low-time calculations.
Let us denote by a set of n probabilities associated with n BCs where . Thus, the probability function of each BC is
gathered together into a vector:
These equations shows that it has been given a major weight to BCs six, four and seven, respectively. Nevertheless it will be proved how this fact
actually does not mean they needed to be the most significant BCs to the LDT. By using vector W it is now possible to achieve the probability of
occurrence of the MP, which is as follows:
Benefits and Drawbacks
LDT has been depicted and its following conversion to BDD has been done. Previous sections have stated the numerous advantages provided by
mentioned conversion in hereby project. Besides already mentioned advantages, the equivalent BDD obtained from the LDT will provide the
necessary basis to obtain a ranking of the most important BCs over the whole LDT.
When any business seeks to improve a department or inner issue in a certain department at any given time, numerous causes related with those
departments involved are found. There will be several tens of BCs under the most favourable conditions which will lead to the MP. Nonetheless,
and specifically when speaking about large companies, the regular scenario is to face hundreds or even tens of hundreds of BCs. In other words,
this introduces a challenging scenario in computational terms due to the fact that to handle mentioned data is not straightforward.
The main reason that leads to hereby chapter to convert LDT to BDD and its subsequent BDD data handling is due to the ability to deal with
hundreds or thousands of BCs using both computational time and resource management in a remarkable way. Moreover, in order to be able to
simulate the Importance Measures in a reasonable computational time the CSs obtained from the BDD seems to be the feasible manner. Once the
conversion from LDT to BDD is done and the CSs are achieved, it is possible to obtain the Importance Measures. In fact, the CSs as well as the
probability associated to each BC provide the needed data for the Importance Measures afterward described. Exact calculations to obtain the MP
probability are carried out and approximations are not needed when using BDDs. That turns out into one of the biggest advantages of the BDDs.
Table 1 shows some other reasons that have influenced in the final decision to use BDDs as a formal solution to respond to the needs asked in
herebyMake sure the adverb is not modifying a noun. Adverbs can describe verbs, adjectives, and other adverbs, but never a noun. Remember
that not all adverbs end in “-ly”; check in a dictionary if you are not sure which role a word is playing.
project.
Table 1. Benefits and drawbacks
LDT BDD
The way the BDDs are able to handle all kind of trees, regardless whether they are small, medium or large size is one of the biggest advantages of
using BDDs that has made them so particularly appropriate. It has been proved that when big size trees must be faced there could be some issues
and there exist some different techniques. For instance, a technique successfully used is to convert the tree into small ones in a way that the
software is able to simulate each single small part and then combine all the obtained results (Artigao, 2009). Pretty good results have been
achieved.
BDDs makes possible to calculate the occurrence probability of a MP occurred in a business given a certain LDT. On this occasion, BDDs based
algorithms do not use approximation techniques such as truncation to calculate the occurrence probability of a MP. Nonetheless, BDDs could
have an incredibly high time and memory consumption, chiefly when many BCs are involved. At this point and as previously stated, a particular
emphasis must be done when dealing with the variable ordering. BDD arises when a low computational time and reliable results are sought for
solving LDTs problems. Thanks to BDDs is possible to achieve a good solution in an efficient and effective manner, whenever the variable
ordering is tackled with special care.
NEW APPROACH TO REDUCE THE COMPUTATIONAL COST
The DM problem described is a NP-hard type problem and, therefore, for a large number of BCs, or a complex topology, it can be not
recommended to find a solution. This chapter presents a novel approach for finding a good solution minimizing the computational cost. This
approach is based on the logical gates, especially the AND gates, the number and the position on the tree (level), and their effects to the solution
and the computational cost of the system. The reference solutions, or experimental solutions, are obtained in simple systems, where it can be
extrapolated to complex systems via polynomial regression functions. These functions are setting according to the reference solutions, where it
will be more precise with more reference solutions.
Table 2 shows the probabilities and CSs for different LDT cases studies. The probability of occurrence of MP and the CSs are obtained for
different amounts of AND gates in each level.
The LDT has been calculated for the cases marked in black in Table 2, and red are estimated results. The estimations have been obtained through
polynomial expressions, where the polynomial degree depends on the number of experimental points obtained. The experimental solutions have
been obtained using the algorithms developed by (Artigao, 2009).
Table 2. Experimental results and estimations
0.2458 0.3727
0.2154 0.3604
0.184 0.3479
0.1516 0.3352
0.1182 0.3223
Figure 5 shows the results of probabilities found exactly (E) by BDD and the predicted (P) results found by new approach. It is observed that the
probability is indirectly proportional to the number of AND gates, and proportional to the level, which is expected. Moreover, the consequences
of adding a new AND gate is indirectly proportional to the level. In Figure 5 is also plotted (black curve) the absolute deviation expressed as
abs((E-P)/P). The deviation is proportional to the number of gates, and with values always inferior to 0,45%. It demonstrates that the accuracy of
the solutions founds by the new approach is in every case very good.
The deviation has been estimated for different levels and number of AND gates and presented in Figure 6. It has been estimated through
quadratic polynomial expression. It is useful in order to know approximately the accuracy of the probability estimated in Table 2.
Figure 6. Deviation vs. Number of AND
A similar study presented in Figure 5 has been done taking into account the number of CSs, and showed in Figure 7. The number of CSs is larger
in each level when the number of AND gates increase, and the number of CSs is smaller when the level is larger taking into account the same
number of AND gates. The error is not as relevant for CSs than for the probabilities, because is the same independently of the number of CSs. It is
relevant in order to estimate the computational cost for solving the problem. Exponential expressions have been used to evaluate the size of the
CSs.
Figure 7. CS Analysis
CONCLUSION
DM via LDT and the conversion to BDD is presented in this chapter. This approach often requires decision maker to obtain a complex analytic
expression of the occurrence probability for a MP. The complexity of this expression depends of the number of BCs and the topology of LDT.
When the LDT is formed by a high number of components, the ranking of the different events is an essential factor to an efficient conversion of
the logical decision tree. With this purpose, several ranking methods are presented in this chapter taking into account that there is not a method
that provides the best outcomes for all the cases. Moreover a simple example is presented in order to facilitate the reader the comprehension of
the followed procedure.
An analysis of different scenarios regarding to the AND gates and levels is done in this chapter. It has been demonstrated that the number of CSs,
and therefore the computational cost, can increase significantly and do not viable to find a solution in a reasonable time. Some significant
conclusions can be gathered from the proposed analysis:
• A higher number of AND gates produces a reduction of the final probability. This is a logical fact due to AND gates require the occurrence
of several BCs to propagate a problem until the top of the LDT. Other conclusion related to this fact is that the reduction of the final
probability depends on the level in which the AND gate is placed. The deeper the AND gate is placed, the more the reduction that it will
produce.
• A higher number of AND gates produces an exponential increasing of the number of CS. The deeper the AND gate is placed, the smaller is
the increment produced. This is a relevant information due to the number of CS is strongly related to the computational cost.
This chapter also presents a novel approach for not complex topologies that allows, employing simple regression techniques, to estimate the
solution of different scenarios for a LTD problem. Polynomial and exponential expression have been used with this purpose. It leads solutions
with very good accuracy associated to scenarios associated to a large number of CSs. It leads, therefore, to reduce the computational cost for
solving the problem and becomes a useful procedure to treat with big data.
This work was previously published in the Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 190207, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
The work reported herewith has been financially supported by the Spanish Ministerio de Economía y Competitividad, under Research Grant
under Research Grant DPI2012-31579, and the EU project OPTIMUS (Ref.:
REFERENCES
Artigao, E. (2009). Análisis de árboles de fallos mediante diagramas de decisión binarios. Proyecto fin de Carrera . Ciudad Real: Universidad de
Castilla La Mancha.
Brace K. S. Rudell R. L. Bryant R. E. (1990). Efficient implementation of a BDD package.27th ACM/IEEE Design Automation Conference.
10.1145/123186.123222
Bryant, R. E. (1986). Graphbased algorithms for Boolean functions using a graphical representation.IEEE Transactions on Computing, 35(8),
677–691.
García Márquez, F. P., & Moreno, H. (2012). Introducción al Análisis de Árboles de Fallos: Empleo de BDDs . Editorial Académica Española.
Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data for Dummies. Wiley & Sons, Inc.
Lee, C. Y. (1959). Representation of switching circuits by binary decision diagrams. Bell System Technology , 1(38), 985–999. doi:10.1002/j.1538-
7305.1959.tb01585.x
Lopez, D., & Van Slyke, W. J. (1977). Logic Tree Analysis for Decision Making. Omega, The Int. Journal of Management Science , 5, 5.
Ohlhorst, F. (2013). Turning Big Data into Big Money . Wiley & Sons, Inc.
Sathi, A. (2012). Big Data Analytics. Disruptive Technologies for Changing the Game. MC Press Online .
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data.McGrawHill Osborne
Media .
KEY TERMS AND DEFINITIONS
Basic Events: The basic events are logical variables that adopt two possible states: 1 if the basic event occurs and 0 if it does not occur. They can
be associated to a component of a system, a success, the cause of a problem, etc…
Binary Decision Diagram (BDD): A BDD is a directed acyclic graph (DAG) that simulates a logical function. The main advantage of the
BDDs is the possibility of evaluating the top event using implicit formulas.
CutSets (CSs): The CSs of a BDD are the paths from the root node to the terminal nodes with value 1. They represent the series of events that
have to occur so that the top event occurs. The size of a BDD can be represented by the number of CSs forming it. The probability of the top event
is the sum of the probabilities of the CSs.
Logical Decision Tree (LDT): It is a graphical representation of a structure function. A LDT structure consists of a root node (top event) that
is broken down into various nodes located below it, where the nodes can be events, logical gates and branches. It represents the interrelations
between the basics events that form a more complex event.
NonDeterministic PolynomialTime Hard Problem (NPHard): The NP-hard problems are a class of problems that are at least as hard
as the hardest problem in NP. If there were a polynomial algorithm to solve any NP-Hard problem, it could be used to solve any NP problem.
Ranking Methods: The efficiency of the conversion from LDT to BDD depends strongly on the ordination of the basic events of the LDT. With
this purpose there are some heuristic algorithms that try to order these events. There is not a unique method that provides the best ordination for
all the cases so different methods need to be considered when a conversion is required.
Top Event: This is the event placed at the highest level of the LDT. It represents the main cause, or the success that is pretended to be studied.
APPENDIX
Youqin Pan
Salem State University, USA
ABSTRACT
Big data is a buzzword today, and security of big data is a big concern. Traditional security standards and technologies cannot scale up to deliver
reliable and effective security solutions in the big data environment. This chapter covers big data security management from concepts to real-
world issues. By identifying and laying out the major challenges, industry trends, legal and regulatory environments, security principles, security
management frameworks, security maturity model, big data analytics in solving security problems, current research results, and future research
issues, this chapter provides researchers and practitioners with a timely reference and guidance in securing big data processing, management,
and applications.
INTRODUCTION
Big Data has become an entrenched part of discussions of new development in information technology, businesses, governments, markets, and
societies in recent years. It has inspired noteworthy excitement about the potential opportunities that may come from the study, research,
analysis, and application of big data. However, accompanying those enticing opportunities and prospective rewards, there are significant
challenges and substantial risks associated with big data. One of the biggest challenges for big data is increased security risk.
Security for big data is magnified by the volume, variety, and velocity of big data. With the proliferation of the Internet and the Web, pervasive
computing, mobile commerce, and large scale cloud infrastructures, today’s data are coming from diverse sources and formats, at a dynamic
speed, and in high volume. Traditional security measures are developed for clean, structured, static, and relatively low volume of data.
Undeniably, big data presents huge challenges in maintaining the integrity, confidentiality, and availability of essential data and information.
At the same time as big data is gaining momentum in the networked economy, security attacks are on the rise. Security perpetrators are
becoming more sophisticated, wide-spread, and organized. As businesses and other organizations are moving towards a more open, user-centric,
agile, and hyper connected environment that fosters intelligence, communication, innovation, and collaboration, attackers are eager to exploit
the intrinsic vulnerabilities that come with open and dynamic systems. Since big data are generated by those systems, and newly developed big
data infrastructures such as NoSQL databases, cloud storage, and cloud computing have not been thoroughly scrutinized for their capability of
safeguarding data resources, information security presents a formidable challenge to big data adaptation and management.
By identifying and laying out the major challenges, issues, industry trends, framework, maturity model, and fundamental principles of big data
security, this chapter will provide researchers and practitioners with a timely reference and guideline in securing big data processing,
management, and applications.
BACKGROUND
Big Data is generally considered to have three defining characteristics: volume, variety and velocity (Zikopoulos, et al. 2012). When at least one of
the dimensions is significantly high, the data is labeled big. Traditional techniques and technologies are not sufficient to handle big data. With the
enormous size, speed, and/or multiplicity, big data processing requires a set of new forms of technologies and approaches to achieve effective
decision support, insight discovery, and process optimization (Lancy, 2001). Although the three V’s of big data definition has wide acceptance,
recently, there have been attempts to expand the dimension of big data to include value and veracity (Demchenko, et al., 2013). The value
dimension of big data deals with drawing inferences and testing hypothesis, and the veracity dimension is about authenticity, accountability,
availability, and trustworthiness. Some researchers (e.g., Biehn, 2013) have suggested adding value and viability to the three V’s.
Although big data has been discussed for over a decade since 2000, interest in big data has only experienced significant growth in the last few
years. Figure 1 shows the Google search interest for the search term “Big Data” from January 2004 to June 2014. The figure does not show actual
search volume. The y axis represents search interest relative to the highest point, with the highest point being scaled to100. For more historical
information about big data, the reader is referred to Press (2013), which documents the history of big data that dates back to the 1940s.
Figure 1. Big Data Web Search Interest, January 2004 – June
2014
Source: Google Trends
Soon after the interest in big data has gained significance, the search interest in big data security and big data privacy has also experienced
remarkable growth. Figure 2 shows the Google search interest for the search term “Big Data Security” (top line) and “Big Data Privacy” (bottom
line) from January 2011 to June 2014.
Figure 2. Big Data Security/Privacy Web Search Interest,
January 2011 – June 2014
Source: Google Trends
In order to clearly articulate, define, and build the big data ecosystem, there has been a concerted effort to develop big data architecture by both
standard organizations such as National Institute of Standards and Technology (NIST) and major information technology companies such as
IBM, Microsoft, and Oracle. Demchenko et al. (2013) consolidated previous models of big data architectures and proposed a Big Data
Architecture Framework. This framework consists five key components of the big data ecosystem: Data Models, Big Data Management, Big Data
Analytics and Tools, Big Data Infrastructure, and Big Data Security. Data model defines the type and structure of data. Big data management
deals with the planning, monitoring, and managing the big data life cycle. Big data infrastructure consists hardware, software, and technology
services (often cloud based), including security infrastructure that establish access control, confidentiality, trust, privacy, and quality of service.
Big data security goes beyond the security infrastructure to provide security lifecycle management, security governance, policy and enforcement,
fraud detection and prevention, risk analysis and mitigation, and federated access and delivery infrastructure for cooperation and service
integration. A more specific security framework was proposed by Zhao et al. (2014) for big data computing across distributed cloud data centers.
BIG DATA SECURITY MANAGEMENT
The complexity resulting from the five dimensions (5 V’s) of big data makes the management of big data more challenging than traditional
database or knowledge base management. Major challenges in handling big data include storage challenge, which must deal with increased size,
cost, and scalability requirements; network challenge, which involves the accessibility, reliability, and security of obtaining and sharing data with
both internal constituents as well as external customers, suppliers, and other partners; data integrity challenge, which necessitates data
authentication, validation, consistency checking, and backup; and metadata challenges, which requires the establishment of data ontologies and
data governance.
The NIST defines information security as “The protection of information and information systems from unauthorized access, use, disclosure,
disruption, modification, or destruction in order to provide confidentiality, integrity, and availability.” (NIST, 2013)
Although security and privacy principles for traditional data can be applied to big data, hence many of the existing security technologies and best
practices can be extended to the big data ecosystem, the different characteristics of big data require modified approaches to meet the new
challenges to effective big data management. There are several areas in which big data faces different or higher risk than traditional data. Higher
volume translates into higher risk of exposure when security breach occurs. More variety means new types of data and more complex security
measure. Increased data velocity implies added pressure for security measures to keep up with the dynamics and faster response/recovery time.
Most organizations are just starting to embrace big data, thus the big data governance and security policy are likely not at a high level of maturity,
which is also a reflection of the immaturity of the big data industry and bid data ecosystem.
Security and Privacy Legal Requirements
Big data as a relative new paradigm needs new laws and regulations to safeguard security and privacy. In general, the legal system is still
immature and unbalanced in the global environment. The United States and the European Union have a number of laws regarding information
privacy and security. Those in the US include the Privacy Act of 1974, The Privacy Protection Act (PPA) of 1978, The Health Insurance Portability
and Accountability Act (HIPPA) of 1996, The Children's Online Privacy Protection Act (COPPA) of 1998, The Children's Internet Protection Act
(CIPA) of 2001, The Electronic communications Privacy Act (ECPA) of 1986, and The Federal Information Security Management Act (FISMA) of
2002.
In Europe, the United Kingdom passed the Data Protection Act (DPA) in 1998. The law regulates data processing on personal identifiable
information (PII). The European Union adopted The Data Protection Directive in 1995, which governs personal data processing within the
European Union. The European Union Internet Privacy Law was ratified in 2002. A new security and privacy law, The European General Data
Protection Regulation, was drafted in 2012, which supersedes the Data Protection Directive. The Federal Assembly of the Swiss Confederation
passed the Federal Act on Data Protection (FADP) in 1992.
The Privacy and Data Security Law Deskbook (Sotto, 2013) provides coverage of major International laws of information security and privacy,
including United States, European Union, Australia, and Singapore. The book offers practical guidance and explanation of different aspects of
security and privacy issues in different legal environments, and serves as a comprehensive guide on international security and privacy topics. For
new developments and general information about changes in privacy and security, the Privacy and Information Security Law Blog provides
current coverage of global privacy and cybersecurity law updates and analysis.
Big Data Security Challenges
Security has become an increasing concern as more and more people are connected to the global network economy. According to the Internet
Live Stats website, the number of worldwide Internet users is approaching three billions (Internet Users). There has never been a lack of security
attacks, from the early days of telephone phracker to modern time cyber criminals. As the big data increases in volume, speed, and variety, the
likelihood that data breach is expected to increase significantly. Recently, it is estimated that the average security breaches results in a loss of $40
million for American companies (Rowe, 2012). Symantec, an Internet security firm, estimated that the worldwide cybercrime cost is over $100
billion more than the cost of illegal drug market.
The 2014 first quarter IBM X-Force Threat Intelligence Quarterly report (IBM, 2014) states that attackers seeking for valuable data will spare no
effort in looking for security vulnerability. They will devise new tools or revitalize old techniques to break the security defense system. IBM X-
Force Research actively conducts security monitoring and analysis of global security issues. It has built a database of more than 76,000 computer
security vulnerabilities. Cloud Security Alliance (2013a) publishes anneal top security threats in cloud computing.
Figure 3 shows the most common attack types based on data collected in 2013. Undisclosed attack involves those either using nascent
tools/techniques or those were not identified by the intrusion detection system. The most common attack among those known types is the
distributed denial of service (DDoS) attack. SQL injection is a code injection technique targeting databases or data-driven applications. Watering
hole is a computer attack method first described by the RSA security firm in 2012. This attack aims at a particular group of organizations and
penetrates the group via infecting some group members who frequent certain targeted websites. Cross-site scripting is used by attackers to inject
client-side scripts into Web pages visited by website users. Malicious activities can be carried out through the injected client-site scripts. Phishing
is commonly used by attackers masquerading as a trustworthy entity to gain access to personal information such as username, passwords, and/or
credit card data. A targeted phishing attack directed at specific individuals or organizations is referred to as spear phishing.
Figure 3. Most Common Attack Types in 2013
Source: IBM XForce Threat Intelligence Quarterly (2014)
The volume of data continues to grow exponentially, the majority of big data is unstructured. Those unstructured data must be formatted and
processed using big data technologies such as Hadoop. Unfortunately, Hadoop was not developed with enterprise applications and security in the
blue print. It was designed to format large volume of unstructured data such as those available on publicly accessible websites. The original
purpose did not include security, privacy, encryption, and risk management. Organizations using Hadoop clusters must implement additional
enterprise security measures such as access control and data encryption. As Hadoop has become a widely adopted technology in the enterprise
big data applications, its insufficiency in enterprise security must be recognized. A new unified security model for big data is needed.
Traditional data security technologies were developed around databases or data centers that typically have a single physical location, while big
data is typically in distributed, large scale clusters. For example, a Hadoop cluster may consist of hundreds or thousands of nodes. As a result,
traditional backup, recovery, and security technologies and systems are not effective in big data environment.
Big data is a new paradigm in data science and data applications. New security measures are needed to protect, access, and manage big data.
Comprehensive security policies and procedures need to expend from structured data in traditional databases and data warehouses to the new
big data environment. Sensitive data needs to be flagged, encrypted, and redacted in NoSQL databases, and their access needs to be granularly
controlled. All access to those data by either users or applications needs to be logged and reported.
The Big Data Working Group of the Cloud Security Alliance (2013) listed the top ten security and privacy challenges for big data. Those ten
challenges are organized into four categories of the big data ecosystem: 1) Infrastructure Security, 2) Data Privacy, 3) Data Management, and 4)
Integrity and Reactive Security. Note that the top ten challenges were developed based on a list of high priority security and privacy problems
identified by Could Security Alliance members and security trade journals. The big data working group then studied published solutions and
selected those problems without established or effective solutions.
Table 1. Top Ten Big Data Security and Privacy Challenges
There are several security frameworks that have been adopted, such as NIST Cybersecurity Framework, Control Objectives for Information and
Related Technology (COBIT), and the information security management system (ISMS) established by the International Standards Organization
(ISO 27000). A security framework establishes principles, guidelines, standards, and a set of preferred practices for organizations to analyze,
plan, design, and implement security information systems. ISO 27000 is the only framework that organizations can adopt and being certified by a
third party once the set of standards such as security policy, asset management, cryptography, access control are satisfied.
The ISO 27000 consists a series of standards. Disterer (2013) discussed the ISO/IEC 27000, 27001 and 27002 for information security
management, and concluded that The ISO 27000, 27001 and 27002 standards can serve as a framework to design and operate an information
security management system. Those standards have been widely disseminated in Europe and Asia. It is expected that the adaptation of
standards-based security framework will increase significantly in the future due the rising challenges in in information security and privacy.
The ISO 27000 standard was developed in 2009 and continues to evolve to meet the new challenges in the security industry (ISO 27000, 2009,
2012, 2014). Plan-Do-Check-Act, the classic quality management framework for the control and continuous improvement, forms the basis for
security management in ISO 2700. Although the current ISO 27000 standard does not provide specific provisions for big data security, we expect
that future versions will add this important coverage.
Figure 4 shows a general big data security management (BDSM) framework we adopted from the Plan-Do-Check-Act Deming cycle. The
Design/Planning phase defines, identifies, and evaluates security risk, and develop appropriate policies, procedures, control, and measure
according to desired security level. This phase also includes a security modeling that builds a threat model and lays out strategies for most
security/privacy breach scenarios. The design blueprint will be turned into operational systems at the implementation phase. The operation
phase delivers the desired functions and services with security and privacy protection and compliance to laws and regulations. The monitoring
and auditing phase measures key performance indicators against designed objectives and provides feedback and performance reports. The
deficiencies identified at the monitoring stage will be used as input for big data security system redesign, quality control, and continuous
improvement.
Figure 4. Big Data Security Management Framework
Successful adoption of a security management framework is critical in helping organizations to identify and assess information security risks,
apply suitable controls, adequately protect information assets against security breaches, and to achieve and maintain regulatory compliance. The
ISO 27000 stipulates that information security management framework must deliver the following desirable outcomes (ISO/IEC 27000, 3rd Ed.,
2014):
Big Data Security Principles
Cavoukian (2013) has developed a widely recognized framework for information privacy—Privacy by Design. The central theme of Privacy by
Design is that in the age of big data, privacy cannot be assured just by compliance with laws and regulations. Privacy must be ingrained in
organizational design, and hence becoming the default mode of operation. The seven principles of privacy proposed by Cavoukian can be readily
modified and adopted for big data security. We propose nine foundational principles for big data security:
4. Defense in depth;
5. End-to-end security;
7. User-centric;
8. Real-time monitoring;
Protective measures or subsystems consist of deterrence, avoidance, prevention, mitigation, detection, response, recovery, and correction should
be part of the big data security management system. The big data security best practices suggested by the Association for Data-driven Marketing
& Advertising (ADMA, 2013) include the following:
1. Implementation of end-to-end security measures;
Big Data Security Maturity Model
A useful approach for assessing organization's preparedness for big data security is a big data security maturity model. The idea of a maturity
model is popularized by the Software Engineering Institute (SEI) at Carnegie Mellon University after they proposed the Capability Maturity
Model (Paulk, et al., 1995). The classic capacity maturity model has five level: initial, repeatable, defined, managed, and optimizing. According to
ESI, the software development predictability, effectiveness, and manageability improve as the organization moves up the maturity levels. The
capacity maturity model has been adopted in various fields other than software development. For example, business process maturity model, risk
maturity model, privacy maturity model, etc.
Figure 5 shows a bid data security management maturity model we adopted from the capacity maturity model. Along the maturity spectrum, at
the very beginning, Nonexistent, there is no awareness of big data security risk and challenges. In the Initialstage, the organization has
recognized the importance of big data security management in the face of big data security challenges. It is the starting point of deploying new or
undocumented processes. This stage is characterized by ad hoc, or even chaotic approaches and unpredictable results. In the Developing stage, a
broad assessment of big data security risk is carried out and certain key areas have implemented, and big data security processes documented. At
the Defined level, the big data security processes are standardized and imbedded in routing business operations. At the Managed level,
performance metrics are established and big data security processes are measured against the performance metrics and continuous improvement
is embedded into the operations. Finally, at the Optimizing level, big data security processes are fine-tuned to achieve optimal results, delivering
the highest level of security within the resource constraints.
Figure 5. Big Data Security Management Maturity Model
Big Data for Security Industry
Although there are many challenges in big data management and applications, there are also many opportunities presented by big data that lead
to new discoveries, insights, and innovative applications. In the security industry, big data analytics has helped organizations to proactively
identify and predict security attacks (Hipgrave, 2013). With the advancement of big data analytics, organizations will be able to collect and
analyze large volume of data from various unstructured data sources such as email, social media, e-commerce transactions, bulletin boards,
surveillance video feeds, Internet traffic, etc. Forward-looking organizations are moving beyond traditional security measures to safeguard their
coveted information assets. Big data security monitoring and analytical tools can be a valuable aid to security professionals and law enforcement
in their fighting against cyber criminals. An example of such an application is to analyze data in confluence and in motion to detect cyber
fraud/attack before serious damage occurs.
Big data security should not be an afterthought, even though the security industry is often playing catch up with the perpetrators. Those
organizations that adopt big data security framework and follow big data security principles will be able to take advantage of big data analytics
and be able to predict, detect, act, respond, and recover quickly from security breaches. Once the organization reaches sufficient big data security
maturity level, it will be in a solid position to fight against all known security attacks. It is even possible that the organization is able to anticipate
and mitigate new/unforeseen threats. Common big data analytics applications in security and public safety include crime analysis, computational
criminology, terrorism informatics, open-source intelligence, and cyber security (Chen, Chiang, and Storey, 2012).
RESEARCH ISSUES AND DIRECTIONS
Big data security is of great interest to security practitioners as well as academic researchers. Many believe that we have entered the era of big
data that offers enormous opportunities for transforming data science and data driven research and applications. With the advancement of big
data technologies, infrastructure, data mining, business intelligence, and big data analytics, researchers and business analysts are able to gain
more insights about big data security challenges, risk assessment and mitigation, and be able to develop more effective business solutions
(Chang, Kauffman, and Kwon, 2014).
However, we are still at the ground level of big data paradigm shift. There are still many unsolved issues in big data security management. Most
traditional security tools and technologies do not scale and perform well in big data environments. Substantial effort is needed to develop a
holistic and robust big data security ecosystem. Some pioneering work has been done (Demchenko, Ngo, and Membrey, 2013; Disterer, 2013).
However, there are still lingering questions such as: Is cloud computing the right infrastructure to support big data storage and processing? Is the
architectural flexibility of NoSQL databases amenable to trustworthy big data security? How can we trust big data that come from a variety of
untrusted input sources? Will big data “usher in a new wave of privacy incursions and invasive marketing”? (Boyd and Crawford, 2012) How do
we balance the need for big data security and usability?
It is important for security professionals, regulatory bodies, law enforcement, and big data service providers as well as users to keep in mind,
while we try to address those questions, big data security continues to evolve as the five V’s of the big data ecosystem are constantly changing and
increasing. More and more people will be involved with big data research and development, services, applications, and consumptions. No doubt
that the number of people interested in exploiting the big data with malicious intent is also increasing.
As discussed in the previous section, there are many challenges in big data security and privacy. Here we list some key research issues facing big
data researchers and practitioners:
1. Robust security standards designed and developed for big data from the ground up. The lack of established security standards has resulted
in some vendors involved with big data systems such as NoSQL solutions to develop ad-hoc security measures.
2. Built-in security at the granular level. This has the potential to allow security control and management with surgical precision.
3. A big data security architecture that provides a security framework to integrate diverse security technologies that are available today, but
do not work coherently together. Rather than piecemeal and separate solutions, we need a holistic approach to deal with current and
emergent security threats.
4. Continue the development of the big data ecosystem, of which big data security is a key component. Security should be deep-rooted in all
aspects of the big date lifecycle, that includes sourcing, storage, process, analysis, and delivery of results to targeted applications.
5. Reshape the cloud infrastructure for better big data security and trustworthiness, in order to alleviate the concern that big data security
and privacy are cloudy in the cloud.
6. Continue the promising research in addressing information technology/system security problems using big data and big data analytics.
7. Real-time big data security analytics that integrates natural and artificial intelligence for advanced security threat analysis, prediction,
detection, tracking, and response.
CONCLUSION
It is evident that big data is with us today and it is going to have a significant impact on governments, businesses, societies, and individuals in the
future. As we embrace the opportunities brought by big data, we must also be keenly aware of the security challenges associated with big data
infrastructure and applications. In this chapter, we explored the concepts of big data security, presented big data security management
frameworks, summarized big data security challenges, identified big data security principles, and suggested a big data security maturity model.
Furthermore, we discussed the application of big data analytics in solving security related problems.
Although big data security is immature today, we believe, with concerted effort from industries, governments, academicians, and practitioners,
big data security will improve over time to meet those challenges discussed in this chapter. Similar to the case of Internet security, which is an
afterthought when the need for security became critical and evident. The initial Internet architecture had little consideration for security and
privacy. However, as Internet and the World-wide Web grew exponentially in the 1990s, multi-layer Internet security protocols were developed
and those security standards helped facilitating the growth of the Internet and Internet applications. We are optimistic that big data security will
follow a similar path. This chapter provides basic concepts, principles, challenges, and current issues of big data security. We hope it serves as a
launching pad for advancing big data security research in the future.
This work was previously published in the Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 5366, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
ADMA. (2013). Best Practice Guideline: Big Data. A guide to maximising customer engagement opportunities through the development of
responsible Big Data strategies. Retrieved from: http://www.adma.com.au
Boyd, D., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly
Phenomenon. Information Communication and Society , 15(5), 662–675. doi:10.1080/1369118X.2012.678878
Chang, R. M., Kauffman, R. J., & Kwon, Y. (2014). Understanding the paradigm shift to computational social science in the presence of big
data. Decision Support Systems , 63, 6780. doi:10.1016/j.dss.2013.08.008
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data to Big Impact.Management Information
Systems Quarterly , 36(4), 1165–1188.
Cloud Security Alliance. (2013b). Expanded Top Ten Big Data Security and Privacy Challenges. Big Data Working Group. Retrieved from
https://cloudsecurityalliance.org/research/big-data/
Demchenko Y. Grosso P. de Laat C. Membrey P. (2013). Addressing big data issues in Scientific Data Infrastructure. In Proceedings of the
International Conference on Collaboration Technologies and Systems (CTS). IEEE. 10.1109/CTS.2013.6567203
Disterer, G. (2013). ISO/IEC 27000, 27001 and 27002 for Information Security Management . Journal of Information Security , 4(2), 92–100.
doi:10.4236/jis.2013.42011
Hipgrave, S. (2013). Smarter fraud investigations with big data analytics. Network Security , 2013(12), 7–9. doi:10.1016/S1353-4858(13)70135-1
Paulk, M. C., Weber, C. V., Curtis, B., & Chrissis, M. B. (1995). The Capability Maturity Model: Guidelines for Improving the Software Process.
SEI series in software engineering . Reading, MA: Addison-Wesley.
Press, G. (2013). A Very Short History of Big Data. Forbes Magazine. Retrieved from: http://www.forbes.com/sites/gilpress/2013/05/09/a-
very-short-history-of-big-data/
Shmueli, G., & Koppius, O. R. (2011). Predictive Analytics in Information Systems Research. Management Information Systems Quarterly , 35(3),
553–572.
Sotto, L. J. (2013). Privacy and Data Security Law Deskbook . Frederick, MD: Aspen Publishers.
Zhao, J., Wang, L., Tao, J., Chen, J., Sun, W., Ranjan, R., & Georgakopoulos, D. (2014). A security framework in G-Hadoop for big data
computing across distributed Cloud data centres. Journal of Computer and System Sciences , 80(5), 994–1007. doi:10.1016/j.jcss.2014.02.006
Zikopoulos, P. C., Eaton, C., deRoos, D., Deutsch, T., & Lapis, C. (2012). Understanding big data – Analytics for enterprise class Hadoop and
streaming data . New York, NY: McGraw Hill.
KEY TERMS AND DEFINITIONS
Big Data: Unstructured date with five characteristics: volume, velocity, variety, veracity, and value.
Big Data Security: The protection of big data from unauthorized access and ensure big data confidentiality, integrity, and availability.
Big Data Security Framework: A framework designed to help organizations to identify, assess, control, and manage big data security and
maintain regulatory compliance.
Big Data Security Maturity Model: A multi-level model used to help organizations to articulate where they stand in the spectrum of big date
security, from nonexistent to optimality.
Privacy: An individual’s right to safeguard personal information in accordance with law and regulations.
Security: A state of preparedness against threats to the integrity of the organization and its information resources.
Security Attack: An attempt to gain unauthorized access to information resource or services, or to cause harm or damage to information
systems.
CHAPTER 14
FeatureBased Uncertainty Visualization
Keqin Wu
University of Maryland Baltimore County, USA
Song Zhang
Mississippi State University, USA
ABSTRACT
While uncertainty in scientific data attracts an increasing research interest in the visualization community, two critical issues remain
insufficiently studied: (1) visualizing the impact of the uncertainty of a data set on its features and (2) interactively exploring 3D or large 2D data
sets with uncertainties. In this chapter, a suite of feature-based techniques is developed to address these issues. First, an interactive visualization
tool for exploring scalar data with data-level, contour-level, and topology-level uncertainties is developed. Second, a framework of visualizing
feature-level uncertainty is proposed to study the uncertain feature deviations in both scalar and vector data sets. With quantified representation
and interactive capability, the proposed feature-based visualizations provide new insights into the uncertainties of both data and their features
which otherwise would remain unknown with the visualization of only data uncertainties.
1. INTRODUCTION
Uncertainty is a common and crucial issue in scientific data. The goal of uncertainty visualization is to provide users with visualizations that
incorporate uncertainty information to aid data analysis and decision making. However, it is challenging to quantify uncertainties appropriately
and to visualize uncertainties effectively without affecting the visualization effect of the underlying data information.
Uncertainty in scientific data can be broadly defined as statistical variations, spread, errors, differences, minimum maximum range values, etc.
(Pang, Wittenbrink, & Lodha, 1997) This broad definition covers most, if not all, of the possible types and sources of uncertainty related to
numerical values of the data. In this chapter, however, we investigate the uncertain positional deviations of the features such as extrema, sinks,
sources, contours, and contour trees in the data. These feature-related uncertainties are referred to as feature-level uncertainty while the
uncertainties related to the uncertain numerical values of the data are referred to as data-level uncertainty. Visualizing feature-level uncertainty
reveals the potentially significant impacts of the data-level uncertainty, which in turn helps people gain new insights into data-level uncertainty
itself. Therefore, investigating the uncertainty information on both data-level and feature-level provides users a more comprehensive view of the
uncertainties in their data.
Many uncertainty visualizations encode data-level uncertainty information into different graphics primitives such as color, glyph, and texture,
which are attached to surfaces or embedded in volumes (Brodlie, Osorio, & Lopes, 2012; Pang et al., 1997). Those methods. In essence, give global
insight into the data by differentiating the area of high uncertainty from that of low uncertainty, however, the impact of the uncertainty on the
important features of the data is hard to assess in such visualizations. Meanwhile, uncertainty visualizations may be subject to cluttered display,
occlusion, or information overload due to the large amount of information and interference between the data and its uncertainty. We believe that
one promising direction to cope with this challenge is to allow users to explore data interactively and to provide informative clues about where to
look.
In this chapter, while our uncertainty representation can be adapted to different uncertainty models, we measure the uncertainty according to the
differences between the data values, critical points, contours, or contour trees of different ensemble members. Our objectives are (1) to bring
awareness to the existence of feature-level uncertainties, (2) to suggest metrics for measuring feature-level uncertainty, and (3) to design an
interactive tool for exploring 3D and large 2D data sets with uncertainty.
In what follows, we first discuss related work, issues, and challenges of uncertainty visualization in section 2, then explain in detail in section 3
and 4 our methods: (1) an interactive contour tree-based visualization for exploratory visualization of 2D and 3D scalar data with uncertainty
information and (2) a framework for visualizing feature-level uncertainty based on feature tracking in both scalar and vector fields. Lastly, we
conclude this chapter and discuss future directions.
2. BACKGROUND
We discuss issues, challenges, and the related work of uncertainty visualization in this section.
2.1. The Gap between DataLevel Uncertainty and FeatureLevel Uncertainty
Knowing the uncertainty concerning features is important for decision making. Many uncertainty visualizations based on statistical metrics
merely measure uncertainty on the data-level—the uncertainty concerning the numerical values of the data and introduced in data acquisition
and processing. While these techniques achieved decent visualization results, they do not provide users enough insight into how much
uncertainty exists for the features in the data. For example, the uncertainty of the ocean temperature data may result in the uncertain deviation of
the center of an important warm eddy. The uncertainty of the hurricane wind data may cause the uncertain location of a hurricane eye. This kind
of uncertainty is neglected by many current methods but needs to be quantified and visualized so viewers are aware of it.
In scientific data, the difference between a known correct datum and an estimate is among the uncertainties most frequently investigated. To
compare data-level uncertainty and feature-level uncertainty, we investigate two data sets. The first data set is a slice of a simulated hurricane Lili
wind field (Figure 1a). The second data set is created by adding random noise to the first dataset. For a wind vector, its data-level uncertainty is
represented as both angular difference and magnitude difference between vectors of the two fields. Figure 1c shows the arrow glyphs
(Wittenbrink, Pang, & Lodha, 1996) for visualizing the uncertainty of vector fields with the angular uncertainty presented as the span of each
arrow glyph and magnitude uncertainty as the two winglets around an arrow head. For details about designing arrow glyphs for vector field
uncertainty, please refer to (Wittenbrink et al., 1996).
Figure 1. An uncertain vector field and its uncertain feature
locations
As shown in Figure 1c, with the uncertainty glyphs, users may notice that the area around the hurricane eye (the major vertex in the middle)
exhibits high data-level uncertainty which raises a question: does it affect the location of hurricane eye? Only an explicit comparison between the
hurricane eyes in the two fields will answer this question. We therefore extract topologies of the two fields as shown in Figure 1d. The sink point
(in black) inside the hurricane eye noticeably shifts northwest (indicated by a red arrow) from the original vector field to the new one (in gray).
Another question is, is there anything hidden in the relatively low uncertainty area indicated by the small arrow glyphs? A quick look at the
Figure 1d reveals that the upper corner vortex significantly shifts its position (indicated by a red arrow) though it is located at the region with
relatively low data-level uncertainty.
This example illustrates that merely investigating the data-level uncertainty may not tell the whole story about the data and that the feature-level
uncertainty is an indispensable part of uncertainty that cannot be neglected. Moreover, visualizing uncertainty of features, instead of that of the
data, may provide a succinct and meaningful representation of the uncertainty and thus give a better interpretation of the data.
2.2. Challenges of Visualizing Uncertainty
Representing uncertainty in 3D or large 2D data sets could encounter severe issues such as cluttered displays, information overload, and
occlusions. Many uncertainty visualizations place glyphs that encode uncertainty within the visualization of the data. For instance, Sanyal et al.
(2010) visualized data-level uncertainties via circular or ribbon-like glyphs over a color-mapped image of the data. Despite their effectiveness in
revealing uncertainty information accurately at glyph locations, due to the overlaps between the data and uncertainty glyphs, the number of the
glyphs has to be limited and information loss for both the data and uncertainty is unavoidable. Other techniques which overlay or embed
uncertainty representation in the data visualization face similar issues.
Meanwhile, interaction with the visualization of 3D or large 2D data sets could encounter issues such as geometry bandwidth bottleneck, depth
perception, occlusion, and inefficiency in 3D object selection. These issues are inherent in interactive visualizations, and may be intensified in the
integrated visualization of a 3D dataset or large 2D data set and its uncertainty because there is simply more information to show in such
visualizations.
2.3. FeatureBased Uncertainty Visualization
In feature-based visualization, features are abstracted from the original data and can be visualized efficiently and independently of the data. We
believe that the feature-based uncertainty visualization becomes desirable when the size and complexity of the uncertainty information increase.
Several features interest us the most. Critical points such as sinks, sources, maxima, and minima are representative of topological features that
carry significant physical meaning of a scalar or vector data set (Helman & Hesselink, 1989; Keqin, Zhanping, Song, & Moorhead, 2010).
Contours, including 2D iso-lines and 3D iso-surfaces, are features frequently investigated for exploring data with uncertainty (Grigoryan &
Rheingans, 2004; Sanyal et al., 2010). A contour tree stores the nesting relationships of the contours of a scalar field. It is a popular visualization
tool for revealing the topology of contours (Hamish Carr, Snoeyink, & Axen, 2003), generating seed set for accelerated contour extraction
(Hamish Carr & Snoeyink, 2003), and providing users an interface to select individual contours (Bajaj, Pascucci, & Schikore, 1997).
Several methods have been developed to address the uncertainty of the size, position, and shape of contours (Grigoryan & Rheingans, 2004;
Pfaffelmoser, Reitinger, & Westermann, 2011; Rhodes, Laramee, Bergeron, & Sparr, 2003). Rendering contours from all ensembles in a single
image, known as spaghetti plots (Diggle, Heagerty, Liang, & Zeger, 2002), is a conventional technique used by meteorologists for observing
uncertainty in their simulations. Pang et al. (Pang et al., 1997) presented fat surfaces that use two surfaces to enclose the volume in which the true
but unknown surface lies. Pauly et al. (Pauly, Mitra, & Guibas, 2004) quantified and visualized the uncertainty introduced in the reconstructions
of surfaces from point cloud data. Pfaffelmoser et al. (Pfaffelmoser et al., 2011) presented a method for visualizing the positional variability
around a mean ISO-surface using direct volume rendering. A method to compute and visualize the positional uncertainty of contours in uncertain
input data has been suggested by Pöthkow and Hege (Pöthkow & Hege, 2011). Assuming certain probability density functions, they modeled a
discretely sampled uncertain scalar field by a discrete random field.
Among uncertainty methods, only a few are proposed to address the topological features. Otto et al. studied the uncertain topological
segmentation of a vector field by introducing the probability density that a particle starting from a given location will converge to a considered
source or sink (Otto, Germer, Hege, & Theisel, 2010). The uncertainty related to the topology structure of a scalar field, is barely studied.
2.4. Visual Encoding of Uncertainty
Several efforts have been made to identify potential visual attributes for uncertainty visualization. MacEachren suggested the use of hue,
saturation, and intensity for representing uncertainty on maps (1992). Hengl and Toomanian (2006) showed how color mixing and pixel mixing
can be used to visualize uncertainty in soil science applications. Davis and Keller (1997) suggested value, color, and texture for representing
uncertainty on static maps. Djurcilov, Kima, Lermusiaxb, and Pang (2002) used opacity deviations and noise effects to provide qualitative
measures for the uncertainty in volume rendering. Sanyal, Zhang, Bhattacharya, Amburn, and Morrehead (2009) conducted a user study to
compare the effectiveness of four uncertainty representations: traditional error bars, scaled size of glyphs, color-mapping on glyphs, and color-
mapping of uncertainty on the data surface. In their experiments, scaled sphere and color mapped sphere perform better than traditional error
bars and color-mapped surfaces. Later, they proposed graduated glyphs and ribbons to encode uncertainty information of weather simulations
(Sanyal et al., 2010).
While the uncertainty visualization is application-dependent in many cases, two visualization schemes are widely used: using intuitive
metaphors, such as blurry and fuzzy effects (Cedilnik & Rheingans, 2000; Djurcilova et al., 2002; Grigoryan & Rheingans, 2004), which naturally
implies the existence of uncertainty, and using quantitative glyphs (Sanyal et al., 2010; Schmidt et al., 2004), which shows quantified uncertainty
information explicitly. Both schemes have their own tradeoffs. In quantitative glyphs the uncertainty information has to be shown in a discrete
way. By using uncertainty metaphors, people get less quantified information about uncertainty since they cannot tell levels of blur or fuzziness
apart accurately (Kosara et al., 2002). To reveal uncertainty accurately, uncertainty glyph is preferred to uncertainty metaphors.
3. AN INTERACTIVE FEATUREBASED VISUALIZATION OF SCALAR FIELD UNCERTAINTY
In this section, we analyze uncertainty-related information on three levels: on data-level, we study the uncertainty of the data, on contour-level,
we quantify the positional variation of the contours, and on topology-level, we reveal the variability of the contour trees.
The core of this method is the use of contour trees as a tool to represent uncertainty and to select contours accordingly. First, a planar contour
tree layout which suppresses the branch crossing and integrates with tree view interaction is developed for a flexible navigation between levels of
detail for contours of 3D or large 2D data sets. Second, we attach uncertainty information to the planar layout of a simplified contour tree to
avoid the visual cluttering and occlusion of viewing uncertainty within volume data or complicated 2D data. We call the scalar field obtained by
averaging the values from all the ensemble members at each data point the ensemble mean and the contour tree of the ensemble mean the mean
contour tree. We show the data-level uncertainty by displaying the difference between each ensemble member and the ensemble mean at each
data point or along a contour. For contour-level uncertainty, given a contour in the ensemble mean, we compute the mean and variance of the
differences between this contour in the mean field and its corresponding contours in all the ensemble members. For topology-level uncertainty,
we map the contour trees of all the ensemble members to the mean contour tree and show their discrepancy to indicate uncertainty.
This section is structured as follows. Section 3.1 provides the definition, simplification, layout, and tree view graph design of contour trees. The
visualization of three levels of uncertainties is discussed in section 3.2. Section 3.3 introduces the interface design. Section 3.4 demonstrates and
discusses application results.
3.1. Contour Tree Layout and Tree View Graph Design
We visualize uncertainty and variability information through contour trees. The layout of a contour tree becomes an issue when the size and
complexity of the contour tree increase (Heine, Schneider, Carr, & Scheuermann, 2011). To explore data with uncertainty efficiently, a preferred
contour tree layout is the one that is two-dimensional, shows hierarchy and height information intuitively, suppresses branch self-intersections,
allows for a fast navigation through different levels of simplification, and allows displaying uncertainty information. To meet these requirements,
we propose a planar contour tree layout which is integrated with the tree view graph interaction.
3.1.1. Contour Tree
The contour tree is a loop-free Reeb graph (Tierny, Gyulassy, Simon, & Pascucci, 2009) that tracks the evolution of contours. Each leaf node is a
minimum or maximum, each interior node is a saddle, each edge represents a set of adjacent contours with iso-values between the values of its
two ends. There is a one-to-one mapping from a point in the contour tree (at a node or in an edge) to a contour of the scalar field. A contour tree
example is shown in Figure 2. More detailed information about contour and contour tree can be found in (Carr, 2004).
Figure 2. The contour tree and contours of a 3D scalar field
Each horizontal line cuts exactly an edge of the tree for every
contour at the corresponding iso-value Color is used to
indicate the correspondence between a line and its
corresponding contours
3.1.2. Contour Tree Simplification
Simplification is introduced to deal with the contour trees that are too large or complicated to be studied or displayed directly (Hamish Carr,
Snoeyink, & Panne, 2004; Pascucci, Cole-McLaughlin, & Scorzelli, 2004). Contour tree simplification facilitates a high level overview of a scalar
field. Particularly, a simplified contour tree attached with uncertainty glyphs reduces the workload in viewing 3D data or complicated 2D data
with uncertainty.
Usually, a contour tree simplification is conducted by successively removing branches that have a leaf node (extremum) and an inner node
(saddle) (Hamish Carr et al., 2004; Takahashi, Takeshima, & Fujishiro, 2004). Figure 3 shows a top-down simplification (Wu & Zhang, 2013) by
repeatedly assembling the longest branches on a current contour tree. The black numbers 0, 1,..., 4, indicate the order of assembling the
branches. As shown in this example, the order of assembling the branches suggests a balanced hierarchy. The shorter branches are found at lower
hierarchies and higher branches are located at higher hierarchies so that the more simplified contour tree always catches the more significant
features. Each branch rooted at an interior node, other than the two ends of the branch, is a child branch. A branch which has child branches is
called the parent branch of its child branches. The sub-tree which consists of all the edges and nodes along a branch and its descendants is called
the sub-tree of the branch.
Figure 3. Contour tree hierarchy built based on a top-down
contour tree simplification
3.1.3. A Rectangular Contour Tree Layout
The key to prevent unnecessary self-intersections is to recursively assigning a vertical slot to a branch so that its child branches are contained
entirely within the slot.
To be more specific, a branch B is assigned a vertical slot R, and each child branch of B is assigned a disjoint portion of R—a smaller vertical
slot within R. Nodes are positioned on the y-axis according to their function values. To emphasize the hierarchy, a branch is rendered in the
middle, and its child branches are spread out to its left and right sides. The longest branch is drawn as a vertical line segment in the middle of the
display. All the other branches have L shapes that connect extremum to their paired saddles to prevent them from crossing the slots of their
siblings. An example is shown in Figure 4d and e. The rectangular display of the same tree reduces the number of crossings from three to one
since the layout design rules out the case where a child branch intersects with its parent or siblings. Some self-intersections are unavoidable due
to the strangulation cases where a downward branch appears as a child of an upward branch, or vice versa. A space-saving solution is given in
Figure 4c which takes less horizontal space than the tree layout in Figure 4b. For a given parent branch, we separated it into three parts vertically:
from high to low, upward branch zone where all the child branches are upward, mixed branch zone where child branches are either upward or
downward, downward branch zone where all the child branches are downward. In the mixed branch zone, the upward branches take one side
while the downward branch takes the other. In the upward branch or downward branch zone, the child branches stretch outward from the parent
branch on both left and right sides without overlapping each other.
3.1.4. Tree View Interaction Design
A typical tree view graph displays a hierarchy of items by indenting child items beneath their parents. In treeview representations, the
interactions are directly embedded: the user can collapse (or expand) a particular sub-tree to hide (or show) the items in it.
An example of data exploration through tree view interaction is given in Figure 5. The data is a vorticity magnitude field of a simulated flow with
vortices. A click on an inner node of the original contour tree (Figure 5a right) results in hiding or showing the sub-tree rooted at the node. The
persistence – indicated as the vertical length of a branch –serves as an importance indicator for users to select contours. An interactively
simplified contour tree is shown in Figure 5b. The roots of the collapsed sub-trees are marked with plus mark icons. For each branch of a contour
tree, a single contour of the branch is extracted. A contour in the scalar field is represented by a point in the tree (indicated by a horizontal line
segment in the same color with the contour). The selected contours are representative contours that give an overview of the whole scalar field.
The more a contour tree is simplified, the higher level overview is obtained.
3.2. Contour TreeBased Uncertainty Visualization
This section discusses uncertainty metrics for the data-level uncertainty and contour-level uncertainty and their visualization representations.
The uncertainty or variability information is attached to the new planar contour tree display to give a high-level overview of uncertainty and to
allow a quick and accurate selection of contours with different levels of uncertainty.
3.2.1. DataLevel Uncertainty
The data-level uncertainty measures how uncertain the numerical values of the data are. Uncertainty measures, such as standard deviation, inter-
quartile range, and the confidence intervals, fall into this category. As discussed in section 2.2, techniques which overlay or embed uncertainty
representation in the underlining data visualization face perception and interaction issues. We therefore propose an alternative visualization that
attaches uncertainty glyphs to a contour tree rather than integrating them with data visualization directly.
We adapted graduated glyphs (Sanyal et al., 2010) to visualize data-level uncertainty. A circular glyph encodes the deviation of all ensemble
members from the ensemble mean at a data point. A graduated ribbon is constructed by interpolating between circular glyphs placed along an
iso-line in the data image. A glyph that has a dense core with a faint periphery indicates that ensemble members have a few outliers and mostly
agree. A mostly dark glyph indicates that large differences exist among individual members. The size of a glyph indicates the variability of a
location with respect to other locations on the grid. For more details on graduated glyphs, we refer the reader to (Sanyal et al., 2010).
We use graduated glyphs to show uncertainty at each data point. Figure 6 shows a 2D 9×9 uncertain scalar dataset which is a down-sampled sub-
region of the data in Figure 5. In Figure 6a, the mean field of eight members is color-mapped and overlaid with uncertainty glyphs at each data
point. Likewise, as shown in Figure 6b, the glyph of a data point is attached to its corresponding location in the contour tree. Figure 6c illustrates
a segment of a graduated ribbon along a branch. As shown in Figure 6b and d, while a graduated circle illustrates the data-level uncertainty at a
data point, the graduated ribbon provides continuous uncertainty representation along individual contours with less clutter. Therefore, we prefer
ribbon-like graduated glyphs over circular graduated glyphs for representing data-level uncertainty along contours.
Figure 6. Data-level uncertainty representation based on the
contour tree (a) Graduated circular uncertainty glyphs on an
uncertain scalar field (b) A contour tree with circular
graduated uncertainty glyphs (c) A segment of a graduated
ribbon on a branch (black) constructed by overlaying thinner
ribbons successively (d) Contour tree attached with average
data-level uncertainty along each contour and four sets of
corresponding contours are shown after clicking on four
locations (indicated by colored line segments) in the contour
tree
3.2.2. ContourLevel Uncertainty
In this chapter, contour-level uncertainty measures the variation in the position of a contour. In a spaghetti plot, the most unstable contours and
the places where the contours are extremely diverse among individual ensemble members are interesting to users (Sanyal et al., 2010). However,
the users’ estimates tend to be inaccurate due to the randomness of the contour size, shape and length. Additionally, it is hard to look into such
uncertainty in a large data set. Precise and automatic contour variability measurement is needed to assist the exploration of the large uncertain
data. We introduce the concept of contour variability which quantifies how diverse a contour is within multiple ensemble members to address
this need.
To measure the variability of a contour, we first identify corresponding contours in different ensemble members. Then, we calculate the
differences between them and use the mean and variance of the differences to represent the positional variation among contours.
3.2.2.1. Contour Correspondence and Difference
For a contour in one ensemble member, there may be more than one contour in another ensemble member with the same iso-value with it, or it
may be missing in some ensemble members. Spatial overlap is frequently used as a similarity measure to match features in different datasets
under an assumption that these features only have small spatially deviation (Schneider, Wiebel, Carr, Hlawitschka, & Scheuermann, 2008; Sohn
& Chandrajit, 2006). For instance, the correspondence between two contours can be measured by the degree of contour overlap as discussed by
Sohn and Bajaj (Sohn & Chandrajit, 2006) when they computed correspondence information of contours in time-varying scalar fields.
The non-overlapped area between the two corresponding contours C and decides the difference between the two contours. Given a
contour C in the ensemble mean, we search in ensemble member i for the contour who shares the same iso-value with C and has the
best correspondence with C. The best matched contours, if found, are considered the same contours which appear in different ensemble
members. Figure 7 gives an example. In Figure 7a, there are three contours (in blue, gray, and brown) in ensemble i with different
correspondence degrees with contour C(in red). With the largest overlap degree with C, the blue contour is identified as the corresponding
contour of C in ensemble memberi. Figure 7b, c and d illustrate the non-overlapping area in different cases. To reduce bias towards long or short
contours, larger or smaller iso-surfaces, we normalize the non-overlapping area with the contour length (or iso-surface area in 3D case) of C:
where size(C) is the contour length (or iso-surface area in 3D case) of C.
We calculate the mean and variance of the differences to the mean to help a user select contours according to the quantified contour variability
information. Given a contour C in the ensemble mean, let its corresponding contours of k individual ensemble members be is the
contour with the same iso-value that is matched with C. The average difference among the corresponding contours is: The
variance of difference among contours is: If an ensemble member does not have a matched contour for C, its
contribution is set to a large value, the maximum non-overlapping area found between C and all the matched contours
The contour variability along a contour can be shown at the corresponding location in the contour tree. Figure 8 illustrates the glyph design to
encode the contour variability. Before visualization, both variability statistics are normalized to a range between 0 and a unit width. As discussed
in section 3.2.1, ribbon-like glyph is preferred over circular glyph to prevent visual cluttering, we resample the varying contour variability along
each branch and use linear interpolation to produce a ribbon-like glyph along each branch. For each branch, two ribbons are attached. The blue
one is for the mean difference, while the green one is for the variance. The varying width of each ribbon indicates the varying magnitude of each
variability measurement for the contours along the branch.
Figure 8. Visualization of contour variability based on contour
tree (a) A circular glyph for contour variability is produced by
combining two circular glyphs for mean and variance of
differences between corresponding contours in different
ensemble members and the ensemble mean (b) The ribbons
attached to the contour tree (left) indicate the variability of
corresponding contours in the data (right) Three sets of
corresponding contours are shown after clicking on three
locations (indicated by arrows) in the contour tree
3.2.3. TopologyLevel Uncertainty
The uncertainty within the data impact not only the values or the contour positions locally, but also the global pattern of the data which is
described by the topology of the data. Visualizing the uncertainty concerning the topology among the ensemble members provides new
perspective on the global impact of uncertainty. In this chapter, the topology-level uncertainty is defined as the variation in the height and
number of branches in the contour tree of an uncertain scalar field.
The idea is to map the branches between the contour tree of different ensemble members and the contour tree of the ensemble mean and to
overlay the mapped branches. A set of matched branches are assigned with a same x-axis value but keep their original y-axis values so that an
overlap of the branches on x-axis indicates their correspondence while the disagreements between the branches on y-axis indicate their
discrepancy in iso-value. A branch of mean contour tree may not find a matched branch in some ensemble members. The number of matched
branches is encoded with the width of the branch. A thicker branch is more certain than a thinner one.
3.2.3.1. Branch Correspondence
We measure the correspondence degree between two branches as the spatial overlap between their contour regions. A contour region of a branch
is defined as the region covered by all the contours within the sub-tree of the branch.
Given a branch B in the mean contour tree, we search in ensemble member i for its best matched branch. The best matched branches, if found,
are considered the same branches which appear in different ensemble members. We do not need to search all the branches of the contour tree in
ensemble member for the branch with largest overlap with B. The contour tree hierarchy and branch orientation help us limit the number of
branches to compare. (1) The matched branch must be found among the child branches of the matched branch of its parent due to the nesting
relationship between child and parent branches. (2) A downward branch does not match an upward branch, or vice versa.
Figure 9 illustrates the branch correspondences detected according to the spatial overlaps of contour regions. The contours and contour trees of
two scalar fields in Figure 9a and b share a same spatial domain. In Figure 9a, the sub-tree of the middle branch (1, 10) is the whole contour tree.
The sub-tree of branch (0.5, 10.5) is the whole contour tree in Figure 9b. Both branches share the whole data domain and hence are the best
matched branches of each other. For the rest branches, the correspondence indicated by overlaps between their contour regions: branch (2, 3) →
branch (1.8, 3.3), branch (6, 9) → branch (6.5, 9.6), branch (4, 8) →branch(4.2, 8.3), and branch(5, 7) → branch(5.4, 7.1). For example, branch
(4, 8) matches branch (4.2, 8.3) since they have the largest overlaps. Branch (5.4, 7.1) matches branch (5, 7) even when branch (4, 8) has larger
overlap with it since branch (4, 8) is mapped to its parent branch (4.2, 8.3). Figure 9c shows the matched branch in the same x-axis location.
Figure 9. Branch correspondence indicated by contour region
overlaps The nodes and contours are numbered by their iso-
values. (a) and (b) show contour trees of two data fields and
the corresponding contour regions of the tree branches (c) The
matched branches are indicated by the shared x-axis locations
3.2.3.2. TopologyLevel Uncertainty Visualization via Contour Tree Mapping
We map the contour tree of each ensemble member to the mean contour tree hierarchically. In the resulting visualization, the contour trees of
different ensemble members are rendered beneath the mean contour tree in different colors. The x-axis locations of the matched branches are
aligned. The number of unmatched branches for a branch in the mean contour tree is encoded as the width of the branch.
As shown in Figure 10, the mapped contour trees indicate how uncertain the iso-value ranges of individual branches are and how uncertain the
number of branches is. The places where branches of the contour trees disagree with each other indicate the uncertain iso-value ranges of
different contour regions segmented by branches. The overall blurring or clearness of the display indicates the overall high or low uncertainty.
The thickness of a branch indicates the number of matched branches found in the ensemble members. For example, a thin branch (indicated by
the green arrow) is found in the lower right corner of the tree. It represents an uncertain contour region (or minimum) which only appear in a few
ensemble members. As shown in Figure 10, only two contours (in blue and green) from ensemble members 4 and 7 are shown after clicking on
the branch. On the contrary, the two sets of contours selected from the middle locations of two thick branches (indicated by red and blue arrows)
include corresponding contours from all the ensemble members. Accordingly, the topology-level uncertainty provides users a quick and high level
overview of how much uncertainty there is in the topology of contours.
3.3. User Interface Design
The contour tree provides an intuitive interface for exploring uncertainty. Figure 11 shows the interface designed to enable efficient browsing,
manipulation, and quantitative analysis of uncertain scalar data fields. The top-left area shows a 2D or 3D visualization of a data set, including
iso-lines and color-mapped image for 2D data, or iso-surfaces and volume rendering for 3D data. It provides interactions such as rotation and
zooming. The top-right area shows a 2D display of the contour tree of the mean data field. It allows interactive contour tree simplification and
contour selection. The three bottom areas show the data-level uncertainty, contour variability, and topology variability of the data based on
contour trees. They allow similar interactions as the top-right contour tree display area. Once a data set is imported with the pre-computed
uncertainty and variability information, the tool shows the color-mapped data (volume rendered for 3D data) in the data display area. The
contours (iso-lines or iso-surfaces) are also rendered in the data display area and updated accordingly as a user clicks on one contour tree or
changes the iso-value by sliding on the vertical bar in the top-right contour tree display area.
Figure 11. The user interface Top-left: data display area shows
iso-surfaces of the mean field of five brains chosen from the
simplified mean contour tree on the top right Top-right:
contour tree display area shows a simplified contour tree
Bottom: display areas for data-level uncertainty, contour-level
uncertainty, and topology-level uncertainty shown on
simplified contour trees
Users may simplify the contour tree in two ways. (1) Use a horizontal bar under the contour tree display to control the current number of
branches in the contour tree. As a user drags the slider from right to left, branches are removed from the contour tree according to the order
stored in the pre-computed simplification sequence. (2) Directly right-click on the inner nodes of the contour tree to prune or extend sub-trees.
Users may select contours to display in two ways. (1) Display multiple contours of the mean field with selected iso-values by clicking on the
contour tree of the top-right area. A selected contour is shown in the same distinct color with a point placed at the clicked location. (2) Display a
set of corresponding contours (spaghetti plots) with the same iso-value in different ensemble members in the data display area by double-clicking
on a location in one of the three contour trees in the bottom. Contours from different ensemble members are shown in different colors.
3.4. Application and Discussion
In this section, we apply the new uncertainty visualization to a simulated weather data set (135×135×30) and a medical volumetric data set
(128×128×71).
The first experiment demonstrates an effective application of the new uncertainty visualization on the simulated data from Weather Research
and Forecasting (WRF) model runs. The members of numerical weather prediction ensembles are individual simulations with either slightly
perturbed initial conditions or different model parameterizations. Scientists use the average ensemble output as a forecast and utilize spaghetti
plots to analyze the spread of the ensemble members. In our application, water-vapor mixing ratio data from eight simulation runs of the 1993
Super storm are used. Figure 12a shows the original contour tree (1734 critical points and 867 branches) and all the contours with the iso-value
indicated by the red horizontal line connecting the slider of the vertical bar. Uncertainty information is shown in the simplified contour tree in
Figure 12c. A few thin branches appear at its bottom region indicating the uncertain contour region (or minima) in the data. Figure 12b shows the
volume rendering of the average data with circular uncertainty glyphs whose sizes vary according to the level of data-level uncertainty. Figure 12d
shows a set of corresponding contours with high data-level uncertainty while Figure 12e shows a set of corresponding contours with high contour
variability. Circular graduated glyphs indicating the magnitude of the data-level uncertainty and contour variability are shown at the bottom. The
color bar on the right indicates the correspondence between the iso-surfaces and simulation runs.
Figure 12. Exploration of weather ensemble simulation with
contour tree based visualization (a) Original contour tree (b)
Volume rendering with uncertainty glyphs (c) Left to right:
data-level uncertainty, contour variability, and topology
variability shown in the simplified contour tree (d) A set of
contours (corresponding to the red points in the contour trees)
with high data-level uncertainty indicated by the large
graduated circle (e) A set of contours (corresponding to the
yellow points in the contour trees) with high contour
variability indicated by the large glyph for contour variability
The second application is to visualize non-weighted diffusion images. We apply our method to study the variation between the brain images from
five subjects after affine registration to one particular data set chosen at random using FLIRT (Jenkinson & Smith, 2001). Contour trees had been
introduced to explore brain data in previous literature (Hamish Carr & Snoeyink, 2003). The between-subject variation in brain anatomy is
closely related to the uncertainty study of brain imaging (Eickhoff et al., 2009). The mean field of the brains contains 2143 critical points and
1071 branches in the original contour tree. Figure 13a shows the integrated visualization of both data and uncertainty glyphs. Variability
information is shown in the simplified contour tree in Figure 13b. Brain outer surfaces and ventricle surfaces are selected from two points
(indicated in red and yellow) in the contour trees. As shown in Figure 13a, the inner part of the brain area exhibit higher data value along with
higher data-level uncertainty. This is reflected in the generally thicker graduated ribbons observed in the upper region of the left tree in Figure
13b. Meanwhile, contours corresponding to the upper region of the trees have higher contour variability and higher topology variability,
indicating the impact of the data-level uncertainty on the contours and contour trees. The inner features such as brain ventricles exhibit higher
variability than the brain outer surfaces based on the quantified uncertainty and variability information shown on the corresponding location in
the contour trees. The interactive visualization supports the exploration of variability information that is hidden in the conventional view for
selected anatomical structures.
4. A FEATUREBASED FRAMEWORK FOR VISUALIZING VECTOR FIELD UNCERTAINTY
In this section, we present a framework to visualize feature-level uncertainties in vector fields. In many cases, locations of features matter more
than the data-level uncertainty. For instance, the locations of warm eddies are important in ocean fishery, and the locations of hurricane eyes or
the peaks in pressure field are important in weather analysis. Two features are considered the same feature at in different data sets if they share
the same tracking path (Tricoche, 2002) or the biggest similarity (Sohn & Chandrajit, 2006). We evaluate the uncertainty related to features by
measuring the deviation of those feature pairs in different data sets. The features currently studied by us are vector field critical points. They are
intuitive features closely related to physical features (Garth & Tricoche, 2005). Scalar fields can be analyzed through their gradient fields.
This section is organized as follows. Section 4.1 gives the method framework. Section 4.2, 4.3, and 4.4 discuss feature extraction, uncertainty
measurement, and uncertainty glyph design respectively. Section 4.5 demonstrates and discusses application results.
4.1 Method Pipeline
The impact of uncertainty on the features is quantified as feature-level uncertainty which is measured by feature deviation Figure 14). The feature
deviation is obtained through a three-step procedure—feature identification, feature mapping, and uncertainty representation. Given a set of data
members (e.g. multiple simulation runs) this method first identifies the features within all the data members and the mean field given by
averaging all the members. Second, feature tracking is implemented to map the features of each data member to that of the mean field. The
mapped features are then assumed to have the same feature with slight position deviation in each of the individual data members. Finally, the
feature-level uncertainties are expressed as the deviations of the features.
Figure 14. The pipeline for feature-level uncertainty
visualization
4.2. Feature Extraction
Topology consists of critical points, periodic orbits, and separatrices. It characterizes a flow in that the relatively uniform flow behaviour in each
topological region can be deduced from its boundary. The computation of critical points in a vector field can be found in (Helman & Hesselink,
1989). For a scalar field, its gradient field can be used to extract critical points so that the features of scalar fields and vector fields are analyzed in
the same way. A vector field can be constructed out of scalar field using the gradient operator The maxima of appear
as sinks and minima appear as sources in its gradient field Figure 15 illustrates an example of the critical points extracted from the gradient
field of a temperature field.
Figure 15. Topology extracted from the gradient field of a
temperature field (a) Color-mapped temperature (b) Gradient
field represented with arrows (c) Gradient field with extracted
vector topology
4.3. FeatureLevel Uncertainty Metric Based on Feature Mapping
Feature Flow Field (FFF) (Theisel & Seidel, 2003) is adopted to couple critical points of different fields by tracing streamlines within it. The
concept of FFF has been successfully applied to tracking critical points (Theisel, Weinkauf, Hege, & Seidel, 2005), extracting Galilean invariant
vortex core lines (Sahner, Weinkauf, & Hege, 2005), simplification (Theisel, Rössl, & Seidel, 2003a), and comparison (Theisel, Rössl, & Seidel,
2003b). In this chapter, it is used to identify the same feature that appears in different data members at different positions. The uncertainty of
this feature is then expressed as the deviation between all of its counterparts in different data members.
Given k data members a mean field is first computed as the average of them. is then paired with each data member . Feature
mapping is implemented for each data pair and With the FFF method, features of different vector fields could be correlated.
Figure 16 demonstrates how to measure feature-level uncertainty related to a feature. For a data member and the mean field we trace a
streamline from a critical point in until it reaches a critical point in . After tracing critical points between all the pairs, the feature-level
uncertainty is measured by the distances between and Figure 16a illustrates the feature mapping between a pair of data. Figure 16b
shows a straightforward representation of the uncertainties related to individual features by arrows. Given a data member it is
possible that the streamline starting from reaches the boundary of the FFF or ends at instead of reaching a critical point in In these cases, we
assume that the mapped critical point for in this data member exists outside the domain. Therefore, we set its distance from a large value —
the maximum distance found between and all the mapped critical points
Figure 16. Feature-level uncertainty measurement (a) Feature
deviations detected by tracing critical points within FFF
4.4. Uncertainty Representation
Glyph design addresses the central problem of how uncertainty information is processed into knowledge. For the detected deviations of a critical
point, a quantitative glyph is designed to indicate uncertainty level related to the critical point. The new uncertainty glyph is inspired by both
graduated circular glyph (Sanyal et al., 2010) and elliptical glyph (Walsum, Post, Silver, & Post, 1996).
4.4.1. Elliptical Glyph
A glyph that can be used at different levels is the elliptical glyph (Walsum et al., 1996). It depicts the covariance between multiple real-valued
random variables . In probability theory and statistics, covariance measures how much two variables change together. The covariance matrix ∑
generalizes the notion of variance to multiple dimensions. where is the expected value of and the
ellipse axis length is Ellipse axis directions are given by An elliptical glyph can be applied to visualize tensors, but can also
show a simplified representation of the spatial distribution of a set of 2D or 3D data (Sadarjoen & Post, 2000).
4.4.2. Graduated Elliptical Glyph
Before using the graduated ellipse, we considered using arrows to indicate the uncertainty by showing directions and distances of individual
deviations. Nevertheless, it is found that arrows, though showing the deviations in an authentic way, could cause severe information overload
when the number of ensemble runs increases. Contrarily, the graduated ellipse possesses the elliptical glyph’s ability to depict the overall
deviation of a feature and the graduated glyph’s intuitive way to depict inner deviation of individual ensemble members. When placed across an
image, the overall size and orientation of the glyphs indicate the variability of features while the very shape of and color distribution within an
individual glyph give a quick statistical summary of uncertain deviations of a feature.
Let be a critical point in the mean data field Its counter-parts in data members are A graduated elliptical glyph
consists of k nested ellipses and is placed at the location of The nested ellipses share the same orientation and axis ratio. The rendering of each
nested ellipse is as follows:
First, assign an ellipse E with axes A and B computed according to the relative locations of towards Let variable X be
and Y be The lengths and directions of A and B are given by and respectively. is the Covariance Matrix
of X and Y. Second, nested ellipses are produced by fitting them into ellipse E. Sort according to its distance from In
descending order. Let the biggest distance among be D. Each nested ellipse is given axis lengths and Finally. In a way similar
to producing a graduated circular glyph (Sanyal, et al., 2010), assign saturation level for the ith glyph as and overlay the ellipse
so that the smaller one is over the larger one.
Figure 17 gives a comparison between using a simple ellipse, arrows, graduated circular glyph, and graduated elliptical glyph for feature-level
uncertainty. The individual features (which are not displayed in a final visualization) are shown as well to better illustrate the design
concept of each glyph. The simple ellipse (Figure 17a) only characterizes the overall deviation of a feature. Arrows (Figure 17b) indicate the exact
locations of all the deviated features while inviting visual clutters when k increases. The graduated circular glyph (Figure 17c) shows the
individual deviation of a feature but no direction information is revealed. However, the graduated elliptical glyph (Figure 17d) summarizes overall
and individual distribution of a feature in a succinct way. Figure 18 shows a set of graduated ellipse with varying color distributions, sizes, axis
length ratios, and orientations.
Figure 17. Feature-level uncertainty glyph design (a) Ellipse
(b) Arrows (c) Graduated circular glyph (d) Graduated ellipse
4.5. Application and Discussion
The method is applied to a 2D vector dataset. The first dataset includes 5 simulated hurricane wind fields (Figure 19). The uncertainty glyphs are
placed in critical point locations. The results demonstrate how much features are affected by the uncertainty within the data. Through mapping
and comparing the critical points between different ensemble members, the shifts between critical points become perceivable. The graduated
elliptical glyphs effectively indicate the magnitude and overall orientation of the uncertain deviations of vortices. The uncertain position of the
hurricane eye reflects the impact of the uncertainty directly. A side-by-side display of different component data or the visualization of the data-
level uncertainty may not give viewers such insight.
(a) Feature tracking within FFF of two vector fields and (b)
Uncertain location of hurricane eye and vortices in a hurricane
wind field (5 ensemble members)
Although the result of this feature-level uncertainty visualization is positive, there are a few limitations and areas that need further study. Most
notably, more features could be considered in the future. Second, other feature-mapping methods may be included depending on the feature type
since the current feature-mapping method, FFF, mainly tracks topological features.
5. CONCLUSION
This chapter has conducted an in-depth investigation of feature-level uncertainties and suggested a promising direction of future uncertainty
studies in exploiting topology tools. The presented feature-based techniques alleviate the inherent perception issues such as clutter and occlusion
in uncertainty visualizations in 3D or large 2D scenes. The incorporation of the feature-level uncertainties into visualization provides insights
into the reliability of the extracted features which otherwise would remain unknown with the visualization of only data-level uncertainty. In
addition, the novel use of contour trees provides an effective solution for interacting with 3D or large 2D data sets with uncertainty.
There are many possible directions in which one could extend this work. (1) Extend the developed feature-based uncertainty visualization
framework to study uncertainty in various fields. (2) Improve the interactive uncertainty visualization based on user feedbacks. (3) Investigate
more metrics to measure uncertainty or variability and apply our methods to address different types of uncertainty models. (4) The possibilities
inherent in topology tools are not exhausted yet. For example, the possibility of using Morse-Smale complex (Smale, 1961) for uncertainty study
has not been investigated. Therefore, we will continue the current work in utilizing topology tools to visualize uncertainty.
This work was previously published in Innovative Approaches of Data Visualization and Visual Analytics edited by Mao Lin Huang and
Weidong Huang, pages 6893, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bajaj, C. L., Pascucci, V., & Schikore, D. R. (1997). The contour spectrum . In Proceedings of Visualization '97 . New Brunswick, NJ: IEEE Press.
Brodlie, K., Osorio, R. A., & Lopes, A. (2012). A review of uncertainty in data visualization . InDill, J, Earnshaw, R, Kasik, D, Vince, J, & Wong, P.
C. (Eds.), Expanding the Frontiers of Visual Analytics and Visualization (81-109). Berlin: Springer. doi:10.1007/978-1-4471-2804-5_6
Carr, H., Snoeyink, J., & Axen, U. (2003). Computing contour trees in all dimensions. Computational. Geometry . Theory & Applications , 24(2),
75–94.
Carr, H., Snoeyink, J., & Panne, M. V. D. (2004). Simplifying flexible isosurfaces using local geometric measures . InProceedings of Visualization
'04 . New Brunswick, NJ: IEEE Press. doi:10.1109/VISUAL.2004.96
Cedilnik, A., & Rheingans, P. (2000). Procedural annotation of uncertain information . In Proceedings of Visualization '04 . New Brunswick, NJ:
IEEE.
Davis, T. J., & Keller, C. P. (1997). Modeling and visualizing multiple spatial uncertainties. Computers & Geosciences , 23(4), 397–408.
doi:10.1016/S0098-3004(97)00012-5
Diggle, P., Heagerty, P., Liang, K.-Y., & Zeger, S. (2002). Analysis of longitudinal data . New York: Oxford University Press.
Djurcilova, S., Kima, K., Lermusiauxb, P., & Pang, A. (2002). Visualizing scalar volumetric data with uncertainty. Computers & Graphics , 26(2),
239–248. doi:10.1016/S0097-8493(02)00055-9
Eickhoff, S. B., Laird, A. R., Grefkes, C., Wang, L. E., Zilles, K., & Fox, P. T. (2009). Coordinate-based activation likelihood estimation meta-
analysis of neuroimaging data: A random-effects approach based on empirical estimates of spatial uncertainty.Human Brain Mapping , 30(9),
2907–2926. doi:10.1002/hbm.20718
Garth, C., & Tricoche, X. (2005). Topology-and feature-based flow visualization: Methods and applications. In Proceedings of the SIAM
Conference on Geometric Design and Computing. New York: ACM Press.
Grigoryan, G., & Rheingans, P. (2004). Point-based probabilistic surfaces to show surface uncertainty. IEEE Transactions on Visualization and
Computer Graphics , 10(5), 564–573. doi:10.1109/TVCG.2004.30
Heine, C., Schneider, D., Carr, H., & Scheuermann, G. (2011). Drawing Contour trees in the plane. IEEE Transactions on Visualization and
Computer Graphics , 17(11), 1599–1611. doi:10.1109/TVCG.2010.270
Helman, J., & Hesselink, L. (1989). Representation and display of vector field topology in fluid flow data sets. IEEE Computer Graphics and
Applications , 22(8), 27–36.
Hengl, T., & Toomanian, N. (2006). Maps are not what they seem: Representing uncertainty in soil-property maps. In Proceedings of 7th
International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edgbaston, UK: World Academic
Press.
Jenkinson, M., & Smith, S. (2001). A global optimisation method for robust affine registration of brain images. Medical Image Analysis , 5(2),
143–156. doi:10.1016/S1361-8415(01)00036-6
Keqin, W., Zhanping, L., Song, Z., & Moorhead, R. J. (2010). Topology-aware evenly spaced streamline placement. IEEE Transactions on
Visualization and Computer Graphics , 16(5), 791–801. doi:10.1109/TVCG.2009.206
Kosara, R., Miksch, S., Hauser, H., Schrammel, J., Giller, V., & Tscheligi, M. (2002). Useful Properties of Semantic Depth of Field for Better F+C
Visualization. In Proceedings of the Symposium on Data Visualisation. New Brunswick, NJ: IEEE Press.
Otto, M., Germer, T., Hege, H.-C., & Theisel, H. (2010). Uncertain 2D vector field topology. Computer Graphics Forum , 2(29), 347–356.
doi:10.1111/j.1467-8659.2009.01604.x
Pang, A., Wittenbrink, C., & Lodha, S. (1997). Approaches to uncertainty visualization. The Visual Computer , 13(8), 370–390.
doi:10.1007/s003710050111
Pascucci, V., Cole-McLaughlin, K., & Scorzelli, G. (2004). Multi–resolution computation and presentation of contour tree. InProceedings of the
IASTED Conference on Visualization, Imaging, and Image. Calgary, AB: ACTA Press.
Pauly, M., Mitra, N. J., & Guibas, L. (2004). Uncertainty and variability in point cloud surface data. In Proceedings of Eurographics Symposium
on PointBased Graphics. New Brunswick, NJ: IEEE Press.
Pfaffelmoser, T., Reitinger, M., & Westermann, R. (2011). Visualizing the positional and geometrical variability of isosurfaces in uncertain scalar
fields. In Proceedings of Eurographics/IEEE Symposium on Visualization. New Brunswick, NJ: IEEE Press.
Pöthkow, K., & Hege, H.-C. (2011). Positional uncertainty of isocontours: Condition analysis and probabilistic measures. IEEE Transactions on
Visualization and Computer Graphics , 17(10), 1393–1406. doi:10.1109/TVCG.2010.247
Rhodes, P. J., Laramee, R. S., Bergeron, R. D., & Sparr, T. M. (2003). Uncertainty visualization methods in isosurface rendering . In Proceedings
of Eurographics . New Brunswick, NJ: IEEE Press.
Sadarjoen, I. A., & Post, F. H. (2000). Detection, quantification, and tracking of vortices using streamline geometry. Computers &
Graphics , 24(3), 333–341. doi:10.1016/S0097-8493(00)00029-7
Sahner, J., Weinkauf, T., & Hege, H.-C. (2005). Galilean invariant extraction and iconic representation of vortex core lines . InProceedings of
EuroVis . New Brunswick, NJ: IEEE Press.
Sanyal, J., Zhang, S., Bhattacharya, G., Amburn, P., & Moorhead, R. (2009). A user study to compare four uncertainty visualization methods for
1D and 2D datasets. IEEE Transactions on Visualization and Computer Graphics , 15(6), 1209–1218. doi:10.1109/TVCG.2009.114
Sanyal, J., Zhang, S., Dyer, J., Mercer, A., Amburn, P., & Moorhead, R. J. (2010). Noodles: A tool for visualization of numerical weather model
ensemble uncertainty. IEEE Transactions on Visualization and Computer Graphics , 16(6), 1421–1430. doi:10.1109/TVCG.2010.181
Schmidt, G. S., Chen, S.-L., Bryden, A. N., Livingston, M. A., Osborn, B. R., & Rosenblum, L. J. (2004). Multidimensional visual representations
for underwater environmental uncertainty. IEEE Computer Graphics and Applications , 24(5), 56–65. doi:10.1109/MCG.2004.35
Schneider, D., Wiebel, A., Carr, H., Hlawitschka, M., & Scheuermann, G. (2008). Interactive comparison of scalar fields based on largest contours
with applications to flow visualization.IEEE Transactions on Visualization and Computer Graphics ,14(6), 1475–1482.
doi:10.1109/TVCG.2008.143
Sohn, B. S., & Chandrajit, B. (2006). Time-varying contour topology. IEEE Transactions on Visualization and Computer Graphics , 12(1), 14–25.
doi:10.1109/TVCG.2006.16
Takahashi, S., Takeshima, Y., & Fujishiro, I. (2004). Topological volume skeletonization and its application to transfer function design. Graphical
Models , 66(1), 24–49. doi:10.1016/j.gmod.2003.08.002
Theisel, H., Rössl, C., & Seidel, H.-P. (2003a). Combining topological simplification and topology preserving compression for 2D vector fields .
In Proceedings of Pacific Graphics . New Brunswick, NJ: IEEE Press. doi:10.1109/PCCGA.2003.1238287
Theisel, H., Rössl, C., & Seidel, H.-P. (2003b). Using feature flow fields for topological comparison of vector fields . In Proceedings of Vision,
Modeling, and Visualization . New Brunswick, NJ: IEEE Press.
Theisel, H., & Seidel, H.-P. (2003). Feature flow fields . InProceedings of Data Visualization . New Brunswick, NJ: IEEE Press.
Theisel, H., Weinkauf, T., Hege, H.-C., & Seidel, H.-P. (2005). Topological methods for 2D time-dependent vector fields based on stream lines
and path lines. IEEE Transactions on Visualization and Computer Graphics , 11(4), 383–394. doi:10.1109/TVCG.2005.68
Tierny, J., Gyulassy, A., Simon, E., & Pascucci, V. (2009). Loop surgery for volumetric meshes: Reeb graphs reduced to contour trees. IEEE
Transactions on Visualization and Computer Graphics, 15(6), 1177–1184. doi:10.1109/TVCG.2009.163
Tricoche, X. (2002). Vector and Tensor Field Topology Simplification, Tracking and Visualization. (Dissertation). Kaiserslautern, Germany,
University of Kaiserslautern.
Walsum, T. V., Post, F. H., Silver, D., & Post, F. J. (1996). Feature extraction and iconic visualization. IEEE Transactions on Visualization and
Computer Graphics , 2(2), 111–119. doi:10.1109/2945.506223
Wittenbrink, C., Pang, A., & Lodha, S. (1996). Glyphs for visualizing uncertainty in vector fields. IEEE Transactions on Visualization and
Computer Graphics , 2(3), 266–279. doi:10.1109/2945.537309
Wu, K., & Zhang, S. (2013). A contour tree based visualization for exploring data with uncertainty. International Journal for Uncertainty
Quantification , 3(3), 203–223. doi:10.1615/Int.J.UncertaintyQuantification.2012003956
KEY TERMS AND DEFINITIONS
Ensembles: Multiple predictions from an ensemble of model runs with slightly different initial conditions and/or slightly different versions of
models. Forecasters use ensembles to improve the accuracy of the forecast by averaging the various forecasts.
Feature: Phenomena, structures, or objects of interest in a dataset. The definitions of features depend on specific applications and users. The
examples of topological features in a scalar field are Morse-Smale complex, Reeb-graph, and contour tree. Generally, vector field topology
consists of critical points, periodic orbits, and separatrices.
FeatureBased Visualization: A type of visualization which visualizes features extracted from the original and usually large data set.
Uncertainty: The opposite of certainty, having limited knowledge to exactly describe an existing state and the future outcome, etc. Uncertainty
in scientific data can be broadly defined as statistical variations, spread, errors, differences, and minimum maximum range values, etc.
CHAPTER 15
Big Data – Small World:
Materializing Digital Information for Discourse and Cognition
Ian Gwilt
Sheffield Hallam University, UK
ABSTRACT
This chapter furthers discourse between digital data content and the creation of physical artifacts based on an interpretation of the data. Building
on original research by the author, the chapter asks the questions: why should we consider translating digital data into a physical form? And what
happens to how we understand, read, and relate to digital information when it is presented in this way? The author discusses whether or not the
concept of the data-driven object is simply a novel visualization technique or a useful tool to add insight and accessibly to the complex language
of digital data sets, for audiences unfamiliar with reading data in more conventional forms. And the author explores the issues connected to the
designing of data into the material world, including fabrication techniques such as 3D printing and craft-based making techniques, together with
the use of metaphor and visual language to help communicate and contextualize data.
INTRODUCTION
In the 1990’s the arrival of domestic computing and mainstream digital technologies signaled the start of a concerted effort to digitize all forms of
creative, cultural and scientific content from the past, present and future. Two decades on the digital is a fully integrated meta-form that drives
many of our communication tools, social activities and work practices (Gwilt, 2010). After this initial wave of digital integration, more careful
consideration as to the relationship between our physicality, environmental surroundings and material artifacts, and how this links with a range
of digital technologies is beginning to take place. A reconsideration of biological necessities and the recognition of the human hardwiring into
Euclidian space has begun to raise questions about the singularity of digital culture. As yet the promise of a transcendent digital virtual reality has
failed to live up to expectations and a new way of interacting with the digital is beginning to unfold. In the computer games world game-play has
been combined with gestural interfaces where players can see the unmediated expressions of their competitors. Biomorphic forms in architecture
and product design signal a new zeitgeist in urban design as digital technologies have developed the processing power to model and visualize the
complex curvilinear shapes and patterns found in nature. In the built environment everyday objects such as chairs and automobiles are
increasingly enabled with sensors and user-feedback technologies that can respond to and even pre-empt our individual needs and relationship
with the physical world. The Internet of things is becoming a reality, realized through digital connectivity and the concept of ‘everything, all the
time’. Sensing technologies are capable of remotely collecting our every interaction and this capacity plays an important role in the big data
revolution. Wireless mobile technologies have moved the experiences of the digital computer into the street and the public arena where their use
is becoming increasingly commonplace, connecting the digital with real-world events and locating our engagement with computing technologies
into real-time social, cultural and political contexts. These distributed, pervasive digital technologies are beginning to have a major impact on the
fabric of society, from how we access healthcare, to how we do our shopping, work, travel, and communicate.
The technological and perceptual dispersal of the digital computer from something that sits on the office desk, into increasingly embedded,
distributed and multiform constructs, also disarms the often talked about binary opposition between the digital and the physical. As computing
technologies become increasingly located and related to place and social contexts of use, the potentials for the digital to augment and interact
with material culture become more opportune. In terms of information visualization this closer relationship does two things: it provides new
opportunities for content forms; and drives the desire for data visualizations that speak to both our real-world and digital interactions. The
cultural theorist Pierre Levy (1998) refers to this diverse range of digital integration as a type of accelerated techno-cultural heterogenesis. The
shift in emphasis back toward the physical does not however, mean that we are about to give up the connectivity, convenience and enabling
potential of our digital technologies. Although the experiences promised by immersive virtual reality have yet to find a place in our mainstream
engagement with computers, the types of informed digital/material constructs described in this paper are beginning to gain wide spread
recognition. Terms like augmented, and mixed reality are increasingly being used to describe a set of relationships, technologies and expectations
for a variety of combined, digital/material constructs. These neologisms are becoming part of the public vocabulary in an increasingly
technologized society.
Predictably then, the tendency toward the materialization of the digital is also occurring in the area of computer-based information visualization,
which is in itself a relatively new phenomenon. The materialization of digital data is facilitated by the development of a number of new
manufacturing technologies such as 3D printing and Rapid Prototyping techniques that allow for the translation of digital data into physical
forms. I will discuss the potential of these techniques later in the chapter through a number of applied examples. In the next two sections we will
continue with a short overview of contemporary data collection and information visualization.
DATA COLLECTION, BIG DATA
Digital technologies continue to provide numerous embedded data collecting points that glean information from our everyday interactions with
the world, both analogue and digital. The data trails of people, computers, economies, healthcare services, communication and leisure activities
are leading to an exponential growth in the gathering of data, which is now measured in exabytes. This proliferation of information has seen the
rise of big data, an open-ended term that alludes to the scale, variety of forms, and rapidity of contemporary data collection. There are a number
of data management and analysis software tools that can be used to collect and present data sets, and different parametric and data
configurations allow for a variety of sorting, ordering and comparative activities to take place. However, the scale of big data means that common
statistical and clustering techniques typically used for two and three dimensional data now needs to accommodate highly multidimensional data
from a variety of different, ‘messy’ sources to explore and reveal trends (Illinsky & Steele, 2011). The potential to cross reference large-scale data-
sets is a key feature of big data that has caused much debate about how we draw meaning by combining data from a variety of sources and in
different forms to indicate relationships (Mayer-Schönberger & Cukier, 2013). The argument being that if we embrace the sheer density of
available data, and its continuing supply, it is possible to computationally predict patterns and trends and causes; conversely, it is argued that we
still need to examine the socio-cultural, economic, political and environmental contexts to any relationships that might be revealed in
multidimensional data-fields (Graham, 2012; Mayer-Schönberger & Cukier, 2013).
The visualization of data brings another level of subjectivity to our interpretation and understanding of data and the problem of how to visualize
complex multidimensional data is only recently being addressed (Yua, 2013). The visual languages, syntax and techniques of visualizing big data
are still very much in development and the challenge of data-visualization both digital and physical is to make meaningful interpretations that
can be understood.
Data Visualization
The disciplines of digital information visualization and information design have witnessed a phenomenal growth in the last 15 to 20 years (Card,
Mackinlay, & Shneiderman, 1999; Ware, 2004; Klanten, Ehmann, Tissot, & Bourquin, 2010), and as digital data has become more accessible,
designers and other professionals have taken to the task of visually interpreting this data with gusto. This interest is evidenced in work and
publications such as Klanten’s, ‘Data flow: visualising information in graphic design’ (2008), ‘Information is beautiful’ (McCandless, 2012), and
‘Infographica: the world as you have never seen it before’, by Toseland and Toseland (2012). The bread and butter content of conventional data
visualizations such as scientific, political, environmental and economic statistics, now sit side-by-side with visualizations and measurements of
social networking patterns, Internet distribution arrays, transportation usage, and even more idiosyncratic content such as visits to fast food
outlets or personal coffee consumption. Nicholas Felton is an example of a designer who creatively visualizes his daily routines and other events,
which are published as ‘personal annual reports’ (Felton, 2014). A huge range of visual representation techniques are employed in contemporary
information visualization practices including the use of: bar charts; graphs; diagrams; illustrations; 3D models; maps; animations; generative
and interactive visualizations, designed for onscreen and print-based consumption, and within these techniques decisions around: the use of
colour; graphical qualities of line; the position, scale and layout of items; use of typography, and other media elements, all play a part in helping
to communicate data and information.
However, information design as a practice has a long history prior to the invention of the digital computer and we can trace the desire to visualize
information back to the earliest forms of mark making and symbolic languages. Tufte (1983) attests to the fact that abstract visualization of
quantitative, statistical data did not really come about until the mid 18th Century when tabular data was first replaced by charts and graphics.
According to Pauwels (2006) the visualization of scientific findings is a fundamental part of scientific discourse. Pauwels’ proposition underlines
the importance of data visualization as a tool for aiding cognition and following on from this, the importance of making the appropriate
visualization design decisions. The majority of data visualizations are primarily concerned with the presentation of quantitative data, but Tufte’s
suggestion that the minimal use of visual elements should be used to communicate the maximum amount of data, has at times been forgotten.
This could be due in part to the range of visualization capabilities of the modern day computer, and the desire of data visualizers to make use of
them all. Moreover, he suggests visual elements in information visualization should perform a dual purpose, carrying information, and
communicating something about that information (Tufte, 1983). Notwithstanding this advice, Manuel Lima in his ‘Information Visualization
Manifesto’ suggests that unlike design in the material world, where form is regarded to follow function, in the digital domain “form does not
follow data” (Lima, 2009), or in other words that data does not necessarily dictate form. This concept of ‘incongruent data’ suggests that any
number of visualizations can be made from the same data set, and that any visualization is able to make an equal claim on veracity or
appropriateness. This has of course been the case in analogue media too, but the scope and range of formats made available within digital
computing has extended these choices dramatically. Combining this sudden expansion of choice, with the ease of production inherent in the
digital and the potential dislocation between form and function, also goes some way to explain the plethora of new data visualizations that have
appeared over the last decade or so, some of which have attracted criticism for being unnecessary, meaningless or at worst unintelligible. Lima
and others however, recommend that information visualizations should at their core provide insight or clarity. This directive can be used as a way
of legitimizing design decision-making. Presaging Lima’s suggestions, Card et al. (1999), asserted that from a Human Computer Interaction
(HCI) perspective, information visualization is “…the use of computer-supported interactive, visual representations of abstract data to amplify
cognition”(p.7). Furthermore, the champion of information aesthetics, Andrew Vande Moere foregrounds the democratizing potential that
information visualization tools can have as an inclusive communication device (Klanten et al., 2010). Given the wealth of choice offered by
computing technologies the idea that every effort should be made to achieve a contextual or sympathetic connectivity between the use and design
of data visualizations and the underlying data is a logical one. In the following section I will examine in some detail how the property norms of
digital and material culture might be considered in an attempt to amplify cognition through a relationship of data and form.
Attributes and Properties of Digital and Natural Forms
The actualising of a digital data-set into a physical object may initially seem paradoxical, since by fixing digital data in time and physical space we
would appear to disable much of the dynamic potential inherent in computer technologies. However, by realizing a digital data set as a physical
object we can begin to consider and exploit the attributes and properties that are typically assigned to both digital and material cultures. A key
challenge for the data-driven object is to usefully combine any number of possible relationships between digital traits such as dynamism,
complexity, interconnectivity, mutability and so on, and the material properties inherent in the physical object such as tactility and notions of
uniqueness, preciousness/value (in both economic and socio-cultural terms), durability, and provenance. The concept of the data-object invites
us to apply a new information communication typology wherein an articulated object could be populated with information that would exploit
both digital and material attributes to the benefit of a range of social, educational, scientific and economic stakeholders.
The articulations of the data-object attempt to combine both echoes of the digital alongside the benefits of the material properties inherent in the
physical. By creating a physical data-object we can hopefully use a combination of physical and digital traits to encode, communicate and point to
relationships in the underlying data (Table 1).
Table 1. Ontological/epistemological qualities of the dataobject
Tangible Networked
Unique Multiple
Precious Disposable
Fixed Dynamic
Present Remote
Historic Technological
Patinated Mutable
Understandably though, there are tensions formed when we create physical objects from digital data sets. This fusion of properties from both
paradigms can lead to a perceived compromise in the expectations or benefits we associate with the discrete digital data or material artifact.
However, as suggested the dialogic exchange between digital information and material form can establish new potentials for reading and
interacting with these articulated objects. Nevertheless we should not understate the fundamentally ontological and epistemological differences
that frame our understanding of digital and material cultures. Reconciling these differences at a cultural level is beyond the data artifact, but
what the data artifact does do is act as a point of confluence where those differences can be challenged or explored. Take for example, notions of
time and space, which have particular associations in both the digital and material realm. Within the digital the concept of space or ‘taking up
space’ is abstracted, we talk about large and small file sizes in the computer memory but this space is hard to visualize. Conversely the physicality
of an object clearly shows size and the occupation of space. The ‘physical’ perception of scale, and size is something we are much more attuned to
in material culture and is a property that can be explored in the creation of a data artifact. Similarly, spatial relationships and how we order, sort
and arrange items within both digital and material cultures also have their own established conventions and are equally important considerations
for how we visualize data as form.
Notions of time are also particular to digital and material cultures. When we think of time in relation to the digital it is often in terms of the
immediate present and where synchronous feedback across space, data and community is the expected norm. In the digital, time can also be
replayed or visualized at different speeds and scales, it can even be archived for browsing at some other date. Conversely our perception of
physical objects is that they have a fixed and contiguous relationship with time as we measure it in the material world. Physical objects cannot be
scrolled backwards or forwards, out of synch with the time continuum, but are linked to cultural notions of age, newness, or antiquity.
Consequently there is often a tacit sense of time embedded into the way we read and respond to physical objects and by extension physical data.
This sense of time: how old an object is; how long it took to make; acquire; or physically reach, is an important factor in the way we attribute
value to material artifacts. For the data-object this physical notion of time and time-scale might be gained at the expense of digital immediacy or
temporal non-linearity.
Consider the work entitled ‘iForm’ (2010) by James Charlton (Figure 1). In this work GPS tracks of iPhone users were recorded over a set period
of time and location. This information was then used to create a physical sculpture which transcoded the digital representations of time and space
into a physical form. Charlton notes that the representation of time and movement as a physical object challenges the assumptions we make on
the meaning of an object’s form (Charlton, 2010). Objects that are conceived from the visualization of digital data do not need to conform to the
representational mores of objects that draw a reference to material culture. In this case the surfaces and angles in the work represented spatial
and temporal variations of a digital data set, which are used to inform the shape of the object. Charlton’s data-driven object concretizes the
abstract concepts of space, time and movement by physically representing our interaction with the digital.
Figure 1. iForm, rapid prototype sculpture and data projection
© 2010, Artist James Charlton (Used with permission).
Other notions of time, use and age are distinctively represented in digital and material spaces. Physical objects collect dust or become dirty, worn
or broken, they accrue ‘character’ via our interaction with them and being in the world. This patina is something that is hard to fabricate in the
digital and implicitly denotes age, frequency of use (popularity) and history. Furthermore, physical objects decay with time, where, in theory at
least, digital data can indefinitely be stored without loss of fidelity. However, the continuous process of upgrading computing hardware,
operating systems, and software, can create, if not quite a form of digital entropy, a distinctive timeline for digital content, marking its’ heritage
and capabilities. Moreover, the history of digital content is intrinsically iterative, multiple and variable; think of the ‘save’ and ‘save as’ functions
in computer software, and the different results which arise from using these options. One allows for multiple variations or histories and the other
replaces one history with another. The concept of temporal non-linearity is often ascribed to the digital as one of its key attributes, but equally the
history and narrative of a real-world object is a complex assemblage of relationships, which take into account use, ownership, geographical
location and material composition, and the changes in these aspects over time (Riggins, 1994). This flexibility with which digital data can be
manipulated, reproduced and rearranged can engender a perceived lack of authenticity, although real-world data collection processes are equally
open to interpretation. In the digital authenticity is often assigned by popularity as much as by provenance.
It is generally accepted that physical artifacts become imbued with a social cultural narrative by way of our interactions with them (Miller, 2010).
And that interaction with a specific object can bring cultural and individual value to even a mass-produced item. This notion of interaction in
material terms is quite different from the functional goal orientated interaction with data on the computer screen, which has the advantage of
easily facilitating a variety of shared experiences across a broad spectrum of communities. This distributed experience is harder to create in the
physical world, and is something that the social networked space of the digital is very adept at. Although the first-hand tangible experience of the
physical object might be lost through the mediated channels of the digital this can be made up for through the potential of multiple sharing,
commentary and reinterpretation. On one hand the democratic possibilities of the digital give widespread access to information and experience,
which is often difficult to achieve with a physical object. On the other, there are strong inherent cultural notions of authenticity, which we ascribe
to the empiric interaction with a singular physical object (Riggins, 1994). In the best case scenario the manifestation of digital information into a
material form should play to the strengths of both of these two paradigms, in the worst case scenario it would fail to capitalise on either.
When we are thinking about the creation of physical data-driven objects we should also consider the material properties of the object and how
that might influence the way we read the underlying data. The material properties of wood, metal, plastic and so on, all have a set of cultural
resonances and readings that prefigure how we think of a particular object. For example, the clean smooth lines of melamine plastic furniture
points to the concept of a ‘modernist’ aesthetic (Shove, Watson, Hand, & Ingram, 2007). Whereas the high-tech finish of carbon fiber brings to
mind the composition of future worlds and artifacts. Each material form has its own set of cultural readings, which can be complex, and subtly
interwoven with the use of the object and the context or setting in which the object is found. For instance, wood has the association of being a
natural material but can also be perceived as a sophisticated urban form. Just as the general categorization of physical materials can prefigure the
reading of an object, the ‘finish’ of a particular material can also be important. Returning to the example of the wooden artifact, a table top
roughly cut from a tree might retain pieces of bark and the inconsistencies of natural growth. This finish has a very different set of associated
cultural values from that of a wooden table top, machine cut, symmetrically shaped, sanded and lacquered. Do these finishes have their parallel
in the digital domain? A low-resolution digital image might be perceived to have a ‘raw’ quality about it, where as an image with a high pixel
resolution might intrinsically suggest refinement and quality. Importantly the aesthetics of an artifact - digital or material - have an important
influence on how the artifact is read and regarded. Just as the use of colour, composition, line-weight and typography in digital information
visualization can set up a way of reading content, the material choices and finishes of a material-based data artifact can equally influence how the
object and its’ attendant data is interpreted. The creator of a data-object assemblage then must be cognizant to all these considerations.
MAKING TECHNOLOGIES
The capacity of the digital to easily create multiple versions of the same artifact is to some extent reflected in material culture through
contemporary manufacturing and production processes. As we move into an era of new forms of digitally enabled manufacturing, the idea of
mass customisation and the individually specified object is increasingly taking president. New adaptive and programmable production processes
make the fabrication of bespoke objects easier and cheaper. Many of the first generation data-objects have been produced by utilizing different
fabrication technologies that sit under the broad category of Rapid Prototyping, in addition to computer-controlled milling and laser-cutting
processes. Rapid Prototyping machines can fabricate objects in a variety of materials ranging from paper-based products, to different types of
plastic and waxy materials and even precious metals; and use a number of different processes to make objects. Typically though, a computer-
based 3D modeling software is used to create a digital model, which is then used to print the material object (Noorani, 2006). Rapid Prototyping
techniques were initially developed as a (relatively) inexpensive method of testing manufactured components and products before they were put
into production (Chua, Leong, & Lim, 2003).
One of the early forms of Rapid Prototyping is the Laminated Object Manufacturing (LOM) method. The LOM fabrication process is of interest
here as it articulates attributes from both digital and material making processes. In this process sheets of paper are glued together one layer at a
time. As the layers are built up a computer-controlled laser or knife is used to trace out the required form and to ‘cube’ unwanted material ready
for removal. This process requires much hand finishing, principally the removal of unwanted material and the sanding down and treating of the
object to the desired finish. The LOM process effectively turns paper back into a wood-like material and the burnt, laser cut edges of the
laminated paper give the object an unusual low-tech feel, distinctively at odds with the look and feel of more advanced Rapid Prototyping
processes. The slightly rough, handmade look of the object, as well as its physical weight, gives it a set of qualities that echo pre-industrial
revolution manufacturing/making processes, with the visible grid-lines echoing the visual language of both graph paper and the 3D modeling
environment of the computer. The artist Brit Bunkley (2009) has effectively used the LOM fabrication process in a number of his sculptural
works (Figure 2).
Figure 2. Trophy, Laminated Object Manufacturing (LOM)
detail
© 2006, Artist Brit Bunkley (Used with permission).
Computer-controlled laser-cutting technologies are another popular technique for producing data artifacts, which can be used to achieve similar
affordances to the LOM method. In Abigail Reynolds’ work ‘MOUNT FEAR Statistics for Crimes with Offensive Weapon South London 2001-
2002’ (2002), sheets of corrugated cardboard are cut out and stuck together to create a room sized three-dimensional bar chart of crime statistics
in London, England. The physical size of the data-object, roughly finished in cardboard, creates an unsettling and impactful representation of the
underlying information. The work clearly shows how choice of materials, scale and manufacturing process can influence how data presented in
physical form is received.
The aesthetic of the LOM manufacture process and laser-cutting techniques described above differ from the idea of the precise, computer-
fabricated making and finishing which we typically expect from Rapid Prototyping technologies, where hard surfaces and bright white plastics are
the norm. Yet what these examples begin to reveal is the complexity of reading material forms, where our understanding of even the simplest of
objects can be prefigured by the making technologies employed and the choice of materials used. Daniel Miller (2010) comments on the
importance of objects in society not just as things that affect our behavior and sense of self, but also as scene setting agents which move in and
out of our focus. In this respect, our relationship to material objects might be seen to parallel our relationship to the computer desktop interface.
According to Bolter and Grussin (1999) the computer desktop interface operates as a window and a mirror, something you can look through, and
look at, at times content, at times form, and at times both. As a combination of both digital and material elements, data-objects read on a number
of interlinked levels, as form and content.
Data and Metaphor
As discussed earlier, in data visualization form does not necessarily follow function. Lakoff and Johnson (1980) suggest that we use spatial
metaphors to help us understand complex concepts. It follows that data visualizations will often use metaphor to create relationships and
understanding based on objects, spaces and experiences in the material world. Within computer-based data visualization, metaphor is used to aid
understanding by association with physical phenomena (Anders, 1999). These metaphors can be given material weight when data sets are
materialized as physical objects. In Nadeem Haidary’s work ‘Caloric Consumption’ (2009), statistics commenting on the average calorific intake
per capita in different countries are visualized as a real/physical dining fork. The prongs on the folk are different in length and represent the
varying calorific intake data from four countries (Haidary, 2009). This data-object uses the fork as a metaphor for personal and global
consumption. Because of the different lengths in the prongs the fork is not really functional, it operates as a symbolic reading of both data and
form, seen through the lens of the cultural artifact, as Latour and Weibel (2002) suggest, there are any number of possible readings and multiple
encodings for any given symbol.
The visual language of data visualization is also well served with biological metaphors. From the use of particle diagrams, molecular and cell
structures, to tree-shaped flow charts, growth rings, cloud patterns. These visual reference points are commonly used to aid understanding of
statistical data through familiarity and to give some notion of ‘organic’ connotations to data. Other graphical types of imagery including: maps;
pictograms; arrows; and geometric shapes, are frequently used to add a relational context to data (Yau, 2013). Tufte (1997) comments on the
relationship between visual and statistical thinking and how the application of visual language to data can aid, or detract, from our understanding
of the data. The use of metaphor becomes more challenging when we combine digital data with the material language of physical form.
Making from the Digital
As mentioned, there are a number of enabling fabrication technologies that can facilitate the production of data-objects. As these technologies
have become cheaper and more readily available the creative community has begun to take advantage of these new production processes. In the
‘Inside Out’ (2010) Rapid Prototype exhibition over 40 miniature sculptures were produced using Rapid Prototype technologies. The show was
the result of collaboration between artists and designers in the UK and Australia. Each artist contributed one sculpture to the show. The
sculptures were initially designed on the computer and output in resin for exhibition in both countries. The ‘Inside Out’ exhibitions, in name and
theme, explored the making techniques and perceptions of creative practice in physical and digital environments, and allowed artists from
different disciplines including textile design, fine arts, architecture and animation, to consider how to engage with these new fabricating
processes (Smith, Rieser, & Saul, 2010).
Two of the works in particular demonstrate how data recorded from natural phenomena and biological processes can be used in the creation of a
data-object. The work by Mitchell Whitelaw entitled ‘Measuring Cup (Sydney 1859-2009)’ (2010), is a materialization of temperature statistics
for Sydney, Australia over 150 years. In Whitelaw’s work this data is used to form a resin beaker (Figure 3). Each layer of the beaker represents
one year’s statistics. Placed one on top of another, the rings of data build up the sides of the container, and like the growth rings of a tree, the
annual rings of temperature data give a tangible representation of change over the years. Interestingly the ergonomic affordance of a flared lip,
usually introduced to aid drinking, is reflected by the recent upward (outward) trend in overall temperatures (Whitelaw, 2010).
Figure 3. Measuring cup (Sydney 1859-2009), rapid prototype
sculpture
© 2010, Artist Mitchell Whitelaw (Used with permission).
While Whitelaw’s piece is a representation of data analysis from the natural environment, Michele Barker and Anna Munster’s piece in the
exhibition looked to a data set of a much more personal nature. In the work ‘Brainwaves Of Monks Meditating On Unconditional Loving-
Kindness And Compassion’ (2010), Barker and Munster materialize the act of thought. Using neuroscientific data of brain activity, recorded
differences in electromagnetic energy emitted during meditation were used to form the basis of the Rapid Prototype sculpture. In a novel form of
bioinformatics the data-object, a 3D sculptural representation of the data traced in the form of an ascending graphical line, is both informative
and fragile, and the qualities of the physical object lend an intimacy that is missing in a conventional 2D chart or graph.
In another intimate expression, the singularity of our individual physiology is the inspiration for a series of data driven works by the jewelry
maker/conceptual artist Christoph Zellweger. In Zellweger’s ‘Data Jewels’ (2006), Rapid Prototypes and other computer controlled
manufacturing processes were used to create wearable jewelry based on an individual’s genetic code (Zellweger, 2007). Zellweger’s jewelry
reveals the hidden codes and data of our physical being, and exploits the trend in advanced manufacturing where individual items can be
fabricated to exact personal taste or requirement (see Figure 4). This flexibility in physical making moves toward the type of adaptive mutable
attributes we normally ascribe to the digital, yet still allows us to benefit from the tangible properties we gain from material culture.
Figure 4. Data jewel, rapid prototype jewelry
© 2006, Artist Christoph Zellweger (Used with permission).
In their own way the three examples mentioned above all explore the potential for exploiting the attributes and properties that we commonly
associate with digital data and material objects. Each work uses the combined potentials of hybrid data-objects to create works that are
conceptually and formally challenging. However these data-objects all use discrete and relatively small data sets. The test for future data-objects
will be how to accommodate the complexities of big data and how to make multidimensional comparisons between data in physical form.
CONCLUSION AND FUTURE RESEARCH DIRECTIONS
As discussed, the manifestation of digital data into material artifacts is increasingly taking place across the practice of information visualization.
These data-driven artifacts are based on a diverse assortment of data sets, many of which are generated through our engagement with natural
and man-made phenomena, personal biometrics and social communication technologies. It is an interesting loop that brings these statistics back
into the material world having being collected, processed and analyzed in the digital realm. However the notion of big data and the desire to
correlate complex relationships in data is an interesting challenge for the data-object. But what is the purpose of information visualization, if not
to bring clarity, insight, legibility, and improve the communication of an idea. And if this is the key underlying purpose of data visualization what
methods should be used to assist in the sharing of knowledge, in an engaging and responsible way? Certainly the data-driven object needs to be
considered in light of the existing issues and conversations which are already taking place in the discipline of information visualization. Yet the
creation of a physical object based on a digital data set is in a sense a new media form. These dialogic objects have the ability to capitalize on the
inherent traits found in both digital and material culture to communicate their message. As Guy Julier (2014) proposes, the coming together of
different elements by design, in this instance data and material form, can lead to new understanding. By combining two cultures into an
articulated form, new ways of looking at the digital/material relationship and how we can communicate through it can be explored. These hybrid
constructs invite us to engage with the complexities of big data in social cultural contexts and to take advantage of the potentials offered by
integrating digital material attributes. Luc Pauwels (2006) begins his book on the visual cultures of science with the statement that “The issue of
representation touches upon the very essence of all scientific activity. What is known and passed on as science is the result of a series of
representational practices.” He goes on to say that, “Visual representations are not to be considered mere add-ons… they are an essential part of
scientific discourse.” (p.vii). These assertions clearly underline the point that careful consideration as to how we visualize science, and by
extension, the pervasive data gathered from 21st Century living, needs to be undertaken intelligently, with care and through the appropriate
forms and visual languages.
This work was previously published in the Handbook of Research on Digital Media and Creative Technologies edited by Dew Harrison, pages
3346, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Anders, P. (1999). Envisioning cyberspace . New York, NY: McGraw-Hill.
Bolter, J. D., & Grusin, R. (1999). Remediation: Understanding new media . Cambridge, MA: MIT Press.
Card, S., Mackinlay, J., & Shneiderman, B. (1999). Readings in information visualization: Using vision to think . San Francisco, CA: Morgan
Kaufmann.
Chua, C. K., Leong, K. F., & Lim, C. S. (2003). Rapid prototyping: Principles and applications . Singapore: World Scientific. doi:10.1142/5064
Iliinsky, N., & Steele, J. (2011). Designing data visualizations . Sebastopol, CA: O’Reilly Media Inc.
Inside Out Rapid Prototype Exhibition. (2010). Object gallery, Sydney, Australia. Retrieved August 29 2014, from
http://www.object.com.au/exhibitions-events/entry/inside_out_rapid_prototyping/
Klanten, R., Ehmann, S., Tissot, T., & Bourquin, N. (Eds.). (2010).Data flow 2: Visualizing information in graphic design . Berlin, Germany:
Gestalten.
Lakoff, G., & Johnson, M. (1980). Metaphors we live by . Chicago, IL: University of Chicago Press.
Latour, B., & Weibel, P. (2002). Iconoclash: Beyond the image wars in science, religion and art. MIT Press.
Levy, P. (1998). Becoming virtual - Reality in the virtual age . New York, NY: Plenum.
Lima, M. (2009). Information visualization manifesto. In Visual complexity VC blog. Retrieved April 11, 2011, from
http://www.visualcomplexity.com/vc/blog/?p=644
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work and think . London, UK: John Murray.
Noorani, R. (2006). Rapid prototyping principles and applications. Hoboken, NJ: John Wiley and Sons.
Pauwels, L. (Ed.). (2006). Visual cultures of science . NH, Lebanon: Dartmouth College Press.
Riggins, S. (Ed.). (1994). The socialness of things: Essays on the socio-semiotics of objects . Berlin, Germany: Mouton de Gruyter.
doi:10.1515/9783110882469
Shove, E., Watson, M., Hand, M., & Ingram, J. (2007). The design of everyday life . Oxford, UK: Berg.
Smith, C., Rieser, M., & Saul, S. (Eds.). (2010). Inside out: Sculpture in the digital age . Leicester, UK: De Montfort University.
Toseland, M., & Toseland, S. (2012). Infographica: The world as you have never seen it before . London, UK: Quercus Publishing Plc.
Tufte, E. (1983). The visual display of quantitative information . Cheshire, CT: Graphics Press.
Tufte, E. (1997). Visual explanations: Images and quantities, evidence and narrative . Cheshire, CT: Graphics Press.
Ware, C. (2004). Information visualization: Perception for design . San Francisco, CA: Morgan Kaufmann.
Yau, N. (2013). Data points: Visualization that means something . Indianapolis, IN: John Wiley & Sons.
KEY TERMS AND DEFINITIONS
DataObject: A physical object where the form is dictated by an underlying digital data set.
Information Design: The use of visual design practices to help communicate information either in print or digital form.
Information Visualization: Primarily screen-based visualization of data, designed to utilize the generative, time-based and interactive
capabilities of the computer.
Jackie Campbell
Leeds Beckett University, UK
Victor Chang
Leeds Beckett University, UK
Amin HosseinianFar
Leeds Beckett University, UK
ABSTRACT
This chapter aims to critically reflect on the processes, agendas and use of Big Data by presenting existing issues and problems in place and
consolidating our points of views presented from different angles. This chapter also describes current practices of handling Big Data, including
considerations of smaller scale data analysis and the use of data visualisation to improve business decisions and prediction of market trends. The
chapter concludes that alongside any data collection, analysis and visualisation, the ‘researcher’ should be fully aware of the limitations of the
data, by considering the data from different perspectives, angles and lenses. Not only will this add the validation and validity of the data, but it
will also provide a ‘thinking tool’ by which to explore the data. Arguably providing the ‘human skill’ required in a process apparently destined to
be automated by machines and algorithms.
1. INTRODUCTION AND BACKGROUND
”Data is the new oil” (Thorpe, 2012) and Big Data can be used to know us better than we know ourselves (Mayer-Thurman and Cukier, 2013).
Based on our Google searches, flu outbreaks can be predicted (known as nowcasting) (Google, 2014; Dugas; 2012); based on our social lifestyle
data, the authorities will know when we will commit a crime before it has even been committed (Berk, 2009). These kinds of claims and threats
are the current strap lines promoting the use and potential for Big Data and Business Intelligence (BI). Alongside this new research efforts in
economics, Kahneman questions our basic skills as decision makers, claiming that human decisions are naturally biased and usually flawed
(Kahneman, 2013). Kahneman argues (amongst other things) that if recent financial decisions had been based on data, the current financial crisis
in the UK, US and now apparently China would never have happened, which have been supported by analyses undertaken by Chang (2014) in his
studies.
The power of Big Data combined with the ability to ignore our own intuition could provide an entirely new paradigm to the processes by which
we do business, research and make decisions. However, data still comes with its flaws: bias data, data with quality issues, interpreted data,
assumed data, mismatched data, subjective data, dated data, ethically questionable data, data from ‘chosen’ samples, data designed to prove a
point, lies and statistics. Data need to be reprocessed, reorganised and restructured before any forms of analysis can be conducted.
Contemplation of systems as black boxes and since the number of inflows and outflows through the black boxes is massive, a data flaw would be
very challenging to spot. Hence, it would make it more difficult to bring about a new skill and area of concern; one which academics and
professionals alike should develop and consider. That would be the ability to know the data and results for what they are, to understand the
version of the ‘truth’ that this data represents, and to ensure the piece of information is not lost in the subsequent use and promotion of the
findings.
This chapter aims to critically reflect on the processes and use of Big Data by returning to the issues and considerations of smaller scale data
analysis and research. The issues are unpacked with respect to the data findings and the search for the ‘truth’ or a perspective on the truth. The
paper concludes that alongside any data collection and analysis the ‘researcher’ needs to be fully aware of the limitations of the data, by
considering the data from different perspectives, angles, and lenses. Not only will this add the validation and validity of the data, but it will also
provide a ‘thinking tool’ by which to explore the data. Arguably providing the ‘human skill’ required in a process apparently destined to be
automated by machines and algorithms.
2. BIG DATA: THE NEW OIL OR FOOLS’ GOLD?
There are some amazing data mining discoveries based on the leverage of Big Data; UPS used predictive analysis by monitoring and replacing
specific parts they had saved on repair costs (FieldLogix, 2014). Deadly manhole explosions were predicted in New York (Ehrenberg, 2010).
Walmart discovered that just before a hurricane, people in America bought an unusually large number of pop tarts (Hays, 2004). Data usually
serves a purpose to prove or disprove hypotheses and theories, and excellent example of which is an investigation into the relationship between
mobile phone usage and brain tumours (Frei et al, 2011). The Danish cell phone operators and health care service worked together to provide the
data for a study into the relationship between brain tumours and cancer. The results showed no direct correlation. The benefits of adopting Big
Data are as follows. First, the process of selecting data or population sampling is not required, as researchers can just take it ‘all’ and being
persuaded to ignore data validity issues based on the ‘general trend’ provided by such a vast dataset. Second, Big Data processing can highlight
the part of the datasets which reflects the core part of the problem. For example, medical data analysis can find the direct correlation between the
lifestyle and genetic history with breast cancer.
It is interesting that 70% of companies say they are keeping data, but they do not know what for (Avanade, 2012). In terms of the amount of data
being produced, current estimates are that 90% of the world’s data as of 2013 was created in 2012 and 13 alone, and there will be 44 times as
much by the year 2020 (ScienceDaily, 2013). It is being called ‘the new oil’, yet clearly companies are aware they have the raw materials, but do
not know how to ‘spin’ the oil. Big Data, in terms of project lifecycle and requirements misses out on the ’data collection’ and ‘data design’ stages
that would be perceptible in a research project (Bazeley, 2013). This means that data analysts are more commonly brought in to ‘resuscitate’
(Bazeley, 2013) the data, not an impossible task, but more challenging and arguably more costly than a data collection and design process
targeted at specific aims, objectives and requirements (Grbich, 2013).
3. BIGGER DATA = BIGGER PROBLEMS?
Volume, velocity and variety are the three elements of Big Data. Hence, the size, amount and growth of the data available can influence how data
can be used. This does not excuse the data quality problems and challenges. Data quality issues can cause complications: a girl died in South
America after receiving a donor transplant of the wrong blood group in data error (NYTimes, 2004), the outdated land data showed the area hit
by Hurricane Katrina in 2005 to be of low risk of flooding causing many of the residents insurance to be invalid (Campbell et al, 2006). There are
data related issues reported frequently, but even on a personal level almost anyone can tell you of a data quality problem they have experienced.
The consequence of these errors may or may not be significant, and data storage facilities aim to reduce data quality problems by enforcing
constraints (which is not null, check, data types, etc.) and the relational database model is well designed to protect the data [Date, 2004; Attaran
& Hosseinian-Far, 2011]. Big Data however does not lend itself to the relational database model. The preferred data stores are hugely
denormalised structured star schema models (Inmon, 2005; Kimball, 2003) or for unstructured data completely denormalised databases such as
NoSQL or Hadoop (Smith, 2007). Big Data claims that data quality issues are minimalized due to the vast amount of data (Mayer-Thurman and
Cukier, 2013). However they are still there even if the nature of the algorithms ignores them their very presence could be of consequence and
should still be considered.
Knowing our requirement and expectation from the data is important for Big Data science. There is a difference in asking for a bank balance and
a likelihood of rain, one expects an entirely accurate answer and the other is surrounded by a number of assumptions and potentials. Which takes
us back to the original objectives of the data analysis – what is the information we are looking for? - General trend or absolute, qualitative or
quantitative? These are considerations that researchers and data analysts spend some time deciding, exploring and identifying as part of the
process. Does Big Data benefit from such consideration? Judging by the number of companies who have collected the data and don’t know what
to do with it, it would be challenging to provide a positive answer to this question.
The next few sections will consider the data lifecycle in more details in an attempt to unpack some of the considerations and issues with each
phase in a data analysis project.
3.1. The Data ‘Project’
“Whenever we use data we are forced to think” (Bazeley, 2013) and this is a good start to any project – thinking and considering what is sought.
Ideally the aims and objectives of data analysis should be identified at the start of the project and then lead on to appropriate data collection
phase. The characteristics of a ‘Big Data’ project can make this more complicated. The project is likely to have many stakeholders, each with their
own requirements and hypothesis. Not all these stakeholders may fully appreciate the nature of the project, increasing the likelihood of the
project requirements changing as the ‘users’ understand the capability and potential of the system. This is a recognised approach to data
warehouse projects – to start small and ‘grow’ as the potentials are realised (Smith, 2014). This lack of project requirements is a basis of the
argument between Kimball and Inmon as to whether to bring over ‘all’ the data (at the lowest granularity) into the data warehouse (Kimball) or to
just bring over the required data (Inmon) (Kimball, 2003; Inmon, 2005). Kimball arguing that by bringing all the data warehouse, it can be more
flexible and supportive to the ever changing requirements and questions of a company. This is because that they learn to appreciate the potential
of the data and their business changes (Kimball, 2003). Inmon prefers to acknowledge that the data marts of a data warehouse will be rebuilt
many times over its course, leading to ever evolving designs and data extractions to support a company in a more ‘as and when’ style (Inmon,
2005).
As a project, data analyses and research and statistics have been around for a while. They have revolved around reasonably small scale data, often
collected by a few people with specific intentions as to the purpose of the data, according to the requirements of the project, sample data sets are
selected through appropriate data analysis techniques (Baseley, 2013). It seems that Big Data projects are currently treated and understood by
most companies to be the same as data analysis projects, though with much more data. This is suggested that this view is not correct. It should go
even further to separate Big Data projects into two further generic types:
1. Big ‘data analysis’ projects i.e. the same kind of data analysis that has been around for years; looking at sales patterns, looking for clusters,
time based analysis, just with more data. And
2. ‘Big Data’ analysis projects i.e. projects which are searching through the large sets of randomly related data for correlations and/or
something of interest e.g. data mining.
“Research is messy” (Kincheloe, n.d). Sometimes it is compared to a patchwork quilt; each ‘patch’ representing a small part of the ‘whole’ of the
research, each its own separate ‘research project’ designed with considered data collection and analysis methods to investigate the question
(Kincheloe, n.d.). It is suggested that within either type of ‘Big Data’ project, there still be the smaller investigations. It could well be that a project
would like to do both as “whenever we use data, we are forced to think” (Bazeley, 2013), and in thinking, we wonder how the sales relate to the
weather and how the weather improves our mood and how our mood affects our spending …..Which is all of interest and worthy of investigation,
but each of those variables (sales, weather, mood, spending) lend themselves to different identifications, classifications and objectivity.
Academic research projects are often categorised as being Qualitative or Quantitative. Qualitative research is an approach where tend to focus on
thinking about the quality of things rather than the quantity (Bazeley, 2013). It is often used to look for ‘causality’ – if one event causes another
event – and to ‘understand’ or ‘explore’. The findings are not absolute, but in understanding we can aim to improve. Quantitative research is
generally looking for more absolute truths much of the analysis is based around correlations (Marsh and Elliot, 2008). Note the difference
between the statements ‘with their lack of education Bill and Stephen struggled to find meaningful work’ and ‘low education predicts poor
employment rates’. The first uses people and emotive wording to encourage us to relate and question, the second reads as if to present the facts.
Potentially both ‘analysis’ come from the same data set. The first seems more like qualitative data – it is known that Stephen ‘struggles’ and he
wanted ‘meaningful’ employment. The second quantitative data – based on employment and education statistics (Bazeley, 2013). The aim of the
research here was important, if it was to see if low education affects employability then to discover that ‘Low education predicts poor employment
rates’ is useful. If we are looking to ‘understand the difficulties of low education on men’ (within a specific age bracket/area …) then the first
statement goes some way towards that. This example provides a good illustration of where initial research (into employability statistics and
education standards) can provide a rationale for further research (understanding the difficulties).
3.2. Data Collection
Any kind of data analysis will involve data collection. An appropriate methodology may be selected based on the requirements of the data
investigation. Academics spend a lot of time effort and research in identifying a methodology in keeping with the objectives of the research
(Bazeley, 2013). A Case study approach could be used to investigate a specific ‘case’, or a ‘grounded theory’ approach used to collect data and see
where it takes us (Cresswell, 2007). More radically, a ‘phenomenological’ approach would aim to explore a phenomenon by ‘bracketing off’ all
prior opinions, perceptions, tand heories on the research (Cresswell, 2007). Industry projects are generally not considered for such theoretical
methodologies; however it can be argued that often the approach is implicit in the project. Phenomenology can be defined as ‘looking for
anything of interest’ (all data). By analysing data relating to employability and education – a case study is forming. Researchers ‘collect’ data via
narrative extraction, interviews, focus groups, observations, surveys, etc. Industry, of course, uses these techniques as well, perhaps not with such
a thorough evaluation of the methods and rigorous design of tools. Although more agile and with speedier life cycle!
Academics are forced to consider the limitations, validity and bias of their data. They consider their rationale for and position in the research;
they critique the effect this will have on the data. In business, a data analysis/mining specification would likely address the same points. Research
is often driven by personal interest (Bazeley, 2013), often it puts the researcher as an ‘insider’ to the research – this can create ethical and
validation issues (Cresswell, 2007). The personal interest could well be driving a personal opinion or hypothesis that is subconsciously ‘set up’ to
discover (Crawford and Boyd, 2011). Discussing and looking for concerns does not eliminate them and can sometimes lead to well critiqued
arguments. It does however, encourage us to think about the data in other ways which may lead to greater understanding.
The argument with ‘Big Data’ projects is that the biases, filters, codifications in the data are less transparent. The data may be ‘bought’ in, the
design of the original data collection, may be lost or not available, and the codifications may not be as it was sought. So the data is compromised
and in turn our requirements become compromised. To use twitter data to evaluate modern language, has its limitations as offensive language
has been removed (Mayer-Thurman and Cukier, 2013). Google data, insurance data, National Health Services data belong to a certain
demographic (is the demographic effectively a ‘case study’?). It is worth noting that which organisations (National Health Services, government)
have non-selective data sets, offering some of the nearest N=all datasets available in a country.
The data always represents something and this is where a smaller case study can identify more clearly what it is representing, and where the data
collection methods of an academic researcher should be understood and defended rigorously. On a larger scale the decisions as to which data was
to be collected still occurred, the biases in that data will still exist based on who designed the original systems, the intention of the original
systems, the original stakeholders and business drivers. To put it bluntly, politics, personality and culture will create biases in this data.
3.3. Data Cleansing and Transformation
Data analysis consists of data collection, data reduction and data visualisation (Bazeley, 2013). A key stage in data warehousing projects is the
‘cleansing and transformation’ stage of the ETL (extract transform and load) process (Kimball, 1996; Inmon 2005). ETL involves stripping data
of ‘bad data’ by dealing with missing values, outliers or other data quality issues see Won et al’s paper on ‘dirty data’ for list of numerous possible
data quality issues (Won et al, 2003). Data may be ‘coded’, or aggregated (so losing the detail) or ‘transformed’ to matching units (American/UK
date or degree Celsius/Fahrenheit). Data errors can even be introduced at this stage for example a crime is bracketed into a ‘crime_type’ such as
‘robbery’ or violent crime’, is the same crime counted twice if it was both? Is it logged as two crimes - a violent crime and a robbery? These
decisions are decided by experts and stakeholders and are as a result of processes procedures and legalities and also sometimes some strange
reasoning - but they can become invisible in the data. The intention is to provide the data in the same format so it can be compared. Ideally the
‘metadata’ about the data will be recorded and stored – i.e. where the data has come from, the date the data was obtained, the filters the data has
been through etc. In many cases this metadata will form part of the data analysis. It is vital to ‘reduce’ the data to be useful and provide input to
queries and reports. Kimball and Inman’s arguments become very real in this phase of the project the decisions made providing potential for less
detailed and more bias results.
In becoming ‘Big Data’ the data and system experts can be lost and new assumptions may made about the data which compromise the validity.
3.4. Data Analysis
Data analysis techniques should suit the research questions (Bazeley, 2013). Commonly considered data mining (for Big Data) are association,
classification, sequences and correlation (Marakas, 2003), still any type of data analysis can be performed on a large amount of data as long as
the technique is appropriate for the data set. This argument works both ways – in that any technique can be used on a smaller dataset. However
there are data issues that can affect all analysis independent of technique. Sometimes data doesn’t work together, it may not have a ‘primary key’
field such as “employee_id” to link the data or it may be stored by different granularities e.g. weather data being held by year is not useful to link
with sales data by month. This is a strength of the data warehouse design – it is recommended to always include time (year, financial year,
season, month, week, day, hour, time – as desired) to provide a link to the data. Much Big Data works by linking GIS data gained from cell phone
data (Gobble, 2013). However, real problems can occur when data is collected from open source repositories in a genuine hope to link it with
other data to investigate.
Another problem with Big Data is testing and validating the results. Still the bigger the data set, the less ‘combined’ the data and the more
complex the code and the harder it is to spot whether the data (and results) retrieved are correct especially to a novice. The algorithms used are
less transparent and another source of error. In returning to the Google ‘nowcasting’ prediction of flu it has been since undermined in its validity
based (in the main) on the algorithms used (Hal, 2014).
3.5 Data Visualisation
Data visualisation is a technique to process datasets and present them in a way that people can understand implication and interpretation of data
analysis easily (Alsufyani et al., 2015). It provides benefits to the businesses since the stakeholders and decision-makers can understand the
complexity more easily and thus can make better judgement on their decisions. Data visualisation has been used in industry for decades. For
example, financial services use visualisation to present the daily data indexes of stock market exchanges, credit rating and foreign exchange rates,
which include the quantity of transactions, volume of trading and daily revenues and losses.
Adding to the issue of data validation – assuming the results are correct. Providing a powerful visualisation can often skew the data further
‘artistic license’ helpful to produce an aesthetically pleasing visualisation, can often work better on a subset of the data, or with outliers removed
or further categorising of the data. Tables, bar charts, line graphs and pie charts may be ‘boring’ but they do provide a reasonably straight
forward, recognised and meaningful way to represent data (Yau, 2011).
3.5.1. Lies and Statistics
Lastly the data concerns that is described above represent genuine issues that are difficult to control, spot and prevent. They are suspected
human errors either unintentionally or intentionally.
Unintentionally people are generally bad at statistics and decision making (Kahneman, 2011). When given the introduction to a young American
man as “Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he
has little need for order and structure and a passion for detail.” Is Steve more likely to be a librarian or a farmer’ (Kahneman, 2013, p.7). Most
people will say librarian, however in American there are 20 more male farmers to each male librarian so statistically he is more likely to be a
farmer (Kahneman, 2013). Another of Kahneman’s thought provoking examples is “A bat and ball cost £1.10; the bat costs a pound more than the
ball. How much does the ball cost? ” (Kahneman, 2013, p.44) – the answer is 5 pence, but the answer we are drawn to giving is 10 pence. We have
subconscious biases – we believe a suspect with ‘previous’ is more likely to commit a crime than a suspect with no criminal record (Viktor Mayer-
Schönberger and Cukier, 2013). Of course computers will provide statistically correct answers to these questions – but will we trust them?
Software will not do analysis for you, nor can it think for you (Bazeley, 2013). Theories may change but the reality is the same - consider the
theory that once the world was flat.
Data findings are used to add value and validity to arguments. By intentionally employing some of the above ‘techniques’ or issues the data and
findings can be manipulated, massaged, presented in such a way as to give favourable results. Discussion of the dark side of data mining and
ethics is not included in this paper they are slowly being realised somewhere behind this tidal wave of discovery and can be readily found
elsewhere.
3.5.2. Is it an Accessible Commodity or a Commodity Which Creates a ‘Digital Divide’?
To be using Big Data well requires resources – physical and human. The fact that so many companies are storing the data but not actively using it
suggests they do not have the resource or understanding to analyse the data. Current literature implies that data warehousing/data mining
projects are iterative and experimental - perhaps not a ‘one off’ project that can be costed and resourced (McAfee and Brynjolfsson, 2012). The
functionality contained in the data mining whilst having huge potentials for many departments do not belong solidly in one department (Smith,
2007) – making the ‘ownership’ of the project difficult. The skills that have been traditionally linked with data analysis such as mathematics,
statistics and programming - whilst still required – are only associated with one stage of the BI project. A successful BI project will benefit as
much from systems analysis and creative thinking skills (Smith, 2007). The cost of resourcing the projects, buying the data and training or
recruiting staff in relevant skills is significant. Perhaps only realistic for the larger, profitable companies so creating a greater divide between
Companies who can and cannot utilise BI (Crawford, 2011).
If the power of Big Data is to reach the ‘man on the street’ or even the small businesses it seems likely it would be at an unquestionable, ‘black
box’ – data in, recommendation out level. Whether we would want or trust it is another thing. Some people want to know the future, like
predictions. Would it be useful to be given a ‘recommended job role’ based on any data? a holiday destination? a partner ? We already have all of
these systems and in general we treat them with the respect they deserve. We ask – where did they get this information? We happily input the
wrong data to get results we prefer. There are people quite happy to put faith is far worse and even disproven prediction techniques. Actual
predictions are rarely given –we get 10% chance of rain a 50% chance of living until we are 90. These systems support our intuitions and still we
remain in control, British society is one of freedom, learning by mistakes, innovation and creativity. Some people may trust the data as some
astrology, but the policy of data the management will have to prove itself for people to actively allow it to affect their lives.
3.5.3. Predicting the Likely Trends and the Future
Data visualisation can blend with business intelligence to understand the market trend and predict the likely movement based on analysis of
historical data and user behaviours. It involves with mathematical modelling that use sophisticated technology based on Cloud Computing and
Big Data methods to calculate complex derivatives and statistical analysis, whereby results have been presented in a form of visualisation within
seconds. The benefits allow the businesses to make more accurate and rapid decisions to stay competitive, or to maximise short windows of
opportunities. As demonstrated by Chang (2014), Business Intelligence as a Service (BIaaS) can predict the likely trend of a certain investment
under fulfilling certain conditions. Predicting the future trends can offer benefits for businesses to understand more about their customer
behaviours and preferred choices, which can be blended with decision-support systems to provide a more effective combination for businesses.
However, the limitations are as follows. Firstly, it requires analysts with strong programming and system administration skills to achieve data
visualisation. Secondly, most of the tools available in the market are either expensive or not easy to use. It will take years of substantial
development for businesses to get affordable and easy to use tools for visualisation.
4. CONCLUSION
• Where there is data, there will be higher chances of error. Actively distrust the data – question it, look for bias and issues at each stage of
the process, look at the data through a different ‘lens’ – Your argument might be different from the original one. An expert team will
recognise, document and make transparent the potential issues in validity.
• Data provides an appropriate thinking tool – which can help the user discover more, innovate and move a business forward. Exploit this by
exploring theories and theorists.
• Lastly, there is so much data not being used. It is suggested that this has to indicate a general requirement to develop methodologies,
standards, architecture, vertical alignment and software solutions to the field. A challenge which would benefit from bringing together skill
sets from both industry and academia.
This work was previously published in the International Journal of Organizational and Collective Intelligence (IJOCI), 5(1); edited by Victor
Chang and Dickson K.W. Chiu, pages 115, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Alsufyani, R., Safdari, F., & Chang, V. (2015). Migration of Cloud Services and Deliveries to higher Education. In Emerging Software as a Service
and Analytics 2015 Workshop (ESaaSA 2015), in conjunction with CLOSER 2015, Lisbon, PT, 20 - 22 May 2015.
Attaran, H., & Hosseinian-Far, A. (2011) A novel technique for object oriented relational database design. IEEE 10th International Conference on
Cybernetic Intelligent Systems (CIS). London, UK. 10.1109/CIS.2011.6169147
Bazeley, P. (2013). Qualitative data analysis: practical strategies / Pat Bazeley . London: SAGE.
Berk, R. (2009). The role of race in forecasts of violent crime. Race and Social Problems , 1(1), 231–242. doi:10.1007/s12552-009-9017-z
Biesdorf, S., Court, D., & Willmott, P. (2013). Big Data: What's your plan? The McKinsey Quarterly , (2): 40–51.
British Computer Society. (2013) About BCS. BCS: The Chartered Institute for IT. [Online]Available from: http://www.bcs.org/category/5651>
[Accessed 7th February 2013].
Campbell, R., . . .. (2006) GIRO data quality working paper. [Online] Available from:
<www.actuaries.org.uk/system/files/documents/pdf/Francis.pdf> [Accessed 25th March 2014].
Chang, V. (2014). The business intelligence as a service in the cloud. Future Generation Computer Systems , 37, 512–534.
doi:10.1016/j.future.2013.12.028
Chui, M., Manyika, J., & Kuiken, S. V. (2014). What executives should know about 'open data'. The McKinsey Quarterly , (1): 102–105.
Crawford, K., & Boyd, D. (2011) A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. Oxford Internet Institute.
September 21, 2011. [Online]. Available from:< http://softwarestudies.com/cultural_analytics/Six_Provocations_for_Big_Data.pdf> [Accessed
25th March 2014].
Cumbley, R., & Church, P. (2013). Is “Big Data” creepy? Computer Law & Security Report , 29(5), 601–609. doi:10.1016/j.clsr.2013.07.007
Dugas, A. F. et all (2012). Google Flu trends: correlation with Emergency Department influenza rates and crowding metrics CAD Advanced
Access January 8th 2012.
Ehrenberg, R. (2010). Predicting the next deadly manhole explosion. Wired July 7, 2010. [Online]. Available from:<
http://www.wired.com/wiredscience/2010/07/manhole-explosions/>/[Accessed 25th March 2014].
Frei, P., . . .. Use of mobile phones and Brain tumours: Update of the Danish cohort study. BMJ, 2011, 343. [Online]. Available from:
th
<http://www.bmj.com/content/343/bmj.d6387/>[Accessed 25thMarch 2014].
Gang-Hoon, K. I. M., Trimi, S., & Ji-Hyong, C. (2014). Big-Data Applications in the Government Sector. Communications of the ACM , 57(3), 78–
85. doi:10.1145/2500873
Gobble, M. M. (2013). Big Data: The Next Big Thing in Innovation.Research Technology Management , 56(1), 64–66.
doi:10.5437/08956308X5601005
Grbich, C. (2013). Qualitative data analysis: an introduction / Carol Grbich (2nd ed.). London: SAGE.
Hal, H. (n.d). News: [dn25217] Google Flu Trends gets it wrong three years running. New Scientist. p. 24. doi:10.1016/S0262-4079(14)60577-7
Harding, J. (2013). Qualitative data analysis from start to finish / Jamie Harding . London: SAGE.
Hays, C. L. (2004). What Wal-Mart Knows about customers habits. New York times November 14th 2004. [Online]. Available from:<
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html?_r=0/>/[Accessed 25th March 2014].
Hsinchun, C., Chiang, R. H. L., & Storey, V. C. (2012). Business Intelligence And Analytics: From Big Data To Big Impact.Management
Information Systems Quarterly , 36(4), 1165–1188.
Inmon, W. H. (2005). Building the data warehouse/W.H. Inmon . Indianapolis, IN; Chichester: Wiley.
Kahneman, D. (2013) Thinking, fast and slow / Daniel Kahneman. New York: Farrar, Straus and Giroux, 2013. First paperback edition.
Kim, W., Choi, B., Hong, E., Kim, S., & Lee, D. (2003). (n.d). A taxonomy of dirty data. Data Mining and Knowledge Discovery ,7(1), 81–99.
doi:10.1023/A:1021564703268
Kimball, R., & Ross, M. (1996). The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses, Solutions from
the Expert . New York: Wiley & Sons.
Kincheloe, J. (2005, June01). (n.d). On to the next level: Continuing the conceptualization of the bricolage . Qualitative Inquiry , 11(3), 323–350.
doi:10.1177/1077800405275056
Marakas, G. M. (2003). Modern data warehousing, mining, and visualization: core concepts / George M. Marakas . Upper Saddle River, N.J.:
Prentice Hall, c.
Marsh, C., & Elliott, J. (2008) Exploring data: an introduction to data analysis for social scientists. Cambridge: Polity, 2008. 2nd ed. / Catherine
Marsh and Jane Elliott.
McAfee, A., & Brynjolfsson, E. (2012). Big Data: The Management Revolution. (cover story). Harvard Business Review , 90(10), 60–68.
McAffee, A. (2013). Datas biggest challenge convincing people not to trust their judgement. Harvard Business Review. [Online] Available
from<http://blogs.hbr.org/2013/12/big-datas-biggest-challenge-convincing-people-not-to-trust-their-judgment/>[Accessed 25th March 2014].
Ross, J. W., Beath, C. M., & Quaadgras, A. (2013). You May Not Need Big Data After All. Harvard Business Review , 91(12), 90–98.
Schultz, B. (2013, Fall). Big Data In Big Companies. Baylor Business Review , 32(1), 20–21.
ScienceDaily. (2013) Big Data, for better or worse: 90% of world's data generated over last two years. May 22nd 2013. [Online] Available
from<http://www.sciencedaily.com/releases/2013/05/130522085217.htm/>[Accessed 25th March 2014].
Smith, D. (2007). Data Model Overview. Teradata. [Online] Available from: <www.teradata.com/.../Data-Model-Overview-Modeling-for-the-
Enterprise-.>[Accessed 25th March 2014].
Thorpe, J. (2012). Data Humans and the New Oil. Harvard Business Review. [Online]. Available from:<http://blogs.hbr.org/2012/11/data-
humans-and-the-new-oil/?> [Accessed 25th March 2014].
Treiman, D. J. (2009). Quantitative data analysis: doing social research to test ideas / Donald J. Treiman (1st ed.). San Francisco, Calif.:
Jossey-Bass, c.
Vizard, M. (2013). Big Data Experience Is Much Needed--and Wanted (pp. 1–1). CIO Insight.
Yau, N. (2011). Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics . Indianapolis: Wiley.
Section 2
Siddhartha Duggirala
IIT Indore, India
ABSTRACT
With the unprecedented increase in data sources, the question of how to collect them efficiently, effectively, and elegantly, store them securely
and safely, leverage those stocked, polished, and maintained data in a smarter manner so that industry experts can plan ahead, take informed
decisions, and execute them in a knowledgeable fashion remains. This chapter clarifies several pertinent questions and related issues with the
unprecedented increase in data sources.
INTRODUCTION
We live in the age of Data. Eric Schmidt famously said in 2010 that every day we create as much data as was created in total from beginning of
written history through 2003. With the propulsion of mobile devices, sensors, search logs, online search, digital social lives we are generating
about 2200 Petabytes of data every day (Kirkpatrick, R., 2013).
Google, Amazon, Facebook, Twitter, Foursquare, McDonalds and lots of other companies build their empires, enriched those empires using the
data we generate (Kohavi, R., 2009).
● That being said, what is data? Data is a collection of facts, opinions and responses.
● Is Big Data (or even extreme data as some people like to call it) nothing but hyped version of normal data? Well, the major distinction
comes from the 3 V’s Volume (Petabytes per day), Variety (structured data like RDBMS, Unstructured like search logs, tweets, images, videos
excreta), Velocity (real time capture) characterizing Big Data. While the traditional data mainly sit in RDBMS, Big Data otherwise the
extreme data also encompass a different domain of data storage other than normal structured data.
● Mere definition of Big Data would be “A massive volume of both structured and Unstructured data that is so large that it’s difficult to
process with traditional databases, software techniques” (Big Data: New frontiers of IT Management)
● Why do we have to store and analyze this data anyway? Simply there is a lot of potential in data which when observed at, analyst can create
world class wonders. There has been a lot of research on this and these are the few reports you can go through? (Bryant, R.E., 2008;
Manyika, Brown, 2011)
Since, we answered the questions “Why Big Data?”, “what is Big Data?” let’s answer most relevant basic question to us, How to store and leverage
Big Data? Let’s start answering the question by agreeing on the fact that Big-Data isn’t just data growth, nor is it a single technology; rather, it’s a
set of processes and technologies that can crunch through substantial data set quickly to make complex, often real-time decisions. We will study
technological and technical advancements that fueled Big data phenomenon, in the next section.
In 3rd section we will move on to Hadoop, Sector-Sphere and various other software frameworks enabling us to compute at Big data scale.
TECHNICAL AND TECHNOLOGICAL ADVANCEMENTS
There are a lot of ways to store and analyze data. Let’s move on to technologies that enabled us to analyze data. And let’s also look at various
analyses that are predominantly used on Big data.
A/B Testing
A/B testing as the name sounds we have to decide which version A and B is better. To do this we experiment simultaneously. At the end we select
the version which is more successful (Brain, 2012).
Associated Rule Learning
Set of techniques for discovering interesting patterns/relationships among variables in large databases.
Beowulf Cluster
The project started in mid-90. It was initially a cluster of 16 Dx4 connected by channel bounded Ethernet links. The cluster structure would be of
parent and children kind of hierarchical structure. The client submits jobs to parent node, which in turn handovers the jobs and data to children
node for processing and which in turn sends output to the parent node, parent node aggregates the output of children node do some further
processing and gives out the final output to the client. Writing the programs for child node and parent node might get a little tricky.
Classification
A set of techniques in which we assign new data points to different classes, based on training data points and their corresponding classes
(supervised learning).
Clustering/ Cluster Analysis
Cluster analysis is an exploratory data analysis tool for solving classification problems. The object is to sort cases (with similar behavior or
features) into clusters.
Datacenter
A datacentre is every so often built with a large number of servers unified through a huge interconnection network. Dennis Gannon claims: “The
cloud is built on massive datacentres” (Gannon, 2010). While the data centres are evolving. Most datacentres are built with commercially
available components. The interconnection network among all servers in the datacentre cluster is a critical component of data centre. The
network design must meet the following special requirements: low latency, high bandwidth, low-cost, network expandability, message passing
interface communication support, fault tolerance and graceful degradation.
Datacentres usually are built at a site where leases and utilities for electricity are cheaper and cooling is more efficient. Modular datacentre in
container was motivated by demand for lower price consumption, higher computer density and mobility to relocate datacentres to better
locations. Large-scale datacentre built on modular containers appear as a big shipping yard of container trucks. We also need to consider data
integrity server monitoring, security management in datacentres, which can be easily handled if the datacentre is centralized in a single large
building (Josyula, Orr & Page, 2012).
The following papers discuss about the construction of modular datacentres (Guo, 2009) developed a server-centric BCube network for
interconnecting modular datacentres, giving us a layered structured. (Wu, 2009) has proposed a network topology for inter container connection
using the above BCube network as a building block. It’s named as MDCube (for Modularized Datacentre Cube).
Software-defined datacentre (“Software defined datacenter,” 2010) is an architectural approach to IT infrastructure extending virtualization of all
of the datacenters resources and services to achieve IT as a service. In a software-defined datacentre, “compute, storage, networking, security,
and availability services are pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software”. Software-defined
datacentres are often regarded as the necessary foundational infrastructure for scalable, efficient cloud computing the software-defined data-
center, which is still in nascent development phase, evolved primarily from virtualization. The term software-defined data center was coined in
2012 by then-VMware Chief Technology Officer Steve Herrod and became one of Computer Reseller News’ 10 Biggest Data-center Stories of
2012. Though some critics see the software-defined data-center as a marketing tool and “software defined hype, “proponents believe that
datacentres of the future will be software-defined (Zhang, 2013).
If you want to dwell more into the Data-center architecture you can refer to the following reports (Josyula, 2011).
Data Mining
Data mining is the application of specific algorithms for extracting patterns from data. (Fayyad, 1996)
Distributed Systems
Multiple computing devices connected together through network used to solve a common computational problem. Cluster computing, Cloud
computing, even the most popular client server systems (internet) are distributed systems. These systems may have a common distributed file
system. It offers advantages like scalability, reliability (Hwang, 2012). (see Figure 1)
The ability of the computer to learn and adapt to the situation without being explicitly programmed (Mohri, 2012)
Massively Parallel Processing
These are very large scale clusters, which are used to achieve massive parallelism in the computation. The clusters can be of homogenous CPUs or
it might contain CPUs plus GPUs of FLP accelerators introducing massive parallelism. CUDA is a parallel programming architecture developed
by the folks at NVDIA. The following are a few applications of CUDA SETI@Home, Medical analysis simulations based on CY and MRI scan
images, accelerated 3D graphics, cryptography, inter-conversion of video file formats, single chip cloud computer through Virtualization in many-
core architecture (Appuswamy, 2013).
Network Attached Storage
FTP server stores files on a system connected to a network and serves them on-demand. In similar lines Network-attached storage (NAS) is a file-
level data storage connected to a computer network, providing data access diverse group of clients (Workstation, server, laptop, tablet Etc.).
Unlike FTP server NAS is built specifically to store and serve files, and is specialized for this task either by its software, hardware, or
configuration. (Rouse, 2013; HMW Mazagine, 2003)
Software Defined Networking
In the SDN architecture, the control and data planes are decoupled, network intelligence and state are logically centralized, and the underlying
network infrastructure is abstracted from the applications. As a result, enterprises and carriers gain unprecedented programmability,
automation, and network control, enabling them to build highly scalable, flexible networks that readily adapt to changing business needs.
(“Software defined networking:,” 2012). In (Dürr, 2012) the author introduced a form of SDN which is assisted by the cloud. The Idea here is to
pull out computational complex and memory-intensive network management operations like optimized route calculation, the management of
network state information, multicast group management, accounting, etc. from the network and to implement them in one or for instance, to
increase reliability multiple datacentres (“in the cloud”). The network itself is only responsible for forwarding packets. (see Figure 2)
Figure 2. Network access broker assisted SDN
Folks at Packet Design introduced a network access broker in addition to controller in SDN. The role of the network access broker is to verify if
the network can handle the traffic demands of the application without impacting other applications adversely.((Sverdlik, 2013), (Alaettinoglu,
2013))
Software Defined Storage
Software-defined storage (SDS) is an approach to store data programmatically. The storage-related tasks are decoupled from the physical storage
hardware.
Software-defined storage puts the emphasis on storage services such as duplication or replication, instead of storage hardware. Without the
constraints of a physical system, a storage resource can be used more efficiently and its administration can be simplified through automated
policy-based management. Storage can, in effect, becomes a shared pool that runs on commodity hardware (Rouse, 2010).
Let’s go into some of the recent developments in storage space where different vendors unveiled their products:
● Storage giant EMC released hardware solutions with new VNX models with further improved flash memory and software storage solutions
with MCx software. Conversely, from a software storage perspective, ViPR will enable EMC customers to view objects as files.
● Pure Storage, a provider of an all-flash memory based storage array, is a good choice for deployment within the traditional IT
infrastructure or within the emerging software-defined infrastructure. The company is focused on replacing spinning disk with flash-based
arrays.
● Red Hat’s advantage is the ability to deploy storage software on commodity server and storage combinations, which then yield low-cost
storage arrays for use as file and object servers along with support for Hadoop and OpenStack environments. Version 2.1 of Red Hat Storage
Server includes improvements to its network-based asynchronous data replication product Geo-Replication, further interoperability with
Windows, including full support for SMB 2.0 and Active Directory support. In addition to that Storage Server has been integrated into Red
Hat’s Satellite management software thereby enabling easier installation and deployment of the storage software.
● IBM's new FlashCache Storage Accelerator uses software to boost flash performance.
● It’s important to realize that no storage product today fits directly within an application server stack within an SDI. However, some
companies are developing data replication within the stack to move the function closer to the data and dismiss the storage platform from this
task.
Virtualization
A conventional computer has a single OS image. This offers a rigid architecture that tightly couples application software to a specific hardware
platform. Some applications working good on one machine may not be executable on another platform with a different instruction set under a
fixed OS. Virtual machines (VMs) offer novel solutions to underutilized resources, application flexibility, software manageability and security
concerns and physical machines (Hurwitz, 2010). (see Figure 3)
Figure 3. Different virtualization schemas
The VM approach offers hardware independence of the OS and applications. The above figure shows the different Virtualization schemas the 3
(a) shows the normal traditional system. In 3 (b) Hypervisor (bare metal VMM) works on the hardware and controlling the IO, CPU cycles etc. 3
(c) has Host OS on which the VMM runs, hosting guest OS. The fourth type of Virtualization can be that part of VMM is running in privileged
mode and partly in client mode, in this the host OS may have to be modified.
The Virtualization can be done at different levels with each level has its own merits and demerits:
● Hardware abstraction Layer level (VMware, VirtualPC, Xen, User mode Linux)
Virtual clusters are built with VMs installed on distributed servers from one or more physical cluster. The VMs in a virtual cluster are
interconnected logically by a virtual network across several physical networks. The provisioning of VMs to a virtual cluster is done dynamically.
VM also allows functionality that is currently implemented within data-center hardware to be pushed out and implemented on individual hosts.
The data- centers must be virtualized to serve as cloud providers. The Virtual Infrastructure managers are used to create VMs and aggregate them
into virtual clusters as elastic resources Nimbus and Eucalyptus support essentially virtual networks. OpenNebula has additional features to
provision dynamic resources and make advance reservations. VSphere 4 uses the hypervisors ESX and ESXi from VMware, it supports virtual
storage in addition to virtual networking and data protection. For the relation between Big Data and Virtualization you can refer to
http://www.dummies.com/how-to/content/the-importance-of-virtualization-to-big-data.html3/
Among the other techniques that helped a lot in analysis of big data are as follows Ensemble learning (using multiple predictive models to obtain
better predictive performance. This is type of supervised learning), Natural Language Processing, Neural Networks, Optimization, Pattern
Recognition, Predictive Modelling, Regression, spatial analysis, sentiment analysis, Time series analysis, unsupervised learning, visualization.
PROGRAMMING ENVIRONMENT FOR BIG DATA
Together, the 3 V’s trends have led to the rise of a new breed of databases, generally referred to as NoSQL (Not Only SQL). Let’s explore these
characteristics and the technologies that have evolved to support them.
Because the new database technologies each addressed different issues, they ended up differing significantly in their feature set, data model,
query language and architecture. From here, four major patterns emerged from real
● The column family database, the document database and the graph database. A fifth technology, Hadoop, oriented at large
● Scale batch analytics, also emerged. One thing that brought these technologies together, despite their differences, was that none were
relational.
Hence the term NoSQL wedged, distinguishing them from traditional databases. Some took this as a negative overtone, implying an imperative to
move away from SQL. The term NoSQL today is used inclusively and is generally understood to mean not only SQL, Proclaiming a multi-lingual
database landscape extending beyond nonetheless also includes relational databases (Sears, 2006).
For processing large amounts of data, it isn’t possible to store and process it in a single system as it hits the physical constraints of the system
(Sears, 2006). One way is to do this computation in parallel on a distributed system or on a supercomputer system (but it might be a little
expensive). As we have seen distributed system is just a bunch of systems connected by a network to achieve a common goal of running a job or
application, while parallel processing is simultaneous use of more than one computational unit to run an application.
Running a job through distributed environment has its own merits and de-merits. Merits being that better resource utilization, good response
time, increased throughput. Demerit can be running a program can be complicated. So let’s first discuss issues for running a typical execution of
a parallel program:
• Computation Partitioning: We need to split the job/program into smaller tasks so that the job can be executed in parallel by the
workers.
• Data Partitioning: This is splitting the input or intermediate data into smaller pieces. Data pieces may be processed by different parts of
a program or a copy of the same program.
• Mapping: Assigning smaller parts of program or smaller pieces of data to resources is what this process is about. This is usually handled
by resource allocators in the system.
• Synchronization: Synchronization between different workers is necessary as without coordination it may lead to race conditions and
data dependency might hamper the process execution.
• Communication: Communication between different worker nodes is necessary for coordination and even for intermediate result
transfer.
• Scheduling: Which of the subparts of the program or even different jobs are to be processed when is what has been defined by this part.
Specialized knowledge of programming is necessary for parallel processing, which might affect the productivity of programmers. So it would be
better for a simple programmer to have a layer of abstraction from the low level primitives. MapReduce, Hadoop, Dryad and DryadLINQ from
Microsoft are a few recently introduced programming models. They were actually developed for data retrieval applications but it was proven to be
working well for different applications (Gunarathne, (2010)), it was also proven to be better than MPI (Ekanayake, 2010).
MapReduce
MapReduce is a linearly scalable programming model. The programmer writes just two functions: a map function and a reduce function, each of
which defines a mapping from one set of key-value pairs to another. These functions are ignorant about the size of the data or the cluster that
they are operating on, so can be used unchanged for a small dataset and for a mammoth one.
The input to Map function is in the form of (Key, Value) pair, and the output from the Map is in the form of (Key, Value) pair. The intermediate
output is sorted according to key and grouped together. The Reduce function receives the intermediate (Key, Value) pairs in the form of a group
of intermediate values associated with one intermediate key and a set of values.
Let’s consider Word Count Problem in which the input is a text file and the output would be set of words and their occurrence in the input file.
How would we solve it using the MapReduce framework? Below is sample Map, Reduce functions of this job. Let’s analyze that to get a bird-view
of what is happening here:
In the main function when gave an input to the start_map function what happens is as follows: the framework automatically computes the blocks
of a particular file in the GFS. For each block there would be a worker assigned called mapper. What mapper receives is basically the line number
relative to the file and the line. The line number is key and line is value in this case. Now mapper simply returns words in the line and number 1
corresponding to each occurrence. After the Map phase the intermediate output is sorted and grouped together. Now this grouped intermediate
output is sent to individual reducers, which just aggregates and calculates the number of occurrences and returns the final output that is words
and their corresponding occurrences.
For example we input a file named quotes which already in the filesystem. Let the quotes.txt has the following lines “Don’t live in the past” People
who live in the past are generally haunted by ghosts of past”.
Let’s assume that each line is in different block (which is not generally the case), so there would be two blocks and so we assign two mappers for
each block.
Now the first mapper receives block-1 which has the key value pair as (1, “Don’t live in the past”) and the second mapper receives block-2 what
has the key value pair (2,” People who live in past are generally haunted by ghosts of past”)
The mapper1 gives the output {(“Don’t”, 1), (live, 1), (in, 1), (the, 1), (past, 1)}. The mapper2 gives the output {(“people”, 1), (“who”, 1), (“live”, 1),
(“in”, 1), (“past”, 1), (“are”, 1), (“generally”, 1), (“haunted”, 1), (“ghosts”, 1), (“of”, 1), (“past”, 1)}
The framework combines the output from the two mappers, sorts them and groups them as follows{(“don’t”, 1), (“people”, 1), (“who”, 1), (“live”,
{1, 1}), (“in”, {1, 1}), (“the”, 1), (“past”, {1, 1, 1}), (“are”, 1), (“generally”, 1), (“haunted”, 1), (“ghosts”, 1), (“of”, 1)} which would then be sent to
reducer. The reducer just aggregates (in this case.) and give the output as follows {(“don’t”, 1), (“people”, 1), (“who”, 1), (“live”, 2), (“in”, 2), (“the”,
1), (“past”, 3), (“are”, 1), (“generally”, 1), (“haunted”, 1), (“ghosts”, 1), (“of”, 1)}
The data flow, which we observed above is logical data flow. First Data portioning is done, second the computation partitioning, determining
master and workers, reading the input data, Map function, Combiner function, partitioning function, Synchronization, Communication, Sorting
and Grouping, and finally the Reduce function.
MapReduce might sound like quite a restrictive programming model, and in a sense it is: you are limited to key and value types that are related in
specified ways, and mappers and reducers run with very limited coordination between one another (the mappers pass keys and values to
reducers). A natural question to ask is: can you do anything useful or nontrivial with it?
The answer is yes. MapReduce was invented by engineers at Google as a system for building production search indexes because they found
themselves solving the same problem over and over again (and MapReduce was inspired by older ideas from the functional programming,
distributed computing, and database communities), but it has since been used for many other applications in many other industries. It is
pleasantly surprising to see the range of algorithms that can be expressed in MapReduce, from image analysis, to graph-based problems, to
machine learning algorithms. It can’t solve every problem, of course, but it is a general data-processing tool. The major point to note is that after
data partitioning and computing partitioning, mappers are allocated mostly at the nodes where the data resides. This property is also called data
affinity which is a major performance enhancer. This is model of taking computation near to where the data resides.
One major dis-advantage of MapReduce is that communication overhead might be quite high when compared to MPI for 2 reasons:
1. MapReduce reads and writes via files, whereas MPI transfers information directly between nodes over the network.
2. MPI doesn’t transfer all data from node to node, while MapReduce is full data flow.
Modifying the classical MapReduce with the two changes i.e., Stream information between steps instead of writing to disk, use long-running
threads or processors to communicate the partial flows; will lead to performance increases at the cost of poorer fault tolerance and east to
support dynamic changes such as the number of available nodes (Fox, 8). This has been studied in several projects (Malewiczm, 2009; Bu, 2010).
Twister (SALSA group, 2010) another parallel programming paradigm (MapReduce++), and its implementation architecture at run time is given
in the following image. The performance is Twister is much faster than traditional MapReduce ((Ekannyake, 2010;), (Zhange, 2010)). It
distinguishes the static data which are never reloaded from the dynamic partial flow that is communicated (Chen, 2012).
APACHE HADOOP
Hadoop is an open source implementation of Google’s GFS and MapReduce (Apache Software Foundation, 2013). It’s implemented in Java.
Although Hadoop is best known for MapReduce and its distributed file system (HDFS, renamed from NDFS), the term is also used for a family of
related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing (Apache Hadoop
components,).
● MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.
● HDFS: A distributed file system that runs on large clusters of commodity machines.
● Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
● Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is
translated by the runtime engine to MapReduce jobs) for querying the data.
● HBase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style
computations using MapReduce and point queries (random reads).
● Zookeeper: A distributed, highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used
for building distributed applications.
Let’s go into the Hadoop starting from the Hadoop Distributed File System. (see Figure 4)
Figure 4. Hadoop architecture
HADOOP DISTRIBUTED FILE SYSTEM
Hadoop filesystem is designed for storing very large files with streaming data access patterns running on clusters of commodity hardware. The
design is not suitable for low-latency data access, large number of files, multiple writes& arbitrary file modification. All the access to files on the
HDFS is based on blocks. A block is the minimum amount of data that can be read or write. In general the block size is large (default Is 64 MB)
compared to disk blocks. The reason is to minimize the cost of seeks. A file in HDFS is divided into independent blocks as in a normal file system.
An HDFS cluster has two types of nodes operating in master-slave pattern: a name-node (master) and several data-nodes (slaves). A name-node
maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored on name-node local disk in
two files: the namespace image and the edit log. Name-node also have the information the data-nodes on which the blocks of a file are located,
but this is generated at the time of system start.
A client accesses the filesystem by communicating with namenode and datanodes. Client uses POSIX like interface so, it need not know about
namenode and datanode to function.
Datanodes are the worker of the filesystem. They store and retrieve blocks as commanded by namenode/client and report back to namenode
periodically with list of blocks that they are storing.
If the namenode is lost then the whole data would be lost, as we can’t relate to data blocks to files. So better option is to configure Hadoop, so that
the namenode writes its persistent state to multiple filesystem. We can also run a secondary namenode, which actually merges namespace image
with the edit log.
When we have a very large cluster with many files, memory of the namenode becomes scaling factor. (Hadoop Distributed File, 2013) HDFS
Federation introduced in the 2.X release allows cluster to scale by adding namenodes, each of which manages a portion of the filesystem
namespace. The namespace volumes managed by different namenodes are independent of each other implying no communication between each
other and even failure of one namespace doesn’t affect the availability of namespaces managed by other namenodes.
Even if we store persistent state of namenode or using secondary namenode, it doesn’t provide high availability of namenode. Namenode is still
single point of failure. To remedy this problem added support for HDFS high availability has been given in 2.x release. There would be stand-by
namenodes and an active namenode. When the active namenode fails the stand-by namenode takes charge and will be active namenode
processing requests.
The Hadoop can use any generic filesystem. When a client request for a file read, the client first contacts namenode to get the locations of the file
blocks. After getting the address of the blocks, the client access the datanodes and reads data from them. After the read the stream is closed. (see
Figure 5)
Figure 5. Hadoop read
Writing to a file in HDFS is almost similar except that the instead of read you send write packet and the datanode sends us an acknowledge
packet in return.
To achieve failover and availability of data, blocks are replicated to multiple other data nodes. So, even if there is any failure of a data node other
data node has the same block can serve the request. By the default the replication factor is 3 in Hadoop so each block is replicated on 2 other data
nodes.
There are tools like Flume (Flume, 2012.) And Scoop to move data into HDFS, they are worth considering rather than writing our own
application.
MapReduce
We have seen the execution of MapReduce task earlier in this chapter. Here, let’s study how MapReduce is being implemented in Hadoop. New
version of MapReduce has been implemented in newer releases. So let’s start out with classical MapReduce and move to YARN (MapReduce2).
In the classic version of MapReduce Figure 6 there are four independent entities:
Figure 6. Job execution in classical MR
● The distributed filesystem (normally HDFS), which stores and serves files for the job and in job processing as well.
● The client submits a job requesting a unique JobID from the jobtracker, validates the output specification of the job and computes the
input job splits..
● After being assigned JobID the framework copies the resources needed to run the job (JAR file, the configuration file, and the computed
input splits) and informs job-tracker that it’s ready for execution
● After this, the job will be placed on internal queue and is scheduled by the scheduler (fair scheduler, capacity scheduler or simple FIFO
queue). And the Initialization involves in creating an object to represent the job being run.
● Having chosen a job, the jobtracker now chooses a task for the job. The Job tracker receives a periodic heart beat from Task trackers,
telling the jobtracker that they are alive. As a part of the heartbeat, it will indicate whether it is ready to run a new task or not and if it is, the
job tracker will allocate it a task.
● Choosing a map task is done taking into account the tasktrackers network location and picks input split which is as close to tasktracker as
possible (It’s tries to make the task to be data-local) reduce task, the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks
(no data locality constraints)
o If a task reports progress, it sets a flag to indicate that the status change should be sent to the jobtracker. The flag is checked in a
separate thread every three seconds, and if set, it notifies the jobtracker of the current task status. Meanwhile, the tasktracker sends
heartbeats, and the status of all the tasks being run by the tasktracker is sent in the call.
o The jobtracker combines these updates to produce a global view of the status of all the jobs being run and their constituent tasks.
o After the job execution is done, JobTracker prints a message to tell the user. Job statistics and counters are printed to the console at
this point. (HTTP job notification if configured) Finally, the jobtracker cleans up its working state.
YARN (MAP REDUCE 2)
YARN wrestles the scalability deficiencies of “classic” MapReduce by splitting the responsibilities of the jobtracker into two separate objects. In
YARN the separation of this roles is as follows: anapplication master and a resource manager.
The basic idea here is that, application master discusses with the resource manager for cluster resources and then runs the application. Node
Managers oversee the containers running on cluster nodes, ensuring over usage of resources doesn’t occur in an application.
● The YARN global resource manager (which coordinates the allocation of compute resources on the cluster.)
● The YARN node manager per slave node (which launch and monitor the compute containers on machines in the cluster).
● The MapReduce application master per application (which coordinates the tasks running the MapReduce job).
● The distributed filesystem (normally HDFS, which is used for sharing job files between the other entities).
● The first stage the job submission is very similar to the classic implementation except that the new job ID (in YARN it’s called application
ID) is retrieved from the resource manager (rather than the jobtracker). The job client checks the output specification of the job; computes
input splits (although there is an option to generate them on the cluster which can be beneficial for jobs with many splits); and copies job
resources (including the job JAR, configuration, and split information) to HDFS. Finally, the job is submitted.
● After receiving the Job, the resource manager hands over the request to the scheduler. The scheduler allocates a container, and the
resource manager then launches the application master’s process there. The application master initializes the job by creating a number of
bookkeeping objects to keep track of the job’s progress. Next, it retrieves the input splits computed in the client from the shared filesystem
and creates a map task object for each split, a number of reduce task objects. If the job is small, the application master may choose to run the
tasks in the same JVM as itself. (uberized job or run as an uber task). Before any tasks can run, create the job’s output directory is created
first (whereas in MapReduce 1 these are done in a special task that is run by the tasktracker)The application master requests containers for
all the map and reduce tasks on the job from the resource manager if the job does not qualify for running as an uber task. Using the map
task’s data locality, the scheduler makes scheduling decisions (just like a job-tracker’s scheduler does). The scheduler prefers rack-local
placement to nonlocal placement. Requests also specify memory requirements for tasks. The way memory is the allocation of memory is
different in YARN, applications may request a memory capability that is anywhere between the minimum allocation and a maximum
allocation (must be a multiple of the minimum allocation).
● Once a task has been assigned a container by the resource manager’s scheduler, the application master starts the container by contacting
the node manager directly. Before it can run the task it localizes the resources that the task needs (including the job configuration and JAR
file, and any files from the distributed cache same step is taken in MapReduce 1). Finally, it runs the map or reduce task. This execution is
done on a dedicated JVM (no JVM reuse is possible).
● When running under YARN, the task reports its progress and status (including counters) back to its application master (where in Map
Reduce1 this is done by Job tracker). The client polls the application master for every few seconds to receive Job progress updates.
● On Job completion the client checks whether the job has completed. Notification of job completion through an HTTP call-back is also
supported (as it is in Map-Reduce 1). In MapReduce 2, the application master initiates the callback (Job Tracker does this in Map Reduce 1).
On job completion, the application master and the task containers clean up their working state, Job information is archived by the job
history server to enable later interrogation by users if desired.The main enhancements of this YARN architecture are scalability, agility,
support to MapReduce, improved cluster utilization, support for workloads other than MapReduce (Horton works,). Failures are one of the
major issues that should be taken care of, by the framework. Let’s have a look at different failure scenarios and how they are solved in
MapReduce1.
o Job Failure: The reasons might be an error in the user map/reduce code if so a runtime exception is thrown. It may also occur with
the sudden exit of the child JVM the tasktracker notices that the process has exited and marks the attempt as failed. The tasktracker
receives no progress update for a while, then the task is marked as failed and the child JVM is killed. When the jobtracker is notified of a
task attempt that has failed (by the task tracker’s heartbeat call), it task execution is rescheduled (it will try to avoid rescheduling the
task on a tasktracker where it has previously failed. Furthermore, if a task fails pre-configured number of times it will exit and give an
error code back to the client).
o Tasktracker Failure: If a tasktracker fails by crashing or running very slowly, it will stop sending heartbeats to the jobtracker (or
send them very infrequently). The jobtracker perceives that the tasktracker is not sending heartbeat signals and arranges for map tasks
that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs, any tasks in progress are also
rescheduled.
o Jobtracker Failure: Hadoop has no mechanism to deal with jobtracker failure. It is a single point of failure, so in this case all
running jobs fail.
Failures in YARN
For MapReduce programs running on YARN, there is the possibility of failure in either of the entities: the task, the application master, the node
manager, and the resource manager.
● Task Failure: Runtime exceptions and sudden exits of the JVM are propagated back to the application master, and the task attempt is
marked as failed.
● Application Master Failure: Applications in YARN have tried multiple times in the event of failure. In the case of the MapReduce
application master, it can recover the state of the tasks that were already run by the (failed) application so they don’t have to be rerun.
● Node Manager Failure: If a node manager fails, it will stop sending heartbeats to the resource manager, and the node manager will be
removed from the resource manager’s pool of available nodes.
● Resource Manager Failure: The resource manager was designed from the outset to be able to recover from crashes by using a check
pointing mechanism to save its state to persistent storage
Till now we have seen how data is stored HDFS and the features of HDFS, how data can be processed using MapReduce both old implementation
and new implementation. But it would be nice if can query data in a simpler set of commands than to write MapReduce programs for each query.
In the part we will look at some of Apache Projects which can be very useful for querying of data.
So, let’s move on to Data Query language frameworks on top of Hadoop:
Apache PIG
Pig brings up the level of abstraction for processing large datasets. (Pig,) in MapReduce the programmer specifies a Map function followed by a
Reduce function. Trying to fit every data processing into this pattern often requires multiple MapReduce stages and it can get challenging. With
Pig, the transformations which we can apply to data are formidable, given its richer data structures, typically being multivalued and nested (joins
for example). Pig is made up of two pieces: Pig Latin, the language for conveying data flows. The execution environment to run Pig Latin
programs (local execution and cluster execution is supported).
Pig is very programmer supportive in writing a query. Pig is designed to be very extensible programming language. But it’s designed for batch
processing not for low latency queries. Pig’s support for complex, nested data structures differentiates it from SQL (works on flatter data
structures). Also, the Pig’s ability to use UDFs and streaming operators that are tightly integrated with the language and Pig’s nested data
structures makes Pig Latin more customizable than most SQL dialects.
Hive
Hive is a data warehouse system for Hadoop. It facilitates ad-hoc queries, the analysis of large datasets and easy data summarization stored in
Hadoop compatible file systems. Hive provides a SQL-like language to query the data. Simultaneously, this language allows the traditional
map/reduce programs. (see Figure 8)
Figure 8. Hive architecture
Hive services as we are seeing in Figure 10 CLI is usual and default command line interface, Hive Server runs Hive as a thrift service making it
possible to connect through various clients like Thrift application, JDBC application and ODBC application. Metastore database stores are where
all the schemas are stored in. The filesystem service connects to the common distributed file system here the Hadoop cluster. By default the
metastore database is stored in derby an embedded database making only one session at a time possible. So to use multiple session we can
configure to use MySQL database or even have a remote database so that the MetaStrore DB is completely firewalled.
The HiveQL is highly inspired from MySQL. The Hive doesn’t support updates but you can use command to insert into already existing database.
This is schema on read kind of database opposing to schema on write of traditional RDBMS making it useful for lesser load times at the cost of
larger query time. At this time of writing the Hive supports only Compact and Bitmap indexing techniques.
HBase
HBase is globally distributed column-oriented database built on top of Hadoop. Database to perform read and write operations on very large data
set. Scalability is the major distinguisher between RDBMS and HBase.
● Automatic partitioning
● Scalability
● Commodity hardware
● Fault Tolerance
● Batch Processing
● No indices
Applications store data in Named tables. Each table has rows and columns, which intersect to give table cells which are versioned. All table access
is done on the primary key (row keys), which are byte arrays so any data type or even data structure can be key. All rows are byte order sorted on
the row key. Row columns are grouped together based on column families. They are stored in filesystem based on column-family which has to
specify upfront the schema definition.
A table is initially stored in a single server; as the size of table increase beyond some configurable size it is divided into regions and now new
region is stored on another server. Row update is atomic. Similar to the Hadoop master- slave HBase also have the same structure with HBase
master managing various region slaves, bootstrapping new install, for recovering region server failover. The HBase the vital information root
catalog table and current master information on Zoo-Keeper cluster. By default there is an instance of Zoo-keeper managed by HBase but we can
configure it to use a Zoo-Keeper cluster.
There are tools like Sqoop which can be used for migration of data from traditional databases. Researchers studied the Hadoop way of doing
things they identified the advantages and disadvantages of this approach (Vinayak, 2012). The emphasis is on the architectural issues of the
recently developed components and layers and their use (misuse). A new approach has been introduced to the ASTERIX project at UC Irvine.
This project started in 2009 with the objective of creating a new parallel, semi-structured information management system.
The bottom-most layer of the ASTERIX stack is a data-intensive runtime called Hyracks (refer to Figure 9). Hyracks sits at roughly the same level
of the architecture that Hadoop does while using higher level languages such as Pig and Hive. The topmost layer of the ASTERIX stack is a full
parallel DBMS with a flexible data model and query language for describing, querying and analyzing data. AQL is comparable to languages such
as Pig, Hive or Jaql, but ADM and AQL support both native storage and indexing of data as well as access to external data residing on a
distributed file system. In between these two layers sits Algebricks, a model-agnostic, algebraic “Virtual Machine” for parallel query processing
and optimization. Algebricks is the target for AQL query compilation, but it can also be the target for other declarative data languages (Battre,
2010) Nephele/PACT is another data-intensive computing project, investigating “Information Management on the Cloud”. The system itself has
two layers, PACT programming model and execution system and Nephele roughly analogous to Hyracks we have just described but a bit more
low level. PACT is a generalization and extension of the MapReduce programming model with the addition of a richer operator set including
several binary operators, other operators like Cross, CoGroup and Match, giving sufficient natural expressive power to cover relational algebra
with aggregation.
Figure 9. Astrerix architecture
Till now we studied mainly the data storage and processing architectures in Hadoop realm. Now let’s study some of the large scale graph
processing system.
Graphs are extensively used in many application domains, comprising social networks, interactive games, online knowledge discovery, computer
networks, and the world-wide web. The popularity and size of social networks pose new challenges. The number of active users on Facebook are
more than 1 billion. These massive systems are a direct application of graph structure with nodes representing entities and edges denoting a
relationship between entities, and search based on the relationships between entities. We cannot keep such a gigantic structure on a single server.
In addition, current distributed systems are not suitable for interactive online query processing of large graphs. In particular, the relational model
is inapt for graph query processing (Grzegorz, 2010) making reachability queries inflexible to express. Distributed graph processing systems give
a viable solution to the above mentioned problem.
In the following sections let’s study two of the many graph database and some large scale graph computation technologies; Neo4j from Neo
Technologies, GraphChi, Trinity from Microsoft research.
NEO4J
One NoSQL technology has emerged to manage connected data and that is the graph database. Graph databases are more and more commonly
found in applications where the data model is connected, including social, telecommunications, logistics, master data management,
Bioinformatics and fraud detection. (Hunger, 2010) Graph databases, like the other NoSQL databases, follow a pattern where the data is in effect
the schema. In a graph database, individual data items are represented as nodes. Graph databases unlike other NoSQL systems offer full ACID
properties. The other major difference between other databases and graph databases is that the connections between nodes (i.e., the
relationships) directly link node in such a way that relating data becomes a simple matter of following relationships. The join problem is tackled
by specifying relationships at the insert time, rather than calculating it at the query time. (see Figure 10)
Figure 10. GRAPH database
What is Neo4j?
“Neo4j is a robust, highly scalable, high performance graph database”. The following are a few highlight of this database.
● High availability
● ACID
● Scale to billions of nodes and relationships (here node implies a vertex in the graph not the computing node we were talking till now)
● High speed query through traversal (typically million traversals per second).
The fundamental units of that form a graph is nodes and the relationships between those nodes. In Neo 4j both nodes and relationships can have
properties. Additionally the nodes can be labelled with zero or more labels.
A relationship connects two nodes, and is guaranteed to have a valid start and end node (note that end node and the start node can be the same
node). As relationships are directed you can view them as an incoming relationships or outgoing relationships. (see Figure 11)
Figure 11. Using relationships and its types
So we stored the above structure in our database and we ask some queries the process would be as shown in Table 1.
Table 1.
How What
Properties are key-value pairs where the key is a string, value is either a primitive type or an array of primitive type. (Null is not a valid property
value). A label is named graph construct that is used to group nodes into sets; all nodes labelled with the same label belongs to the same set. Any
non-empty string can be used as a label. Traversing a graph means visiting nodes. Neo 4j is a schema-optional graph database. And finally the
creating indexes can escalate the performance gain. Indexes in Neo4j are eventually available. (see Figure 12)
Figure 12. Neo4j high availability cluster
● Data Security:
▪ It doesn’t deal with data encryption explicitly, but uses built-in Java programming and JVM to encrypt before storing data.
● Data Integrity:
▪ The whole data model is stored as a graph on disk and persisted for every transaction. In storage layer, Nodes, Relationships and
properties have direct pointers to each other.
▪ Provides full two phase commit transaction management, with rollback support over all data sources.
▪ Event-based Synchronization, Periodic Synchronization, Periodic full export/import of data are few supported integration mechanism.
● Availability and Reliability:
▪ Operational availability such that there wouldn’t be any single point of failure.
▪ Online backup (In case of failure online backup files are mounted on to new instance of Neo4j and reintegrated into an application).
▪ Online backup high availability (the backup instance listens to online transfers of changes from the master. In the event of failure
backup instance takes over).
● High Availability Cluster: uses clusters of database instances, with one read/write master and a number of read-only slaves. Failing
slaves can be restarted and brought back online, Should the master fail, new master will elected from the slave nodes. The slave nodes will be
in sync with the master cluster in the real-time, so the read requests can be sent to satellite slave clusters, so the read request is catered very
quickly. It requires quorum to Neo4j HA requires a quorum in order to serve write load. What this means is that a strict majority of the
servers in the cluster need to be online in order for the cluster to accept write operations.
▪ As data set size increases, the time it takes to get the relationships of any given node stays constant. But eventually it will hit resource
constraints at some point. To address this, Neo4j uses a concept known as cache-based sharding. This concept simply mandates
consistent request routing. So in this case each server Caches a separate part of the graph. There will always be some overlap but this
strategy performs very well.
● Capacity: Neo4j relies on Java’s Non-blocking I/O subsystem for all file handling. The file sizes are only limited by the underlying
operating system’s capacity to handle large files. It tries to use memory-map but if the RAM is not sufficient it uses buffering so that ACID
speed degrades as RAM becomes limiting factor.
▪ Locks are acquired at the node and relationship level. Deadlock detection is built into the core transaction management. (Montag,
2013).
GraphChi
GraphChi (Kyrola, 2010) is a novel disk-based system for computing capably with billions of edges. A new technique called Parallel Sliding
Window is introduced. PSW can process a graph with mutable edge values efficiently from disk, with only a small number of non-sequential disk
accesses, while supporting the asynchronous model of computation. PSW processes graphs in three stages:
TRINITY
Is a general purpose graph engine on a distributed memory-cloud? Trinity supports both graph query processing and offline graph processing.
Trinity is able to scale-out, i.e. it can host arbitrarily large groups in the memory of a cluster of commodity machines. It implements a globally
addressable distributed memory storage and provides a random access abstraction for large graph computation. (Shao, 2013)
A Trinity system has multiple components communicating with each other over a network. We can categorize the components into 3 classes:
1. Trinity Slaves: These nodes store and performs computations on the data. Each slave stores a portion of the data and processes
messages received from other slaves, proxies or clients.
2. Trinity Proxy: These nodes only handle messages but doesn’t own any data. They usually serve as a middle tier between slaves and
Clients. These are optional.
3. Trinity Client: They can be applications linked thorough Trinity library to communicate with Proxy and slaves.
Memory cloud in Figure 13 is nothing but a distributed key-value store and it is supported by a memory storage module for managing memory,
providing concurrency control mechanisms and message passing Framework bestows efficient, one-sided network communication. Trinity lets
user define graph schema, communication protocols and computation paradigms through Trinity Specification Language (TSL). (see Figure 14)
Figure 13. Trinity cluster structure
Figure 14. Trinity system layer
The keys in the key value store would be global unique identifiers and Values can be blobs. To store the data, the memory on the slave machines
is divided into trunks and the data is hashed to machine first and then hashed to the trunk. This trunk division helps in gaining parallelism in
computation and helpful in-case of concurrency control. The address table for the hashing the key to machines is stored on a system which is
promoted to be a leader and this detail is backed up onto persistent storage. The data on the slaves is also backed up on the persistent storage in a
distributed file system, so that the data is not lost when we can’t reach a slave and there is no single point of failure. There can metadata stored
along with key-values such as spin locks to support consistency.
Trinity is a general purpose graph processing engine, it is not customized for a specific set of applications. So the programmers can use TSL to
define the data schema and communication protocols as necessary for their application.
The databases, technologies are not the only available option for there has been a lot of work going on in this Big Data realm. There are a lot other
NoSQL databases like Cassandra, MangoDB, Raik, CouchDB, GoldenORC and Pregel (Grzegorz, 2010) etc. So don’t just confine yourselves do
look them if you are interested. Some online query processing Engines like Horton (Sarwat, 2012). And even new architectures of cloud like
OpenNebula, Sector/Sphere, and Eucalyptus etc. These are some of the many things we couldn’t cover. If you want to learn more about Hadoop
and related technologies, the chapter “Driving Big Data with Hadoop Technologies” is where you should go.
CONCLUSION
First we have established what Big Data is, how it is transforming the computing and what is making it different. Then we have briefly introduced
and discussed some techniques and technologies like Data mining, Machine learning, Cloud computing, Virtualization etc. fuelled the growth the
Big Data.
In the next section we introduced various data processing paradigms like MapReduce, iterative MapReduce and then we learnt about Hadoop
how it’s storing data, and is it implementing MapReduce and processing vast amounts of data, Then we studied about some Large scale Graph
processing technologies and methods which when leveraged can value insights and meaning to the data and its’ relationships.
With this we are closing our discussion mentioning various other technologies which can’t be discussed here and where you can find resources for
them.
This work was previously published in a Handbook of Research on Cloud Infrastructures for Big Data Analytics edited by Pethuru Raj and
Ganesh Chandra Deka, pages 129156, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Appuswamy, R. (2013). Nobody ever got fired for buying a cluster . Cambridge, MA: Microsoft Research.
Baru, C. M. B. (2013, March). Benchmarkig big data systems and the big data top 100 list. Big Data, 1(1).
Battre, D. (2010). Nephele/PACTs: A programming model and execution framework for we-scale analytical processing. InProceedings of the
SoCC (pp. 119-130). New York: ACM.
Bu, Y. (2010). Help: Efficient iterative data processing on large clusters. In Proceedings of 36th International Conference on Very Large Data
Bases (pp. 13-17). Singapore: VLDB Endowement.
Chen, Y. (2012). Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads . InProceedings of 38th
International COnference on Very Large Data Bases . Istanbul, Turkey: VLDB Endowment.
Dürr, F. (2012). Towards cloudassisted software defined networking. Stuttgart, Germany: Institute of Parallel and Distributed Systems (IPVS).
Ekanayake, J. (2010a). High performance parallel computing with clouds and cloud technologies . Boca Raton, FL: CRC Press. doi:10.1007/978-
3-642-12636-9_2
Ekannyake, J. (2010b). Twister: A runtime for iterative map reduce. In Proceedings of ACM HPDC. Chicago: ACM.
Fox, G. (2010, September). 8). MPI and MapReduce. In Proceedings of Clusters, Clouds, and Grids for Scientific Computing. Flat Rock, NC:
Academic Press.
Garside, W. (2013). Big data storage for dummies EMC isilon(special ed). Hoboken, NJ: John Wiley & Sons, Ltd.
Gunarathne, T. (2010). Cloud computing paradigms for pleasingly parallel biomedical applications. In Proceedings of ACM HPDC. Chicago:
ACM.
Guo, C. (2009). BCube a high-performance server-centric network architecture for modular datacenters. ACM SIGCOMM , 39, 44.
Huang, T. M. (2006). Kernel based algorithms for mining huge data sets, supervised, semi-supervised, unsupervised learning . Berlin: Springer-
Verlag.
Hurwitz, J. (2010). The importance of virtualization to big data.Big Data for Dummies. Retrieved from dummies.com.
Hwang, K., Geoffrey, C., & Fox, J. (2012). Distributed and cloud computing . Morgan Kaufmann publishers.
Kyrola, A. (2008). GraphChi: Large-scalre graph computation on just a PC . Pittsburgh, PA: Carnegie Mellon University.
Malewiczm, G. M. H., & Austern, A. B. (2009). Pregel: A system for large scale graph processing. In Proceedings of 21st Annual Symposium on
Parallelism in Algorithms and Arcitectures. Calgary, Canada: ACM.
Manyika, J., & Brown, B. (2011). Big data: The next frontier for innovation, competition, and productivity . McKincey Global Institute.
McAfee, A. (2012, October). Big data: The management revolution.Harvard Business Review .
Provost, F. (2013, March). Data science and its relationship to big data and data-driven decision making. Big Data, 1(1).
Sarwat, M. (2012). Horton: Online query execution engine for large distributed graphs. In Proceedings of 28th International Conference on Data
Engineering. IEEE.
Scaramella, M. E. (2011). The evolution of the datacenter and the need for a converged infrastructure . IDC.
Sears, R. (2006). To blob or not to blob: Large object stoage in a database or a file system . Redmond, WA: Microsoft Research, Microsoft
Corporation.
Shao, B. (2013). Trinitiy: A distributed graph engine on a memory cloud . Beijing, China: Microsoft Research Asia. doi:10.1145/2463676.2467799
Software Defined Networking: New Form of Networks. (2012, April 13). Open networking foundation white papers. Retrieved from
https://www.opennetworking.org/sdn-resources/sdn-library/whitepapers
Sotomayor, B. (2009). Virtual infrastructure management in private and hybrid clouds in IEEE internet computing.
Sverdlik, Y. (2013, September 13). Intel's growing role in software defined networking. Data Center Dynamics. Retrieved from
http://www.datacenterdynamics.com/focus/archive/2013/09/intels-growing-role-software-defined-networking
Wu, H. (2009). MDCube: A high performance network structure for modular data center interconnection. In Proceedings of CoNEXT’09. Rome:
ACM.
ADDITIONAL READING
Ananthanarayanan, R., Basker, V., Das, S., Gupta, A., Jiang, H., Qiu, T., et al. (2013) Photon: Fault-tolerant and Scalable Joining of Continuous
Data Streams SIGMOD '13: Proceedings of the 2013 international conference on Management of data, ACM, New York, NY, USA, pp. 577-588
Chi, E. H. (2000) A taxonomy of visualization techniques using the data state reference model IEEE Symposium on Information Visualization.
InfoVis 2000. Vesanto, J., Himberg, J., Siponen, M., Simula, O., Enhancing SOM based data visualization
Covell, M., & Baluja, S. (2013) Efficient and Accurate Label Propagation on Large Graphs and Label Sets Proceedings International Conference
on Advances in Multimedia, IARIA
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009).Planet: Massively Parallel Learning of Tree Ensembles with MapReduce VLDB ’09 .
Lyon, France: VLDB Endowment.
Singh, H.L., Gracanin, D. (2012) An approach to Distributed Virtual Environment performance modeling: Addressing system complexity and
user behavior, Virtual Reality Short Papers and Posters (VRW), 4-8 March 2012 IEEE Costa Mesa, CA pp:71 - 72
KEY TERMS AND DEFINITIONS
Adaptability: An ability of a system to change or to be changed in order to fit or even work better in some situation or for some purpose.
Architecture: Architecture is an approach to present Structures at macro level. This structure masks all the detailed operational issues for the
common user of the services that the structure can provide.
Database: Systematically organized or structured repository of indexed information that allows easy retrieval, updating, analysis, and output of
data.
Scalibility: A characteristic of a system, model or function that describes its capability to cope and perform under an increased or expanding
workload.
Storage: Storage is a method or action of retaining data for future use. The maintenance or retention of retrievable data on a computer or any
other form of electronic system, or memory.
CHAPTER 18
Big Data at Scale for Digital Humanities:
An Architecture for the HathiTrust Research Center
Stacy T. Kowalczyk
Dominican University, USA
Yiming Sun
Indiana University, USA
Zong Peng
Indiana University, USA
Beth Plale
Indiana University, USA
Aaron Todd
Indiana University, USA
Loretta Auvil
University of Illinois, USA
Craig Willis
University of Illinois, USA
Jiaan Zeng
Indiana University, USA
Milinda Pathirage
Indiana University, USA
Samitha Liyanage
Indiana University, USA
Guangchen Ruan
Indiana University, USA
J. Stephen Downie
University of Illinois, USA
ABSTRACT
Big Data in the humanities is a new phenomenon that is expected to revolutionize the process of humanities research. The HathiTrust Research
Center (HTRC) is a cyberinfrastructure to support humanities research on big humanities data. The HathiTrust Research Center has been
designed to make the technology serve the researcher to make the content easy to find, to make the research tools efficient and effective, to allow
researchers to customize their environment, to allow researchers to combine their own data with that of the HTRC, and to allow researchers to
contribute tools. The architecture has multiple layers of abstraction providing a secure, scalable, extendable, and generalizable interface for both
human and computational users.
INTRODUCTION
Big Data is big news. The data deluge of scientific research data, social media data, and financial/commercial data is now mainstream. It is
discussed in the public press, studied in academic situations, and exploited by entrepreneurs (Press, 2013). Until very recently, the humanities
have not been included in the digital data wave (Anderson, 2007). However, the state of humanities research is undergoing a major
transformation (Drucker, 2009; Dunn & Blanke, 2009). Digitally-based research methodologies are becoming more widely used and accepted in
the humanities (Borgman, 2009). Big data is just becoming available for humanists (Manovich, 2012).
The initial efforts at doing in silico humanities research were personal projects of forward-looking researchers (Dunn & Blanke, 2009). These
initial efforts entailed digital scholarly editions (Newman, 2013; Walsh, 2013), specialized applications for specific research projects (Walsh &
Hooper, 2012), and specialized applications for a specific research domain (Almas et al., 2011). Some of these initiatives evolved into community-
based efforts to develop a set of standards for text encoding that became the focus of much of the digital humanities work for the past 10 years
(Drucker, 2009). Institutional support for digital humanities includes the trend towards Digital Humanities Centers in colleges and universities.
Many of these centers have not been as successful as initially expected due to the resource intensive nature of digital humanities projects such as
text digitization and encoding or specialty applications development (Borgman, 2009). Cross-institutional activities, including funding agency
initiatives to develop shared technical infrastructures, have had mixed results (Friedlander, 2009). All of these efforts have not provided the
desired results. Infrastructure for digital humanities remains unavailable to many researchers (Borgman, 2009).
This chapter describes the research efforts to create a cyberinfrastructure to support Big Data for digital humanities. In the sections that follow,
the need for this research is placed in the context of digital humanities and cyberinfrastructure research, the nature of data in the humanities is
discussed, the HathiTrust Research Center (HTRC) is introduced, the cyberinfrastructure research is described, and the architecture is
explicated.
BACKGROUND
Digital Libraries and Digital Humanities
Digital Humanities encompasses many types of research domains, methodologies, and data types (Manovich, 2012). However, much of the work
in digital humanities deals with textual data (Crane, 2006; Drucker, 2009; Unsworth, Rosenzweig, Courant, Frasier & Henry, 2006). Until very
recently, the biggest barrier to digital humanities was getting textual data in digital form; digitizing materials or negotiating for access to digital
materials was the first step in most digital humanities projects (Cunningham, 2011; Pitti, 2004). With the advent of massive digitizing efforts by
many libraries and by Google, digital data for researchers in the humanities is now becoming available (Cunningham, 2011; Svensson, 2010;
Williford & Henry, 2012). These digital libraries offer services to researchers that include bibliographic and full text searching, results
management, metadata, and text and image displays. In addition, some digital libraries can provide specialized or advanced functionality to
specific communities of users within the scope and mission of the library (Candela et al., 2011).
Digital libraries vary in size and scope. Small and narrowly scoped collections may have begun as digital humanities projects. These digital
research collections result in a resource that is indistinguishable from a digital library (Pitti, 2004). Two such projects are the Chymistry of Isaac
Newton and the Perseus Digital Library. The Chymistry of Isaac Newton is a tightly scoped collection of the Alchymical Notebooks of Isaac
Newton that also provides a research environment for developing new insights into Newton’s work and his impact on the history and philosophy
of science (Newman, 2013; Walsh & Hooper, 2012). It is also a scholarly edition of these manuscripts, a set of reference works on early chemistry
and alchymical work (Pastorino, Lopez & Walsh, 2008), and an interface for a Latent Semantic Analysis Tool1 to analyze the relationships
between the terms within the documents in the Newton Digital Library (Hooper, 2013). The Perseus Digital Library is another narrowly scoped
digital library of approximately 3,000 digitized historic manuscripts with search tools and some analytics such as word statistics (word
2
occurrence in the documents and the corpus) and cross references to other documents2.
Before big data was a term in popular use (Press, 2013), Crane defined a set of dimensions to differentiate between typical digital libraries and
massive digital libraries that could incorporate all print books. These dimensions are scale, heterogeneity of content, granularity of content,
audience, and collections and distributors (2006). Crane proposes that massive digital libraries are very large (over 1,000,000 books) with a
broad scope (similar to brick and mortar libraries) that can manage the data of each book more granularly than as a volume (providing structure
at the section, page, or concept level) for a broad set of users and are available via a number of points of entry and interfaces (2006). Project
Gutenberg and the Internet Archive are examples of well-known digital libraries (Manovich, 2012). Project Gutenberg has 42,556 volumes3 in the
collection available at the volume only. Project Gutenberg does not have multiple distribution channels. The content is heterogeneous and
appeals to a broad audience (Project Gutenberg, 2013). The Internet Archive has 4,510,411 texts4. This collection is heterogeneous with broad
appeal and available through multiple distribution channels. The data is only available at the volume level without other levels of granularity of
content (Internet Archive, 2013). Using Crane’s criteria, the Internet Archive would be considered a massive digital library, but Project Gutenberg
would not.
The HathiTrust Digital Library, a shared and secured digital repository, is a partnership of major research libraries. It is one of the largest
collections of digital books and documents available to researchers (Christianson, 2011). As of May 2013, the HathiTrust Digital Library collection
included 10,728,728 total volumes, comprising 5,617,818 individual book titles and 278,481 serial titles for a total of 3,755,054,800 pages,
totaling 481terabytes of storage5. By all measures, the HathiTrust Digital Library fits Crane’s criteria of a massive digital library. Nearly 11 million
volumes prove scale, and the granularity of this collection is multi-dimensional. The data is available at the volume or the page as full text or PDF
(when copyright allows). The HathiTrust Digital Library has a broad, comprehensive, and heterogeneous collection, a broad audience, and
multiple distribution channels. The most common subjects are Language and Literature followed by History, Sociology, and Business &
Economics (Piper, 2013).
From Digital Libraries to Big Data
Massive digital libraries with millions of digitized books can be considered Big Data for the humanities. Unlike Big Data in other domains such as
purchasing data, telecommunications, sensors, satellites, or microarrays, humanities data consists of thoughts, ideas, concepts, words, language,
events, time, voice, grammar, persons, identities, character, location, and place. All of which are in a totally unstructured format and most of
which originate in books, journals, and documents. A significant characteristic of Big Data is dimensionality (Jacobs, 2009). Humanities textual
data in general and the HathiTrust Digital Library data in particular are multidimensional. One dimension of this data is the logical structure of
the document; access and understanding of this data can be at the volume, section, chapter, page, or word level. Another dimension of the data is
conceptual; access and understanding of the concepts, ideas, and attitudes within the pages of the books. Another dimension is the language
itself; access and understanding of this data can be the meaning of words, the structure of language, and the organization of thought through
language. Yet another dimension is temporal; access and understanding of the data can be the time period of the author and the writing of the
book or the time period of the subject of the book. The data could be used as a longitudinal survey of ideas, attitudes, word meanings, language
structure, publishing trends, and so on. Williford and Henry assert that the underlying humanities data is just as heterogeneous, complex, and
massive as big data in other domains. In addition, the research methods used in the humanities may produce a very large data output. Many
scholars engaged in computationally intensive research see this output as more important than the original data (Williford & Henry, 2012).
By multiple measures, HathiTrust data is Big Data. If Big Data is to be defined as “data whose size forces us to look beyond the tried-and-true
methods that are prevalent at that time” (Jacobs, 2009) or as “data [that] is too big, moves too fast, or doesn’t fit the strictures of your (sic)
database architectures [so to] gain value from this data, you must choose an alternative way to process it” (Dumbill, 2012), then the HathiTrust
data is Big Data. If Big Data is defined by its network properties, its relationship to other data, and its value derived from patterns and
connections between pieces of data (Boyd & Crawford, 2011), then again, HathiTrust data is Big Data.
Cyberinfrastructure for the Humanities
The availability of this new, huge corpus of material has lowered one of the major barriers to the wide adoption of digital humanities
methodologies. However, this big data has presented a new set of challenges and barriers for humanities scholars access to computation
platforms that can process the quantity of data available, the knowledge of algorithms and the ability to build appropriate applications, and the
technical skills required to get the research results.
Digital library resources such as HathiTrust and Google Book Search that provide only traditional library functions such as searching and reading
interfaces are not sufficient as infrastructure for humanities research (Borgman, 2009; Williford & Henry, 2012). The current model of digital
humanities, a multi-disciplinary research group including humanities researchers and computer scientists supported by grant funding
developing a resource or a tool, is unsustainable (Manovich, 2012). Developing cyberinfrastructure, “computing systems, data storage systems,
advanced instruments and data repositories, visualization environments, and people, all linked together by software and high performance
networks to improve research productivity and enable breakthroughs not otherwise possible” (Stewart et al., 2010, p. 1) will free humanities
researchers to pursue their scholarly and intellectual interests using technology that they themselves did not need to develop (Unsworth et al.,
2006).
A suitable cyberinfrastructure for digital humanities would be accessible and sustainable, could facilitate interoperability and collaboration, and
would support experimentation (Unsworth et al., 2006). Multiple calls for developing cyberinfrastructure for the humanities have been issued
(Svensson, 2011; Manovich, 2012; Unsworth et al., 2006; Williford & Henry, 2012). However, the response has been sparse. Only two projects are
known to be developing broad, public cyberinfrastructure for the humanities – Project Bamboo and the HathiTrust Research Center
(Christenson, 2011; Sieber, Wellen & Jin, 2011). The Andrew W. Mellon Foundation awarded a grant to the University of Chicago and the
University of California, Berkeley to build a consortium of institutions to develop and sustain cyberinfrastructure for the arts and humanities
(Greenbaum & Kainz, 2009). This project was begun in 2008 and ended in 2013. The project developed a registry, the Digital Research Tools
website, to help humanities researchers locate tools, services and collections6 (Adams & Gunn, 2012).
The HathiTrust Digital Library founding partner institutions determined that the data in their digital library “would serve as an extraordinary
foundation for many forms of computing-intensive research”, and thus commissioned a call for proposals to establish up to two research centers
that would develop a cyberinfrastructure for the entire HathiTrust Digital Library corpus (Christenson, 2011). Indiana University and the
University of Illinois developed a proposal to develop the HathiTrust Research Center which was accepted July 2011.
HATHITRUST RESEACH CENTER
It is important to delineate the structure of the HathiTrust Research Center (HTRC) with respect to the HathiTrust Digital Library itself. The
HathiTrust Digital Library offers long-term preservation and access services, including bibliographic and full-text search and reading capabilities
for public domain volumes and some copyrighted volumes. The HTRC on the other hand, provides computational research access to the
HathiTrust Digital Library collection. Limited reading of materials will be possible in the Research Center to accommodate needs for reviewing
results and so on, but the destination for reading-based research remains the HathiTrust Digital Library.
The HTRC is a distributed cyberinfrastructure that meets the specific needs of long-term secure research and analysis of the HathiTrust text
corpus. The core goal of the cyberinfrastructure is to deliver optimal access and use of the HathiTrust corpus. The sheer size of the corpus, nearly
11 million digitized books with nearly 4 billion individual pages, demands innovative thinking about the architecture and the optimization at all
levels of the software infrastructure from disk to tools. Research and development has focused on scalability by reducing reads, intelligent
caching, and delivering maximum cycles at minimal costs; interoperability of algorithms and data by providing APIs and flexible parameter
driven interfaces; accessibility by simplification of technologies for researchers who care about their theories but not the technology; and
experimentation support by providing user contributed data and algorithms.
The corpus is a dynamic collection. As mentioned previously, the collection has10,728,728 total volumes, comprising 5,617,818 individual book
titles and 278,481 serial titles for a total of 3,755,054,800 pages, slightly over 30% (3,287,832 volumes) were in the public domain with the
remaining volumes being under copyright protection. The first version of the HRTC will provide research access to the public domain materials.
An Architecture for Digital Humanities Cyberinfrastructure
In this section, we will describe the architecture for the HathiTrust Research Center (HTRC), which is designed to support a broad range of
functions in a secured environment.
The HTRC supports six primary functions: data discovery, service discovery, data retrieval, job management, results management, and data
management. Data discovery provides an interface for searching the HTRC collection for specific subsets of data to be processed as well as to
create and save individualized subsets as sub-collections of the data for ongoing use. Service discovery allows researchers to find available
algorithms for analyzing, visualizing, and processing the data with individualization of these algorithms via parameters. Job management is the
processes of allocating computational resources, as well as submitting and monitoring algorithm execution. Results management allows for the
persistent storage of the output of the research process. As noted above, the output of the algorithms (the data extracted, collated, annotated, and
analyzed) is the research product (Williford & Henry, 2012). Data management includes data storage, data access, and ongoing maintenance of
the data. The HTRC infrastructure includes a massive data store with secured programmatic retrieval via web service architecture. Access for
data, services, and computation are controlled via a secured set of APIs.
The HTRC architecture was designed to support the functions developed above. Both conceptually and technically, the functions form three
layers: Discovery and Access, Services Management, and Data Management (see figure 1). The goal of this three-layered architecture is to provide
multiple levels of abstraction, to protect the underlying data, to manage services and resources, and to provide a simple model for
interoperability.
Figure 1. HTRC Architecture
Sustainability is a major concern in cyberinfrastructure (Bietz, Ferro, & Lee, 2012; Mackie, 2007; Schopf, 2009; Stewart, Almes, & Wheeler,
2010). Sustainability includes funding, relationship management, and resource preservation (Bietz, Ferro, & Lee, 2012). Schopf contends that
while funding is important, it is more important that the cyberinfrastructure culture change to embrace reuse of exiting code and software
packages to enhance reliability and increase maintainability and sustainability (2009). Therefore, a major goal of the HTRC cyberinfrastructure
architecture is long-term sustainability. The design is intended to be implemented with existing, well-known, and well-supported open source
software to facilitate the ongoing maintenance and support of this software long after the research and development are completed. The HTRC
was built using Apache, WS02, SEASR/Meandre infrastructure, and other open source projects.
Access Management
With Big Data comes big security challenges such as appropriate access and privacy (Bryant, Katz, & Lazowska, 2008; Cloud Security Alliance Big
Data Working Group, 2012; Manadhata, 2012; Rong, Nguyen, & Jaatun, 2013). Cyberinfrastructure security in a cloud environment has
additional considerations for privacy that include managing user services and data dispersed across multiple clouds possibly in other legislative
domains, protecting user data against unauthorized access from other users running processes on the same physical server, insuring that data is
not modified without the owner’s consent, and protecting logs that may contain sensitive infrastructure information, query terms, and results
information (Rong et al., 2013). Security is applied throughout the HTRC infrastructure to ensure data and to protect privacy.
The HTRC follows a layered approach to security, where the lowest level ensures that data is securely and reliably stored with necessary network
level security and replication for reliability (see Figure 2). The next layer, the data application programming interface (API), provides secured
access to the data while ensuring proper access control based on user roles. Agent API, the level atop the Data API and computational resources,
implements proper security measures for the algorithms deployed on HTRC data. Data API and Agent API are exposed to the user via HTRC
portal and discover service. Authentication and authorization across these layers are implemented based on OAuth2 (OAuth, 2012) protocol. A
separate identity provider that supports OAuth2 is deployed to manage users including user profiles with the necessary access control metadata.
Access to the HTRC resources is secured using the OAuth2 token-based authentication protocol. Each request that comes to the Data API, the
Agent API, or the Registry API must be accompanied by an access token issued by the HTRC identity provider. Applications or users who invoke
HTRC APIs can acquire an access token by following standard OAuth2 protocols. The HTRC OAuth2 implementation uses the “Authorization
code” as the authorization grant (Hardt, 2012). The benefits of this grant include such features as authorizing third party clients on behalf of
users and managing user identities in a single location while eliminating the need to share credentials with any third party clients.
Upon receiving a request to the API endpoint, the security filter is invoked before passing the request to the actual business logic. This filter
extracts the access token from the request and validates it against the OAuth2 provider, which returns the validity of the token to the filter. If the
token is valid, the OAuth2 provider returns the authorization information that token represents such as user name and roles. This information is
used by the API layers to decide whether the user has the necessary rights to access the resources requested. In addition, this information will be
used to log the API access history for audit purposes. One of the limitations of OAuth2 protocol is lack of support for authorization based on user
attributes. The HTRC addresses this by utilizing role based access control mechanisms (Sandhu, Ferraiolo, & Kuhn, 2000) provided in WSO2
Identity Server (WS02, 2013), which is used as the OAuth2 provider in HTRC. This enables the handling of different types of users with varying
levels of permissions.
DISCOVERY AND ACCESS
This layer is designed to provide access to the HTRC corpus. The goal is to simplify the interactions of technologies so that researchers who care
about their theories but not the technology can use digital humanities methodologies to explore the data, to experiment with new ideas, and to
develop new theories.
Discovery and Collection Building
Searching for data in the HathiTrust Research Center (HTRC) collection is an essential function and is the primary starting point for digital
humanities research. Among many other criteria, researchers require the ability to find items from a certain region or period, by one or more
known authors, that contain a specific set of concepts. Once identified, researchers also require the ability to store and modify collections of
selected items. These collections serve as the foundation for the large-scale analysis supported by HTRC services.
For end user searching, the HTRC selected Blacklight, a widely used, open-source library catalog and discovery system designed to work with
both bibliographic and full text data in SOLR indexes (Sadler, 2009). Using an established search interface helps with the HTRC’s sustainability
(Schopf, 2009). Blacklight is based on the “Ruby on Rails” web application framework and is easily customized or extended. As part of the HTRC
initial implementation, the Blacklight application was extended to support the creation of research collections for analysis. Once authenticated,
researchers are able to search the indexes (described in following sections) and create new collections based on items selected or to modify
previously saved collections. Users are also able to derive new collections from public sets created by others. In the initial implementation,
collections consist of a set of HathiTrust volume identifiers along with collection metadata including the collection name, description, availability
(public or private), and descriptive keywords. Once created, research collections are available for analysis.
The HTRC is currently exploring improvements to the discovery and collection building interfaces. Future enhancements will include greater
user control over bibliographic and full-text searching, improved faceting, and more sophisticated collection building capabilities. While many
questions remain open, the ability to define, gather, and preserve collections of items for analysis remains a central challenge for the HTRC
project.
HTRC Portal
The HTRC portal serves as an entry point for the entire HTRC infrastructure. It provides a web interface for users to interact with the backend
services. The simple web interface provides access to the digital humanities text analysis tools (algorithms), subsets of the public domain volumes
(collections), and computational resources. In the portal, a researcher will select an algorithm, choose a collection to process, and submit the
request. As an example, a researcher is interesting in creating a tag cloud of the words from the novels of Charles Dickens. The researcher decides
that the tag cloud workflow that checks spelling and eliminates commonly used words called stop words (words such as “a”, “and”, “of”, “the”,
etc.) would provide a more meaningful result. When the researcher chooses the algorithm, the portal requests the parameter information about
this algorithm and dynamically builds the web form (see Figure 3). When the researcher has completed the request and clicks on submit, the
portal sends the data to the agent to be processed.
The portal interacts with the HTRC services layer to verify the input data is correct, to allocate computing resources, and to submit a job to be
processed. The portal provides an interface to monitor the progress of the process and provides status updates to the researcher. When the
process is completed, the results are available to the researcher. The web interface allows the researcher to terminate a running algorithm if
necessary. When ready to view the results, the researcher goes to the results tab, selects the job and can see and download the results (see Figure
4).
The HTRC web interface hides the backend services from the users as much as possible. The portal issues calls to the services layer to build all of
the lists of services available such as the collections that the research is allowed to process, the algorithms, the status of jobs submitted, and the
results sets from both the current and previous sessions. To make the application easy to maintain, the HTRC web interface is implemented
based on Apache Struts 2 framework and deployed in the Apache Tomcat container.
The HTRC anticipates that additional access methods to the HTRC data will be required. The HTRC is designed to allow other interfaces from
other digital humanities services, systems, and cyberinfrastructure to securely access the data, the algorithms, and computational resources.
SERVICES MANAGEMENT
Managing the available services is the second major layer of the HTRC cyberinfrastructure. In this layer, the HTRC manages access control via
authentication and authorization in a security module, provides access to data collections and digital humanities algorithms, manages the results,
and allocates computing resources and jobs.
Agent Architecture
The architecture is designed with an agent/actor framework (Hewitt, Bishop, & Steiger, 1973). In this framework, a central program (the agent)
manages individual requests for services by spawning and managing individual processes (actors) that fulfill the service requests. The agent/actor
model helps to mitigate the problems associated with shared memory concurrency. An actor is a lightweight cooperative thread that serially
receives messages, performs computation based on the message, and then suspends until the next message is available. These messages are the
only way actors communicate, as they do not share memory. Actor systems are automatically parallel since actors are executed on a parallel task
pool and are not allowed to perform blocking operations. Actors are also very lightweight, on the order of 400 bytes. These features make them
very scalable. The HTRC agent is written in Scala (École Polytechnique Fédérale de Lausanne, 2013) and uses the Akka library for the actors
(Akka, 2012).
In the HTRC cyberinfrastructure, the service requests are always created via a program – either from an HTRC process such as the public portal
interface or from an algorithmic program using HTRC resources. The service requests use the Representational State Transfer (REST)
architecture (Fielding & Taylor, 2002). The agent framework allows for an abstracted and secured access to the services of the HTRC by
controlling the access to the services via a set of “transactions” via a RESTful API. The RESTful API sends assignments to this internal actor,
which will then perform the tasks and returns the results back to the requestor. The communication between the agent and the action uses
asynchronous messaging to ensure no API call will block a user's actor process (see Figure 5). The agent receives a request and then forwards it to
the actor representing that user. This actor then modifies the user's state and performs the specified action. In the case of a registry query, the
user actor sends a message to the “registry proxy” actor. This actor contacts the registry, executes the query, and sends the response to the user
actor. Once the user actor has the response it completes the original web request.
Figure 5. Agent Interaction Model
The user actor sends a message and then waits for a response, in what seems a violation of the actor model, as it would block the actor until the
response arrived. An asynchronous process can mimic this synchronicity by use of the “Future” feature. When the user actor sends the message to
the registry, rather than waiting for a response, it registers a callback. The user actor is then free to process new user requests. When the
information from the registry is available, the web request is completed by the callback. To match this model, a non-blocking approach was used
to present the REST API, as a blocking model would interact poorly with the asynchronous actors.
Job submissions are handled by having the user actor create a child actor to manage the job. This “compute child” retrieves data from the registry
actor, combines it with the user input and launches a child actor representing the actual job. These job actors simply stage data via a secure copy
(scp) and launch jobs via a secure shell (ssh). The job actors send status updates to the compute child. When a user queries job status, the
corresponding actor queries its compute children, which are always available to respond with recent status information.
Governance Registry
Digital humanities services within the HTRC include such functions as creating sub-collections of the data, running analysis algorithms, and
managing analysis results. These services are managed via the HTRC governance registry. The services offered through the HTRC are generic
with specific instantiations; that is, the service itself is a general function with specific parameters, programs, and/or data that is unique to the
execution. As a simple example, a tag cloud algorithm has a specific set of input parameters that must be provided for a successful run. These
parameters include the identifier of the set of the collection to be processed, the number of words to be included in the tag cloud, a stop word list
to remove the most common, unimportant words, and a dictionary for a spell checking function. These parameters are encoded in an XML
schema (see Figure 6). The basic information such as name, version, description, and author is represented in the “info” element. The second
component of the info element is the parameters element. Each parameter specified is given a name, a type, a flag indicating if it is mandatory or
optional, and a default value if appropriate (such as a generic stop word list or a general dictionary). A label for the portal user interface is also
present, along with a description of purpose of the argument.
Figure 6. Registry Algorithm XML for Tag Cloud with Data
Cleaning
Defined registry services available via the agent include such general tasks as: listing all available collections, uploading a new collection,
modifying a collection, listing all available algorithms, fetching the full properties of an algorithm, submitting jobs, fetching standard counsel
error (strderr) and standard output console (stdout) for a job, querying for the status of a job. Specific digital humanities processes and
algorithms available for the initial implementation include word counts, tag clouds, named entity extraction (characters, locations, and dates),
concept modeling, bibliographic data download, and viewing results. Algorithms results are stored persistently and are accessible from the HTRC
portal.
The text analysis algorithms in the HTRC cyberinfrastructure were available by integrating a well-known humanities workflow engine, Meandre,
and analytics toolkits SEASR (Ács et al., 2010). With these tools, the HTRC has been able to demonstrate basic text processing such as word
counts visualized as Tag Cloud (using d3.js) and data cleaning based on transformation rules that were created as part of the Mellon SEASR
Services project (Searsmith, 2011). These transformation rules expose thousands of OCR error corrections and normalization of spellings. To
demonstrate additional text analytics, the HTRC leveraged flows that extract entities such as dates and plot them on a Simile timeline or extract
locations and plot them on a Google map and flows that perform the comparison of two collections of documents using Dunning Loglikelihood.
Meandre flows were executed by running scripts, which encapsulate all the necessary dependencies. The integration of the Meandre flows into
the HTRC tested the both the agent APIs, the actor model, the job submission process, and the registry data model. During a large-scale
demonstration and test, 354 HTRC sessions with agent actors were initiated, 43 sub-collections were created, and 153 algorithms were
successfully executed.
DATA MANAGEMENT
The final layer of the architecture is data management. At the heart of the HTRC is a corpus consisting of millions of volumes and billions of
pages that will give digital humanities researchers access to data at an unprecedented scale and allow them to make new discoveries that would
be unimaginable before. The types of data available at the HathiTrust Digital Library (HT) include the original scanned images, text from optical
character recognition software (OCR), OCR coordinates data, structural metadata stored in the Metadata Encoding and Transmission
Standard7 (METS), and bibliography metadata in the MAchine Readable Cataloging8(MARC) metadata standard. The HT deposits and replicates
the data at both the University of Michigan and Indiana University.
The HTRC acquires the data from the HT. Currently the HTRC only obtains a subset of the corpus, which consists of the raw OCR texts from 2.6
million volumes in public domain and their associated METS and MARC metadata documents. The total size of this subset amounts to only about
2.2TB on a file system. Initially, this does not appear to be impressive as the data can easily fit on a single disk. But the total file system size alone
does not indicate the challenges that must be overcome in the ingestion, storage, and management of the data. The challenges are primarily due
to the differences in data structures that each system needs to meet their individual goals: the HTRC, a digital humanities cyberinfrastructure
that uses text both as index and as raw data for algorithmic processing, needs fast access to the individual text files while the HT, a digital library
that uses the text as index and page images as display data, needs fast access to images and rarely uses the text files.
The HathiTrust Digital Library’s data structure is designed for fast access to large volumes of data and to facilitate page reading. Each volume at
the HT bears a unique identifier made up of a prefix and a string ID. The string ID is the original identifier assigned by the contributing
institution of that volume. The format of this ID ranges from alphanumeric strings to handle URIs; the prefix is a short alphanumeric code
identifying the contributing institution that essentially serves as a namespace qualifier to avoid potential collisions among the institution-specific
string IDs. Each actual volume is an individual Zip file containing a number of text files where each text file corresponds to a single page of the
volume. Accompanying each volume Zip file is an XML document containing the METS metadata for that volume. The HT uses the Pairtree
hierarchical directory structure to store these files on their specialized hardware and file system (Kunze, Haye, Hetzner, Reyes, & Snavely, 2011).
Each prefix has its own Pairtree. The string ID portion of a volume ID is first cleaned according to the Pairtree rules to replace characters that are
not file system safe. Then the cleaned string is broken into a sequence of 2-charactershorties and sometimes with a trailing 1-character morty (if
the cleaned string contains an odd number of characters). The sequence is used to create a ppath, a nested list of directories starting from the
Pairtree root where each directory name corresponds to a shorty or a morty. At the end of each ppath are the volume Zip file and the associated
METS XML file (see Figure 7). Advantages of the Pairtree structure include a well-balanced directory tree, and mutually referable ppaths and IDs
(Kunze et al, 2011). However, this structure also introduces a lot of overhead because on average each volume Zip file is less than 1MB, yet
accessing the file from the Pairtree requires reading many inodes and directory entries first resulting in a seek time that is much greater than the
read time of the data itself. Storing a large number of small files directly on disk is also not efficient as the files are too small to leverage the
advantages from modern high performance file systems which typically are optimized to use parallelization to achieve high throughput on large
files (Borthakur, 2007; Schmuck & Haskin, 2002; Schwan, 2003). It becomes apparent that the HTRC must store and manage the corpus data in
a more efficient way.
The most obvious data management solution is a relational database system (RDBMS). It is a very mature technology that has been a staple in
data-centric systems for decades. However, as Big Data is becoming more pervasive, relational database is known to have limitations on its ability
to scale well. A relational database also imposes a very rigid schema that is optimized for dense data but is not efficient for sparse data such as the
HTRC corpus (Cattell, 2011). For the HTRC, an intuitive schema design would define each volume as a row in a table, and each page a column;
however, since these volumes have different numbers of pages that could vary by several orders of magnitude, such a schema needs to have
enough columns to accommodate the book with the most pages while the majority of the rows will be only partially filled. Therefore a relational
database is quickly ruled out.
There is an abundance of alternative storage and data management solutions that all share the term “NoSQL” although they vary greatly in data
models and data storage mechanisms (Han, Haihong, Le, & Du, 2011). One commonality is that some of the strong assertions found in typical
relational database systems are relaxed in accordance to the CAP theorem (e.g. consistency) in order to provide better availability and scalability
(Stonebraker, 2010). Since supporting large-scale research and analysis is a goal of the HTRC, a NoSQL solution was evaluated. The HTRC
developers evaluated different systems and concluded that BigTable type of storage is better suited for the corpus data than key-value types (e.g.
MongoDB) because tabular form allows data items within each volume to be associated together where as a key-value store would completely
break such association (Chang et al., 2008). The specific NoSQL solution that the HTRC has implemented is Apache Cassandra as it is an open
source and fully distributed column store (whereas a similar one, the Apache HBase has special nodes that could become single point of failure)
(Hecht & Jablonski, 2011). The data model in Cassandra consists of Keyspaces, Column Families, Rows, and Columns, which at a high level are
analogous to database terms such as Databases,Tables, Rows and Columns. Each row can have its own set of columns that can be entirely
different from the columns in other rows. Cassandra is designed for scalability and is typically setup with multiple nodes to form a ring, and each
node is assigned a token which controls the range of the data the node is in charge of, and the data is usually replicated across different nodes and
the number of replica is controlled by a replication factor. Cassandra stores data onto the disk as SSTables and a compaction process can
consolidate smaller SSTables into larger ones.
Data API
There are two competing requirements on how the HTRC corpus data should be served to the users: security and ease of use. Data security is a
significant aspect of the HTRC cyberinfrastructure in order to keep the large repository of digitized books, both public domain and copyrighted,
safe and free from unauthorized access. Logging access to each volume is a key requirement. Of course, the HTRC wants to simplify access to
allow researchers to be able to find and process the data that they require without imposing a heavy programming burden and without a
significant performance penalty. From a set of use cases, researchers in Digital Humanities work with a wide range of programming languages
and software tools to carry out their work. Based on the requirements and use cases, the HTRC examined several approaches to meet these
requirements including giving direct access to the backend data store, providing client libraries, and deploying a service.
Giving users direct access to the Cassandra data store seems to be the most straightforward approach. Cassandra servers use the Thrift interface
definition language for communication, which would allow each of the algorithms in the HTRC to directly access the data store (Slee, Aditya &
Kwiatkowski, 2007). However, giving direct access to the backend repository does not conform to either the security requirement or the ease of
use requirement. Cassandra has implemented few security provisions and has no auditing functions. Neither the data itself nor any
communication via the Thrift interface is encrypted (Okman, Gal-Oz, Gonen, Gudes, & Abramov, 2011). In addition, the Thrift communication
interface to Cassandra is rather complex and presents a learning curve to the users. The approach of providing client libraries brings a middle
layer between the backend store and user’s client code. This approach is capable of presenting a simplified API to the users while hiding much of
the backend detail. This middle layer also allows user activities to be audited and other necessary security measures to be enforced. However, in
order to support the various languages, the HTRC developers would face the difficult task of developing and maintain a client library for each
different programming language, making ongoing maintenance and long-term sustainability more difficult.
Using a simple, lightweight protocol for the Data API was determined to be the optimal choice. RESTful web service works over the standard
HTTP protocol, and most modern programming languages come with libraries and methods for establishing communications over HTTP, so the
HTRC only needs to maintain a single service (Pautasso, Zimmermann, & Leymann, 2008). Auditing can be done at the service level. Using
HTTP over SSL (i.e. HTTPS) also protects the communication channel between the client and the service and keeps the data safe from
eavesdroppers. The HTRC Data API needs to be able to deliver data at either the volume or the page level. As straightforward as it may seem, the
Data API faces a few challenges associated with delivering multiple volumes.
The first challenge is maintaining the structures of multiple volumes with their pages while delivering the set through a single stream to the client
as the HTTP protocol is designed to allow a single response stream. Using a Zip stream to encapsulate and deliver the data and metadata to the
client worked in all of our use cases. Zip, a technology for aggregating and compressing data, is integrated into many modern programming
languages via libraries. Other advantages of using Zip include its ability to reduce data size and its compatibility with both textual and binary data
Using Zip also provides a solution to the second challenge, conveying error messages to the client after the response stream is committed. By
design, the HTTP protocol dictates that when a response is sent back to the client, it provides a status code to indicate whether the request
succeeded or failed. In case of a failure, the status code provides some information about the error. While this design is sufficient for conventional
HTTP use where one request typically asks for just one document, streaming a set of volumes or pages is inherently different because an error
may occur after the initial success response code is sent and the Zip file has begun to stream. It would be impossible to change the status code or
add other HTTP response headers once the stream has started. Failing silently is an unacceptable approach; not only would users be confused or
frustrated by this lack of information but in a worse case scenario, if the algorithm did not realize the stream was interrupted, incorrect analysis
could be produced. Fortunately, with Zip stream, error messages can be injected as a Zip entry just as other volume data. The HTRC Data API
uses a specially named Zip entry “ERROR.err” to convey error messages to the client. This allows the client to check the presence of this entry to
detect if any error occurred and to request the data again or to abort the process.
The initial implementation of the Data API was logically straightforward. A client sends a list of volume identifiers or page identifiers in the
request. Upon receiving a request, the Data API service instantiates a handler for the request, parses the identifiers list to ensure their validity,
and iterates through the list retrieving each one from Cassandra. The data is Zipped to be streamed back to the client. However, such a
synchronized process between the retrieval from the data store and the data transmission process was unnecessarily slow.
After the initial testing, the Data API was redesigned using an asynchronous mechanism to fetch data from the Cassandra store. In this new
implementation, requests from all clients are added to a common queue. A pool of threads work asynchronously by taking entries off of the
queue, fetching the data from Cassandra, and notifying the corresponding handlers when some data is ready. Such an asynchronous approach
provides synchronous data fetching, zipping and transmitting. The new Data API increased the average transfer rate from below 6MB/sec to
about 20MB/sec. The asynchronous Data API employs a throttling mechanism to limit the number of entries each client may place in the queue
at any given time, thus controlling the amount of data that can wait in memory. The data is stored as volumes (books) and pages. The API uses
page as the throttling unit since the size volumes varies widely. If each throttling unit were defined to be 100 pages, an entry requesting for 200
pages would occupy 2 units. The throttling mechanism solves the problem of a single client monopolizing the queue because at any given time,
the number of pages a single client can have in the queue is bounded, so that other requests will get resources as well. The Data API utilizes the
Java WeakReference class to wrap client request entries before they are added to the queue, so that if a client drops before the request is
complete, its queued entries will no longer be valid and therefore will not be processed. The throttling increased the transfer rate to 60MB/sec.
HTRC Indexing
Indexing is a key component to the HTRC cyberinfrastructure. As researchers may search in both the HathiTrust Digital Library as well as the
HathiTrust Research Center, the HTRC decided to use the same indexing software and index schema as the HathiTrust Digital Library. Both use
Apache Solr, a lucene-based open source enterprise search platform, to index both bibliographic data (fielded data such as author, title,
publisher, publication dates) and the unstructured OCR full text data. Solr, like most indexing tools, requires large amount of memory. The
HTRC Solr index files for the public domain data described above are 1.8 TB. In the foreseeable future, the size will increase significantly as new
data is processed. It is currently hosted in seven servers by using the Solr sharding technique. Solr sharding is a way to scale Solr horizontally by
breaking a single Solr instance into multiple Solr instances, each with identical schema. The shards run independently and have no overlap with
one another. Because each Solr shard runs on its dedicated machine, the distribution enables HTRC Solr to be able to utilize more resources, like
CPU and memory, which improves search performance significantly (see figure 8).
Figure 8. Solr Shards
Solr sharding is not completely transparent. It requires specific knowledge of the access point of every single shard for each query. In addition,
Solr does not have a strong security layer at either the document level or the communication level, meaning that it is possible to modify a Solr
index freely when the access points of the shards are known. To solve these problems, we designed and implemented a proxy for the HTRC Solr
indices. This proxy runs in front of HTRC Solr shards as an extra lightweight layer to protect the index files from being modified, to hide the
details of distributed deployment of Solr shards, and to audit access. The proxy allows only the Solr search functionality and preserves all of the
Solr query syntax. Queries are sent to the Solr proxy as if it were a single Solr instance. The Solr proxy logs all requests, including any prohibited
commands. The head shard receives the queries and then distributes them to all shards including the head shard itself. The head shard then
aggregates the query results and returns the aggregated results to the Solr proxy. Finally, the Solr proxy returns the result back to the requesting
application. The proxy also provides extra RESTful calls for enhanced functionality such as downloading the MARC records for a set of volume
IDs.
The HTRC is currently holding approximately 2.6 millions of volumes. To build an index efficiently against such a large corpus, the index
building process needs to execute in parallel. We are now using seven machines, each with four cores and 16GB memory, to build full text index.
Of the seven machines, six have Solr instances and are responsible for building index. The seventh machine starts N threads, each responsible for
1/N of all volumes to index (see Figure 9). For each volume, the responsible thread gets the bibliographic information from the corresponding
MARC record, extracts the text content from the pairtree, prepares SolrInputDocument by adding all bibliographic fields and OCR files into
SolrInputDocument object, and finally commits it to one of the six Solr servers. We track the index shard for each volume in this distributed
environment to make debugging and data management easier. Although the process of building index is generally considered a one time effort, a
complete reindex can be required periodically for such reasons as schema changes and performance enhancements and thus, our efforts to create
an efficient process.
The ingest of the HT data into the HTRC is done using rsync, which is a tool available on many Linux distributions that allows data to be
synchronized between a source and a destination by transmitting only the differences. The HT hosts an rsync server that points to a directory
containing Pairtrees from all prefixes (see Figure 7). An rsync client at the HTRC can retrieve and store a mirror copy of the entire directory
structure on the local disk. The data is parsed and loaded into the Cassandra NoSQL server. After rsync finishes updating the Pairtrees on disk,
the ingest process uses the output from the rsync process to determine what updates and deletions there are and modifies the data in the
Cassandra ring accordingly.
Cassandra is a fully distributed data store; however the initial tests of the HTRC ingest service showed a bottleneck in the process caused by the
single rsync point. The early version of the ingest service ran on one machine, which launched a single rsync command to transfer the Pairtrees.
This process proved to be very slow. Testing showed that while the rsync server exposes a single entry point at the root directory, it does not
prevent rsync clients from specifying a path pointing directly to a deeper level under the rsync root. Taking advantage of this capability, a method
to allow multiple rsync to transfer the data was devised. At the source end, the HT runs a Linux command such as “tree” to list all the paths with a
fixed depth into these Pairtrees and store these paths in a file (referred to as the rsync points file) at the root level on the rsync server. At the
destination site, the HTRC’s ingest service receives the rsync points file. The ingest service then opens the file to retrieve the list of paths and
creates these directories as if they were new. Meanwhile, the paths in the rsync points file are also added to a common queue. After all directories
are available on the disk, multiple threads issue rsync commands with a specific path as the entry point. This effectively changes from
transferring one single tree with a single rsync to transferring multiple subtrees with multiple rsync commands.
To further improve performance, the rsync processes are distributed across all Cassandra nodes. Each ingest service instance is assigned a unique
number between 0 and (n 1) wheren is the total number of nodes with the ingest service. Each ingest service instance first transfers its own copy
of the rsync points file from the HT, but rather than processing and creating directories for all paths in the file, the ingest service computes the
modulo hash value of each path. If the hash value matches the unique number assigned to this ingest instance, it processes the subtree under this
path. This allows all nodes to participate in the rsync process of the source Pairtree. This is a more scalable and efficient model distributing the
workload and data across all current and future nodes.
Research Contributions
The HathiTrust Research Center is a research project to develop a cyberinfrastructure research environment using the HathiTrust Digital Library
corpus for humanities researchers. This cyberinfrastructure offers a simple way to subset a huge collection of millions of books with billions of
pages, allows researchers to choose and execute algorithms without having to manipulate the data to fit the algorithm, and view and save results
with the ultimate goal of helping researchers create new knowledge.
This research effort has contributed to the general understanding of developing cyberinfrastructure with open source components without
compromising a sound logical architecture. Using open source software to build robust cyberinfrastructure has a number of benefits; it allows for
rapid development; it provides a solid code base for basic features; and it allows for efficient use of resources. We were able to build a distributed
data synchronization process using simple tools that processed a billion files. This innovative approach can be a model for big data systems with
massive amounts of data to be transferred.
The HTRC cyberinfrastructure allows, indeed encourages, users to contribute their algorithms for analyzing the millions of books in the corpus.
These user-contributed algorithms need access to the HTRC data and indexes to perform their work. Balancing security for the data and with
ease of use for the users and their algorithms, we developed a layer of security over two popular open source tools – Cassandra and Solr. For
Cassandra, we developed a RESTful Data API that allows algorithms to access data without exposing the underlying Cassandra schema. The API
provides a simple mechanism to stream huge sets of data to the algorithms. In addition to providing a more secure interface, the Data API allows
us to log access to understand usage of the materials and to provide copyright protections. Algorithms need access to the Solr indexes. With a
proxy service, we are able to simplify access to the Solr shards, provide logging for access analysis, and to add a layer of security to the indexes.
This work contributes towards a better understanding of the issues of data security when users need to execute their own programs in a
cyberinfrastructure.
FUTURE RESEARCH DIRECTIONS
The HTRC has ongoing initiatives for security and increased access. In the next phase of the project as the HTRC prepares to receive the works
still under copyright from the HathiTrust Digital Library, significant research on the non-consumptive research is in process; non-consumptive
research is a legal term meaning that no action or set of actions on the part of users, either acting alone or in cooperation with other users over
the duration of one or multiple sessions, can result in sufficient information gathered from the HathiTrust collection to reassemble pages from
the collection. Non-consumptive as a research challenge is approached through deeper study of the constraint and recommendations for tooling
adaptations to satisfy it.
The HTRC production roll out includes scaling up to full operational capacity in both hardware and support mechanisms. This includes providing
a Sandbox for algorithm developers to test their code against the Data and Indexing APIs and scaling their code to the data. In the future, the
HTRC will allow for the ability of end users to contribute algorithms, data, and notations, and metadata improvements. Research is underway to
determine the data structures, interfaces and relationships that must be created to support this new and important functionality.
Big data in the humanities, and the tools that provide access to them, have the potential to revolutionize the process of humanities research
methodologies as well as the outcomes of that research (Boyd, & Crawford, 2011). As the work of humanists change and evolve, the HTRC will
evolve and change as well. Future work will focus on the new requirements of humanities researchers.
CONCLUSION
The HathiTrust Research Center has been designed to make the technology serve the researcher – to make the content easy to find, to make the
research tools efficient and effective, to allow researchers to customize their environment, to allow researchers to combine their own data with
that of the HTRC, and to allow researchers to contribute tools. The architecture has multiple layers of abstraction providing a secure, scalable,
extendable, and generalizable interface for both human and computational users.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 270294, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Ács, B., Llorà, X., Auvil, L., Capitanu, B., Tcheng, D., Haberman, M., & Welge, M. (2010). A general approach to data-intensive computing using
the Meandre component-based framework. InProceedings of the 1st International Workshop on Workflow Approaches to New DataCentric
Science. IEEE.
Adams, J. L., & Gunn, K. B. (2012). Digital humanities: Where to start. College & Research Libraries News , 73(9), 536–569.
Almas, B., Babeu, A., Bamman, D., & Boschetti, F. Cerrato, L., Crane, G., … Smith, D. (2011). What did we do with a million books:
Rediscovering the Grecoancient world and reinventing the humanities (White Paper). Washington, DC: National Endowment for the
Humanities.
Bietz, M. J., Ferro, T., & Lee, C. P. (2012). Sustaining the development of cyberinfrastructure: An organization adapting to change.
In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (pp. 901-910). ACM. Retrieved from
http://dx.doi.org.ezproxy.lib.indiana.edu/10.1145/2145204.2145339
Borgman, C.L. (2009). The digital future is now: A call to action for the humanities. Digital Humanities Quarterly, 4(1).
Bryant, R. E., Katz, R. H., & Lazowska, E. D. (2008). Big-data computing: Creating revolutionary breakthroughs in commerce, science, and
society. In Computing Research Initiatives for the 21st Century. Computing Research Association. Retrieved from
www.cra.org/ccc/docs/init/Big_Data.pdf
Candela, L., Athanasopoulos, G., Castelli, D., El Raheb, K., Innocenti, P., & Ioannidis, Y. … Ross, S. (2011). The digital library reference model.
Retrieved from http://bscw.research-infrastructures.eu/pub/bscw.cgi/d222816/D3.2b Digital Library Reference Model.pdf
Cattell, R. (2011). Scalable SQL and NoSQL data stores. SIGMOD Record , 39(4), 12–27. doi:10.1145/1978915.1978919
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., & Gruber, R. E. (2008). Bigtable: A distributed storage system for
structured data. ACM Transactions on Computer Systems , 26(2), 4. doi:10.1145/1365815.1365816
Cloud Security Alliance Big Data Working Group. (2012). Top 10 big data security and privacy challenges. Cloud Security Alliance. Retrieved
from https://cloudsecurityalliance.org/research/big-data/#_downloads
Crane, G. (2006). What do you do with a million books? D-Lib Magazine , 12(3). Retrieved from
http://www.dlib.org/dlib/march06/crane/03crane.htmldoi:10.1045/march2006-crane
Cunningham, L. (2011). The librarian as digital humanist: The collaborative role of the research library in digital humanities projects. Faculty of
Information Quarterly, 2(1). Retrieved from http://fiq.ischool.utoronto.ca/index.php/fiq/article/view/15409/12438
Drucker, J. (2009, April 3). Blind spots: Humanists must plan their digital future. The Chronicle of Higher Education .
Fielding, R. T., & Taylor, R. N. (2002). Principled design of the modern web architecture. ACM Transactions on Internet Technology , 2(2), 115–
150. doi:10.1145/514183.514185
Friedlander, A. (2009). Asking questions and building a research agenda for digital scholarship . In Working Together or Apart: Promoting the
Next Generation of Digital Scholarship . Academic Press.
Greenbaum, D. A., & Kainz, C. (2009). Report from the bamboo planning project. Coalition for Networked Information. Retrieved from
http://www.cni.org/topics/ci/report-from-the-bamboo-planning-project/
Han, J., Haihong, E., Le, G., & Du, J. (2011). Survey on NOSQL database. In Proceedings of the 6th International Conference on Pervasive
Computing and Applications (ICPCA), (pp. 363-366). ICPCA.
Hecht, R., & Jablonski, S. (2011). NoSQL evaluation: A use case oriented survey. In Cloud and Service Computing (CSC), (pp. 336-341). IEEE.
Hewitt, C., Bishop, P., & Steiger, R. (1973). A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd
International Joint Conference on Artificial Intelligence (IJCAI'73), (pp. 235-245). IJCAI.
Jacobs, A. (2009). The pathologies of big data. Communications of the ACM , 52(8), 36–44. doi:10.1145/1536616.1536632
Kunze, J., Haye, M., Hetzner, E., Reyes, M., & Snavely, C. (2011). Pairtrees for collection storage (V0.1). Retrieved from
https://wiki.ucop.edu/download/attachments/14254128/PairtreeSpec.pdf
Manadhata, P. K. (2012). Big data for security: Challenges, opportunities, and examples. In Proceedings of the 2012 ACM Workshop on Building
Analysis Datasets and Gathering Experience Returns for Security (pp. 3-4). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=2382420
Manovich, L. (2012). Trending: The promises and the challenges of big social data. In M. K. Gold (Ed.), Debates in the Digital Humanities.
Minneapolis, MN: The University of Minnesota Press. Retrieved from http://lab.softwarestudies.com/2011/04/new-article-by-lev-manovich-
trending.html
Okman, L., Gal-Oz, N., Gonen, Y., Gudes, E., & Abramov, J. (2011). Security issues in NOSQL databases. In Proceedings of Trust, Security and
Privacy in Computing and Communications (TrustCom), (pp. 541-547). IEEE.
Pastorino, C., Lopez, T., & Walsh, J. A. (2008). The digital index chemicus: Toward a digital tool for studying Isaac Newton's index
chemicus. Body, Space & Technology Journal, 7(20).
Pautasso, C., Zimmermann, O., & Leymann, F. (2008). Restful web services vs. big web services: Making the right architectural decision.
In Proceedings of the 17th International Conference on World Wide Web (pp. 805-814). ACM.
Pitti, D. V. (2004). Designing sustainable projects and publications . In Schreibman, S., Siemens, R., & Unsworth, J. (Eds.), A Companion to
Digital Humanities . Oxford, UK: Blackwell. doi:10.1002/9780470999875.ch31
Press, G. (2013). A very short history of big data. Forbes. Retrieved from http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-
history-of-big-data/
Sadler, E. (2009). Project blacklight: A next generation library catalog at a first generation university. Library Hi Tech , 27(1), 57–67.
doi:10.1108/07378830910942919
Sandhu, R., Ferraiolo, D., & Kuhn, R. (2000). The NIST model for role-based access control: Towards a unified standard. InProceedings of the
Fifth ACM Workshop on RoleBased Access Control, (pp. 47-63). ACM.
Schmuck, F., & Haskin, R. (2002). GPFS: A shared-disk file system for large computing clusters. In Proceedings of the First USENIX Conference
on File and Storage Technologies, (pp. 231-244). USENIX.
Schopf, J. M. (2009). Sustainability and the office of cyberinfrastructure . In Proceedings of Network Computing and Applications . IEEE.
Schwan, P. (2003). Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux Symposium. Linux.
Stewart, C. A., Simms, S., Plale, B., Link, M., Hancock, D., & Fox, G. C. (2010). What is cyberinfrastructure? Norfolk, VA: Association for
Computing Machinery (ACM). doi:doi:10.1145/1878335.1878347
Stonebraker, M. (2010). Errors in database systems, eventual consistency, and the CAP theorem. Communications of the ACM , 5.
Unsworth, J., Rosenzweig, R., Courant, P., Frasier, S. E., & Henry, C. (2006). Our cultural commonwealth: The report of the American council of
learned societies commission on cyberinfrastructure for the humanities and social sciences. New York: American Council of Learned Societies.
Retrieved from http://www.acls.org/programs/Default.aspx?id=644
Walsh, J. A., & Hooper, E. W. (2012). The liberty of invention: Alchemical discourse and information technology standardization.Literary and
Linguistic Computing , 27, 55–79. doi:10.1093/llc/fqr038
Williford, C., & Henry, C. (2012). One culture. Council on Library and Information Resources. Retrieved from
http://www.clir.org/pubs/reports/pub151
ENDNOTES
1 http://webapp1.dlib.indiana.edu/newton/lsa/index.php
2 http://www.perseus.tufts.edu/hopper/collections
6 http://dirt.projectbamboo.org/
7 http://www.loc.gov/standards/mets/
8 http://www.loc.gov/marc/
CHAPTER 19
Construct a Bipartite Signed Network in YouTube
Tianyuan Yu
National University of Defense Technology, China
Liang Bai
National University of Defense Technology, China
Jinlin Guo
National University of Defense Technology, China
Zheng Yang
National University of Defense Technology, China
ABSTRACT
Nowadays, the video-sharing websites are becoming more and more popular, which leads to latent social networks among videos and users. In
this work, results are integrated with the data collected from YouTube, one of the largest user-driven online video repositories, and are supported
by Chinese sentiment analysis which excels the state of art. Along with it, the authors construct two types of bipartite signed networks, video
network (VN) and topic participant network (TPN), where nodes denote videos or users while weights of edges represent the correlation between
the nodes. Several indices are defined to quantitatively evaluate the importance of the nodes in the networks. Experiments are conducted by
using YouTube videos and corresponding metadata related to two specific events. Experimental results show that both the analysis of social
networks and indices correspond very closely with the events' evolution and the roles that topic participants play in spreading Internet videos.
Finally, the authors extend the networks to summarization of a video set related to an event.
INTRODUCTION
The amount of information shared on online social media has been growing at an extraordinary speed during the recent years. Especially the rise
of some easy-to-use video-sharing websites such as YouTube, make it easy for users to upload, manage and share videos. Recent statistics show
that, 100 hours of videos are uploaded to YouTube every minute, more than 1 billion unique users visit YouTube each month and over 6 billion
hours of videos are watched each month1. Among the videos, some are captured by uploaders, others are remixed and reposted. Users rate and
comment the videos according to their tastes. Inevitably, it has led latent rich and complicated social networks in YouTube, which is an
invaluable resource for understanding on-line social phenomenon.
In this work, with visual information we propose a weighted network, video network (VN). Companied with the corresponding metadata, a
bipartite signed network, topic participant network (TPN) is constructed. Figure 1 illustrates the approach of constructing and analyzing the
social networks. Firstly, we use event-related text queries to crawl large quantities of videos and the corresponding metadata from YouTube.
Then, VN is constructed with the near-duplicate key frames (NDK) and video metadata. In addition, we construct signed network, TPN, with user
metadata. In TPN, negative-weighted edges are constructed to represent users’ opposite opinions according to the sentiment analysis method.
Then, we analyze the statistics features of the networks on the global level and propose several measures to evaluate the users in the social
network on the level of individual nodes. In the end, we select several representative video clips to summarize video sets according to the
important videos and participants, which would be useful for video search engine.
• We propose to construct a bipartite signed social network basing on Internet videos and the corresponding metadata related to specific
events.
• We conduct Chinese sentiment analysis of YouTube comments for mining users’ opinions about a specific event, and the method excels the
state-of-art.
• We investigate to utilize the important videos and topic participants to summarize specific events with few videos.
RELATED WORK
YouTube is the most famous video-sharing website in the world. It is not only a website for entertainment, but also supplies large quantities of
videos and the relevant information for scientific researches. The first large-scale YouTube measurement study was proposed by Cha et al.
(2007). Then, researches on YouTube datasets have been a hot topic, which mainly includes latent information of users (Brodersen et al., 2012;
Szabo & Huberman, 2012), analysis of social phenomenon (Hosseinmardi et al., 2014), and characteristics of YouTube videos (Feroz et al., 2014).
Most of previous works only used video metadata, including links (Lai & Wang, 2010), geographic location (Brodersen et al., 2012), view count
(Szabo & Huberman, 2012) and so on. In this work, besides video-accompanied metadata, visual content is also used for constructing networks.
There has been very few works about constructing network using visual content. To our knowledge, only Xie et al. (2011) proposed to construct
social network based on “visual meme” which represents a segment clip of a video. However, they did not consider commenters in constructing
social networks.
Key frame extraction methods aim to obtain a set of frames that can efficiently represent and summarize video contents and can be reused in
many applications. An effective set of key frames, viewed as a high-quality summary of the video, should include the major objects and events of
the video, little redundancy and overlapped content. Chen et al. (2001) presented a method using an unsupervised segmentation algorithm and
the technique of object tracking. After that, another method which adopted the joint evaluation of the audio and visual features was proposed
(Chen et al., 2002). In the paper, we use the existing technique (Pickering et al., 2002) to extract key frames.
Near-duplicate detection method is a common technique to analyze relations between Internet videos. Wu et al. (2007) proposed a novel
approach for near-duplicate key frames (NDK) retrieval caused by transformation. The method of accelerating the matching of NDK was
proposed (Tan et al., 2008), which exploited the temporal coherence property inherent in the key frame sequence of a video. Hehir (2011)
proposed an efficient method for discovering NDK, processing a million key frames in a few hours. Based on it, Yu et al. (2015) improved near-
duplicate detection system by adding a post-processing step. In this work, we use the improved system.
Sentiment analysis is the computational study of people’s opinions, appraisals, and emotions toward entities, events and their attributes (Liu,
2010). It has been used with much success in fields where users have an obvious subjective agenda such as reviews (Zhang et al., 2014) or blogs
(Godbole et al., 2007). Compared to these conventional text sources, social media and streams have posed additional challenges for sentiment
analysis: most messages are very short especially in the social network services, such as Twitter and YouTube; the language of social media is very
informal, which includes numerous accidental and deliberate errors and grammatical inconsistencies; the no-editing and no-filtering approach
makes the data with a very high diversity. Although the difficulties existing, researchers still got a good performance in the data of Twitter (Bravo-
Marquez et al., 2013; Tai & Kao, 2013). However, “while lots of works have been published on Twitter, the YouTube data remains only partially
covered” (Uryupina et al., 2014). Not only YouTube comments exhibit all the typical properties of social media documents discussed above, they
also provide an additional dimension: the target of the comments are always concealed. Therefore, the automatic opinion target identification in
the comments becomes especially important in this case. To date, available dataset and resources for sentiment analysis have been restricted to
text-based sentiment analysis only, let alone the relevant research on object identification. In this paper, we combine comments with visual
information to extract the potential target of comments. What is more, although the technique for English has been mature, the effects of the
same technique for Chinese are inferior because of different language systems. Recent work (Tan et al., 2014) provided the baseline and listed the
difficulties of extracting target from Chinese, including segmentation of words, representation of pronouns, identification of terms, and analysis
of grammar. According to the report, the precision of polarity of Chinese microblog has been over 90 percent. However, the precision of the
whole system including object identification and sentiment analysis of users’ comments was much lower, and the best result was only 44.2
percent.
Most of previous web-based social networks only considered positive weighted or unweighted edges (Scott, 2012). Yang et al. (2007) considered
separating the communities from social networks with negative edges. Kunegis et al. (2009) proposed a method to calculate the clustering
coefficient with negative social network constructed by a news website “Slashdot”. Recent work (Sudhahar et al., 2015) preferred to focus on the
topic of community detection with negative edges. But the networks they used are unweighted signed networks.
To our knowledge, no one investigated the summarization of video sets before. This work will be useful for video search engines to show a full
picture of an event with the minimal number of videos. The most similar works are video summary and video recommendation system. In the
video summary method, most people focused on extracting specific movement or interesting shots, which was usually applied in the sports field
(Li & Sezan, 2001; Parry et al., 2011). The kind of works summarizes one video by compressing the duration while we use few videos (fewer than
20 videos) to represent the whole video set (including thousands of videos) in our work. The other similar field is video recommendation system,
which usually uses the metadata and visual content. However, most recommendation systems focus on the metadata with little visual content. In
the recent years, people realized that the visual information is also important for the system. Based on that, Zhu et al. (2014) proposed a novel
recommendation framework called VideoTopic, which focuses on user interest modeling and decomposes the recommendation process.
However, different aims cause the incomparability between recommendation system and summarization of video dataset.
HYPOTHESES AND TOPIC PARTICIPANTS
Pelle & Vonderau (2009) indicated users of video-sharing websites prefer to upload favorite or important videos. As a result, the number of
videos including NDK mainly depends on the preference of the users. Specifically, in terms of a hot news event, the relevant original videos in the
Internet are posted by people who witness the event (citizens or journalists) or experts who analyze the event. Among the videos, some are made
various edits by different news programs, and a part of them are also remixed and reposted by the fancier. Those are main reasons why NDK
appear.
There are four important hypotheses in this work as basic preconditions. The first one is that although the authors of the videos are unavailable,
we consider uploaders as authors because the uploader is either the author or a friend of author. Anyway the uploader shows the same interests
as the author. The second one is that all of uploaders and commenters are interested in the topic to some degree, but uploaders show more
preference for the topic than commenters. That is because it takes more efforts for users to capture or remix a video. The third one is that the
uploader of an earlier-uploaded video cares more about the topic than that of a latter-uploaded video. Since the difficulty of information retrieval
is the same for each user, the earlier videos’ uploaders who pay attention to the topic earlier are more interested in the topic than the latter
videos’ uploaders. Another reason is that the latter videos’ uploaders may be inspired by the earlier videos and therefore are interested in the
topic. The last one is that the commenters who comment earlier show more interests in the topic compared with latter commenters. The reason is
similar to the third one.
We separate YouTube users as three types, uploaders, commenters, and viewers. It is allowed that three types of users have intersection, for
example, the uploaders and commenters are always viewers. According to above four reasonable inferences, the uploaders and commenters take
more efforts than the viewers, which represent their preference for the topic. As a result, we define the uploaders and commenters of the related
videos as the topic participants. For the viewers, who never upload videos or leave comments, only affect one attribute of videos, view count.
However, the index “view count” is not related to the importance of videos according to the latter section “Experiments and Results”. In this work
we do not take the kind of users into account.
SENTIMENT ANALYSIS
In this work, Chinese sentiment analysis method is used to extract attitudes of topic participants. Whether the opinions between two users are
the same or different decides the relationship between the users. If the sentiments of the users for the same target are the same, the positive
edges would be linked between users, whereas negative edges appear between users who have different sentiments. Because of the character of
the YouTube comments, the targets are always omitted, which has caused that the identification of targets is extremely important in this case.
Therefore, in the following part, we extract the target of the comments first, and then conduct sentiment analysis.
Extraction of Targets
In this section, we propose an algorithm of extracting targets of comments. In the algorithm, the input, set of topic-related keywords, is selected
according to the frequency of the terms in the metadata (including titles, descriptions and comments) of topic-related videos. Then, the high-
frequency terms and numbers in the linguistics are removed, and all the remaining possible targets are chosen as topic-related keywords. When
the users leave comments for a video, the targets are sometimes contained in the comments. For example, the comment “Ke is so smart!!! Lian is
far behind him.” contains two keywords of the event “Taiwan Elections of 2014”, Ke and Lian, which are two most competitive candidates of
Taipei mayor. Sometimes, short comments may not contain keywords, for example, “Back you up”. In the situation, we use the title of the videos,
and in most cases, we can get keywords. For some videos, their titles contain very few useful information about the target, then we use the
accompanied textual description of the video since it is more specific than the title. If the targets are still not detected, we utilize visual content of
videos, such as the images of potential targets, which are available by other videos. The maximum number of potential target’s images implies the
target of the video, which is the most likely target of the comments. The sentiment analysis method of videos’ uploaders is similar, the only
difference is that we skip the step of checking comments because uploaders hardly comment themselves.
Sentiment Analysis of Targets
In this work, we use three classical and effective methods to analyze the sentiment of comments’ targets, n-gram method, lexical method, and
SVM method. We predefine the users' sentiments as positive, negative, and neutral. Specifically, we adopt n-gram model for sentiment analysis,
which has produced promising results in both English and Chinese (Bespalov et al., 2011; Tan & Zhang, 2008). Since most comments on
YouTube are short, we set . In terms of lexical methods, we firstly split the sentences using keywords. For each segment, we consider setting
weights for different words to represent the degree of polarity, and also consider the affection of part of speech, the boost words such as “very”,
the negative words, whether the interrogative sentence or not, and the influence of locations of words. We also implement standard SVM model
to analyze the sentiment which has been proved useful for text classification (Li & Wu, 2010).
CONSTRUCTION OF NETWORKS
Notations
Let be the crawled video set, where is the number of crawled videos. The uploader of the video is , published time is ,
commenters are and the number of its commenters is . Denote the uploader set as , where is the
number of uploader, and in general . Let be the commenter set, where denotes the number of commenters. Let the set
of topic participants be , and the video set participant involves in be . Furthermore, each video will contain a
set of NDK denoted by . Let denote the video set in which each contains NDK . The social networks built in
later section are both directed and weighted. The edge starting from to is denoted by . If an edge exists between nodes and , then ,
otherwise and the weight of edge is denoted by . Let be the number of NDK between two videos and , and denotes the maximum
number of between any videos and . At last, we define the sentiment relationship between two participants .
Video Network
In this section, we construct a weighted video network, in which each node represents a video while each edge represents the relationships
between videos. Specifically, if there is at least one NDK between videos and , and if , there will be a directed edge beginning from node
to node , indicating there is information delivered from to , and the weight denotes the number of NDK.
We use classical and improved degree centrality to evaluate the importance of each node in VN. The classical in-degree and out-degree of each
video node denoted as and are calculated as follows:
(1)
(2)
If a video node has a higher classical in-degree, then it has copied more numbers of videos, and the chance that it is a remixed video is higher.
Whereas, if a video node has a higher classical out-degree, then it has been copied by more numbers of videos, and the chance that it is an
original one is higher. As a result, the video importance in VN is calculated:
(3)
However, the classical degree centrality only considers the number of videos that a video has impacts on, but cannot demonstrate the affection of
a video. Considering evaluating transferred information of each video, we use the weights of edges to define improved in-degree and out-degree
to describe the probability whether a video is remixed or original:
(4)
(5)
The larger the of a video is, the more key frames the video reposts, and the higher chance the video is a remixed one. Whereas, the larger the
out-degree of a video is, the more key frames are used by others, and the higher chance the video is an original one. So the video importance in
VN is calculated as follows, which means the ratio between output information and input information:
(6)
In Equations 3 and 6, the constant “1” in the denominator is used to avoid videos’ in-degree equaling to 0 caused by original videos. Videos with
larger values of or attract more attention from uploaders, and most of them are original, thereby more important.
Topic Participant Network
In this section, we propose a bipartite signed network, topic participant network, where nodes represent topic participants. Whether an edge
exists depends on the situation in VN. If two videos are linked in VN, then all the participants of these two videos are related in TPN. Namely
there will be an edge existing between any two topic participants of the linked videos. In terms of an edge’s weight, it describes participant’s
affection for another. If videos and uploaded by different uploaders contain NDK, the uploaders of these two videos and are more likely to
have interests in a certain aspect of the topic, which means that the uploaders have connections with each other. Similarly, the connection also
exists between any two participants in the set of , , , and . The direction of edges always starts from the earlier participant to the other, which
means the direction of transferred information.
For an arbitrary video, there are links existing between its uploader and commenters. Meanwhile, commenters are also related to each other.
According to the second and the fourth hypothesis above, we consider that the earlier commenters are more relevant with the uploader than the
latter ones because most of earlier commenters watch the video and put it in the hot list, and the recommendation system of YouTube will put the
video in the front of searched results, which makes the latter commenters leave the comments. Thus, for a video , the weight between its
uploader and commenter is computed as follows:
(7)
where denotes the number of comments for video and denotes the maximum number of NDK between any two videos, and its function
will be explained later. The variable describes whether sentiments of different participants are accordant or not. For the same target, the value
of is taken according to Table 1. If the sentiments of two participants are accordant, the original weights would be doubled to enhance the
relationship; if the sentiments are opposite, the original weights would be negative doubled; if one is neutral and the other is negative sentiment,
the weights would be reversed; and if one neutral while the other positive, the weights would be unchanged.
Table 1. The value of sentiments between any two topic participants
Positive 2 1 -2
Neutral 1 2 -1
Negative -2 -1 2
Similarly, there are also connections between commenters, which however should be lower because the target of the most comments is the video
rather than other comments. The weight between commenters and is calculated as follows:
(8)
When it comes to two arbitrary videos containing NDK, the latter uploader is a potential follower of the earlier one, thereby the edge starting
from the earlier video to the latter one. Besides the third assumption, we consider that the earlier uploader has more preference for the topic than
the other. Furthermore, if two videos contain more NDK, more information has been transferred, which leads to a closer relationship between
these two uploaders. As a result, the weights in TPN depend on the number of NDK and the published time interval. Here we calculate :
(9)
where is utilized according to event-popularity on YouTube (Xie et al., 2011)
Moreover, a link also exists between any two commenters and , and Equation 10 shows the calculation of the weight . The value of is set
to 0 at first, and then an iterative process will keep updating the value. According to fourth hypothesis, the edge starts from the commenter of
the earlier-posted video to the other. Equation 11 shows the calculation of the weight between an uploader and an arbitrary commenter .
Similarly, the edge starts from the topic participant of the earlier-posted video to the other.
(10)
(11)
(12)
(13)
Here, we explain the function of . For participants in TPN, an arbitrary commenter of has a closer relationship with the uploader of the
video compared with the uploader of another video , which is linked with in VN. Similar rule is also valid for commenters. In a word, the
relation is closer between two participants of the same video than that between two participants of different videos. In the calculation of weights
in TPN, guarantees the rule above.
In the rest of the section, several indices are defined for evaluating the importance of participants in TPN. Considering the existing of negative
edges, most methods analyzing social networks cannot be used. Therefore, we propose four simple indices to evaluate users’ importance.
First of all, we define two types of people for users, supporters who connect the user with positive edges and objectors who link the user with
negative edges. The index compares the number of supporters and objectors of a topic participants. The participants with more supporters are
more important while those with more objectors are also important to some degree.
The second index we use is eigenvector centrality method, which randomly walks along the directed edges and transports the importance of each
node. Thus, when evaluating the importance of a user, the importance of the user's neighbors is also considered. In the experiment, we use the
improved version of eigenvector centrality, PageRank, the calculation of matrix form is showed:
(14)
where, is the adjacent matrix of networks, and is an absolute diagonal matrix satisfying . The empirical value for is 0.85, and is
a full-one matrix.
Furthermore, the last two indices are designed based on the deduction that the importance of a participant depends on the importance of .
The third index is the summation of importance of . In Equation 15, different weights are set for uploaders and commenters respectively
since they show various preference for the video content. However, the index is largely influenced by the volume of . Therefore, we propose the
fourth index by calculating the average importance of shown in Equation 16, where denotes the number of .
(15)
(16)
In these two indices, we only use rather than . According to the results of experiments, the classical and improved degree centrality are
positively correlated. However, the improved degree centrality uses the information amount rather than the number of videos containing NDK,
which is more reasonable to evaluate the video importance.
EXPERIMENTS AND RESULTS
Dataset and Observation
We use two events, “Super Typhoon Haiyan” (dataset 1) and “Taiwan Local Elections of 2014” (dataset 2), to conduct the experiment. “Super
Typhoon Haiyan” is one of the top 10 international news story in 2013, and it only lasts for several days, considered as a short-time event. In
total, we get 1329 videos, ranging from November 4th, 2013 to November 14th, 2014, and 9872 unique participants. We also choose a long-term
event, “Taiwan Local Elections of 2014”. All in all, 1343 videos whose published time ranges from August 21st, 2014 to January 23rd, 2015 have
been crawled, and 1741 unique users took part in the topic.
Firstly, we use the basic statistics method to analyze the datasets. The distribution over video published time of the dataset 1 is shown in Figure 2,
Figure 3, Figure 4, and Figure 5. As a result of few videos uploaded each day in 2014, we only select the videos uploaded from November 4th,
2013 to December 11th, 2013, the number of which accounts for over 80% of videos in the whole dataset. The typhoon formed on November 2nd
and was named as “Haiyan” two days later. On November 8th, it hit Philippine and killed at least 6300 people, which lead the number of videos
uploaded that day to reach the maximum. For the other dataset, the election was officially published on August 21st and the voting and vote
counting started on November 29. We select the time when the number of videos is over 10 to plot Figure 3. The figure shows that on November
29th and 30th, the number of videos get to a peak value.
From the analysis above, we can find that participants usually pay attention to an short-term news event immediately, and after the climax, the
number of users following the event will decrease gradually. On the contrary, for the long-term news event, the users always keep following the
event until the climax happens, and then they stop following the event immediately.
Afterwards, we perform near-duplicate detection method on the datasets, and get the distribution of videos containing NDK over time. The
videos in the dataset 1 ranging from November 4th to December 9th are shown in Figure 4, which also take up over 80% of all the videos.
Compared Figure 4 with Figure 2, both the numbers of videos reach the peak on November 8th, 2013 when the disaster begins. Both the
distributions over videos’ published time fit the typhoon’s timeline very well. Figure 5 shows the situation of dataset 2, and the trends of Figure 5
and Figure 3 are very similar and fit the election time very well. However, what is worth noticing is that the peak of Figure 5 happens on
November 30th while that of Figure 3 appears on November 29th and November 30th. It demonstrates that the remixed videos have delays
compared with the original ones, and the delay is no more than 1 day.
In the dataset 1, 862 videos contain NDK in total, which account for over 64% of the whole dataset, while 8195 (over 83%) topic participants are
involved in the videos containing NDK. In the dataset 2, 903 videos (over 82%) include NDK while 1094 participants (over 62%) are involved in
the NDK videos. The videos of political events are hardly shot by the common citizens, most of them are from the news and journalists while most
users in the local area in Philippines can capture and upload the videos about “Typhoon Haiyan”. Therefore, more ratios of videos about “Taiwan
Election” contain NDK compared with that of “Typhoon Haiyan”.
Here we use dataset 1 to study YouTube bipartite social network which only contains positive edges while dataset 2 is used to build social signed
network.
Performance of Sentiment Analysis
We label 524 comments and use them to evaluate the performance of sentiment analysis. In the n-gram method, we try to use the labeled dataset
to train the model, however, there is no Chinese labeled dataset related to politics. As a result, we use all the available labeled dataset of short
texts, including comments of hotel with 7000 positive samples and 3000 negative samples and comments of online shopping website “Jing
Dong” with 2000 positive and negative samples. We also use our labeled data to train the model as n-gram method 2 in Table 2.
Table 2. The methods of sentiment analysis and the corresponding precision
Methods Precision
N-gram 1 33.39%
N-gram 2 71.35%
Lexical 1 77.44%
Lexical 2 53.32%
SVM 38.15%
The lexical method is based on the lexicon labeled by DUTIR2 and the segmentation tool we use is FudanNLP which is developed by Qiu et al.
(2013). The source of boost words, negative and question words are from HowNet3, which is an on-line common-sense knowledge base providing
structural Chinese message. According to the event “Taiwan Election”, we select the keywords including the candidates, parties and
corresponding nickname, abbreviated forms of names and so on from the initial results as the possible targets of uploaders and commenters. We
also implement the methods without boost words, negatives and question words as lexical method 2 in Table 2. The standard SVM model is also
conducted to analyze the sentiment. CHI method is used for dimensionality reduction and speedup, and utilize TF-IDF features to represent each
comment.
Table 2 shows the precision of above five methods. The lexical method considering the grammar of the sentence performs the best to analyze the
comments' sentiment, reaching 77.44 percent, 30 percent higher than the best result in the COAE (Tan et al., 2014). The n-gram using cross-field
corpus is proved to be the worst method in terms of the data.
Both n-gram and SVM are classical and effective methods for the sentiment analysis. The main reason of their poor performance is that they rely
on the training set too much. The lack of similar training set makes the methods’ precisions lowest in the five methods. The addition of training
set including labeled data makes n-gram method’s performance much better. Another reason is that there are too many short comments. The
average Chinese character number of comments is only 21.85 while the median is 15 in the experiments’ datasets, which undoubtedly affects the
performance of n-gram and SVM based on TF-IDF. Besides we infer that the reasons our algorithm produces much better performance than state
of the art is that (1) the specific event narrows the scope of targets so that we can list the potential target of comments as keywords; (2) The visual
content is used as a supplement of potential targets.
Evaluating Importance of Videos and Topic Participants
In this part, we use the centrality and popularity measures to evaluate the importance of a video and a topic participant in the constructed VN
and TPN. Here, we only report the analysis results on dataset 2. This is because firstly that TPN constructed on dataset 2 is more complicated,
and secondly similar conclusions can be made on that of dataset 1.
The video importance is calculated according to Equations 3 and 6. In the following part, the relations between video importance and published
time, rating score and view count are analyzed. We use the correlation coefficients to test the linear relation between video importance score and
three variables. The results are shown in Table 3. In the table, “–” means that the correlation between two variables is not significant. Although
there are significant correlations existing, most of correlation scores are so low that there is no relations between these variables. However, the
correlation score between importance and importance is over 0.3. is calculated according to classical degree centrality while is based on
improved degree centrality. As a result, it is predictable that they have a connection. The similar phenomenon happens in the dataset 1. The
correlations between and rating score, view count are not significant while that between video importance score and published time is less than
0.15, which means that the correlation is not obvious although it is significant.
Table 3. The relation between importance score and metadata of dataset 2. Symbol “–” means that the correlation between two variables is not
significant.
Betweenness
Rating - - -
View count - - -
- 0.33 -
0.33 - 0.021
Betweenness - 0.021 -
Figure 3 shows the importance of topic participant in TPN constructed on dataset 2. Figure 6 shows the distribution of the index against each
participant. The user named “mohawk548” has 600 more supporters than objectors. Figure 7 describes the eigenvector centrality of topic
participants, and two public news channels get the highest score because of the large number of neighbors. Figure 8 and Figure 9 illustrate total
importance and average importance of the participants. As expected, although the public news channels (“newsabc” and “udntv”) receive high
total importance scores, when it comes to average importance score, their affections are rather low.
Figure 6 score of participants
The simplest method to summarize the videos is to use the near-duplicate information because it generally contains much redundant information
about a specific event. Therefore, we summarize videos by removing the ones including NDK. If two videos contain more than a certain number
of NDK, it is reasonable to suppose that these two videos describe the same content. Furthermore, the more NDK a video contains, the more
videos the video can represent. Therefore, we can obtain video set summary by choosing the videos which contain the most numbers of NDK.
The result of dataset 1 are listed in Table 4 while the summary of “Taiwan Election” is shown in Table 5. As shown in Table 4, 7 videos represent
over 20 percent of the dataset while 43 videos summarize the 50 percent of all the videos, and 112 videos represent nearly 80% content of 1320
clips. However, the efficiency of dataset 2 is much lower. The possible reason is that for long-term event, the number of videos in different aspects
is much more than that of outbursting news.
Table 4. Result of videos summarization in “Typhoon Haiyan”
7 20.85%
16 30.89%
28 40.73
43 50%
112 78.76%
Table 5. Result of videos summarization in “Taiwan Election”
23 20.35%
56 30.17%
109 40.16
180 50.01%
Our method uses the indices which evaluate the importance of videos and participants: the videos with the largest or usually contain triggers
of the event; the users with the largest provide the videos which are more likely to be popular and represent most users’ attitudes; the users
with the largest supply the inspiring videos which attract more attention; the users with the largest .. and provide the videos which may
record the breakpoint or the propellant of the event. Therefore, we can summarize the video set of the event based on important videos and
videos uploaded by the important topic participants.
In this experiment, we empirically choose all the obvious off-group nodes whose values are much higher than other nodes’ in each index of
evaluating the nodes. For the important participants, we remove all the public channels and then select all the videos they upload or comment as
the candidate results. Finally, we remove all the redundant videos and videos which has a large number of NDK. The results are shown as follows.
In the dataset 1, 2 videos with the largest contain the situation before the typhoon hits Philippines and one of 2 videos with the largest is
duplicated, and the other is about the situation when typhoon hits Philippines. There are 3 videos selected according to users with the largest :
the first one describes a ceremony to relieve the survivors; the second shows the aftermath of Philippines made by the typhoon; the third is about
the reconstruction of Philippines. There are also 5 videos selected based on the users with the largest and : the first one is a piece of news
including the scientific analysis and the path of the typhoon; the second shows the whole process of the event; the third is the scientific data of
the typhoon in real time; the fourth describes the surge after the typhoon leaves; and devastating images is showed by the last one.
Similarly, when it comes to the dataset 2, we get 19 videos after removing redundant videos from candidate result set. In these videos, 2 videos
with the largest are about the election of Taipei in the beginning of the election while 2 videos with the largest describe beginnings of two
negative news about Ke, a mayor candidate of Taipei. The user with the largest provides 4 videos which include interview of mayor candidate in
Taipei before and after voting, positive and negative news of a mayor candidate in Taipei. The user with the largest provides 4 videos which are
all about the interview of candidate of mayor in Taipei. The 3 videos commented by the user with the largest include the interview, biology and
speeches of Ke. The users with the largest provide 4 videos which include the election situation in New Taipei and Ji Long, the live situation of
vote counting, and the negative news about the mayor candidates of Taipei.
These videos in the dataset “Taiwan Election” contain the content of the election in the city of Taipei, New Taipei and Ji Long, the interview of
candidate of Taipei before and after voting, and the positive and negative news of the candidate of Taipei which affects the decision of the voters.
What is more, 19 videos in the results mainly introduce the election situation of Taipei. According to the contents above, we make conclusion that
the most competition appears in the city of Taipei, New Taipei and Ji Long, and the competition among candidates of Taipei is the fiercest, which
satisfies the real situation in Taiwan. From the results above, we find that the effect of proposed method in short-term event is better because the
summarization almost contains the whole procedure of the event. However, when it comes to the long-term event, although only the part of
procedure could be tracked, we still find interesting information like the competition degree of each city. We suspect that the positive and
negative news of the candidate mayor we extracted are the most influent incidents. However we cannot verify the suspect. The result of
summarization in dataset 2 with timeline are shown in Figure 10.
It is a challenge to evaluate the results of methods of video summary, let alone the summary of event-related video set. As far as we know, there
are no objective and automatic evaluation methods to compare the effects of results. In the paper, we use the number of summarized videos and
empirical understanding of the events to evaluate the results of our method and the simplest NDK method. Comparing two methods above,
although we could not quantify the results of the second method, we still find that 11 videos almost contain the whole procedure of “Typhoon
Haiyan” while 43 videos only contain 50 percent content with the NDK method. In terms of “Taiwan Election”, although the selected 19 videos
with the second method cannot contain all the speeches and news about the election, they still include the situation of cities which there are
intense competitions among the candidates and several inside stories during the competitions. Using the first method, 180 videos only represent
50% videos, which makes it useless in reality.
CONCLUSION
In this work, we proposed to construct networks in Internet videos and related metadata based on the improved near-duplicate detection method
and sentiment analysis. Specifically, two types of networks, video network and topic participant network (TPN), were constructed. The sentiment
analysis, which has surpassed the state-of-art, has been applied to make TPN a signed network. Then, basic analysis and statistics were
performed on TPN. Furthermore, several indices were introduced for evaluating the importance of videos and topic participants in constructed
social networks. Finally, we conducted experiments by using YouTube videos with corresponding metadata related to hot events “Super Typhoon
Haiyan” and “Taiwan Local Election of 2014”. Experimental results have shown that the analysis based on the social networks and indices
evaluating the importance fit the development of the event and the roll topic participants play in spreading Internet videos very well. In the end,
we used 11 videos to summarize “Typhoon Haiyan” and 19 videos to represent the event “Taiwan Election”. The work of summarization of the
event-related video set will provide convenience for video search engine and the users.
This work was previously published in the International Journal of Multimedia Data Engineering and Management (IJMDEM), 6(4); edited
by ShuChing Chen, pages 5677, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This work is supported by grants from the Chinese National Natural Science Foundation under contract number 61201339
REFERENCES
Bespalov D. Bai B. Qi Y. Shokoufandeh A. (2011, October). Sentiment classification based on supervised latent n-gram analysis. InProceedings of
the 20th ACM international conference on Information and knowledge management (pp. 375-382). ACM.10.1145/2063576.2063635
Bravo-Marquez F. Mendoza M. Poblete B. (2013, August). Combining strengths, emotions and polarities for boosting Twitter sentiment analysis.
InProceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (p. 2).
ACM.10.1145/2502069.2502071
Brodersen A. Scellato S. Wattenhofer M. (2012, April). Youtube around the world: geographic popularity of videos. In Proceedings of the 21st
international conference on World Wide Web (pp. 241-250). ACM.10.1145/2187836.2187870
Cha M. Kwak H. Rodriguez P. Ahn Y. Y. Moon S. (2007, October). I tube, you tube, everybody tubes: analyzing the world's largest user generated
content video system. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (pp. 1-14).
ACM.10.1145/1298306.1298309
Chen, S. C., Shyu, M. L., Liao, W., & Zhang, C. (2002). Scene change detection by audio and video clues. In Multimedia and Expo, 2002.
ICME'02. Proceedings. 2002 IEEE International Conference on (Vol. 2, pp. 365-368). IEEE.
Chen, S. C., Shyu, M. L., Zhang, C., & Kashyap, R. L. (2001, August). Video Scene Change Detection Method Using Unsupervised Segmentation
And Object Tracking . ICME.
Feroz Khan, G., & Vong, S. (2014). Virality over YouTube: An empirical analysis. Internet Research , 24(5), 629–647. doi:10.1108/IntR-05-2013-
0085
Godbole, N., Srinivasaiah, M., & Skiena, S. (2007). Large-Scale Sentiment Analysis for News and Blogs. ICWSM , 7, 21.
Hosseinmardi, H., Han, R., Lv, Q., Mishra, S., & Ghasemianlangroodi, A. (2014, August). Towards understanding cyberbullying behavior in a
semi-anonymous social network. InAdvances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference
on (pp. 244-252). IEEE. 10.1109/ASONAM.2014.6921591
Kunegis J. Lommatzsch A. Bauckhage C. (2009, April). The slashdot zoo: mining a social network with negative edges. In Proceedings of the 18th
international conference on World wide web (pp. 741-750). ACM.10.1145/1526709.1526809
Li, B., & Sezan, M. I. (2001). Event detection and summarization in sports video. In ContentBased Access of Image and Video Libraries, 2001.
(CBAIVL 2001). IEEE Workshop on (pp. 132-138). IEEE. 10.1109/IVL.2001.990867
Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support
Systems , 48(2), 354–368. doi:10.1016/j.dss.2009.09.003
Parry, M. L., Legg, P., Chung, D. H., Griffiths, I. W., & Chen, M. (2011). Hierarchical event selection for video storyboards with a case study on
snooker video visualization. Visualization and Computer Graphics . IEEE Transactions on , 17(12), 1747–1756.
Pickering, M. J., Heesch, D., Rüger, S. M., O'Callaghan, R., & Bull, D. R. (2002). Video Retrieval Using Global Features in Keyframes. In TREC.
Qiu, X., Zhang, Q., & Huang, X. (2013, August). FudanNLP: A Toolkit for Chinese Natural Language Processing. In ACL (Conference System
Demonstrations) (pp. 49-54).
Sudhahar, S., De Fazio, G., Franzosi, R., & Cristianini, N. (2015). Network analysis of narrative content in large corpora. Natural Language
Engineering , 21(01), 81–112. doi:10.1017/S1351324913000247
Szabo, G., & Huberman, B. A. (2010). Predicting the popularity of online content. Communications of the ACM , 53(8), 80–88.
doi:10.1145/1787234.1787254
Tai Y. J. Kao H. Y. (2013, December). Automatic domain-specific sentiment lexicon generation with label propagation. In Proceedings of
International Conference on Information Integration and Web-based Applications & Services (p. 53). ACM.10.1145/2539150.2539190
Tan H. K. Wu X. Ngo C. W. Zhao W. L. (2008, October). Accelerating near-duplicate video matching by combining visual similarity and
alignment distortion. In Proceedings of the 16th ACM international conference on Multimedia (pp. 861-864). ACM.10.1145/1459359.1459506
Tan, S., & Zhang, J. (2008). An empirical study of sentiment analysis for chinese documents. Expert Systems with Applications, 34(4), 2622–
2629. doi:10.1016/j.eswa.2007.05.028
Tan S. B. Wang S. G. Xu W. R. Yan X. Liao X. W. (2014, July). Overview of Chinese opinion analysis evaluation 2014. In Proceedings of the
6th conference of Chinese opinion analysis evaluation (pp. 5-25).
Uryupina, O., Plank, B., Severyn, A., Rotondi, A., & Moschitti, A. (2014). SenTube: A corpus for sentiment analysis on YouTube social
media. Cited by, 4.
Wu X. Zhao W. L. Ngo C. W. (2007, July). Near-duplicate keyframe retrieval with visual keywords and semantic context. In Proceedings of the
6th ACM international conference on Image and video retrieval (pp. 162-169). ACM.10.1145/1282280.1282309
Xie L. Natsev A. Kender J. R. Hill M. Smith J. R. (2011, November). Visual memes in social media: tracking real-world news in YouTube videos.
In Proceedings of the 19th ACM international conference on Multimedia (pp. 53-62). ACM.10.1145/2072298.2072307
Yang, B., Cheung, W. K., & Liu, J. (2007). Community mining from signed social networks. Knowledge and Data Engineering .IEEE
Transactions on , 19(10), 1333–1348.
Yu T. Y. Bai L. Guo J. L. Yang Z. (2015, April). Constructing social networks based on near-duplicate detection in youtube videos. In Proceedings
of the 1st international IEEE conference on mulimedia big data. IEEE.10.1109/BigMM.2015.70
Zhang Y. Zhang H. Zhang M. Liu Y. Ma S. (2014, July). Do users rate or review?: boost phrase-level sentiment labeling with review-level
sentiment classification. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
(pp. 1027-1030). ACM.10.1145/2600428.2609501
Zhu, Q. S., Shyu, M. L., & Wang, H. H. (2014). VideoTopic: Modeling User Interests for Content-Based Video Recommendation.
[IJMDEM]. International Journal of Multimedia Data Engineering and Management , 5(4), 1–21. doi:10.4018/ijmdem.2014100101
ENDNOTES
1 http://www.youtube.com/yt/press/statistics.html
2 http://ir.dlut.edu.cn/EmotionOntologyDownload.aspx
3 http://www.keenage.com/
CHAPTER 20
Rapid Development of ServiceBased Cloud Applications:
The Case of the Cloud Application Platforms
Fotis Gonidis
SouthEast European Research Centre, Greece & The University of Sheffield, UK
Iraklis Paraskakis
SouthEast European Research Centre, Greece
Anthony J. H. Simons
The University of Sheffield, UK
ABSTRACT
Cloud application platforms gain popularity and have the potential to alter the way service-based cloud applications are developed involving
utilisation of platform basic services. A platform basic service provides certain functionality and is usually offered via a web API. However, the
diversification of the services and the available providers increase the challenge for the application developers to integrate them and deal with the
heterogeneous providers’ web APIs. Therefore, a new approach of developing applications should be adopted in which developers leverage
multiple platform basic services independently from the target application platforms. To this end, the authors present a development framework
assisting the design of service-based cloud applications. The objective of the framework is to enable the consistent integration of the services, and
to allow the seamless use of the concrete providers. The optimal service provider each time can vary depending on criteria such as pricing, quality
of service and can be determined based upon Big Data analysis approaches.
INTRODUCTION
In recent years two major technological trends have emerged and are able to drive the evolution of service oriented computing, namely cloud
computing and Big Data. The rise and proliferation of cloud computing (Armbrust et al., 2010) and cloud platforms in specific (Cusumano,
2010), has the potential to change the way cloud based software applications are developed, distributed and consumed. Cloud platforms
popularity stems from their potential to speed up and simplify the development, deployment and maintenance of cloud based software
applications. Nevertheless, there is a large heterogeneity in the platforms offerings (Gonidis, Paraskakis, Simons, & Kourtesis, 2013) which can be
classified into three clusters. On one cluster application development time is drastically decreased with the use of bespoke visual tools and
graphical environments at the expense of a restricted application scope which is usually limited to customer relationship management (CRM)
and office solutions. At the other end of the spectrum platforms offer basic development and deployment capabilities such as application servers
and databases. The intermediate cluster consists of cloud platforms, which offer additional functionality via the provisioning of, what the authors
call, platform basic services (e.g. mail service, billing service, messaging service etc.). A platform basic service can be considered as a piece of
software, which provides certain functionality and can be reused by multiple users. It is typically provisioned via a web API. The platforms
offering such services are also referred to as cloud application platforms (Kourtesis, Bratanis, Bibikas, & Paraskakis, 2012). The rise of the cloud
application platforms has the potential to lead to a paradigm shift of software development where the platform basic services act as the building
blocks for the creation of service-based cloud applications.
Almost in parallel with the emergence of cloud computing, data volumes started skyrocketing leading to what is commonly referred nowadays as
Big Data (Jacobs, 2009). The term Big Data has been initially attributed with certain features (McAfee, & Brynjolfsson, 2012) such as: a) Volume,
b) Velocity and c) Variety. Volume refers to the large amount of data which are generated each day. Velocity dictates the need for analysis of the
rapidly collected data as those are becoming outdated quickly. Variety denotes the diverse forms and formats in which data are collected and
stored. In addition to these features, Demchenko et al. (2013) propose a wider definition of Big Data by adding two extra characteristics: d)
Veracity and e) Value. Veracity refers to the amount of uncertainty that the collected data contain and to the extend they can be considered
trustworthy. Value denotes the added-value that data can bring to the predictive analysis. Therefore, what once used to be a technical problem
due to limited storage and processing capacity now it is being transformed into a business opportunity (The Data Warehouse Institute, 2011).The
collection and analysis of Big Data can lead to the extraction of meaningful information and can subsequently drive important decisions (McAfee,
& Brynjolfsson, 2012).
In the field of service oriented computing the combination of the cloud computing and Big Data analysis has the potential to lead to the design of
flexible and adaptable service-based cloud applications where their functionality and the concrete service providers are determined based on the
stimuli from the environment. The criteria based upon the selection of the appropriate services are made, can be based upon several factors such
as pricing, quality of service, current availability of the provider, geospatial data etc. In certain cases the selection of the concrete service
providers is a time critical operation and is subject to the analysis of big volume of data. Therefore Big Data approaches can be employed to
process and analyse the large volume of data, acting this way as decision supporters in the selection of the concrete service providers. As
discussed later in this article the world of E-commerce and electronic (mobile) payments can be used as motivating scenario in this research
work. Gartner, the leading information technology research company, reports a 42% average increase in the mobile transaction volume and value
in the period 2011-2016 (Gartner Inc., 2012). Therefore, cloud payment services, enabling an application to handle electronic payments, is
becoming an indispensable part of a service-based cloud applications. There is a plethora of available payment service providers offering a variety
of features regarding the price, the quality of service, the geographical region etc. Different payment providers may at each time serve better the
needs of the application. In this case, Big Data analysis techniques may be applied in order to determine the optimal payment provider while the
application should be able to choose seamlessly the given provider without disrupting its operations.
Consequently, the combination of Big Data approaches and methodologies of designing service-based cloud applications has the potential to pave
the way to reactive applications, which are context aware and able to adopt themselves according to external stimuli. The adaptability lies
primarily in the capability to change seamlessly the concrete service-providers without the intervention of a software engineer.
The design and the development of a large scale service-based cloud application involve the consideration of a significant amount of aspects such
as: the cloud deployment model, the interoperability aspect, the quality of service along with the definition of service level agreements (SLAs), the
ability to manage and migrate data in a feasible manner and the metering and billing of the cloud application (Rimal, Jukan, Katsaros, &
Goeleven, 2010). The current research article focuses on a particular aspect of the design cycle which involves enabling the application to leverage
multiple platform basic services provisioned from heterogeneous service providers in a transparent way.
The heterogeneity mainly arises due to: (i) the differences in the workflow for the execution of the operations of the services, (ii) the differences in
the exposed web APIs and (iii) the various required configuration settings and authentication tokens. The significant number of services that an
application may consist of makes the integration and management of the services a strenuous process. The mission of this article is to propose a
development framework assisting in the design and execution of the service-based cloud applications by alleviating the three afore mentioned
variability issues. Specifically, the target of the proposed framework is twofold: (i) Provide the tools and the methodology to enable the consistent
modelling and integration of different categories of platform basic services with the cloud application. (ii) Enable the applications to deploy
concrete service providers in a seamless, transparent and agnostic to the software engineer manner.
The rest of the article is structured as follows. Next Section provides background information and related work on the field. Thereafter the
motivating example, which is used for the rest of the article, is stated. Subsequently, the main issues, that the article is trying to address, are listed
and the high-level architecture of the framework and the components involved in the development process are described. Then, the article
focuses on the technical design of the proposed framework and describes how the latter can be used to enable the integration of additional
services and providers. Finally, the article concludes with certain limitations of the framework as well as with proposals for future research
direction.
BACKGROUND AND RELATED WORK
The emergence of the service-based cloud applications, as a synthesis of various platform basic services, promises to ease and speed up the
process of cloud application creation. Applications do not need to be developed from scratch but can rather be constructed using, where
appropriate, various platform services, thus increasing rapidly the productivity. Consequently, the barrier of studying the various platform basic
services and selecting the one(s) best offered for the task at hand, is now removed. The software engineer has access in a transparent manner to
all platform basic services and the selected services are seamlessly incorporated in the service-based cloud application. However, in order for this
scenario to be realised, two conditions should be met: (i) the application should be deployable seamlessly in various cloud application platforms
and (ii) the variability among the platform basic services should be alleviated.
With respect to the second condition, which as mentioned in the Introduction is the focus of this article, a number of challenges arise. The first
challenge arises from the fact that there exist multitudes of a particular service, e.g., mail service, since the services are offered by many different
providers. The second challenge arises from the need to provide a framework that spans across a number of different kinds of services, i.e., mail
services, payment services, message queue services and so on. At the same time there is a lack of tools and Integrated Development
Environments addressing the issue of proprietary technologies and APIs (Guillen, Miranda, Murillo, & Cana, 2013a).
The constant increase in the offering of platform basic services has resulted in a growing interest in leveraging services from multiple clouds.
Significant work has been carried out on the field. Petcu and Vasilakos (2014) summarises several library-based solutions which expose a
uniform API and can abstract cloud resources such as data storage, compute and message queue services from various providers. Furthermore,
several standardisation approaches are listed which aim at the definition of standards enabling the uniform access to cloud resources. Model-
Driven Engineering (MDE) techniques are also proposed as enabling methodology for the creation of cloud applications leveraging services from
multiple cloud providers. Guillen et al. (2013) also put forward the idea of model-driven development of cloud applications which are platform
agnostic. Moreover, middleware-based solutions are listed aiming at managing the deployment and execution phase of the cloud application.
This article lists representative work from the fields mentioned above, namely: standardisation approaches, b) library-based solutions, c) model-
driven engineering techniques, d) and middleware platforms.
Open Virtual Format (OVF) (DMTF, 2013) is a specification for packaging and distributing Virtual Machines (VMs), defined by the Distributed
Management Task Force (DMTF) (http://www.dmtf.org/). The architecture of the format is not bound to a particular platform or operating
system and thus enables virtual machines to be deployed on different cloud infrastructure providers. Open Cloud Computing Interface (OCCI)
(http://occi-wg.org/) created by Open Grid Forum (OGF) (https://www.ogf.org/ogf/doku.php) and Cloud Infrastructure Management Interface
(CIMI) (DMTF, 2012) created by DMTF both attempt to standardize the way users access and manage infrastructure resources and therefore to
unify the various proprietary APIs that vendors are currently using. They are particularly focus on the compute, storage and network resources.
Cloud Data Management Interface (CDMI) (SNIA, 2014) is a cloud storage standard defined by Storage Networking Industry Association (SNIA)
(http://www.snia.org/). CDMI attempts to standardize the way users, access and manage cloud storage services offered by storage providers such
as Google Storage, Amazon Simple 3 and Windows Azure Storage. Topology and Specification for Cloud Applications (TOSCA) (Binz, Breiter,
Leymann, & Spatzier, 2012) is a standardization effort from OASIS (Advancing Open Standards for the Information Society) (https://www.oasis-
open.org/) aiming at defining a common representation of the cloud services that a cloud application uses such as application servers and
databases. This way TOSCA attempts to automate the deployment process of a cloud application in multiple cloud platforms. While
standardisation is an efficient approach to leverage services from multiple cloud environments, as Petcu (2014) and Opara-Martins et al. (2014)
highlight, standardisation efforts are still in an immature level and mainly focus on the IaaS level. This is mainly due to the reluctance of the
cloud providers to agree in standardised interfaces and specifications as this will lead to an increased direct competition with other providers
(Singhal et al., 2013).
Library-based solutions such as jClouds (http://www.jclouds.org.) written in Java and LibCloud (https://libcloud.apache.org/index.html)
written in Python, provide an abstraction layer for accessing specific cloud resources such as compute, storage and message queue. While,
library-based approaches efficiently abstract those resources, they have a limited application scope which makes it difficult to reuse them for
accommodating additional services.
Middleware platforms constitute middle layers, which decouple applications from directly being exposed to proprietary technologies and
deployed on specific platforms. Rather, cloud applications are deployed and managed by the middleware platform, which has the capacity to
exploit multiple cloud platform environments. mOSAIC (Petcu, 2014) is such a PaaS solution which facilitates the design and execution of
scalable component-based applications in a multi-cloud environment. mOSAIC offers an open source API in order to enable the applications to
use common cloud resources offered by the target environment such as virtual machines, key value stores and message queues. OpenTOSCA
(Binz et al., 2013), is a runtime environment enabling the execution of TOSCA-based cloud applications. TOSCA (Binz et al., 2012) is a
specification which enables the description of the deployment topology of a cloud application in a platform independent way. Thus, applications
are agnostic with regard to the concrete platform provider resources they use. Both mOSAIC and OpenTOSCA require that applications are
developed using the specific technologies and thus impose a restriction in case applications need to leverage platform providers, which are not
supported by those environments.
Initiatives that leverage MDE techniques present meta-models or Domain Specific Languages (DSLs), which can be used for the creation of cloud
platform independent applications. The notion in this case is that cloud applications are designed in a platform independent manner and specific
technologies are only infused in the models at the final stage of the development. MODAClouds (Ardagna et al., 2012) and PaaSage (Jeffery,
Horn, & Schubert, 2013) are both FP7 initiatives aiming at cross-deployment of cloud applications. Additionally, they offer monitoring and
quality assurance capabilities. They are based on CloudML (Ferry et al, 2013), a modelling language which provides the building blocks for
creating applications deployable in multiple IaaS and PaaS environments. Hamdaqa et al.(2011) have proposed a reference model for developing
applications which leverage the elasticity capability of the cloud infrastructure. Cloud applications are composed of CloudTasks which provide
compute, storage, communication and management capabilities. MULTICLAPP (Guillen et al., 2013b) is a framework leveraging MDE
techniques during the software development process. Cloud artefacts are the main components that the application consists of. A transformation
mechanism is used to generate the platform specific project structure and map the cloud artefacts onto the target platform. Additional adapters
are generated each time to map the application`s API to the respective platform`s resources. MobiCloud (Ranabahu, Maximilien, Sheth, &
Thirunarayan, 2013) is a DSL enabling the design of cloud-mobile hybrid applications which are platform independent. The application is
initially designed using the MobiCloud DSL and subsequently specific code generators are used to generate the executable program for each
specific cloud platform.
The solutions listed in this Section focus mainly on the cross-deployment of application by eliminating the technical restrictions that each
platform imposes. However, they do not support the use of additional platform basic services offered via web API such as payment,
authentication and e-mail service. In addition, the client adapters used to address the variability in the providers’ APIs are hardcoded and thus
not directly reconfigurable in case they are required to be updated. On the contrary, the vision of the authors is to facilitate the use of platform
basic services from heterogeneous clouds in a seamless manner. To this end, the proposed solution attempts to alleviate the three variability
points described in the Introduction, namely: the differences in the workflow modelling, in the providers’ web APIs and in the configuration
settings.
MOTIVATING SCENARIO
Electronic payment transactions has gained widespread acceptance in many domains of business and this trend is upwards considering the
diffusion of mobile and contactless payments. Forrester Research Inc, a major independent research company, predicts that the US mobile
payments growth will accelerate, reaching $90B by the year 2017, a 48% compound annual growth rate from the $12.8B spent in 2012. (Forrester
Research Inc., 2012a; Forrester Research Inc., 2013). At the same time the Mobile Commerce in EU is following an exponential growth from
€1.7B in 2011up to €19.2B by 2017 (Forrester Research Inc., 2012b). Therefore it becomes obvious that payment service providers are becoming
an essential part of the service-based cloud applications.
The payment service enables a website or an application to accept online payments via electronic cards such as credit or debit cards by
intermediating between the application and the bank. Figure 1 shows a simplified view of the payment process. The added value that such a
service offers is that it relieves the developers from handling electronic payments and keeping track of the transactions. Moreover, applications
which perform billing transactions needs to be compliant with the Payment Card Industry Data Security Standard (PCI-DSS) in order to
maximise its reliability. Acquiring the compliance may be a time consuming and costly process, which may be skipped with the use of a cloud
payment service.
There are a large number of available payment service providers, which can be deployed by the service-based cloud applications in order to
process the payments. Those providers offer different features and may vary in the pricing, the quality of service etc. Large enterprises such as
Amazon and Google, operating globally, may require a variable number of payment providers depending each time on certain criteria such as the
geographical region and the rate of the payment requests. Therefore, different service providers may at each time better serve the need of the
cloud applications. For example in case of an instant peek in the number of transactions in a specific geographical region, the current deployed
provider may become unresponsive. Thus the application should be able to predict the failure and deploy a different provider to seamlessly
continue its operation.
Figure 2 shows such a scenario where Big Data contribute to the selection process of the payment provider. Data about payment requests are
collected globally from a variety of sources. The collected data may expose a heterogeneity in the format since they are generated by diverse
sources such as mobile devices, social networks and payment terminals. Moreover, the large velocity and the volume of the generated data
impose that traditional data analysis techniques involving relational databases are not sufficient to cope with them. The collected data may
indicate a rapid rise in the demand of the payment requests in a particular geographical region. Therefore, Big Data analysis techniques may be
deployed to provide a timely prediction about the failure event. Subsequently, the service-based cloud application should be capable of choosing
seamlessly a different provider in order to continue its operations in an undisrupted way.
Figure 2. Big Data as enabler for choosing the optimal service
provider
The payment service providers are only one example of providers, which can be deployed based on the analysis of Big Data. The same rationale
applies when deploying providers offering additional services such as e-mail and message queuing.
The feasibility of the proposed scenario depends on two aspects: (i) The analysis of the collected data through Big Data approaches and the
subsequent recommendation of the best alternative service provider, (ii) the ability of the service-based cloud application to seamlessly deploy an
alternative payment service provider so that it continues its operations without disruptions.
As mentioned in the Introduction, the scope of this article is to propose a development framework to cope with the second aspect, namely the
design of service-based cloud applications, which are not bound to specific service providers. In order to illustrate how the framework can be
used in a real case scenario, the cloud payment service is used. This platform service has been chosen because of its inherent relative complexity
compared to other services such as e-mail or message queue service. The complexity lies in the fact that the purchase transaction requires more
than one step to be completed and there is a significant heterogeneity among the available payment providers with respect to the involved steps.
Next Sections state the concrete issues that the article aims at addressing and subsequently describe the solution approach.
Variability Issues
Preliminary work of the authors on several platform service providers (Gonidis, 2013) offered by Heroku (http://heroku.com), Google App
engine (https://developers.google.com/appengine) and AWS marketplace (https://aws.amazon.com/marketplace) have shown that the following
three variability points needs to be addressed in order to decouple application development from vendor specific implementations:
1. Differences in the Workflow: Stateful services require more than one state in order to complete an operation (Pautasso,
Zimmermann, & Leymann, 2008). Such an example is the payment service that enables developers to accept payments through their
applications. The process involves two states: (i) waiting for client’s purchase request and (ii) submitting the request to the payment
provider. However, depending on the concrete payment provider there may be variations in the states involved. Therefore, a coordination
mechanism is required to handle the operation flow and additionally to alleviate the differences among the various concrete
implementations.
2. Differences in the Web API: There are several platform providers implementing a given platform service and specific operations.
However, they expose a diverse API resulting in conflicts when an application developer attempts to integrate with one or another. The e-
mail service and two service providers, the Amazon Simple E-mail Service (SES) (aws.amazon.com/ses/) and the SendGrid
(https://sendgrid.com/), an add-on mail service offered via Heroku application platform are considered as an example. Upon the request for
sending an e-mail the minimum set of the four following parameters required by Amazon SES is: (i) Source, (ii) Destination.ToAddresses,
(iii) Message.Subject and (iv) Message.Body.Text. In the case of SendGrid the anticipated parameters are: (i) from, (ii) to, (iii) subject and
(iv) text.
As it became clear a cloud application may interact with several platform basic services in various ways. In order to enable the consistent
modelling and integration of services as well as the decoupling from vendor specific implementations, a development framework is proposed.
HIGH LEVEL OVERVIEW OF THE FRAMEWORK
This Section describes the high level architecture of the framework. In particular, it focuses on the components of the framework and the process
required to add a new platform basic service or provider. There are two distinct users (roles): the administrator and the developer. The first uses
the components in order to enrich the framework with additional services and providers while the latter makes use of its components in order to
integrate platform services with the cloud application.
Figure 3 illustrates the components that constitute the development framework. The components are split in two categories, highlighted by the
use of two colours. The one highlighted in orange are provided by the framework and are used by the administrator. The one highlighted in blue
are the platform service components and are produced by the administrator using the framework components.
As it can be observed in Figure 3, the process of adding a new platform service and provider to the framework can be divided into the following
two parts:
1. Platform Service Workflow Modelling: As explained in the previous Section, certain platform services require more than one step to
complete an operation, such as the authentication and the payment service. Thus, the states that are involved in the execution of an
operation shall be defined and modelled in a way that is capable for the framework to handle automatically the workflow.
2. Platform Service API Description: One of the main objectives of the framework is to provide the developers with a single API for each
platform service, independent of the concrete provider. Therefore, this part involves the definition of the reference API, the description of the
web API of each concrete provider supported by the framework and the subsequent mapping of the provider specific web API to the
reference one.
Each of the two parts of the development process involves the three following phases: (a) Platform service modelling phase, (b) Vendor
implementation phase, (c) Execution phase.
In the next subsections, for each of the two parts, the three phases are introduced and the high-level components involved, as depicted in Figure
3, are described.
Platform Service Workflow Modelling
1. Platform Service Modelling Phase: During this phase the abstract states of each platform basic service are described. The following
components are involved in this phase:
a. Reference MetaModel: The Reference Meta-Model contains the concepts required to model the states of the platform service.
b. Platform Service Connector: The Platform Service Connector (PSC) is the abstract representation of the platform service
functionality and hides the specific implementation of the concrete service providers. It contains the states that are involved in each
operation provided by the service. It is generated by the administrator of the framework using the concepts of the Reference Meta-
Model. The PSC is used by the consumer of the framework to obtain access to the functionality of the service.
2. Vendor Implementation Phase: Based on the abstract model defined in the previous phase, the vendor specific implementation is
infused. Specifically, the workflow required by each provider is mapped to the abstract one defined for the particular service.
a. Provider Connector: The Provider Connector (PC) is the module which contains the specific implementation of the concrete
service providers. It is constructed by the administrator of the framework based on the PSC which is built during the modelling phase.
3. Execution Phase: The execution phase takes place at run-time and coordinates the execution of the states involved in an operation
offered by a service.
Platform Service API Description
1. Platform Service Modelling Phase: As mentioned earlier, the framework shall be capable of addressing the variability in the provider
specific web APIs by enabling the definition of a reference API. One reference API is defined, by the administrator of the framework, for each
type of platform basic service which is supported by the framework. It contains the set of operations offered by the specific service.
a. Service API Description Editor: The Service API Description Editor is used to define the reference API. It is implemented as
Eclipse plug-in and includes a user interface which is used by the Administrator of the framework.
b. Platform Service Reference API: The Platform Service Reference API is constructed, by the administrator, for each platform
service and is accessible by the consumer of the framework. Its role is to remove the barrier from the consumer to study the various
service providers API. Instead the consumer accesses all the supported providers via the Reference API.
c. Template API Repository: The template API repository contains the collection of the platform service reference APIs which have
been defined using the Service API description editor.
2. Vendor Implementation Phase: During this phase the specific web API of each of the platform service providers supported by the
framework is described and mapped to the Reference API.
a. Provider Specific API: The Provider Specific API holds the description of the concrete service provider API and the subsequent
mapping to the Platform Service Reference API.
3. Execution Phase: During the Execution Phase the web clients required for the application to connect to the concrete service providers,
are generated. The web clients are source code, which implement the HTTP requests –responses. Additionally, the services that are
consumed by the applications are registered.
a. API Client Generator: The API Client Generator, as the name implies, is responsible for the generation of the web clients for each
concrete service provider. It receives the Platform Service Reference API and the Provider Specific API and produces a java library
which can be used by the consumer in order to connect to the concrete service provider.
b. Platform Service Registry: This component is a registry of all the platform services that the application uses. Its role is to keep
track of the consumed services and to provide an easy way for the software developer to deploy and release services.
TECHNICAL DESIGN
In this Section the technical design of the development framework is described. Particularly, the Section analyses each of the components
mentioned in the high-level architecture and explains its contribution to the framework. For that reason the narration flow used in the previous
Section is followed. The overall process of adding a new platform basic service and service provider using the framework is divided in two
parts: (i) Platform Service Workflow Modelling, and (ii) Platform Service API Description.In each part the following phases are involved:
(a) Platform service modelling phase, (b) Vendor implementation phase, (c) Execution phase. The activities, which are required in each phase,
are shown in Figure 4.
Figure 4. Activity Diagram of the Development Framework
Action1 involves the study of the concrete payment service providers and the extrapolation of the common states in which they may co-exist. For
that reason 9 major payment service providers have been studied (Gonidis, 2013), provisioned either via a major cloud platform such as Google
App Engine and Amazon AWS or via platform service marketplaces such as Heroku add-ons and Engineyard add-ons. These providers can be
grouped into three main categories. An exhaustive listing of the characteristics of each payment provider is out of the scope of this article. Rather,
the article focuses on demonstrating how concrete providers can be mapped on the abstract model. Therefore, the case of one category is
presented, the “transparent redirect”. Spreedly (https://spreedly.com/), a payment provider offered via Heroku platform, is used as the concrete
service provider.
Transparent redirect is a technique deployed by certain payment providers in which, during a purchase transaction, the client’s card details are
redirected to the provider who consequently notifies the cloud application about the outcome of the transaction.
Figure 5 describes the steps involved in completing a payment transaction, while Figure 6 shows the state diagram of the cloud application
throughout the transaction. Two states are observed. While the cloud application remains in the first state, it waits for a payment request. Once
the client requests a new payment, the cloud application should display the fill out form where the user enters the payment details.
Next Section describes how the abstract states, depicted in Figure 6, can be modelled utilising the components of the framework and how the
concrete provider is mapped to the abstract model.
Platform Service Workflow Modelling
Platform Service Modelling Phase
This phase involves the modelling of the abstract functionality of the platform basic service. Specifically, the states and the workflow required to
complete an operation are captured. For that reason, the Reference Meta-Model depicted in Figure 7 is defined.
1. CloudAction: CloudActions are used to model the communication with platform basic services, which require more than one step in
order to complete an operation. The whole process required to complete the operation can be modelled as a state machine. Each step in the
process can be modelled as a concrete state that the platform service can exist in. For each state a CloudAction is defined. When an event
arrives the appropriate CloudAction is triggered to handle the event and subsequently causes the transition to the next state. The events in
this case are the incoming requests arriving either by the application user or the service provider.
2. CloudMessage: CloudMessages can be used to perform requests from the cloud application towards the service provider using the web
API of the latter. The API usually conforms to the REST principle (Fielding, 2001).CloudMessages can either be used in stateless services,
where the operation is completed in one step or within CloudActionswhen the latter are required to submit a request to the service provider.
3. PlatformServiceStates: The PlatformServiceStates is an XML file which holds information about the states involved in an operation
and the corresponding CloudActions which are initialised to execute the behaviour required in each state.
4. ConfigurationData: Certain configuration settings are required by each platform service provider. Example of settings which needs to
be defined are the clients’ credentials required to perform web requests and authentication tokens.
5. Platform Service: As shown from the types of the relationships in Figure 7 the platform service component is composed from all the
previously mentioned concepts.
The motivation for the definition of the separate concepts of theCloudActions and the CloudMessages stems from the basic software design
principles of modularisation, separation of concerns and reusability (Hürsch & Lopes, 1995)(Poulin, 1994). Separation of concerns ensures that a
software application is composed of distinct units each one addressing a specific issue. In turn software modularisation is enabled which further
improves the maintainability of the software. Reusability allows, certain pieces of source code to be reused within the software application
improving this way the productivity. In the framework design,CloudActions are responsible for defining a template for serving the incoming
requests. CloudMessages implement a specific web request to the service providers and can be reused by differentCloudActions.
The reference meta-model is used to construct the Platform Service Connector (PSC). The PSC, as mentioned in the previous Section, is a model
which defines the states and the workflow required to complete an operation of a platform service. It is constructed based on two rules. The rules
are based on the definition of the CloudActions and the CloudMessages mentioned earlier where the former are used to handle incoming
requests where the latter perform web requests to the service providers.
Rule 1: For each state where the application waits for an external request (either from the user of the application or the service provider),
a CloudAction is defined to handle the request.
Rule 2: For each request initiated by the cloud application towards the service provider, a CloudMessage is defined.
In the case of the Cloud Payment Service, Figure 8 shows the Cloud Payment Service Connector. It is constructed based on the state diagram
defined in Figure 6 and using the reference meta-model. It consists of the following blocks:
1. FilloutForm: The FilloutForm is a CloudAction which receives the request for a new purchase transaction and responds to the client with
the fill out form in order for the latter to enter the card details.
5. PaymentSerivceStates: In the PaymentServiceStatesfile the states and the corresponding actions involved in the transaction are
defined. The file is used by the framework to guide the execution of the actions. A part of the description file is shown below. Two states are
defined. For each state the concrete path to the CloudAction responsible for handling the event as well as the next state is described.
<StateMachine>
<State name=”PaymentForm”
action=”org.paymentserviceframework.FillOutFormAction”
nextState=”SendTransaction”/>
<State name=”SendTransaction”
action=”org.paymentserviceframework.SendTransactionAction”
nextState=”Finish” />
</StateMachine>
At this point the Cloud Payment Service Connector (PSC) does not contain any provider specific information. Therefore any payment service
provider which adheres to the specified model can be accommodated by the abstract model.
Vendor Implementation Phase
After having defined the PSC, the specific implementation and settings of each concrete provider needs to be infused (Action 3a of Figure 4). For
each CloudAction and CloudMessage defined in the PSC, the respective provider specific blocks should be defined forming the Provider
Connector (PC).
In the case of the payment service example, the Cloud Payment Provider Connector for the Spreedly provider is shown in the lower part of the
Figure 9. It contains the following blocks:
Figure 9. Cloud Payment Service Provider Connector
4. ConifgurationData:This file needs to be updated accordingly in order to match the specific provider.
Execution Phase
During the execution phase the PSC and the PC, constructed in the previous phases, are managed by the Platform Service Execution Controller
(PSEC) as shown in the Figure 10. The PSEC automates the execution of the workflow required to complete an operation. It consists of the main
following components shown in the upper part of the Figure 10.
Figure 10. Platform Service Execution Controller (PSEC)
1. Front Controller: The Front Controller (Hunter & Crawford, 2001) serves as the entry point to the framework. It receives the incoming
requests by the application user and the service provider.
2. Dispatcher: The dispatcher (Alur, 2001) follows the well-known request-dispatcher design pattern. It is responsible for receiving the
incoming requests from the Front Controller and forwarding them to the appropriate handler, through the ICloudAction which is explained
below. The requests are handled by the CloudActions. Therefore the dispatcher forwards the request to the appropriate CloudAction. In
order to do so, he gains access to the platform service states description file and based on the current state it triggers the corresponding
action.
3. ICloudAction: ICloudAction is the interface which is present at the framework at design time and which the Dispatcher has knowledge
about. Every CloudActionimplements the ICloudAction. That facilitates the initialisation of the new CloudActions during run-time through
reflection.
4. Communication patterns: Two types of communication pattern are supported by the framework: The first one is the Servlets and
particularly the Http Servlet Request and Response objects (Hunter & Crawford, 2001) which are used by the CloudActions in order to
handle incoming requests and respond back to the caller. The second type of communication is via the use of the REST protocol which
enables theCloudMessages to perform external requests to the service providers.
5. Platform Service Registry: The Platform Service Registry, as the name implies, keeps track of the services that the cloud application
consumes. Every service which is used by the application is listed in the Platform Service Registry.
Platform Service API Description
The second part in the process of adding a platform service and providers to the framework constitutes the description of the web API. The
second variability point among platform services is the different web APIs that the concrete providers expose. Therefore, the heterogeneity of the
web APIs shall be captured by the framework and abstracted by a common reference API exposed to the application developers.
In order to enable the uniform description of the platforms services’ API, the benefits of ontologies are exploited. According to Gruber (1993),
ontologies are formal knowledge over a shared domain that is standardised or commonly accepted by certain group of people. The advantages
here are two-fold. First, ontologies allow to define clearly the domain model of our interest; in our case the domain model is the platform service
providers web API. The fact that an ontology can be a shared and a commonly accepted description of a platform service, contributes towards the
homogenisation of the latter. The platform vendors can adhere to and publish the description of their service-based on the common and shared
ontology.
Moreover, ontologies can be reused and expanded if necessary. Thus, an ontology describing a platform service may not be constructed from the
ground up but may be based on an existing one. The intention of the authors is to reuse and expand the Linked USDL (Pedrinaci, Cardoso, &
Leidig, 2014) ontology and particularly the extended Minimal Service Model (MSM) as described in the work of Ning et al. (2011). To the best of
our knowledge and according to Ning et al. (2011) the MSM is the richest description model capable of capturing the web API and enabling
automatic invocation.
The platform service API description is based on a hierarchy of a three level ontologies as shown in Figure 11. Inspiration has been gained by the
Meta-Object-Facility (MOF) standard (Gardner, Griffin, Koehler, & Hauser, 2003) defined for the Model Driven Engineering domain.
Specifically, the hierarchy of the ontologies resembles the bottom three levels of the MOF structure, namely the meta-models, the models and the
instances of the models.
The level 2 ontology (O2) includes the concepts required to describe a web API. Such concepts are the operations offered by the service providers,
the parameters and the endpoint for each operation etc. The level 1 ontologies (O1) include the concrete description of each of the platform
services which are supported by the framework. A dedicated ontology corresponds to each of the platform services and captures information
about the functionality that each of the services expose. The ontologies in the O2 level are also referred to as Template ontologies. The level 0
ontologies (O0) include the description of the specific platform service providers. A dedicated ontology corresponds to each of the providers and
describes the native web API. The ontologies in the O0 level are also referred to as Instance ontologies.
During the three phases the Section describes how the ontological service descriptions are formed and used to automatically generate the web
clients.
Platform Service Modelling Phase
During this phase, the platform service reference API, is defined. The reference API is exposed to the application developers and describes the
operations offered by the particular service. The reference API is captured in the Template ontology.
Figure 12 shows a snapshot of the Template ontology for the payment service which describes the operation for charging a card. For the sake of
simplicity only the necessary amount of information has been included. The name of the operation is “ChargeCard”. It is a subclass of the class
“Operation”. “Operation” is defined in the Abstract platform service ontology (O2 level) and includes all the operations offered by the service.
Figure 12 also includes the following three elements: “CardIdentifier”, which denotes the card to be charged, “ChargedAmount”, which refers to
the amount of money to be charged during the specific transaction and “CurrencyCode” which refers to the currency to be used for the specific
transaction. All three elements are subclasses of the class “Attribute”. The class “Attribute” is defined in the Abstract platform service ontology
and includes all the attributes which are used for the execution of the operations. An attribute is linked to a specific operation with a property.
Specifically, the three afore mentioned attributes are linked to the “ChargeCard” operation with the following properties respectively:
“hasCardIdentifier”, “hasChargedAmount”, “hasCurrencyCode”.
Vendor Implementation Phase
In this phase the provider specific web API is described and mapped to the reference API.The Service API description editor is used to perform
the mapping. The outcome is an Instance ontology (O0 level) for each concrete provider.
Figure 13 depicts two Instance ontologies which correspond to two payment service providers offered by Heroku and Amazon respectively.
Figure 13. Example of Instance ontology for the Stripe (left)
and Spreedly (right) serviceprovider
Particularly, Figure 13 shows the description of the charge operation as defined in the API of the Spreedly service offered via Heroku platform.
Individuals are created to express each of the specific elements of the provider`s API. An Individual, in the field of ontologies can be considered
as an instance of a class. Specifically, the “purchase” Individual denotes the operation name which is equivalent to the “ChargeCard” operation of
the Template ontology. This justifies the fact that “purchase” individual is of type “ChargeCard”. The Individual “amount” denotes the amount to
be charged during the transaction and is equivalent to the “ChargedAmount” attribute. Thus it is defined of type “ChargedAmount”. Likewise the
Individual “currency_code” is of type “currency” and the “payment_method_token”, which identifies the card to be charged, is of type
“CardIdentifier”.
In the same way an Instance ontology is created (Figure 13) to describe the API of the “Stripe” payment service provider offered via Amazon. The
individual “create” denotes the creation of a charge and is equivalent to the “ChargeCard”. Therefore it is of type “ChargeCard”. Likewise, the
individuals “amount”, ”currency” and “card” are of type “ChargedAmount”, “CurrencyCode” and “CardIdentifier” respectively.
In the same way the rest of the functionality of a platform service can be described. At the same time, the differences in the APIs between the
various providers can be captured. The proposed structure of the three levels of ontologies can be used to describe the web API of additional
platform services such as authentication and message queue service. Initially, a Template ontology is formed to describe the functionality of each
of the platform services. Consequently the Instance ontologies are created to capture the vendor specific web APIs.
Execution Phase
During the Execution Phase, the Platform Service Reference and the provider specific API descriptions, which correspond to the Template and
the Instance ontologies respectively, are fed to the API Client Generator (Figure 14). The API Client Generator generates the client code for the
web API invocation of each of the concrete providers which implement the platform service.
Figure 14. Code Generation Process
The process of the code generation is depicted in Figure 14. The API client generator accepts as input the following:
1. The Template and Instance ontologies, which contain the description of the reference API and the mapping of the providers’ specific API
to the reference one.
2. The Template Files. These files contain the source code which is common among the generated classes, also known as boilerplate code.
The ontology Handler parses the ontologies, creates an object representation and forwards it to the code generator. The code generator reads the
Template files and fills in the missing information according to the service description obtained by the ontology Handler. Subsequently, the
following Java Classes are generated:
1. A set of Java Interfaces which give access to the platform basic services. One Interface is generated for each service supported by the
framework. It contains the operations provided by the services and the reference API as described in the Service Description File.
2. A set of Java Classes which give access to the provider implementations. For each concrete service provider which is supported by the
framework a Java Class is generated which implements the service Interface. It essentially includes the provider`s information (URL,
credentials, configuration settings) and the concrete parameters as those are specified in the web API.
Therefore, the software engineers can develop their applications against the Service interfaces, which are produced by the API Client Generator
and gain access to the supported providers without having to know the underlying specific implementations.
FUTURE RESEARCH DIRECTIONS
As the field of the cloud application platforms and platform basic services gains momentum, the demand for efficient frameworks and
methodologies for the design of service-based cloud applications rises. This article proposed a prototype implementation of such a development
framework.
The main limitation of the framework is that it is inherently restricted to the abstraction of the common features of the service providers. This
means that the reference API contains the operations which are collectively offered by the supported providers. This is a natural limitation when
dealing with API abstraction that is also encountered by similar solutions such as the jClouds, mOSAIC and TOSCA, which are involved with
cloud services API abstractions. One solution is to provide the application developers with direct access to the client adapters for the specific
provider when they need to use provider specific functionality, which is not addressed by the reference API. In addition, the reference API rather
than being static can be continuously updated to reflect the new features offered by the platform service providers. An alternative solution
suggests that while developers use the reference API for the parameters that are common among the providers, a secondary mechanism is
deployed to enable provider specific parameters to be included in the web requests. The mechanism may involve a call-back operation during
which the framework, depending on the concrete deployed provider, requests from the client the additional parameters to be included in the
request.
As mentioned earlier, there is a lack of Integrated Development Environments (IDE) to assist in and automate part of the design process of
service-based cloud application. Towards this direction, an Eclipse-based plug-in editor has been constructed to allow users to define the
mappings between the reference and the specific providers’ API and drive the code generation process. In the future the editor will be enriched to
enable the definition ofCloudActions and CloudMessage and to serve as the main point for the integration of platform services with the
application and the management of the already deployed ones.
Future work also involves the expansion of the framework so that it offers functionality for automatic discovery and recommendation of services.
Furthermore, it can provide billing information about the incurring costs of the application with respect to the platform basic services that it
consumes. Integration with tools associated with Big Data analysis techniques will also be considered. The long term vision of the authors is to
create a development environment which assists the software engineers throughout the whole process of designing, implementing and managing
a service-based cloud application.
CONCLUSION
The unprecedented rate in which data are generated nowadays and their subsequent analysis leads to new business opportunities for the
enterprises. This fact together with the emergence of service-based cloud applications paves the way for context-aware applications, which are
able to adapt themselves according to analysis of the collected data.
This article presented a development framework enabling the design of service-based cloud applications. Particularly, the framework facilitates
the integration of platform basic services in a consistent way as well as seamless deployment of the concrete providers implementing those
services. It achieves this by alleviating the variability issues that may arise across the platform services, namely: (i) the differences in the
workflow when executing an operation, (ii) the heterogeneous web API exposed by the providers and (iii) the various configuration settings and
authentication tokens that each provider requires. The main components of the framework are: (i) the reference meta-model, which enables the
modelling of the abstract functionality of the platform basic service and an ontology-based architecture for alleviating the differences between the
Providers’ web APIs and automatically generating the client adapters for the API invocation.
The proposed framework puts forward an approach of developing service-based cloud applications, which are not tightly coupled with specific
providers. The software engineers should focus on the services required by the application rather than on the implementers. The selection of the
concrete service provider can be a secondary automated process, which is based on external stimuli as explained in the case of the payment
service. The collection and analysis of Big Data is able to drive the selection process and pave the way for the design of adaptable and context-
aware service-based cloud applications.
This work was previously published in the International Journal of Systems and ServiceOriented Engineering (IJSSOE), 5(4); edited by
Dickson K.W. Chiu, pages 125, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Alur, D., Crupi, J., & Malks, D. (2001). Core J2EE Patterns: Best Practices and Design Pattern Strategies . Santa Clara, CA: Prentice Hall / Sun
Microsystems Press.
Ardagna D. Di Nitto E. Casale G. Petcu D. Mohagheghi P. Mosser S. Sheridan C. (2012). MODAClouds: A model-driven approach for the design
and execution of applications on multiple Clouds. In AtleeJ.BaillargeonR.FranceR.GeorgG.MoreiraA.RumpeB.ZschalerS. (Eds),Proceedings of
the 4th International Workshop on Modeling in Software Engineering. Zurich, Switzerland:IEEE. 10.1109/MISE.2012.6226014
Armbrust, M., Fox, A., Griffith, R., Joseph, D. A., Katz, R., Konwinski, A., & Zaharia, M. (2010). A view of cloud computing.Communications of
the ACM , 53(4), 50–58. doi:10.1145/1721654.1721672
Binz, T., Breiter, G., Leymann, F., & Spatzier, T. (2012). Portable cloud services using TOSCA. IEEE Internet Computing , 16(3), 80–85.
doi:10.1109/MIC.2012.43
Cusumano, M. (2010). Cloud Computing and SaaS as new computing platforms. Communications of the ACM , 53(4), 27–29.
doi:10.1145/1721654.1721667
Demchenko Y. Grosso P. de Laat C. Membrey P. (2013). Addressing big data issues in scientific data infrastructure. In Proceedings of the 2013
International Conference on Collaboration Technologies and Systems. San Diego, CA: IEEE. 10.1109/CTS.2013.6567203
Ferry, N., Chauvel, F., Rossini, A., Morin, B., & Solberg, A. (2013). Managing multi-cloud systems with CloudMF. InSolberg, A., Babar, A., M.,
Dumas, M., Cuesta, E., C. (Eds), Proceedings ofthe 2nd Nordic Symposium on Cloud Computing & Internet Technologies.Oslo, Norway:ACM.
10.1145/2513534.2513542
Fielding, R., Irvine, U. C., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., & Berners-Lee, T. (1999). Hypertext Transfer Protocol -- HTTP/1.1 .
Network Working Group. doi:10.17487/rfc2616
Forrester Research Inc. (2012a). US mobile payments forecast, 2012 To 2017. Cambridge, MA: Carrington, D., Epps, R. S., Doty, C., A., Wu, S.,&
Cambell, C.
Forrester Research Inc. (2012b). EU mobile commerce forecast 2023 to 2017. Cambridge, MA: Gill, M., Poltermann, S., Husson, T., O’Grady, M.,
Evans, P. F., & Da Costa, M.
Forrester Research Inc. (2013). US mobile retail forecast 2013 to 2017. Cambridge, MA: Mulpuru, S., Evans, P., F., Roberge, D.,& Johnson, M.
Gartner Inc. (2012). Forecast: Mobile payment, worldwide, 2009-2016. Stamford, CT: Shen, S.
Gonidis, F., Paraskakis, I., Simons, A. J. H., & Kourtesis, D. (2013). Cloud application portability. An initial view. In Georgiadis, K. C., Kefalas, P.,
Stamaris, D. (Eds), Proceedings of the6th Balkan Conference in Informatics. Thessaloniki, Greece: ACM. 10.1145/2490257.2490290
Guillén, J., Miranda, J., Murillo, J. M., & Cana, C. (2013a). A service-oriented framework for developing cross cloud migratable software. Journal
of Systems and Software , 86(9), 2294–2308. doi:10.1016/j.jss.2012.12.033
Guillen, J., Miranda, J., Murillo, M. J., & Canal, C. (2013b). Developing migratable multicloud applications based on MDE and adaptation
techniques. In Solberg, A., Babar, A., M., Dumas, M. & Cuesta, E., C. (Eds) Proceedings of the Second Nordic Symposium on Cloud Computing &
Internet Technologies, Oslo, Norway: ACM. 10.1145/2513534.2513541
Hamdaqa, M., Livogiannis, T., & Tahvildari, L. (2011). A reference model for developing cloud applications. In Proceedings of the1st
International Conference on Cloud Computing and Services Science. Noordwijkerhout, The Netherlands:SciTePress.
Hunter, J., & Crawford, W. (2001). Java Servlet Programming . Sebastopol, CA: O'Reilly & Associates, Inc.
Jacobs, A. (2009). The pathologies of Big Data. Communications of the ACM, 52(8), 36-44.McAfee, A., & Brynjolfsson E. (2012). Big data: The
management revolution. Harvard Business Review ,90(10), 60–69.
Jeffery K. Horn G. Schubert L. (2013). A vision for better cloud applications. In ArdagnaE.SchibertL. (Eds), Proceedings of the 2013 International
Workshop on Multi-cloud Applications and Federated Clouds. Prague, Czech Republic:ACM. 10.1145/2462326.2462329
Kourtesis, D., Bratanis, K., Bibikas, D., & Paraskakis, I. (2012). Software co-development in the era of cloud application platforms and
ecosystems: The Case of CAST. In Camarinha-Matos, L. M., Xu, L., Afsarmanesh, H. (Eds),Proceedings of the 13th IFIP WG 5.5 Working
Conference on Virtual Enterprises. Bournemouth, UK: Springer Berlin Heidelberg. 10.1007/978-3-642-32775-9_20
Ning L. Pedrinaci C. Maleshkova M. Kopecky J. Domingue J. (2011). OmniVoke: A framework for automating the invocation of web APIs. In
O’ConnerL. (Ed),Proceedings of the Fifth IEEE International Conference on Semantic Computing. Palo Alto, CA: IEEE.
Opara-Martins J. Sahandi R. Tian F. (2014). Critical review of vendor lock-in and its impact on adoption of cloud computing. In Proceedings of
the International Conference on Information Society (i-Society 2014).Bournemouth, UK: IEEE. 10.1109/i-Society.2014.7009018
Pautasso, S., Zimmermann, O., & Leymann, F. (2008). Restful web services vs. “big”' web services: making the right architectural
decision.InHuai, J., Chen, R. (Eds), Proceedings of the17thInternational Conference on World Wide Web. Beijing, China: ACM.
Pedrinaci, C., Cardoso, J., & Leidig, T. (2014). Linked USDL: A vocabulary for web-scale service trading. In Presutti, V., d’Amato, C., Gandon, F.,
d’Aquin, M., Staab, & S., Tordai, A. (Eds), The Semantic Web: Trends and Challenges: Proceedings of the11th Extended Semantic Web
Conference (LNCS)(Vol. 8465, pp. 68-82). Crete, Greece:Springer International Publishing. 10.1007/978-3-319-07443-6_6
Petcu, D. (2014). Consuming Resources and Services from Multiple Clouds. Journal of Grid Computing , 12(2), 321–345. doi:10.1007/s10723-
013-9290-3
Petcu, D., & Vasilakos, V. (2014). Portability in clouds: Approaches and research opportunities. Scientific International Journal for Parallel and
Distributed Computing , 15(3), 251–270.
Ranabahu, A., Maximilien, E., Sheth, A., & Thirunarayan,K. (2013). Application portability in cloud computing: An abstraction driven
perspective. IEEE Transactions on Services Computing,PP(99), 1-1.
Rimal, B. P., Jukan, A., Katsaros, D., & Goeleven, Y. (2010). Architectural Requirements for Cloud Computing Systems: An Enterprise Cloud
Approach. Journal of Grid Computing , 9(1), 3–26. doi:10.1007/s10723-010-9171-y
Singhal, M., Chandrasekhar, S., Ge, T., Sandhu, R., Krishnan, R., Ahn, G. J., & Bertino, E. (2013). Collaboration in Multicloud Computing
Environments: Framework and Security Issues. IEEE Transactions on Cloud Computing , 46(2), 76–84.
South East European Research Centre. (2013). Experimentation and categorisation of cloud application platform services. Thessaloniki, Greece:
Gonidis, F.
Storage Networking Industry Association. (2014). Cloud Data Management Interface . CDMI.
The Data Warehouse Institute. (2011). Big Data analytics, TDWI best practices report. Renton, WA: Russom, P.
CHAPTER 21
Modeling Big Data Analytics with a RealTime Executable Specification Language
Amir A. Khwaja
King Faisal University, Saudi Arabia
ABSTRACT
Big data explosion has already happened and the situation is only going to exacerbate with such a high number of data sources and high-end
technology prevalent everywhere, generating data at a frantic pace. One of the most important aspects of big data is being able to capture,
process, and analyze data as it is happening in real-time to allow real-time business decisions. Alternate approaches must be investigated
especially consisting of highly parallel and real-time computations for big data processing. The chapter presents RealSpec real-time specification
language that may be used for the modeling of big data analytics due to the inherent language features needed for real-time big data processing
such as concurrent processes, multi-threading, resource modeling, timing constraints, and exception handling. The chapter provides an overview
of RealSpec and applies the language to a detailed big data event recognition case study to demonstrate language applicability to big data
framework and analytics modeling.
INTRODUCTION
Data is growing at a significant rate in the range of exabytes and beyond. This data is structured and unstructured, text based and more richer
content in the form of videos, audios, and images, and from various sources such as sensor networks, government data holdings, company
databases and public profiles on social network sites (Katina, 2013). Along with other various challenges related to big data such as storage, cost,
security, ethics, or management, processing big data perhaps is even more challenging (Kaisler, 2013). One way to mitigate some of the
processing challenges is to build prototypes or models of big data analytics that may allow understanding the underlying complexities and
verifying functional correctness before actual implementation of the solutions. The sheer volume and velocity of big data is rendering our
traditional systems incapable of performing analytics on the data which is constantly in motion (Katal, 2013). One of the most important aspects
of big data is being able to capture, process, and analyze data as it is happening in real-time. Unlike traditional data stored in some database or
data warehouse for later processing, the real-time data is handled and processed as it flows into the system along with timing constraints on the
validity and business response to the flowing in data. Even though various approaches have been suggested and introduced to address the
problem of analyzing big data in the cloud, it still has been a challenge achieving high performance, better parallelism and real-time efficiency
due to the ubiquitous nature of big data as well as the complexity of analytic algorithms (Osman, 2013). Alternate approaches must be
investigated especially consisting of highly parallel and real-time computations for big data processing. This chapter will present one such
approach using an executable real-time specification language based on the dataflow programming model. The language features will be
presented and the language applicability for modeling big data analytics will be demonstrated through a detailed case study.
BACKGROUND
This section provides an overview of the dataflow programming paradigm and introduces the RealSpec real-time specification language.
Dataflow Programming Model
Dataflow programming paradigm was originally motivated by the ability of exploitation of massive parallelism (Johnston, 2004). The dataflow
approach has the potential to exploit large scale concurrency efficiently with maximum utilization of computer hardware and distributed
networks (Herath, 1988). Dataflow paradigms can employ parallelism both at fine grain instruction level as well as at various coarse grain levels,
however, fine grain parallelism has significant overhead (Ackerman, 1982). Due to the fine grain parallelism overhead and the need of big data
analysis and processing, this chapter will focus on coarse grain level parallelism.
Dataflow programs are represented as directed graphs. Nodes on a dataflow graph represent logical processing blocks with no side effects and
working independently that can be used to express parallelism (Sousa, 2012). Freedom from side effects is necessary for efficient parallel
computation (Tesler, 1968). Directed arcs between the nodes represent data dependencies between the nodes. Arcs that flow toward a node are
said to be input arcs to that node, while those that flow away are said to be output arcs from that node (Johnston, 2004). The status or
intermediate results of each node is kept in a special node level memory that is capable of executing the node when all of the necessary data
values have arrived (Ackerman, 1982). When a node in a dataflow network receives all required inputs, the node operation is executed. The node
removes the data elements from each input, performs its operation, and places the transformed data on some or all of its output arcs. The node
operation then halts and waits for the next set of input data to become executable again. This way instructions are scheduled for execution as
soon as their operands are available as compared to the control flow execution model where instructions are only executed when the program
counter reaches the instructions regardless of whether they are ready for execution before that. Hence, in a dataflow model, several instructions
can potentially execute simultaneously providing the potential for massive instruction level parallelism. Figure 1 provides a simple dataflow
graph for a set of programming statements. The boxes represent execution nodes with operations. The directed arrows represent the arcs along
which the data flows. The letters represent streams of data flowing in or out of the program fragment. The numbers represent constant stream at
each time instant.
Considering the program fragment execution under control flow model in Figure 1, each statement is executed sequentially, i.e., first B gets the
sum of A and 2, then D gets the product of C and 5, and finally E gets the quotient of B divided by D. The same program fragment under dataflow
model can add A and 2 and multiply C and 5 simultaneously. Both results can then be put on their respective output arcs as B and D. As soon as
both B and D are available at the inputs of the final ‘/’ node, the operation is triggered and the result is placed as E on the output arc. The above
example not only shows the instruction level parallelism of the dataflow model but also the pipelined nature of the dataflow processing. As soon
as first set of B and D are available for the ‘/’ node and it starts its execution, the next set of B and D values can simultaneously be computed by
the ‘+’ and ‘*’ nodes. This may go on as long as stream of input values are available.
Big data processing inherently involves parallel computations and non-determinism. Sequential von Neumann machines are not oriented to such
type of computations (Herath, 1988). Stream computing is critical for distributed, real-time analytics on live and dynamic big data streams
(Osman, 2013). The transition from batch processing to stream processing inside the data cloud fits various time-sensitive applications and real-
world use cases (Osman, 2013). Models, languages, and architectures are needed that support such parallelism. Dataflow paradigm can support
such massive parallelism at various levels of granularity. With the evolution of multi-core processors and large connected processing farms, the
high parallelism supported by dataflow paradigm can now be realized allowing implementation of highly parallel computations necessitated by
big data analytics. Non-determinism, however, is not inherently present in dataflow streams as dataflow paradigm is functional, meaning a
dataflow program will produce the same set of output for a given set of input (Johnston, 2004). Some research has been done to add non-
deterministic behavior to the dataflow models (Arvind, 1977; Kosinski, 1978; Broy, 1988). The non-determinism in dataflow models will be
further addressed later in the chapter.
The proposed modeling language, based on the dataflow paradigm, supports parallel computations necessary for big data processing. The next
section provides an overview of the proposed language along with various features needed for modeling big data analytics. The chapter discusses
language support for non-determinism as well as the use of multi-core processors on a cloud in one of the later sections.
RealSpec RealTime Executable Specification Language
RealSpec is a declarative executable specification language for the prototyping of concurrent and real-time systems based a dataflow functional
model (Khwaja, 2008a, 2008b, 2008c, 2009, 2010). RealSpec is developed on top of Lucid dataflow programming language by enhancing Lucid
with features for real-time systems (Wadge, 1985). The statements in a RealSpec specification are equations defining streams and filters, not
commands for updating storage locations as in the case of traditional imperative programming languages. Hence, RealSpec is a definitional
language. The equations in a RealSpec specification are assertions or axioms from which other assertions can be derived using the Lucid axioms
and rules of inference (Ashcroft, 1976). For example, the following RealSpec specification defines x (the output) to be the data stream <1, 2, 3, 4,
5,…> at time index <t0, t1, t2, t3, t4,…>, respectively, an infinitely varying sequence:
x
where {
x = 1 fby x+1;
}
In the above specification, the where clause is an expression together with a set of definitions of variables used in the expression. These
definitions are also called operator nets. The binary operator fby (called followed by) provides abstract iteration over sequences. The first
argument of fby primes the pump that permits successive future values to be generated. Note that, since RealSpec is referentially transparent,
the variable x denotes the same stream in all contexts. In the above example, the x on the right-hand side of the fby is the same as the one being
defined which begins with 1. Since x is defined to be 1 at index 0, and 1 is defined to be 1 at index 0, the next value of the stream at index 1 can be
produced to be 2 using x+1, ad infinitum.
User Defined Algebras and Objects in RealSpec
Lucid is based on a few fixed algebras such as integers, reals, Booleans, and strings. However, Iswim (Landin, 1966), which is the basis for Lucid
does not limit any data types. In fact, Iswim is a family of languages supporting a set of primitive things consisting of data objects, a collection of
operations over this set, and a collection of symbols denoting objects and operations (Wadge, 1985). In order to be able to represent processes
and resources in RealSpec, Lucid’s semantics were enhanced to include user defined algebras for representing complex data types and objects.
Based on the Iswim family of languages, consider if the data object is not only a collection of data elements but functions as well. In other words,
data objects may have operator nets or filters in addition to a set of variables. A Lucid program then would be an operator net for this “data with
operator net”. Each instance of the input would be some form of this data with a specific internal state based on the values of its member data
types and the output would be another form of this data with its internal state changed through the internal operator net of the data object. The
internal operator net of the data object is manipulated by the operator net of the main Lucid program to produce the output data object with the
changed state (Freeman-Benson, 1991). Since input/output streams contain streams of data objects which in turn contain instance variables, the
“value” in these streams can be changed in two separate ways (Freeman-Benson, 1991):
• By a stream of the same object with different values in the instance variables.
Thus, the instance variables of a data object must themselves be full-fledged Lucid streams resulting in streams of data objects which contain a
stream of data objects that are followed by potentially an n number of streams. For example, consider a data object d with an internal instance
variable x. The stream representation for this data object may appear as:
where t’ is the time index for d stream and di represent object d with updated state based on its internal instance variable and manipulation
functions. For each di, the internal instance variable x goes through a sequence of values indexed by t”. The stream values ai, bi, and ci represent
possibly different value streams for x for each di stream element.
System, Processes, and Threads
• System: A RealSpec specification starts with a system construct that provides a context for the rest of the specification. The system
construct consists of declaration of system resources, statically defined processes, process and thread creation order, and global system level
functions, if any. The system definition in RealSpec is specified using a system construct:
system sys {
resources { … }
processes { … }
functions { … }
}
Resources, processes, and global functions are defined within the system construct in their respective blocks. The three types of blocks can be
defined in any order within the system construct. A system may or may not use any resources and globally defined functions. Hence, it is possible
that a system definition does not have these two blocks defined. However, a system must have at least one process defined that will serve as the
main process. For a given problem, active and passive problem components are identified and defined using the process and resource language
constructs, respectively, within the system construct.
• Process: Processes are active components in RealSpec, i.e., processes have their own execution threads and instigate various system
actions. System processes are declared within the processes block of the system construct. All statically created system processes are
required to be declared in the processes block. For example, the system sys declares three processes p1, p2, and p3:
system sys {
processes {
p1;
p2;
p3;
}
…
}
A process object is defined using a process construct. A process construct consists of the keyword process with process name followed by process
body within curly brackets. The body of a process definition may consist of a declaration for primitive data variables, other active or passive
objects, and a set of process functions. The functions in a process are declarative assertions defining operator nets of the process. All of the
functions or operator nets of a process execute simultaneously and synchronously. For example, a process factorial that contains a single function
calcfac(int n) to calculate factorial can be defined as follows:
process factorial() {
calcfac(int n)
where {
calcfac(int x) = if x < 2 then 1 else x * calcfac(x-1);
}
}
The same program written in an imperative language like C using recursion will look something like following:
#include<stdio.h>
long factorial(int);
int main()
{
int n;
long f;
scanf(“%d”, &n);
f = factorial(n);
printf(“%d! = %ld\n”, n, f);
return 0;
}
long factorial(int n)
{
if (n == 0)
return 1;
else
return(n * factorial(n-1));
}
Where there are some obvious similarities between the two programs, such as use of variables (x, n) and expressions built up from basic
arithmetic operators, the two languages are quite different. In the above example, the two languages appear more similar than they really are due
to the similar mathematical symbolism and conscious use of C-like notations for RealSpec. These symbols, however, are used in very different
ways. In the C program, statements are commands whereas in RealSpec these are definitions. The variables in the C programs are storage
locations whereas in RealSpec these are variables in the true sense of mathematics. In addition, RealSpec provides default process features such
as inter-process communication, multi-threading, and process priorities by virtue of the process construct whereas these features will have to be
programmatically added to the C version of the program.
Processes may be created statically or dynamically. All processes declared at the system level are statically created when system execution is
started and remain active until the system is running. The processes in RealSpec are by default created and executed asynchronously. The
functions or operator nets within a process, however, are executed synchronously within that process. Processes may be dynamically created by
other processes by calling the start() function of a process. Each process has a pair of implicit functions start() and end(). Any process that is
expected to be dynamically created must be declared with a dynamic qualifier which notifies a process to defer its creation until itsstart()
function is called.
process p1() {
process dynamic p2;
f()
where {
… p2.start() …
}
}
• Threads: A process has a single execution thread by default. However, a process may have as many threads as possible. Multiple threads
can be defined as part of a process definition. When a process is created, all defined threads are automatically created and start simultaneous
execution. The entire context of a process is duplicated for each created thread except for any shared resources. The order of thread creation
and start is random unless specifically defined within the system definition using the precedence constraints. In the example below, x gets
the value of x+1 if the executing thread is th1, indicated by the property pid, otherwise x gets the value of x*2. Hence, the output for thread
th1 will be <1, 2, 3, 4,…> whereas for th2 that will be <1, 2, 4, 8,…>:
process p() threads th1, th2 {
x
where {
x = 1 fby if pid == 0 then x+1 else x*2;
}
}
InterProcess Communication
The processes can also communicate with each other via message passing by using a pair of send and receive thread functions. Any two processes
that are trying to communicate with each other using message passing must have matching send and receive calls. A message handshake takes
place when a process or thread sends a message to another process or thread. The message handshake is based on using an implicit message
buffer that is associated with each process to buffer the messages and an acknowledgement sent from the receiving process to the sending
process. The message communications can either be synchronous or asynchronous. In the case of synchronous message passing, the two
processes p1 and p2 are blocked or synchronized at the send and receive pair until the message transfer handshake is complete. The
asynchronous message passing is indicated by specifying the async qualifier with both send and receive calls. The send call will put the message
in p2’s message buffer with p1’s pid and will return immediately. The receive call will get the message from p2’s message buffer if the
message’s pid matches p1’s pid, otherwise it will return. In the following example of synchronous message passing with timeouts, p1 blocks for
50 microseconds and p2 blocks for 75 microseconds:
process p1() {
… p2.send(data)@tout 50 us; …
}
process p2() {
… x = p1.receive() @tout 75 us; …
}
Timing Constraints and Exceptions
RealSpec supports specification of both absolute and relative timing constraints for processes, messages, and data. In addition, periodic and
aperiodic constraints can also be specified. The absolute timing constraints consist of minimum, maximum, and durational constraints using
@delay, @tout, and @duroperators. In the following example, the input variable is not read until a minimum time of 25.8 us has passed:
process p() {
… x = input @delay 25.8 us; …
}
In the following example, process p1 waits for a message acknowledgement from process p2 for a maximum of 5.6 ms:
process p1() {
… p2.send(data) @tout 5.6 ms; …
}
In the following example, a process p1 is put to sleep by another process p2 for 10 ms:
process p2() {
… p1.sleep() @dur 10 ms …
}
The events in RealSpec are by default aperiodic. Periodic events may be specified by using the @period operator that allows specification of the
lower and upper period limit as well as period duration, e.g., x is periodically input with a lower period limit of 4.5 ms, an upper period limit of
5.5 ms and duration of 1 second:
Relative timing constraints may be specified using a wrt qualifier that stands for “with respect to”. The wrt qualifier takes as a parameter an
event, process, or message name to be used as a reference for calculating a timing constraint of the associated timing operator. In the following
example, the y input delay is calculated with respect to the delay of x:
x = input_x @delay 10 s;
y = input_y @delay 5 s wrt x;
In the dataflow semantics, the wrt qualifier is applied to the previous value of the reference event. In the above example, x will get input_x after
10 seconds delay and y will get input_y after (10 + 5) seconds delay. The dataflow stream effect of wrt is that the qualifier delays or pushes out
the stream one time index due to the dependence on the previous time value of the reference event stream.
Constraint violations are handled using exception handlers and exceptions are raised within process functions using the keyword throw along
with the exception name. Once raised, an exception must be handled by the process via a special process function called exception handler.
Hence, the scope of exceptions and exception handlers in RealSpec is local to a process. Since the exception handlers are functions defined within
a process scope, these handlers can access process level resources and variables. The exception handler functions are executed at the highest
priority level. If multiple exceptions are raised simultaneously within a process, the order of exception raised is used for handler execution. If
multiple exceptions are raised simultaneously by various processes, the priority of processes is used to determine the exception handler execution
order. The following example shows an explicit throwing of an exception when x exceeds a certain threshold value limit where
valueLimitException() is the name of the exception handler used to handle this particular exception:
Abort is used in the case of fatal or hard real-time errors when recovery is not possible and is raised by using the abort command anywhere in
the system including exception handlers. RealSpec terminates the entire system specification execution on executing the abort command.
Resource Modeling
RealSpec supports modeling of system resources using data objects. In RealSpec, resources are considered as passive elements that do not have
their own execution thread instead these resources passively wait for other components to require their services. Passive components are usually
activated on receiving messages from other components. The predefined resources in RealSpec consist of: (a) abstract data structure resources
such as semaphore, mutex, array, queue, and stack; and (b) hardware resources such as signal and analog IO. The predefined resource objects are
multi-thread safe i.e., all resource objects support simultaneous access through multiple threads by using internal semaphores and a mutual
exclusion mechanism. Users can also define new custom resource objects with the following resource construct template:
resource <name>(<parameters>) {
<resource variables>
<resource functions>
}
Once defined, resource objects must be instantiated before being used. In the following example, the system sys declares two resource objects: an
input signal resource for reading a switch state and a five element queue data structure:
system sys {
resources {
signalin switch;
queue q(5);
}
}
Control System Modeling
RealSpec supports modeling of control systems using constructs to model digital and analog IO. A signal resource object is used to model a digital
signal, also called discrete or quantized signal. In most applications, digital signals are represented as binary numbers so their precision of
quantization is measured in bits. These signals are typically an encoding of data using 1’s and 0’s represented by high and low voltage levels. The
signal resource may be used to model interfaces that have two states, on or off, e.g., a valve or a motor on/off button. The RealSpec signal
resource has three variations: signalin, signalout, andsignalinout to model input signals, output signals, and duplex signals, respectively. The
following specification outputs true if the input signal is high for at least 30 seconds, otherwise it outputsfalse:
signalin inputSignal;
x = if inputSignal.pulsein() == 1 && inputSignal.pulseduration() >= 30 s
then true else false;
RealSpec provides an analog IO resource object to model analog interfaces. The resource object analogin is used to model a feedback element or
analog-to-digital convertor. The resource object analogout is used to model feed forward or digital-to-analog convertor. In the following
specification snippet example, a concen analongin resource is declared to read methane gas concentration in the surrounding environment.
The gas concentration is measured as current. The input current low and high reference points (4 mA and 20 mA) are passed in as parameters
when declaring the resource. The digital code resolution is assumed to be 24 bits, also passed in as a parameter. The specification continuously
checks for the gas concentration and if the concentration is between 30% and 45% threshold then outputs true otherwise outputs false.
RealSpec Compile Process
A prototype compiler was developed to be able build and execute RealSpec specifications. The RealSpec compiler was developed in C on a Sun
Solaris system running SunOS 5.8. The target code generated was then compiled for an Intel 386 architecture based embedded platform running
Embedded Configurable Operating System (eCos) 3.0 RTOS (Massa, 2003; eCos). The RealSpec compiler has a two stage compile process. A
RealSpec specification is first compiled into an equivalent C code. The generated C code is then compiled for the target platform running eCos.
eCos has been designed to support applications with real-time requirements, providing features such as multithreading, full preemptability,
minimal interrupt latencies, all the necessary synchronization primitives, scheduling policies, and interrupt handling mechanisms needed for
these type of applications. RealSpec’s features and constructs are mapped to eCos features. For example, RealSpec process or thread declaration
along with priority is mapped to eCos thread create function as follows:
RealSpec:
System sys {
Processes {
producer;
consumer;
}
}
eCos:
This section provided an overview of the RealSpec real-time executable specification language and its various features. The next section will
provide a discussion on the application of RealSpec to Big Data analytics modeling. In doing so features and constructs of RealSpec appropriate
and fitting for modeling big data as well as limitations and any potential enhancements needed to the RealSpec language for the modeling of big
data will be identified.
REALSPEC FOR BIG DATA ANALYTICS
As discussed in the previous section, the RealSpec real-time specification language may have features to model handling and processing of Big
Data. The big data analytic modeling can be at the workflow level or algorithm level. Since RealSpec is a functional language based on Lucid and
there have been extensive examples of algorithmic programs written in Lucid (Wadge, 1985), the modeling of big data algorithms in RealSpec is a
matter of using Lucid language features to write functional programs. Hence, this chapter will focus on the discussion of RealSpec
appropriateness at the big data workflow or process flow level to demonstrate RealSpec can be used to model big data at the architectural level as
well. Lee et al. (2013) presented a workflow framework architecture and a process flow for the big data analytics. Lee et al.’s (2013) framework
and process flow will be loosely used in this section as a basis for discussing RealSpec’s appropriateness for modeling big data workflow.
A Big Data Workflow Framework and Process Flow
Lee et al.’s (2013) workflow framework architecture consisted of four layers: workflow client, workflow management, compute service, and
compute resource as shown in Figure 2. In the framework architecture, the workflow client layer provides the end-user environment to compose
workflows with various workflow management tools. The workflow management layer provides major functions to support the workflow
execution such as scheduling workflow tasks to remote applications, monitoring the status of workflow tasks and keeping track of computing
resource usage, encryption and decryption of data as well as user authentication, and ensuring data required by a workflow is efficiently
accessible. The compute service layer provides a collection of services to perform specific functions such as importing dataset, filtering data with
criteria, classifying data with a particular algorithm, and exporting result. Finally, the compute resource layer provides computational platforms
where the executable codes are hosted.
Figure 2. An architecture of workflow framework
Figure 3 shows a process flow diagram proposed by Lee et al. indicating how a workflow with messages and dataset is executed in the above
framework (Lee, 2013). A user composes a workflow with tasks available for data analytics which is then converted into XML. The XML
commands are submitted to the backend workflow execution engine where the resource request and dataset request services are invoked to
provision computing resources and to upload dataset. When the cluster is ready, the workflow planner then utilizes the scheduler to allocate tasks
to the computing resources.
Figure 3. A Process Flow of the Workflow Framework in
Figure 1
RealSpec Coverage of the Big Data Framework and Process Flow
Figures 2 and 3 show anticipated RealSpec coverage at the workflow architecture level with the shaded boxes. In the workflow framework in
Figure 2 and the process flow in Figure 3, RealSpec can be used to model the Workflow Management, the Compute Service, and the Compute
Resource layers. The Workflow Client layer does not need modeling in RealSpec since the workflow tasks in Lee et al.’s framework that were
needed to be created in an external language like XML can be coded directly with RealSpec language constructs.
The Workflow Management layer may be modeled using RealSpec process and thread active components as this layer consists of mostly
functions instigating various workflow execution activities. The inter-process communication needed between these major functions can be
represented by RealSpec inter-process synchronous or asynchronous message passing mechanism.
The Compute Service layer consists of services performing various algorithmic functions to import dataset, filter data with specific criteria, and
classifying data. These functions can be implemented using RealSpec basic computational operators and functions.
Finally, the Compute Resource layer has to do with actual computational platform hosts on a local cluster or a public cloud. These actual
computational hosts may be modeled as a RealSpec resource data object with the acquisition and provisioning of these computational host
resources being done at the Compute Service layer with a RealSpec provisioning process.
Power Consumption Data Analytics Case Study
This section builds an architectural level RealSpec specification and demonstrates RealSpec modeling of big data analytics by using a simplified
version of the power consumption data analytics example from Lee et al. (2013). The example considers n number of rooms in a building with
various types of power consumptions such as lighting, ventilation, low voltage devices such as printers, and high voltage devices such as servers.
The power consumption workflow may consist of collection of the power consumption data at specific intervals for all rooms, extracting specific
features from the data such as sum and max, clustering the data into groups based on some criteria, say, max ranges, and performing some
correlation analysis on the clustered data. The power consumption data computation will also require resource provisioning on a private or
public cloud.
The first step would be to identify various types of processes and resources objects required to model this example. Table 1 captures the required
processes and resources needed for this example along with their descriptions. The choice of having a separate thread to perform each room
analysis in the analyze_room_power process is to improve overall parallel processing. Likewise, the process cluster_provisioner has a separate
provisioning thread complementing room analysis thread for ensuring each thread gets dedicate provisioning support to avoid blocking threads
for computational resource allocation. The use of multi-threading in both of these cases helps in exploiting high parallel processing. In this
chapter, the RealSpec keywords and language built-in features are highlighted with bold font in the specification snippets.
Table 1. RealSpec processes and resources for the power consumption analytics
resource room_data {
float sum;
float max;
}
The cluster_provision process uses compute_cluster resource to get an available platform from the cloud and to schedule the requesting thread
execution on that platform. Each cluster_provision thread (cp_0 to cp_n) will be executing the same specification with their own execution
context defined by thread process id, pid. The function provision_resource accepts the requesting thread’s id and does the allocation and
scheduling of the platform. In the event a platform is not available on the cloud for the requesting thread, the function throws an exception. The
exception handler prints a message followed by (fby) a try again after 30 minutes. This will continue to happen until the thread is allocated a
platform. Different handling strategies could also have been used here. For example, after n attempts a hard fail could have been implemented
with a system abort. Note that currently RealSpec does not support physical scheduling of resources and that is only a suggested feature of the
language that can be added to the language. At present, the actual scheduling act may be simulated by printing a message. The function
release_resource frees up the allocated resource for a thread once thread execution is completed. The special keyword eod (end of data stream) is
used to terminate the perpetual stream for the release_resource function to ensure the freeing of platform is done only once for a requesting
thread. The process may define other platform provisioning and management functions as necessary.
resource compute_cluster(){
private array int platforms(1000) = 0;
get_platform(int r_id) = i asa (platform_is_found || i<0)
where {
i = platforms.size-1 fby i-1;
platform_is_found = if i>=0 && (platforms[i]>>a) == 0
then true else false;
platforms = platforms[i]<<(r_id+1) when platform_is_found;
}
release_platform(int r_id) = (if platform_is_found then true else false) asa (platform_is_found || i<0)
where {
i = platforms.size-1 fby i-1;
platform_is_found = if i>=0 && (platforms[i]>>a) == r_id+1
then true else false;
platforms = platforms[i]<<0 when platform_is_found;
}
available_platforms() = available asa i<0
where {
i = platforms.size1 fby i-1;
available = 0 fby available+1 whenever platforms[i]>>a == 0);
}
…
}
The compute_cluster resource uses an internal array for keeping track of cloud platform allocations. The internal array is initialized to all zeroes
indicating that all platforms are available. When the function get_platform is called by the cluster_provision process with pid of a thread, the
function searches for an available platform and assigns the first platform found that has a corresponding zero in an array location. The function
then stores the pid of the thread plus one to indicate this platform is now occupied by the thread pid. The three equations inside the where
clause are all simultaneously updated for each time index. As explained previously in the chapter, the fby (followed by) operator allows abstract
iteration or looping. So, the first and second equations are updated simultaneously for each time index. The third equation is only updated when
the condition of the when operator is true. The following is a sample of the values attained by the left side of the variables in each equation at
various time indexes assuming there is an empty slot at location 996:
The function get_platform gets the index of the platform as a value as soon as (asa) a platform is found or -1 if no platform is found. At this point
the equation computation is stopped for all three equations and the result is returned to the caller. The functions release_platform and
available_platforms work in a similar manner.
The analyze_room_power process declares a thread for each room in the building to perform power consumption data accumulation and
analysis in parallel. The order of thread creation and execution start is random unless specified at the system level declaration with precedence
constraints. In this particular case, the order is not important so the default order is accepted. The process uses a predefined RealSpec
resource analogin to sample room power at specific intervals and converts the analog data into digital based on the low and high ranges of
reference input power and the number of bits used for the conversion resolution (Khwaja, 2009). In this particular example, the room power is
sampled every hour.
The function allocate_resource in the analyze_room_power process uses the process id of each of the analyze_room_power threads to call the
corresponding thread in the cluster_provision process which in turns tries to acquire a computing platform from the cloud and schedules the
analyze_room_power thread for execution. Each case condition is executed uniquely by only the thread with the specific pid value on the left
side of the case condition. This mechanism allows highly parallel execution of the platform provisioning and scheduling of each thread for
computation on the allocated platform. This high parallel resource manipulation is possible since, in RealSpec, all predefined resources, such as
array, are thread safe.
The functions sum and max in the analyze_room_power process perform the necessary statistical computation for each room power data. The
running sum is maintained by reading the previous sum from the corresponding room location using the pidas index in the global rooms array
and then updating for each subsequent power reading by the analogin resource at one hour intervals by adding to the previous value of the sum.
The running track of peak values is also computed in similar manner. The sum and max statistical computations are used as a sample. It is
possible to add more statistical related computations to the analyze_room_power process in the similar manner.
process clustering_data() {
cluster_max() = status asa i==rooms.size where {
i = 0 fby rooms.size+1;
status = cond {
a >= RANGE_1_L && a < RANGE_1_H: write_to_cluster(i,a,1);
a >= RANGE_2_L && a < RANGE_2_H: write_to_cluster(i,a,2);
…
a >= RANGE_m_L && a < RANGE_m_H: write_to_cluster(i,a,m);
}
where {
a << rooms[i].max;
RANGE_1_L = …;
RANGE_1_H = …;
RANGE_2_L = …;
RANGE_2_H = …;
…
}
}
cluster_size() = …
…
}
The cluster_max function walks through the global rooms array and clusters the rooms using some predefined range rules. The example uses a
simple range method here, however, any clustering algorithm may be used. The correlating_data process may be specified in a similar manner.
process power_consumption_manager() {
analyze_max_building_consumption() = coorelating_data.correlate_max()
asa clustering_data.cluster_max() @period 24 h;
analyze_total_building_consumption() = coorelating_data.correlate_sum()
asa clustering_data.cluster_sum() @period 24 h;
…
}
The power_consumption_manager process defines the workflow for the building power consumption analysis. The process correlates the
respective type of data as soon as the 24 hours of data is clustered.
DISCUSSION AND FUTURE RESEARCH DIRECTIONS
The case study in the previous section demonstrated the use of RealSepc real-time specification language for modeling big data analytics. There
are certain areas and features of the language that seem to be a natural fit for big data framework and analytics modeling:
• Ubiquitous Big Data Object Modeling: RealSpec resource construct may be used to model different types of big data objects such as
video, audio, text, structured docs, and unstructured docs, with specific attributes.
• On the Fly Processing and Analysis: Using the processes, multi-threading, resources, and timing constraints, RealSpec provides
capability to perform on the fly processing and analysis of unlimited streams of big data objects. These features make RealSpec suitable
especially for time sensitive big data capture, manipulation, and analysis.
• Parallel Streams of Big Data Objects: Simultaneous streams of different types of related or unrelated big data objects (video, audio,
structured, unstructured documents) may be inherently supported by dataflow paradigm and hence RealSpec via multiple data streams.
Dependent stream objects may be related by time indexes to reflect ordering of objects. For example, video stream object at t = 0 may be
related by audio stream object at t = 0. Timing of related objects such as video and audio can be modeled by timing constraints using
absolute or relative timing.
• Parallel Processing of Big Data Streams: Using RealSpec concurrent processes and multi-threading, any number of synchronous
and/or asynchronous processes/threads along precedence constraints can be created for various types of processing as was demonstrated by
the building power consumption case study in the previous section. Inter-process message passing can be used for communication between
process or threads as well as synchronization.
• Complex Event and Data Handling: Predefined resources such as analogin/out and signalin/out provide simplified handling and
manipulating of complex analog data and digital events as was demonstrated in the building power consumption case study.
• Conciseness of the Specifications: The above case study demonstrated that complex big data analytic scenarios may be concisely
specified with RealSpec due to the declarative nature of the language.
• Scalability.
However, there are areas related to big data capture, processing, and analysis that seem to be lacking in the language. Adding explicit support for
these feature will further enhance the applicability of the language in this domain:
• Actual Platform Provisioning: Currently, RealSpec does not support actual hardware platform level services. All these are soft
modeled. Adding cloud level services may enhance the language executability. These services may also help in better modeling of the
resource provisioning and early identification of any provisioning related issues as is usually the intent of modeling.
• Execution Scheduling: As with the actual hardware provisioning, the language also does not support explicit scheduling of computation
on specific computing platforms. This may be another enhancement similar to the platform provisioning.
• Data Brokage: Similar to the above two capabilities, RealSpec does not directly support capability for best data location and/or transfer
for workflow execution optimization.
• NonDeterminism: Non-determinism is partially supported in RealSpec. The processes, and hence the threads of these processes, in
RealSpec run asynchronously to each other. Each process in RealSpec has its own internal set of data streams and hence the data streams
each process work with are non-deterministic with respect to each other. This synchronicity causes the data flow between two or more
processes or threads of these processes to work at different periods as the order of their executions may not be guaranteed due to decisions
taken by the scheduler. In addition, the asynchronous events modeled with the analog and signal resources may also introduce non-
determinism since these events may occur at different periods. However, the data streams within a process are deterministic. All data flow
within a process are synchronous to each other. Most of the big data non-determinism should be addressed by the asynchronous process and
events. Non-determinism at the dataflow level within a process may need to be addressed if a need arises to model non-determinism at the
level. Some research in this area has been carried out (Arvind, 1977; Kosinski, 1978; Broy, 1988) and may be leveraged if needed.
CONCLUSION
RealSpec real-time specification language may be appropriate for modeling big data workflow and analytics by virtue of inherent language
features needed for concurrent and real-time processing of big data streams. The building power consumption case study demonstrated
applicability of various language features in this context. The language allows development of concise specification models for the big data
analytics. Furthermore, the executability of the language will allow early modeling and prototyping of such complex, real-time big data analytics.
The chapter also highlighted certain features that may be lacking in the language. Addition of these features to the language may further enhance
the language applicability to big data analytics. Further investigation and evaluation is needed to fully understand the scope and usage of these
potential features enhancements to the RealSpec language.
This work was previously published in the Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 289312, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Arvind, G. K. P., & Plouffe, W. (1977). Indeterminancy, monitors, and dataflow. In Proceedings of the Sixth ACM Symposium on Operating
System Principles. West Lafayette, IN: ACM. 10.1145/800214.806559
Ashcroft, E. A., & Wadge, W. W. (1976). Lucid – A Formal System for Writing and Proving Programs. SIAM Journal on Computing ,5(3), 336–
354. doi:10.1137/0205029
Broy, M. (1988). Nondeterministic Data Flow Programs: How to Avoid the Merge Anamoly. Science of Computer Programming ,10(1), 65–85.
doi:10.1016/0167-6423(88)90016-0
Freeman-Benson B. (1991). Lobjcid: Objects in Lucid. In Proceedings of the 4th International Symposium on Lucid and Intensional
Programming. Menlo Park, CA.
Herath, J., Yamaguchi, Y., Saito, N., & Yuba, T. (1988). Dataflow Computing Models, Languages, and Machines for Intelligence
Computations. IEEE Transactions on Software Engineering ,14(12), 1805–1828. doi:10.1109/32.9065
Johnston, W. M., Paul Hanna, J. R., & Millar, R. J. (2004). Advances in Dataflow Programming Languages. ACM Computing Surveys , 36(1), 1–
34. doi:10.1145/1013208.1013209
Kaisler S. Armour F. Espinosa J. A. Money W. (2013). Big Data: Issues and Challenges Moving Forward. In Proceedings of the 46thHawaii
International Conference on System Sciences. Maui, HI: IEEE. 10.1109/HICSS.2013.645
Katal A. Wazid M. Goudar R. H. (2013). Big Data: Issues, Challenges, Tools and Good Practices. In Proceedings of the Sixth International
Conference on Contemporary Computing (IC3).Noicla, India. 10.1109/IC3.2013.6612229
Katina, M., & Miller, K. W. (2013). Big Data: New Opportunities and New Challenges. Computer, 46(6), 22-24.
Khwaja A. A. Urban J. E. (2008a). RealSpec: An Executable Specification Language for Prototyping Concurrent Systems. In Proceedings of the
19th IEEE/IFIP International Symposium on Rapid System Prototyping. Monterey, CA: IEEE. 10.1109/RSP.2008.9
Khwaja A. A. Urban J. E. (2008b). RealSpec: An Executable Specification Language for Modeling Resources. In Proceedings of the 20th
International Conference on Software Engineering and Knowledge Engineering (SEKE 2008). San Francisco Bay, CA.
Khwaja A. A. Urban J. E. (2008c). Timing, Precedence, and Resource Constraints in the RealSpec Real-Time Specification Language. In
Proceedings of the 2008 IASTED International Conference on Software Engineering and Applications. Orlando, FL: IASTED.
Khwaja A. A. Urban J. E. (2009). RealSpec: an Executable Specification Language for Modeling Control Systems. In Proceedings of the 12th IEEE
International Symposium on Object/component/service-oriented Real-time Distributed Computing (ISORC 2009).Tokyo, Japan: IEEE.
10.1109/ISORC.2009.36
Khwaja A. A. Urban J. E. (2010). Preciseness for Predictability with the RealSpec Real-Time Executable Specification Language. In Proceedings
of the 2010 IEEE Aerospace Conference. Big Sky, MT: IEEE. 10.1109/AERO.2010.5446788
Kosinski P. R. (1978). A Straightforward Denotational Semantics for Non-determinate Data Flow Programs. In Proceedings of the 5th ACM
SIGACT-SIGPLAN Symposium on Principles of Programming Languages. Tucson, AZ: ACM. 10.1145/512760.512783
Landin, P. J. (1966). The Next 700 Programming Languages.Communications of the ACM , 9(3), 157–166. doi:10.1145/365230.365257
Lee C. Chen C. Yang X. Zoebir B. Chaisiri S. Lee B.-S. (2013). A Workflow Framework for Big Data Analytics: Event Recognition in a Building. In
Proceedings of the IEEE 9th World Congress on Services. Santa Clara, CA: IEEE.
Osman A. El-Refaey M. Elnaggar A. (2013). Towards Real-Time Analytics in the Cloud. In Proceedings of the 2013 IEEE 9th World Congress on
Services. Santa Clara, CA: IEEE. 10.1109/SERVICES.2013.36
Sousa T. B. (2012). Dataflow Programming Concept, Languages and Applications. In Doctoral Symposium on Informatics Engineering.
Tesler L. G. Enea H. J. (1968). A Language Design for Concurrent Processes. In Proceedings of AFIPS 1968 Spring Joint Computer Conference.
Atlantic City, NJ: AFIPS.
Wadge, W. W., & Ashcroft, E. A. (1985). Lucid – The Dataflow Programming Language . London: Academic Press.
KEY TERMS AND DEFINITIONS
Big Data Analytics: Process of examining large data sets containing a variety of data types for specific patterns, correlations, market trends,
customer preferences and other relevant information.
Dataflow Paradigm: Programming paradigm that models programs as directed graphs of the data flowing between operations.
Declarative Language: A non-procedural language where a program specifies what needs to be done rather than how to do.
ABSTRACT
In this chapter, the author proposes a hierarchical security model (HSM) to enhance security assurance for multimedia big data. It provides role
hierarchy management and security roles/rules administration by seamlessly integrating the role-based access control (RBAC) with the object-
oriented concept, spatio-temporal constraints, and multimedia standard MPEG-7. As a result, it can deal with challenging and unique security
requirements in the multimedia big data environment. First, it supports multilayer access control so different access permission can be
conveniently set for various multimedia elements such as visual/audio objects or segments in a multimedia data stream when needed. Second,
the spatio-temporal constraints are modeled for access control purpose. Finally, its security processing is efficient to handle high data volume
and rapid data arrival rate.
INTRODUCTION
Currently, multimedia including image, video, and audio accounts for 60% of internet traffic, 70% of mobile phone traffic, and 70% of all
available unstructured data (Smith, 2013). It is considered as “big data” not only because of its huge volume, but also because of its increasingly
imminent position as a valuable source for insight and information in applications ranging from business forecasting, healthcare, to science and
hi-tech, to name a few. Due to the explosion and heterogeneity of the potential data sources that extend the boundary of data analytics to social
networks, real time streams, and other forms of highly contextual data, security assurance becomes one of the critical areas in this big data
environment, which implies that the data be trustworthy as well as managed in a privacy preserving manner (Bhatti, LaSalle, Bird, Grance, &
Bertino, 2012).
In the literature, many studies have been conducted to address security issues in the big data environment using approaches such as role
management and access control (Choi, Choi, Ko, Oh, & Kim, 2012; Nehme, Lim, & Bertino, 2013) and have achieved promising results. However,
several security requirements caused by special properties of multimedia big data are not yet well addressed and access control enforcement, the
ability to permit or deny a request to perform an operation, is considered one of the most challenging and important aspects in multimedia big
data (Nehme et al., 2013). Some main challenges are summarized below:
1. Many existing security models mainly focus on protecting documents on the file-level (Bertino, et al., 2003; Zhao, Chen, Chen, & Shyu,
2008). However, multimedia data often consists of a huge number of elements with different level of “sensitivity.” For example, patients’
personal information in a medical archive, a vehicle plate number in a car image or a victim’s face in a surveillance video often requires a
higher level security protection than other general elements. Therefore, the security mechanism must be flexible and able to support
multilevel access control so it is convenient to set different access permission for various visual/audio objects or segments in a multimedia
data stream when needed;
2. Besides multimedia contents being multi-level and dynamic, users’ access privileges may also change due to their mobility when they try
to access data from different places, at different time, and using different devices (Kulkarni & Tripathi, 2008). For example, a doctor may
only be allowed to access certain medical images through computers inside the hospital local network when on duty, not the computers at
home or when off duty. Therefore, the security mechanism should take into consideration of the spatio-temporal constraints and model
them in a coherent manner;
3. In a multimedia big data environment, many applications, such as patient monitoring, location-based support system, etc., use large
amount of real-time data to ensure high quality services where efficiency is of ultimate importance besides security assurance (Nehme et al.,
2013; Sachan, Emmanuel, & Kankanhalli, 2010). Therefore, the security mechanism should be able to handle high data volume and rapid
arrival rate, and security processing can be done on-the-fly or more realistically faster than the data incoming speed.
Though multimedia standards like MPEG-7 offers a comprehensive set of audiovisual description tools to describe multimedia contents, the
security requirements are left open without the mechanism to specify who is allowed to access which (parts of the) multimedia data under which
mode (Pan & Zhang, 2008). Therefore, it is essential to support security management of multimedia big data and design security models
accordingly.
In this paper, a hierarchical security model (HSM) is proposed to address all these challenges. It provides role hierarchy management and
security roles/rules administration, which extends the traditional role-based access control (RBAC) by adopting the object-oriented concept,
spatio-temporal constraints, as well as taking into consideration and making full use of the multimedia standard MPEG-7.
The rest of the paper is organized as follows. First a literature review is conducted on the security access control models for multimedia big data
and MPEG-7 properties that can be used to extend RBAC. Then the proposed hierarchical security model is presented followed by a performance
evaluated using two example scenarios. Finally, the paper is concluded.
RELATED WORK
Role-based access control (RBAC) is widely accepted as an alternative to traditional discretionary and mandatory access controls (Chen &
Crampton, 2008) and was one of the most widely used access control models to restrict system access to authorized users (Strembeck &
Neumann, 2004; Li & Tripunitara, 2006). RBAC is a versatile model that conforms closely to the organizational model used in corporations. The
fundamental feature is to separate users and access permissions where the access permissions are assigned to roles, and roles are further
assigned to users. With such separation, the administration of individual users’ access becomes easier as to simply assign appropriate roles to
users (Zhao et al., 2008). However, RBAC fails to support dynamic access control or temporal/spatial constraints because it does not include
context-aware and spatial aware elements (Bouna, & Chbeir, 2008; Choi et al., 2012). It also lacks a hierarchical structure to model multimedia
contents so access control is limited at file-level (Chen, Shyu, & Zhao, 2004; Zhao et al., 2008).
Thereafter, numerous extended RBAC models have emerged to address some of these unresolved security issues. The generalized temporal role-
based access control (GTRBAC) (Joshi, Bertino, Latif, & Ghafoor, 2005) and its XML-based version X-GTRBAC (Bhatti, Ghafoor, Bertino, &
Joshi, 2005) incorporate the content and context aware dynamic access control requirements. However, they only handle multimedia data at file-
level without taking care of the contents inside (Zhao et al., 2008). The context-aware role based access control (C-RBAC) model with multi-
grained constraints was proposed in (Zou, He, Jin, & Chen, 2009) but it fails to support periodic role enabling and disabling and temporal
dependencies among permissions, nor does it support spatial awareness (Nehme et al., 2013). On the other hand, Geo-RBAC (Damiani, Bertino,
Catania, & Perlasca, 2007) is an RBAC with spatial awareness but it lacks support to other essential requirements. More recently, ontology-based
access control model (Onto-ACM) proposed in Choi, Choi, and Kim (2013) enhanced RBAC and C-RBAC by incorporating identity context,
physical context, preference context, behavioral pattern context, and resource context to support dynamic access control with temporal/spatial
constraints. However, it lacks a hierarchical structure to model objects in multimedia contents or model roles in the big data environment.
In our earlier work, a multi-role based access control (MRBAC) model (Zhao et al., 2008) was proposed that not only enables complicated roles
and role hierarchies, but also supports temporal constraints and IP address restrictions. It performs well for traditional multimedia database
systems. However, it requires a multimedia indexing phase that processes and segments multimedia source data before building the hierarchical
structure of multimedia data and performing multi-level security control. This process is not efficient enough to handle multimedia big data. To
address this issue, we adopt and extend the criterion-based multilayer access control (CBMAC) approach proposed in Pan and Zhang (2008) that
enables multi-level multimedia content modeling by using the properties of MPEG-7.
MPEG-7 standard descriptions (Kosch 2004; Manjunath, Salembier, & Sikora, 2002) rely on three main components: descriptors (D),
description schemes (DS), and description definition language (DDL). Here, descriptors define the syntax and the semantics of the feature
representation. Description schemes specify the structure and semantics of the relationships between MPEG-7 components which can be Ds or
DSs. Description definition language is based on XML schema and is used to define the syntax of MPEG-7 description tools (Ds and DSs) to allow
the creation and modification of DSs and creation of new Ds. Two MPEG-7 properties are used in Pan and Zhang (2008) to describe and manage
the multimedia content efficiently. First, the DSs in MPEG-7 can be extended so more kinds and levels of multimedia entities can be described.
Second, the segment description (semantic description) in MPEG-7 can be decomposed further. After decomposition, different security
information can be embedded into different description part, which has a multimedia content locator pointing to the corresponding multimedia
entity. In HSM, MRBAC will be revised and extended to fully utilize these two properties.
THE PROPOSED HIERARCHICAL SECURITY MODEL
The proposed hierarchical security model (HSM) not only enables hierarchical structure for multi-level security control on multimedia data and
hierarchical architecture for complicated roles and role management, but also supports temporal constraints and IP address restrictions.
According to Zhao et al. (2008), and Adam, Atluri, Bertino, and Ferrari (2002), a security access control criterion in RBAC model is defined as
follows:
Definition 1: A security access control criterion CC = (R, O,A), where R is a user role, O is an object and A is access model.
In brief, the security access control criterion defines R is allowed or disallowed to perform operation A on O. Here, R can be a role that is
represented by job function or title that defines an authority level; O can be an object identifier or a content expression specifying an object or a
group of objects; and A represented allowed operations such as read and write (Pan & Zhang, 2008). In HSM, the parameter A is defined in the
same way as the traditional RBAC models. To ensure multi-layer security assurance, the idea is to have multimedia data O modeled by a
hierarchical structure where security criteria can be embedded seamlessly to objects at different levels. In addition, considering that in most
multimedia applications, a request behavior can be modeled by a 4-tuple <who, what, when, where>, i.e., some user requests some data at some
time and some place, role R is extended to model a role with spatio-temporal information. Furthermore, to ensure processing efficiency, HSM
extends CBMAC approach and takes full advantage of MPEG-7 standard. The details are discussed in the next two subsections.
Multimedia Data Objects
The definition of the parameter O in RBAC fits well to the hierarchical structure of multimedia data, i.e. O can be an object or a group of
subobject(s). For example, as an object itself, an image can contain multiple regions (each as a subobject), a video with multiple scenes or shots,
or a video shot with salient moving objects, etc. The metadata (descriptions) created by using MPEG-7 tools have a one-to-one relationship with
the corresponding multimedia entities (objects), which consists of their accessing information ao (e.g., TimePoint, Duration, Quality, etc.) and a
multimedia locator lo pointing to the corresponding object. In addition, descriptions are organized in a tree structure where each node
corresponds to a piece of multimedia data (subobject) that is composed of its direct children. Therefore, the parameter O in the security access
control criterion can be formally modeled below:
Definition 2: A multimedia object O = (ao, lo, po), where aomeans object O’s accessing information, L the locator to O, and po an array of
pointers pointing to O’s direct children nodes where a NULL pointer is used to represent a missing child.
Assume a multimedia data stream application similar to Pan and Zhang (2008) that contains special and complicated medical cases referenced
by doctors, nurses, patients, and their family members. Each multimedia data stream includes both professional and nonprofessional
information in a heterogeneous form (e.g., text, image, audio, video, etc.) and is described by MPEG-7 Description Tools. To simplify the
discussion, only the object name and its child pointers po are used to represent the corresponding multimedia data and their elements in the
example. The corresponding object structure is shown in Figure 1.
Security Criterion Expressions (SCE)
With the hierarchical modeling of the multimedia data, the next step is to embed a proper security criterion expression (i.e., Boolean expression)
to each node. Assume the following security criteria are imposed on the example multimedia data stream shown in Figure 1. For simplicity, no
spatio-temporal restrictions are considered at this stage:
C2: Patient’s diagnosis contents are not accessible to nurses and non-clinic doctors;
C3: Doctors, nurses, and family members are not allowed to access patients’ financial information.
To assign corresponding security criteria to multimedia objects, we adopt and extend two principles: open policy (Al-Kahtani & Sandhu, 2004)
and minimal scope. Open policy means access is denied if there exists a corresponding negative authorization and allowed otherwise. Minimal
scope concept is to limit the scope of variables as narrowly as possible, which has been widely used and well accepted in software engineering
area (Remy & Vouillon, 1997). In other words, we will extract negative authorizations (or called security criterion expressions) from the security
criteria and assign them only to the objects at the level that require complete protection. So the first step is to extract roles that are restricted by
the security criteria. Similar to Pan and Zhang (2008), a secure role table is built (see Table 1).
Table 1. Secure role table
R1 Nonprofessional people C1
R2 Doctors C2, C3
R4 Nurses C2, C3
R5 Family members C3
Note that nonclinic doctors in criterion C2 lead to two secure roles:R2 (Doctors) and R3 (People not working in clinic office) is because we adopt
the basic atomic rule, i.e., each secure role must be undividable. The reason of this restriction is to avoid overlap of secure roles to minimize
complexity and increase clarity. Correspondingly a security criterion expression table can be built as in Table 2.
Table 2. Security criterion expression table
Professional R1 C1
information
Financial R2 ∪ R4 ∪ R5 C3
information
Accordingly, Figure 2 shows the example multimedia object structure with the security criterion expressions. As we can see, the negative
authorization R2 ∪ R4 ∪ R5 is assigned to Financial information because Doctors, Nurses, and Family members cannot access patient’s Financial
information. However, it is not assigned to Personal information even though Financial information is part of Personal information according to
the minimal scope principle, which is different from the algorithm in Pan and Zhang (2008) where all the security criteria are propagated
upwards until the root node (i.e., in this case the root node’s negative authorization will be R1 ∪ R2 ∪ R4 ∪ R5 ∪ (R2 ∩ R3)). Obviously, our
algorithm is more efficient to build and easier to check. The structure is capable of handling any possible invoking, revoking, or revising of
security rules because all the changes will be restricted within the smallest scope possible. In brief, when a security rule is introduced, deleted, or
revised to the existing nodes in the multimedia object tree, the process is rather straightforward as to update the secure role table and security
criterion expression table as shown in Tables 1 and 2. One more step will be involved though if new security rules are introduced to protect sub-
objects of current leaf node(s). For example, assume Treatment information contains Patient’s identity and other treatment information and a
new rule C4 is introduced so nonclinic Nurses cannot access Patient’s identity. In this case, the corresponding leaf node (i.e., Treatment) needs to
be first decomposed before this new security criterion can be added to the newly created leaf node (i.e., Patient’s identity). As discussed in
Related Work section, MPEG-7 supports this kind of decomposition.
Figure 2. An example multimedia object structure with
negative authorizations
During run time, the authentication and access control processing in security system can be simply transferred into a test of whether a particular
role Rx is restricted by the security criterion expression. For instance, if Rx is restricted by R2 ∪ R4 ∪ R5 in the example, Rx is therefore denied
access to Financial information. However, two questions have to be answered before performing such a test. First, the roles discussed so far do
not contain spatio-temporal constraints. How can such constraints be modeled in HSM? Second, how can role hierarchy be established so the
proper level of restrictions can be verified?
Hierarchical Roles with SpatioTemporal Constraints
To model the roles with spatio-temporal constraints, our previous work MRBAC (Zhao et al., 2008) defines two additional roles besides subject
role: 1) spatial roles which are described using the sets of IP addresses to restrict unauthorized accesses from alien computers; and 2) temporal
roles which are defined as the groups of effective time periods to control access functionalities over time. Studies have demonstrated its
effectiveness in modeling security with spatio-temporal constraints. However, MRBAC becomes rather complex and somewhat confusing by
having multiple roles. In addition, it is difficult to associate multiple roles with the hierarchical multimedia object structure in a coherent and
efficient manner. In this study, we extend MRBAC and the traditional definition of role R by introducing extended roleconcept:
Definition 3: An extended role RE = (R, <t>, <s>). Here R is the traditionally defined user role, t and s denotes temporal and spatial
information, respectively and the angle brackets indicate they can be optional.
For instance, we may define a RE as (Doctor, On duty, Inside hospital), (Doctor, Off duty), or (Doctor), etc. Similar to the our previous work
(Zhao et al., 2008), currently parameter t defines time periods or intervals/durations to specify access over time while s contains sets of IP
addresses to control access from computers. However, in the future s may be extended to include other spatial conditions such as the ones based
on geo-locations. In addition, t and s may be associated with R to form a positive role or a negative role, i.e., role R at t and p conditions is
allowed or not allowed to access O using the access model A. Correspondingly, an administrator may initially disable the users from accessing by
default (e.g., the user “Smith” in the “Receptionist” group in Table 3) then grant positive roles to allow this user’s access to some objects.
Similarly, an administrator may initially grant all the access abilities to a user (e.g., the user “Bailey” in the “Doctor” group in Table 3) and then
assign negative roles that deny this user’s access.
Table 3. Example subject roles
In HSM, the latter approach with negative roles is adopted due to two main reasons. First, by convention, security criteria or authorization rules
normally address the access restrictions to data, e.g., patient’s diagnosis and treatment contents are not accessible to receptionists, doctors are
not allowed to access patients’ financial information, etc. Second, it matches with the open policy (Al-Kahtani & Sandhu, 2004) and security
criterion expressions discussed earlier and can be used to enable early termination, that is to avoid further evaluation of the child nodes when the
evaluation of the security criterion is true for their parent node.
To check whether a role is restricted by certain security criterion expression, role hierarchy is a very important concept since the system needs to
make access permission decisions based on the position of a role in the whole hierarchy. As aforementioned, the “role” concept is extended in our
proposed security access control framework to describe role with spatio-temporal constraint, which can be modeled and managed by revising and
extending the hybrid hierarchy introduced in Joshi, Bertino, Ghafoor, and Zhang (2008). The hybrid hierarchy in Joshi et al. (2008) was
designed to facilitate specifications of fine-grained RBAC policies where three hierarchical relations among roles can co-exist and their semantic
meanings are explained below:
By contrast, in HSM, the concepts of negative roles and authorizations are used, where the security criterion expression is essentially a set of
negative roles. Therefore, instead of defining the permission inheritance relationship, HSM adopts restriction inheritance concept and the
hierarchical relationships are revised as follows:
1. Rhierarchy (≤r): Restriction inheritanceonly;
2. Ahierarchy (≤a): Negative roleactivationonly;
3. IAhierarchy (≤): Both restrictioninheritance and roleactivation.
As discussed earlier, the authentication and access control processing in security system can be transferred into a test of whether a particular role
is restricted by the security criterion expression. By applying these three hybrid role hierarchy definitions (as in Table 4) to manage the roles, this
problem is further converted into a comparison of roles in the hierarchy.
Table 4. Semantic meaning of hybrid user role hierarchies in HSM
Symbol Descriptions
Security Verification
Given the hierarchical structure of multimedia data objects with embedded security criterion expressions, the access to the data stream will be
granted or denied based on the role of the requested users. Formally, the security verification is performed on the “Object Entity Set” (OES) of the
request object O which is defined below:
Definition 4: Object Entity Set: OES(O) ={O}U{s: s∈O}. Here O is the object and s all its subordinates in the hierarchical object tree.
Table 5 depicts the security verification algorithm. In steps 2 and 3, user’s identity is verified and no data is returned if the user is not valid. Steps
5 to 8 retrieve the user’s assigned role and return no data if the user has no valid role. In step 9, security_process(u_role, object) is a recursive
function that traverses the tree structure in depth-first order while the security criterion expressions (SCE) are evaluated to check if the user can
access OES(object) in the specified time from some specified computer. The algorithm is shown in Table 6.
Table 5. Security verification algorithm
Table 6. Evaluate SCE and return data according to user role
As can be seen in Table 6, steps 2 and 3 check whether the role of the particular user has access to the requested object based on the SCE Table
(the example is shown in Table 2). If it has the access and the leaf node is reached, the content will be returned (Steps 4 and 5). Otherwise if the
node contains child nodes, the security_process function is recursively called (Steps 6 and 7). In summary, three kinds of results can be
formalized as follows:
1. If the role is restriction inheritance to the security criterion (i.e., negative authorization as discussed earlier) of the requested multimedia
data, the access will be denied;
2. A user can access the complete multimedia data object iff the role is not restriction inheritance to any security criterion of OES(object);
3. A user can access part of the multimedia data where the prohibited sub-objects are removed from object. Here the prohibited sub-objects
are the ones whose security criterion covers are restriction inherited by the user’s role.
SECURITY MODEL APPLICATION AND EVALUATION
In this section, two examples in a multimedia big data environment are presented to show that our proposed HSM model is more practical and
effective in security modeling of the complicated scenarios. It is noticed that the other RBAC models cannot function well in these situations:
Scenario 1: Assume in the medical data stream application as shown in Figure 1, there is a security rule states the patient’s recovery
contents are only accessible by user role of Doctor through the computers inside the hospital local network. For example, the IP address
contains “131.94.*.*”. Therefore, is it possible to access such contents using a doctor’s home computer with IP address “131.95.12.32”?
The traditional RBAC models and many of its extended versions do not check the accessing computers. However, it is a useful function to confirm
the secure access point by using IP addresses, which is done in our proposed HSM model. In addition, our model also can support the checking of
temporal constraints:
Scenario 2: Again in the medical data stream application, assume user role REy can only access the Personal information (denoted as PI).
There is another user role RExwhere REx ≤r REy. What is the accessibility of REx on Professional contents (denoted as PC)?
In this scenario, the traditional and extended RBAC models can hardly describe this situation because: 1) there is no multi-level access control
mechanism defined to process multimedia data; or 2) the role hierarchy is not defined; or 3) no security criterion expressions are embedded in
the hierarchical object structure. While in our proposed HSM model, the scenario can be defined as follows:
2. In addition, we have REx ≤r REy. Based on the transitive property, we get REx ≤r SCE(PC). Thus REx cannot access Professional
contents.
CONCLUSION
This paper presents a hierarchical security model (HSM) to ensure multilevel security access control in the multimedia big data environment. In
the proposed approach, a hierarchical object structure is used to model the multimedia data with the assistance of MPEG-7 description tools. The
set of security criterion expressions are extracted from the system security policy and embedded into the multi-level object structure. It also
incorporates spatial and temporal constraints by introducing the extended role concept. By introducing the restriction inheritance concept, three
types of role hierarchical relationships are defined. Correspondingly, the verification of security access is efficient by converting it into the
problem of making access permission decisions based on the position of a role in the whole hierarchy. In summary, HSM can 1) provide multi-
level security protection for multimedia data; 2) support access control with spatio-temporal constraints; 3) enable efficient access control
management.
This work was previously published in the International Journal of Multimedia Data Engineering and Management (IJMDEM), 5(1); edited by
ShuChing Chen, pages 113, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Adam, N., Atluri, V., Bertino, E., & Ferrari, E. (2002). A content-based authorization model for digital libraries. IEEE Transactions on Knowledge
and Data Engineering , 14(2), 296–315. doi:10.1109/69.991718
Al-Kahtani, M. A., & Sandhu, R. (2004). Rule-based RBAC with negative authorization. In Proceedings of the 20th ACM Annual Computer
Security Applications Conference (pp. 405-415).
Bertino, E., Fan, J., Ferrari, E., Hacid, M.-S., Elmagarmid, A. K., & Zhu, X. (2003). A hierarchical access control model for video database
systems. Transactions on Information Systems , 21(2), 155–191. doi:10.1145/763693.763695
Bhatti, R., Ghafoor, A., Bertino, E., & Joshi, J. B. D. (2005). X-GTRBAC: An XML-based policy specification framework and architecture for
enterprise-wide access control. ACM Transactions on Information and System Security , 8(2), 187–227. doi:10.1145/1065545.1065547
Bhatti, R., LaSalle, R., Bird, R., Grance, T., & Bertino, E. (2012) Emerging trends around big data analytics and security. InProceedings of the
17th ACM Symposium on Access Control Models and Technologies (pp. 67-68).
Bouna, B. A., & Chbeir, R. (2008). MCSE: A multimedia context-based security engine. In Proceedings of the 11th International Conference on
Extending Database Technology: Advances in Database Technology (pp. 705-709).
Chen, L., & Crampton, J. (2008). On spatio-temporal constraints and inheritance in role-based access control. In Proceedings of the 2008 ACM
Symposium on Information, Computer and Communications Security (pp. 205-216).
Chen, S.-C., Shyu, M.-L., & Zhao, N. (2004) SMARXO: Towards secured multimedia applications by adopting RBAC, XML and object-relational
database. In Proceedings of the 12th Annual ACM International Conference on Multimedia (pp. 432-435).
Choi, C., Choi, J., & Kim, P. (2013). Ontology-based access control model for security policy reasoning in cloud computing. The Journal of
Supercomputing , 1–12.
Choi, C., Choi, J., Ko, B., Oh, K., & Kim, P. (2012). A design of onto-ACM (ontology based access control model) in cloud computing
environments. Journal of Internet Services and Information Security , 2(3/4), 54–64.
Damiani, M., Bertino, E., Catania, B., & Perlasca, P. (2007) GEO-RBAC: A spatial aware RBAC. ACM Transactions on Information and System
Security, 10(1), article 2.
Joshi, J. B. D., Bertino, E., Ghafoor, A., & Zhang, Y. (2008). Formal foundations for hybrid role hierarchy in GTRBAC. ACM Transaction on
Information and Systems Security, 10(4), article 2.
Joshi, J. B. D., Bertino, E., Latif, U., & Ghafoor, A. (2005). Generalized temporal role based access control model. IEEE Transactions on
Knowledge and Data Engineering , 7(1), 4–23. doi:10.1109/TKDE.2005.1
Kosch, H. (2004). Distributed multimedia database technologies supported by MPEG-7 and MPEG-21 . CEC Press.
Kulkarni, D., & Tripathi, A. (2008). Context-aware role-based access control in pervasive computing systems. In Proceedings of the 13th ACM
Symposium on Access Control Models and Technologies (pp. 113-122).
Li, N., & Tripunitara, M. V. (2006). Security analysis in role-based access control. ACM Transactions on Information and System Security , 9(4),
391–420. doi:10.1145/1187441.1187442
Manjunath, B. S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-7 multimedia content description interface . John Wiley & Sons, Ltd.
Nehme, R. V., Lim, H.-S., & Bertino, E. (2013). FENCE: Continuous access control enforcement in dynamic data stream environments.
In Proceedings of the Third ACM Conference on Data and Application Security and Privacy (pp. 243-254).
Pan, L., & Zhang, C. N. (2008). A criterion-based multilayer access control approach for multimedia applications and the implementation
considerations. ACM Transactions on Multimedia Computing, Communications, and Applications , 5(2), 1–29. doi:10.1145/1413862.1413870
Remy, D., & Vouillon, J. (1997). Objective ML: A simple object-oriented extension of ML. In Proceedings of the 24th ACM SIGPLANSIGACT
Symposium on Principles of Programming Languages (pp. 40-53).
Sachan, A., Emmanuel, S., & Kankanhalli, M. S. (2010). An efficient access control method for multimedia social networks. InProceedings of
Second ACM SIGMM Workshop on Social Media(pp. 33-38).
Strembeck, M., & Neumann, G. (2004). An integrated approach to engineer and enforce context constraints in RBAC environments.ACM
Transactions on Information and System Security , 7(3), 392–427. doi:10.1145/1015040.1015043
Zhao, N., Chen, M., Chen, S.-C., & Shyu, M.-L. (2008) MRBAC: Hierarchical role management and security access control for distributed
multimedia systems. In Proceedings of IEEE International Symposium on Object/Component/ServiceOriented RealTime Distributed
Computing (pp. 76-82).
Zou, D., He, L., Jin, H., & Chen, X. (2009). CRBAC: Imposing multi-grained constraints on the RBAC model in the multi-application
environment. Journal of Network and Computer Applications , 32(2), 402–411. doi:10.1016/j.jnca.2008.02.015
CHAPTER 23
Big Data Warehouse Automatic Design Methodology
Francesco Di Tria
Università degli Studi di Bari Aldo Moro, Italy
Ezio Lefons
Università degli Studi di Bari Aldo Moro, Italy
Filippo Tangorra
Università degli Studi di Bari Aldo Moro, Italy
ABSTRACT
Traditional data warehouse design methodologies are based on two opposite approaches. The one is data oriented and aims to realize the data
warehouse mainly through a reengineering process of the well-structured data sources solely, while minimizing the involvement of end users. The
other is requirement oriented and aims to realize the data warehouse only on the basis of business goals expressed by end users, with no regard
to the information obtainable from data sources. Since these approaches are not able to address the problems that arise when dealing with big
data, the necessity to adopt hybrid methodologies, which allow the definition of multidimensional schemas by considering user requirements and
reconciling them against non-structured data sources, has emerged. As a counterpart, hybrid methodologies may require a more complex design
process. For this reason, the current research is devoted to introducing automatisms in order to reduce the design efforts and to support the
designer in the big data warehouse creation. In this chapter, the authors present a methodology based on a hybrid approach that adopts a graph-
based multidimensional model. In order to automate the whole design process, the methodology has been implemented using logical
programming.
1. INTRODUCTION
Big data warehousing refers commonly to the activity of collecting, integrating, and storing (very extra) large volumes of data coming from data
sources, which may contain both structured and unstructured data. However, volume alone does not imply big data. Further and specific issues
are related to the velocity in generating data, and their variety and complexity.
The increasing volume of data stored in data warehouses is mainly due to their nature of preserving historical data, for performing statistical
analyses and extracting significant information, hidden relationships, and regular patterns from data. Other factors that affect the size growth
derive from the necessity of integrating several data sources, each of them provides a different variety of data that contribute to enrich the types
of analyses, by correlating a large set of parameters. Furthermore, some data sources—such as Internet transactions, networked devices and
sensors, for example—generate billions of data very quickly. These data should update the data warehouse as soon as possible, in order to gain
fresh information and make timely decisions (Helfert & Von Maur, 2001).
These issues affect the design process, because big data warehouses must integrate heterogeneous data to be used to perform analyses that
consider many points of view, and to produce complex schemas having cubes with high number of dimensions. Furthermore, they must be
capable of quickly integrating new data sources through a minimal data modelling process.
To summarize this, new aspects for data warehouses supporting analyses of Big Data have been stated in Cohen et al. (2009). Big data
warehouses have to be (i) magnetic for they must attract all the data sources available in an organization; (ii) agile for they should support
continuous and rapid evolution; and (iii) deep in that they must support analyses more sophisticated that traditional OLAP functions.
1.1. Background Approaches to Automatic Design
In the mentioned scenario, traditional design methodologies, which are based on two opposite approaches—data-oriented and requirement-
oriented— (Romero & Abelló, 2009), are not able to solve problems when facing big data.
In fact, methodologies adopting a data-oriented approach are devoted to define multidimensional schemas on the basis of the remodelling of the
data sources. These data must be strongly structured, since functional dependencies are taken into account in the remodelling phase
(dell’Aquila et al., 2009). Then, these methodologies are not able to create a multidimensional schema from non-structured data sources.
Furthermore, in presence of a high number of data sources, the process of solving semantic and syntactical inconsistencies among the different
databases can be a very hard task without using an ontological approach. This reengineering process is individually executed by the designer who
minimizes the involvement of end users and, consequently, goes towards a possible failure of their expectations. In the worst case, the data
warehouse is completely useless and the design process must be revised.
On the opposite side, methodologies adopting a requirement-oriented approach define multidimensional schemas using business goals resulting
from the decision makers’ needs. The data sources are considered later, when the Extraction, Transformation, and Loading (ETL) phase is
addressed. In the feeding plan, concepts of the complex multidimensional schema (such as facts, dimensions, and measures) have to be mapped
on the data sources, in order to define the procedures to populate the data warehouse by cleaned data. At this point, the definition of these
procedures can be very difficult and, in the worst case, it may happen that the designer discovers that the needed data are not currently available
at the sources. On the other hand, some data sources containing interesting information, albeit available, may have been omitted or not
exploited.
However, each of these two approaches has valuable advantages. So, the necessity emerged to adopt a hybrid methodology which takes into
account their best features (Di Tria et al., 2012; Di Triaet al., 2011; Mazón & Trujillo, 2009; Mazón et al., 2007; Giorginiet al., 2008; Bonifati et
al., 2001). As a counterpart, hybrid methodologies are more complex because they need to integrate and to reconcile both the requirement and
the data oriented approaches.
Nonetheless, the advantages in adopting hybrid methodologies justify the higher efforts to be spent in the multidimensional design. For these
reasons, the current research is devoted to introduce automatisms able to reduce the design efforts and to support the designer in the data
warehouse design (Romero & Abelló, 2010a; Phipps & Davis, 2002). On the basis of automatic methodologies, we may be able to include and to
integrate new data sources on the fly.
The emergent method to manage such integration is based on the ontological approach (Jiang et al., 2011; Thenmozhi & Vivekanandan, 2012).
Indeed, ontologies represent a common and reusable base to compare and align data sources in a fast manner.
1.2. On the Content of the Paper and Its Organization
The contribution of the paper is a hybrid methodology for big data warehouse design, whose steps are completely automatic. It is based on our
previous methodology devoted to hybrid multidimensional modelling. Here, we extend the previous work and present the complete methodology.
First, integration of different data sources, both structured and non-structured, is based on an ontological approach. Then, the conceptual design
is performed by formal rules that modify an integrated schema according to users’ requirements. At last, the conceptual schema is validated
against a workload before transforming it into a logical schema.
The paper is organized as follows. Section 2 reports related papers on the use of ontologies in the multidimensional design. Section 3 presents an
overview of our methodology, focusing on the underlying multidimensional model. Then, a detailed description of the steps of the methodology
follows. Section 4 shows how the users’ requirements are investigated and represented. Section 5 describes the ontological approach for data
sources integration. Section 6 contains the description of the conceptual design. The case study illustrated in Section 7 shows a real example of
data sources integration and conceptual design. Finally, Section 8 draws future research and Section 9 concludes the paper with some our
remarks.
2. RELATED WORK
At present, a lot of effort is devoted to designing a data warehouse automatically. Many steps can be done using algorithms and inference rules.
Nonetheless, the process of integration of data sources is far from being solved, due to the semantic heterogeneity of the components and means.
In fact, to solve semantic inconsistencies among conceptual schemas, techniques derived from artificial intelligence must be used (Chen, 2001).
So, the current trend in data warehouse design relies on the ontological approach, which is widely used in the semantic web (Sure et al., 2002).
An important work is described in (Hakimpour & Geppert, 2002). The authors’ approach is based on local ontologies for designing and
implementing a single data source, inherent to a specific domain. Next, the data warehouse design process aims to create a global ontology
coming from the integration of the local ontologies. Finally, the global ontology is used along with the logical schemas of the data sources to
produce an integrated and reconciled schema, by mapping each local concept to a global ontological concept automatically. However, the
integration of the local ontologies must be manually done. Moreover, the global ontology needs to be modified each time a data source must be
integrated. On the contrary, as we shall see, in our approach the ontology is pre-existing and never changes.
The work of Romero and Abelló (2010b) is also based on an ontological approach but it skips the integration process and directly considers the
generation of a multidimensional schema starting from a common ontology, namely Cyc. In this case, the approach is semi-automatic since it
requires a user validation to solve inconsistencies.
In (Bakhtouchi et al., 2011), the authors propose a methodology to integrate data sources using a common ontology, enriched with a set of
functional dependencies. These constraints support the designer in the choice of primary keys for dimension tables and allow the integration of
similar concepts using common candidate keys. To show this, the authors consider
relations R1(id, name,address, telephone), R2(id, name, telephone), and R3(telephone,name, address), and
dependencies id→name, id→address,id→telephone, telephone→name, and telephone→address. Then, the name and address data integration is
possible using telephone, which represents the common candidate key.
An interesting proposal to automatically reconcile user requirements and data sources in early stages using ontologies is presented in
(Thenmozhi & Vivekanandan, 2012). First, each source is converted into OWL format. Then, a global ontology is obtained by mapping and
integrating local concepts. On the other hand, user requirements, represented as information requirements in normal text, are converted from
natural language to a logical format. At this point, concepts of interest for analysis are discovered, by matching information requirements against
concepts in the global ontology and extracting those with high similarity values. Finally, the discovered concepts are tagged as multidimensional
elements using reasoning.
This last proposal is very close to our approach, since we share the opinion that converting sentences from natural language to a logical format,
such as clauses of predicate calculus, can be useful for matching concepts on the basis of similarity metrics.
To summarize, we use ontology because this allows us to integrate many different data sources by detecting the similarity of concepts
automatically.
3. METHODOLOGY
Here, we present a hybrid methodology for data warehouse design. The core is a multidimensional model providing a graph-based representation
of a data source. In this way, the traditional operations performed in the reengineering process (such as adding and removing attributes, or
modifying functional dependencies) correspond to basic operations on graphs (such as adding and removing nodes). Nevertheless, this
remodelling activity relies on a set of constraints that have to be derived from the requirement analysis, avoiding the oversight of business needs.
Using these constraints, it is possible to perform the remodelling activity in a supervised and automatic way. Moreover, the conceptual schema is
automatically validated against a preliminary workload before proceeding with further design phases. Also the workload is obtained from the
requirements analysis. All the automatic phases of the methodology have been implemented using the logical programming, in order to define a
system able to support the designer by producing schemas.
The data warehouse design methodology we propose is depicted in Figure 1 and it is based on GrHyMM model (Graph-based Hybrid
Multidimensional Model) (Di Tria et al., 2011). In the figure, the phases that are done automatically and the artifacts that are automatically-
produced have been highlighted.
Figure 1. Data warehouse design methodology
1. Requirement Analysis: Decision makers’ business goals are represented using the i* framework for data warehousing (Mazón et al.,
2005). The designer has to detect the information requirements and to translate them into aworkload, containing the typical queries that
allow the extraction of the required information. Then, the goals of the data warehouse must be transformed into a set of constraints,
defining facts and dimensions to be considered in the multidimensional schema. To this aim, both the workload and constraints must be
given in input to Conceptual Design;
2. Source Analysis and Integration: The schemas and metadata of the different data sources must be analyzed and then reconciled, in
order to obtain a global conceptual schema. The integration strategy is based on an ontological approach and, therefore, we need to work at
the conceptual level. To this end, a reverse engineering from data sources to a conceptual schema is necessary, in order to deal with the
concepts, when considering structured data sources. On the other hand, metadata are taken into account, when considering unstructured
data sources. The conceptual schema that results from the integration process must then be transformed into a relational schema, which
constitutes the input to the Conceptual Design;
3. Conceptual Design: This phase is based on the multidimensional model that provides a graph-oriented representation of relational
databases. In particular, it aims to build attribute trees representing facts pointed out in the integrated data source (dell’Aquila et al., 2009)
and to automatically remodel those attribute trees on the basis of theconstraints derived from the Requirement Analysis. Finally, the
resulting attribute trees are checked in order to verify whether they are able to support the defined workload(dell’Aquila et al., 2010);
4. Logical Design: The conceptual schema is transformed into a relational schema—for instance, a snow-flake schema—considering each
attribute tree present in the conceptual schema as a cube, having the root as the fact and the branches as the dimensions, possibly structured
in hierarchies;
5. Physical Design: The design process ends with the definition of the physical properties of the database on the basis of the specific
features provided by the database system, such as indexing, partitioning, and so on.
3.1. Multidimensional Model
The multidimensional model aims to represent a relational schema using a tree-based representation. This schema can be then remodelled on the
basis of traditional operations on graphs in order to obtain a multidimensional schema.
Let:
where:
• E = {(Ai, Aj) | oriented edge from Ai∈N to Aj∈N, i ≠ j} ⊂ N×N is a set of oriented edges; and
Assumption 1: Let R(X1, …, Xn) be a relation, and let G= (N, E) be a tree. We assume Xi∈N, ∀i = 1, …, n. We assume also (Xi, Xj)∈E, if
Xi is the primary key of R and i≠ j. We say that G = (N, E) is the attribute tree obtained from the relation R, where Xi is the root of G.
On the basis of Assumption 1, the edge (Xi, Xj) indicates the presence of the non trivial (i ≠ j) functional dependency Xi → Xjthat holds on R (in
this case, established by a primary key constraint). It is worth noting that, for the sake of simplicity, we assume the primary key is composed of
only one attribute:
Assumption 2: Let R(X1, …, Xn) and S(Y1, …, Ym) be relations, and let G = (N, E) be a tree. We assume (Xi,Yj)∈E, if:
Assumption 3: (Tree Minimization) Let R(X1, …, Xn) andS(Y1, …, Ym) be relations, and let G = (N, E) be a tree. Now, letw be a real
function w: N×N → such that w(u, v) = 1 for all (u, v)∈E. Then, we can use the foreign key constraint to minimize the tree G as follows:
o Xi → Xj;
o Xj is a foreign key referencing the primary key Ys of the relation S,
o N' ⊆ N;
o (Xi, Ys)∈E';
(Xj,Ys) E',
is the minimization of G.
Example 1: With reference to the relational schema depicted in Figure 2(a), the attribute tree shown in Figure 2(b) is the tree obtained on
the basis of Assumption 1 and Assumption 2, while the attribute tree shown in Figure 2(c) is the minimized tree obtained on the basis of
Assumption 3.
Figure 2. (a) Relational schema; (b) Attribute tree; (c)
Minimized attribute tree
Until now we have considered relational schemas composed of one or two relations. Hereafter, we will be concerned with complex schemas
composed of several relations. For this end, we introduce the tree(R) function that builds an attribute tree starting from the relation R. In fact,
the topology of the attribute tree depends on the relation taken as the starting point to navigate in the schema:
Assumption 4 allows to build an attribute tree when a relation representing a many-to-many n-ary relationship is taken as the starting point:
Assumption 5 allows to build an attribute tree when a relation representing a many-to-many n-ary relationship is encountered while navigating
in the schema:
Example 2: With reference to the relational schema in Figure 3(a), we can build four different attribute trees, according to the relation
chosen as the starting point. Figure 3(b) shows the attribute tree obtained by invoking thetree(product) function. Notice that Assumption 5
has been applied and the double-head arrow “↠” represents a multivalued dependency, that is, one productID points out many occurrences
of the sale relation. Figure 3(c) shows the attribute tree obtained by invoking the tree(sale) function—Assumption 4 is applied here—while
Figures 3(d) and 3(e) show those obtained by invoking tree(order) andtree(category), respectively. Assumption 5 is applied again in Figure
3(d) to navigate from order to sale.
Figure 3. (a) Relational Schema. Attribute trees obtained by
(b) Tree(product); (c) Tree(sale); (d) Tree(order); and
(e)Tree(category).
3.1.1. Operations on the Graph
In what follows, in reference to attributes A and B, A and B denote the nodes representing the corresponding attributes, and A → Bthe edge
existing from the node A to the node B. In the case of branches between nodes A, B simultaneously referring each other (as an example, this
happens when a relation has primary key A and alternative key B, or when two relations are mutually referencing each other via the respective
foreign keys A and B), the “loop” A B is solved algorithmically by a node-splitting and renaming (A, for example). That is, the
loop A B generates the (sub-)tree A → B → A'.
7. graft (A), removing the A node and adding its children to its parent; and
8. change_parent (A,B,C), for the edge B → C and node A,change parent of C from B to A means: (a) delete_edge(B,C), and (b)
create_edge(A,C).
Accordingly, the four basic operations defined on the tree correspond respectively to: creating attribute A, deleting attribute A, adding the
functional dependency A → B, and removing the functional dependency A → B. Moreover, the change parent operation is very useful to modify
hierarchical dimensional levels. Therefore, the basic operations allow performing a reengineering of schemas using a completely data-driven
approach. So, if we manually remodel the attribute tree not on the basis of the designer’s experience and choice, but considering also user needs
coming from the requirement analysis, then we obtain a hybrid approach. As a further evolution, if we define a set of rules that apply the defined
operations on a graph on the basis of a set of constraints derived from the requirement analysis, we can remodel an attribute tree automatically.
4. REQUIREMENT ANALYSIS
In phase 1 of the methodology (cf, Figure 1), the needs of the decision makers are investigated. To this aim, we adopt the i*framework, which
allows to explicitly represent business goals in reference to the actors considered in the system. In the application to data warehousing, we can
observe two main categories of actors: the decision makers and the data warehouse itself. Each actor performs specific tasks, in order to achieve
his/her own goals.
1. Business goals representation: Representing the tasks and the goals of the different actors using the i* framework;
3. Constraints representation: Translating the goals and the tasks of the data warehouse into a set of constraints.
4.1. Business Goals Representation
In the i* framework, user requirements, alias business goals, are exploded into a more detailed hierarchy of nested goals: (a)strategic goals, or
high-level objectives to be reached by the organization; (b) decision goals, to answer how strategic goals can be satisfied; and (c) information
goals, to define which information is needed for decision making. To do this, the designer must produce a model describing the relationships
among the main actors of the organization, along their own interests. This model is the so-called strategic dependency model and aims to outline
how the data warehouse helps decision makers to achieve business goals.
Each actor in a strategic dependency model is further detailed in astrategic rationale model that shows the specific tasks the actor has to perform
in order to achieve a given goal.
Then, strategic rational models are used to create a workload and a set of constraints.
4.2. Workload Representation
The workload contains a set of queries to be manually derived from the tasks of the decision makers and it helps the designer to identify the
information the final users are interested in. In a few words, it includes the typical queries that will be executed by decision makers in the
analytical processing.
The grammar for a high-level possible representation of the queries of the workload is shown in Algorithm 1.
Algorithm 1. Grammar for a highlevel possible representation of the queries of workload
<query>::-<function>(<fact_pattern>);
<fact_pattern>::-<fact>[<aggreg_pattern>;<sel_clause>].<measure>
<aggreg_pattern>::-<level> | <aggreg_pattern>,<level>
<sel_clause>::-<attribute> <comp_op> <constant> |
<sel_clause> <logical_op> <sel_clause>
<function>::-avg | sum | count
<logical_op>::-and | or
<comp_op>::-≥ | ≤ | < | > | =
<fact>::-<identifier>
<measure>::-<identifier>
<level>::-<identifier>
<attribute>::-<identifier>
Here, constant is a number or string from the attribute domain, and identifier is a user-defined name corresponding to any valid variable (the
name of a table or column, for example).
Other authors (e.g., Phipps & Davis, 2002) utilize a workload representation based on SQL statements, in order to select a schema, among the set
of conceptual schemas designed by an algorithm, which best supports user requirements. Also Romero & Abelló (2010a) use SQL statements for
the workload representation, but their aim is to detect the role played by each model element−that is, whether a table is a dimension or a fact
table, for example−and assign it a label accordingly. The grammar we use is based on that introduced in (Golfarelli & Rizzi, 2009), which allows
to represent queries at conceptual level. Then, we have a coherent level of abstraction between the workload and the schema we intend to
validate. Moreover, it allows checking a conceptual schema before proceeding to the next logical design.
4.3. Constraints Representation
For each resource needed from decision makers, the data warehouse must provide adequate information by achieving its own goals. Moreover, a
goal must have measures that are resources to be used in order to provide the information required for decision making. Therefore, a fact is
generated in order to allow the data warehouse to achieve its own goal. At last, for each measure, a context of analysis must be provided by a task
of the data warehouse. So, starting from measures and dimensions emerged from data warehouse’s goals, it is possible to define some constraints
the designer must necessarily consider.
Algorithm 2. Grammar for a highlevel representation of constraints
<constraint>::-<fact>[<dimensions>].[<measures>];
<dimensions>::-<dimension> | <dimensions>;<dimension>
<dimension>::-<level> | <dimension>,<level>
<measures>::-<measure> | <measures>,<measure>
<level>::-<identifier>
<measure>::-<identifier>
<fact>::-<identifier>
5. SOURCE ANALYSIS AND INTEGRATION
In phase 2, the preliminary step is the source analysis, devoted to the study of the source databases. If necessary, the designer has to produce, for
each data source, a conceptual schema along with a data dictionary, storing the description in natural language of the concepts modelled by the
database. Then, the integration process proceeds incrementally using a binary operator that, given two conceptual schemas as operands,
produces a new conceptual schema:
G1 = integration (S1, S2); and
In detail, the integration process of two databases Si and Sj is composed of the following steps:
1. Ontological representation: In this step, we consider an ontology describing the main concepts of the domain of interest. If such
ontology does not exist, it must be built by domain experts. The aim is to build a shared and reusableontology;
2. Predicate generation: For each concept in the ontology, we introduce a unary predicate. The output of this step is a set of predicates,
which represents a vocabulary to build definitions of concepts using the first-order logic;
3. Ontological definition generation: For each concept in the ontology, we also introduce a definition on the basis of its semantic
relationships. This definition is the description of the concept at the ontological level (that is, the common and shared definition). The output
of this step is a set ofontological definitions;
4. Entity definition generation: For each entity present in the data sources and described in the data dictionary, we introduce a
definition using the predicates. Therefore, an entity definition is a logic-based description of a concept in the database. The output of this
step is a set of entity definitions;
5. Similarity comparison: Assuming that similar entities have a very close description, we can detect whether (a) entities that have
different names refer to the same concept, and (b) entities that have the same name refer to different concepts. To do so, we utilize a set of
inference rules, the so-called similarity comparison rules, to analyze the logic-based descriptions and a metric to calculate the pairwise
similarity of entity definitions. In detail, given two schemas Si (Aι1, Aι2, …, Aιk) and Sj (Aφ1, Aφ2, …, Aφm), where Aτh is the h-th entity of
schema St, we compare the logic definition of Aιh (for h = 1, …, k) with that of Aφq (for q = 1, …, m). For each comparison, we calculate a
similarity degree d and an output list L. The output list contains the possible ontological concepts shared by both the logic definitions.
Assuming that we can compare the logical definitions of entities Aιh and Aφq and calculate both the similarity degree d and the output list L,
we can observe one of the following cases: Aιh is equivalent to Aφq, ifd ≥ x, where x is a fixed threshold value. For convenience, we fixed x at
0.70:
s h q s
a. Aυs = Aιh ≈ Ajq, if we observe case 2.5(a). In this case, all the distinct attributes are merged in Aυs ;
b. Aυs = {Aιh, Aφq}, if we observe case 2.5(b). In this case, all the common attributes are associated to Aιh (resp.,Aφq);
c. Aυs = {L, Aιh, Aφq}, if we observe case 2.5(c). In this case, all the common attributes are associated to L;
d. Aυs = {γ, Aιh, Aφq}, if we observe case 2.5(d). In this case, all the entities preserve their own attributes;
e. Aυs = {Aιh, Aφq}, if we observe case 2.5(e). In this case, all the entities preserve their own attributes.
• If a new relationship γ is added, then we use the 0:N cardinality for each entity A which participates to γ. Indeed, this is the most general
cardinality;
• If entities Ai and Aj are merged, then we use the minimum of the respective minimum cardinality constraints and the maximum of the
respective maximum cardinality constraints for each relationship in which Ai and/or Aj participate;
It is worth noting that both attributes and cardinality constraints are retrieved from the Data Dictionary without any elaboration. Only entity
definitions are converted into predicates for comparison.
When a further schema Sw has to be integrated, the integration process starts from step 2.4, using the previous result of step 2.6 and schema Sw.
At last, the global conceptual schema Gu is translated into an integrated logical schema by applying the well-known mapping algorithms (Elmasri
& Navathe, 2010).
5.1. Representation of the Integrated Logical Schema
In order to represent the integrated logical schema, we need to define a metamodel. Such a metamodel describes how to organize the so-
called metadata, that is, the data we can use to describe any relational schema.
The language we used for the definition of the metamodel is thePredicate Calculus. In particular, our metamodel consists of the following four
predicates:
2. field (X,Y). The predicate states that Y is an attribute (field) of X. As X must be a table, we can use the AND logical operator in order to
impose this constraint. For instance, the expression field(X,Y) ∧ table(X) states that Y must be an attribute of the table X;
3. key (X,K). The predicate states that K is (part of) the primary key of X, where X must be a table and K must be a field of X (for
instance, key(X,K) ∧ field(X,K) ∧ table(X)). Moreover, the predicate allows us to model logical schemas where the primary key is composed
of more than one field;
4. fkey(X1,FK1,X2,K2). The predicate states that FK1 is a foreign key. Moreover, FK1 is an attribute of X1 and references (i.e., points to) the
attribute K2. The constraints on the predicate are that FK1 must be an attribute of the table X1, and K2 must be the primary key of the
table X2.
6. CONCEPTUAL DESIGN
Phase 3 is based on the multidimensional model and aims to automatically produce a data warehouse conceptual schema. In fact, this phase has
been implemented as a logical program, able to simulate the behaviour and the reasoning of a designer.
In order to use the constraints and the workload in the conceptual design, we need to preliminarily transform these manually-produced artifacts
into predicates to be given in input to the logical program. To this end, we realized two compilers, as depicted in Figure 5.
Figure 5. Steps of the conceptual design
2. Identifying facts present in the integrated logical schema on the basis of the constraints;
5. Validating the data warehouse conceptual schema, by verifying whether all the remodelled trees agree with the workload.
6.1. Compiling the Artifacts
In this sub-Section, we show how both the input artifacts are transformed into predicates by a compilation process. Each compiler is composed of
a Syntactical Analyzer (SA) that, on turn, uses a Lexical Analyzer (LA) for string pattern recognition.
The SA is a parser that verifies the syntactical structure of a statement. The SA has been developed using Bison (Levine, 2009), which is a tool
that (a) reads a grammar-file, and (b) generates a C-code program. This C-code program represents the SA. In particular, the grammar-file
contains the declaration of a set of terminal symbols (tokens) and a set of grammar rules, expressed according to the Backus Naur Form.
To define a grammar, tokens must be first defined. The tokens include literals (i.e., string constants), identifiers (i.e., string variables), and
numeric values.
%token VAR
<result>: <components> { <statement> };
where <result> is a non-terminal symbol, <components> is a set of terminal and/or non-terminal symbols, and <statement> is the C-code
statement to be executed when the rule is applied.
The LA is the component used by the SA, in order to obtain an ordered sequence of tokens. The tokens are recognized by the LA inside a string
(i.e., pattern matching on text) and, then, passed to the SA. The LA has been developed using Flex (Paxson et al., 2007), which is a tool that (a)
uses the tokens defined for the SA program, (b) reads a rule-file, and (c) generates a C-code program. This C-code program represents the LA. In
particular, the rule-file is composed of two sections: (a) definition, and (b) rules. The definition section includes the tokens already defined with
Bison, plus further identifiers. The identifiers define how to perform the matching between a sequence of alphanumeric characters and a token.
UVAR [a-z][a-z0-9]*
creates an identifier named UVAR which represents any string that starts with an alphabetic character, followed by an arbitrary number of
alphabetic characters or digits 0 to 9. Then, some examples of rules for string pattern matching follow:
The first rule states that, whenever the constant “;” is recognized inside the input string, the token SEMICOL must be returned. In fact, when the
LA recognizes an identifier, it returns the corresponding token to the SA. Accordingly, when the SA recognizes the identifier UVAR returns the
token VAR to the SA. At last, when the LA encounters the end-of-file symbol, it stops the string scanning. So, using the sequence of tokens
returned by the LA, the SA is able to check whether a statement is well-formed against the given grammar.
6.1.1. Compiling the Constraints
In order to transform the constraints coming from the Requirement analysis into predicates, we describe the metamodel used to represent a
starting multidimensional schema. On the basis of this assumption, the constraints define an “ideal” multidimensional schema and the logical
program has to create a final multidimensional schema that properly satisfies these constraints:
3. dimension_level (D,N,E,C,T). The predicate states that E is a dimension level of cube C. Here, D indicates the dimension number
and N the hierarchical level number inside the dimension. T indicates whether E is a time dimension or not.
6.1.2. Compiling the Workload
Also the queries included into the workload must be transformed into a set of predicates using the following query model, used to represent a
query to be checked against the final multidimensional schema:
1. query (C,A,S,M). The predicate states that C is the cube involved in the query, A is the aggregation pattern, S is the selection pattern,
and M is the measure.
6.2. Identifying Facts
As stated before, we developed a logical program able to construct an attribute tree. The logical program contains the metadata of the relational
schema and the navigation rules. It uses the rules for navigating in a recursive way through a logical schema, starting from an initial table marked
as a cube.
Indeed, the main difficulty in this step is to correctly map the multidimensional concepts to the integrated schema. For example, in a relational
schema, the designer must face the problem to identify which relations can be considered as cubes or dimension tables.
In our methodology, we mainly deal with the correct identification of facts, as these are the starting point to build the initial attribute tree.
We consider the facts involved in the constraints coming from the requirement analysis. Given a fact F1 in a constraint, we choose a candidate
fact F2 in the integrated schema such that F2corresponds to F1:
The rule, which is based on a syntactical matching, builds an attribute tree from the table C, if C is marked as a cube in the requirements. If so,
the rule asserts that C is the root for that attribute tree and, then, the navigation starts. (Notice that create_root(C) is the function introduced in
sub-Section 3.1.1. We will use these functions also in the next).
6.3. Building the Attribute Tree
The complete attribute tree is now constructed by navigation rules. These rules allow to navigate in the logical schema via join paths, defined by
foreign key constraints. The nodes of the constructed tree are the fields of the tables. Therefore, the tree represents the structure of the database
according to the multidimensional model introduced in Section 3.1.:
∧ key(X2,C2) ∧ navigate(X2)
This first rule allows to navigate from table X1 to table X2, via a foreign key constraint, when there is a one-to-many relationship between X1
and X2:
Rule 3: navigate(X1) ⇐ table(X1) ∧ key(X1,C1) ∧ fkey(T,CX1,X1,C1) ∧ fkey(T,CX2,X2,C2) ∧ ¬(X1=X2) ∧ table(X2) ∧ key(X2,C2) ∧ table(T) ∧
key(T,CX1) ∧ key(T,CX2):
This second rule allows to navigate from table X1 to table X2, via a foreign key constraint, when there is a many-to-many relationship between X1
and X2. T is the intermediate table. Here, an edge with a double harrow is created from C1 to T and a simple edge from Tto C2.
As concerns rule I, the logical program needs to distinguish whether a table is the root, i.e., the starting fact table, and whether the foreign key is
defined on a primary key. The differentiation can be done on the basis of the first navigation rule, by creating three specialized rules. The main
difference consists of the creation of the edge. The specialized rules are the following:
If X1 is not the root and the foreign key is not defined on its primary key, then an edge must be created between K and C2, where K is the primary
key of X1 and C2 is the primary key of the target relation X2; create_edge(K,C2) is a function, which asserts that an edge exists
between K and C2:
If X1 is not the root and the foreign key is defined on its primary key, then an edge must be created between C1 and C2, where C1 is the primary
key of X1:
• root(X1) ∧ create_edge(X1,C2).
At last, every time a table has been reached, a new node is created—using create_node(A) function—and its own fields are listed, by creating
edges among the primary key and fields; thus, these fields form a set of leaf nodes.
The final output of the logical program is a set of assertions, which defines the tree G:
• root (C): C is the root of G;
6.4. Remodelling the Attribute Tree
In this step, an attribute tree is modified in a supervised way using all the constraints coming from requirements.
We denote with (A, B):- C the fact that the attribute C can be computed using A and B, that is, C is a derived measure. As an example,
(price, quantity):- amount means that there exists an (algebraic) expression to compute amount using price andquantity.
Let us consider a tree G and a constraint coming from the user requirements. In informal way, we create as many root children as many measures
there are in the constraint. Moreover, we add a root child for each first dimensional level in the constraint. In a recursive way, the other
dimensional levels are added as children nodes of their own predecessor levels. In general, when we add a new node B to a parent node A, the
node B can be created ex novoor can be already present in the tree. In the latter case, the edge between B and the old parent of B must be deleted
and a new edge between B and the new parent A must be created; this is the so-called change parent operation.
• Adding measures:
∧ change_parent(C,Z,M)
∧ create_edge(C,M).
For each measure M which is not a root child but is a child of another node Z, a change parent operation is done. Similarly, if Mis not a root child
but can be derived from a set of nodes Mi, fori=1, …, n, then nodes Mi are deleted, a new node M is added, and an edge between the root and M is
created. In the other cases, the program fails:
• Adding dimensions:
∧ add_description(K)
For each first dimensional level E, which is not a time dimension and whose primary key K is not a root child but is a child of another node Z, a
change parent operation of K from Z to C is done. If such a dimensional level E does not exist, then the program fails. In this rule,
add_description(K) is the function that creates a node as a descriptive attribute of K, if necessary:
∧ root(C) ∧ create_edge(C,E)
This rule adds a time dimension to C having E as terminal level, if imposed in the requirements:
For each n-th time dimensional level present in the requirements a node is created as a child of the n−1-th time dimensional level:
Deleting attributes:
In the end, all the remaining nodes that do not represent dimensional levels are deleted.
• The root node is the cube and the children of the root represent the measures of the fact table;
• The non-leaf nodes represent dimensional attributes, i.e., entities that represent levels of aggregation. The number of dimensional
attributes linked to the root establishes the dimensionality of the data cube. The dimensional attributes linked each other by an edge form a
hierarchy;
6.5. Validating the Data Warehouse Conceptual Schema
The workload is now used in order to perform the validation process. If all the queries of the workload can be effectively executed over the
schema, then such a schema is assumed to be validated and the designer can safely translate it into the corresponding logical schema. Otherwise,
the conceptual design process must be revised.
We define the following issues related to the validation of a conceptual schema in reference to the queries included into the preliminary
workload:
• A query involves a cube that has not been defined as such;
• A query presents an aggregation pattern on levels that are unreachable from the given cube;
• A query requires an aggregation on a field that has not been defined as a dimensional attribute;
• A query requires a selection on a field that has not been defined as a descriptive attribute.
A query is assumed to be validated if there exists at least an attribute tree such that the following conditions hold: (a) the fact is the root of the
tree; (b) the measures are the children nodes of the root; (c) for each level in the aggregation pattern, there exists a path from the root to a node
X, where X is a non-leaf node representing the level; and (d) for each attribute in the selection clause, there exists a path from the root to a node
Y, where Y is a leaf node representing that attribute.
If all the queries are validated, then each attribute tree can be considered as a cube, where the root is the fact, non-leaf nodes are aggregation
levels, and leaf nodes are descriptive attributes belonging to a level. So, the conceptual design ends and the designer can transform the conceptual
schema into a logical one. On the other hand, if a query cannot be validated, then the designer has to opportunely modify the tree. For example, if
an attribute of the selection clause is not in the tree, then the designer can decide to add a further node.
The conceptual schema validation is executed by the logical program, via an inferential process that allows verifying the issues pointed out above.
At the end of the inferential process, the logical program states whether the conceptual schema is valid, on the basis of a given preliminary
workload and the produced attributes trees.
Cube test:
Measure test:
When performing this rule, the cube test is terminated and, therefore, we have the assurance that C is a cube. So, this rule only verifies
whether M is a measure of the cube C. If M is not a measure of the cube C, then the logical program fails and the schema is not validated:
Path test:
This test recursively scans the list A of the aggregation pattern, defined inside the query query(C,A,S,M). (The ending condition is represented by
the empty list). For each element D extracted from the head of A, this rule checks whether D is a root child or there is a path from C to D. If a path
from C to D is found, then a successful message is shown and the test is accomplished:
Aggregation test:
Also this test first recursively scans the list A, but this rule checks whether all the elements of the list are dimensional attributes. If an
element D of the list does not have a child node F, then the logical program fails. The underlying assumption is that each dimensional attribute
must be equipped with at least one descriptive attribute:
Selection test:
This test recursively scans the list S of the selection pattern, defined inside the query query(C,A,S,M). For each element Dextracted from the head
of S, this rule first performs the path testfrom C to D, then checks whether all the elements of the list are descriptive attributes. If an element D of
the list is not a descriptive attribute—ie, a leaf node—then the logical program fails. To accomplish this test, D must have no child.
7. CASE STUDY
The case study aims at illustrating the application of the methodology in a context where the designer has to create a data warehouse, adopting a
hybrid approach supporting automatic steps.
7.1. Requirement Analysis
In this case study, we assume there is only one decision maker, who manages a company that sells many products, also using e-commerce tools.
7.1.1. Business Goals Representation
The manager is interested in increasing the selling of the products. So, this represents the strategic goal.
From this high level goal, the decision goal is “make promotions”, which establishes how to boost selling. To this aim, the information that is
necessary to satisfy this goal is “identify the most sold products”. Then, we represent the information requirements as tasks of the decision
maker. In this case, the task is “analyze the quantity of sold products per month, region, and category in the last year”. For the sake of simplicity,
we omitted the other goals and tasks of the decision maker.
On the other hand, we have to represent also the tasks of the data warehouse, in reference to the information to be provided. For the given
requirements, the task is “provide information about selling per time, location, and products”. This task includes quantity as a measure.
Furthermore, it is possible to define complete time andlocation hierarchies on the given dimensions.
7.1.2. Workload Representation
From the task of the decision maker, we define the following query string:
which is well-formed and is devoted to the aggregation on quantity measure by month, region, and category, considering only products sold in
2012.
7.1.3. Constraints Representation
From the task of the data warehouse, we define the following constraint:
which is well-formed and states that the we must have a sales cube, having a quantity measure and three dimensions. In the example, the first
dimension is composed of two hierarchical levels: product, which is the first level, and category, the second. The second dimension is a time
dimension, composed of three levels: namely day, month and year. The third dimension is a geographical dimension, composed of two levels:
namely city and region.
7.2. Source Analysis and Integration
In this Section, we provide a complete example of source analysis and integration, in order to highlight how the ontology supports the designer in
the data warehouse conceptual design when considering a broad variety of data sources.
7.2.1. Ontological Representation
We built our ontology starting from OpenCyc, the open source version of Cyc (Reed & Lenat, 2002; Foxvog, (2010). To this end, we extracted
from OpenCyc the concepts of interest related to the business companies and sales activity, which is the most frequent domain in data
warehousing. The relationships considered areisA(X, Y) to indicate that X is a specialization of Y, and has(X, Y) to indicate that X has an instance
of Y. To make this cleaner, we provide an oriented graph. The ontology is partially shown in Figure 6.
Figure 6. Part of the ontology
7.2.2. Predicate Generation
Using the ontology previously introduced, first we defined the predicates to be used as a vocabulary for the logical definitions of database entities.
Each predicate corresponds to a concept present in the ontology. A subset of the predicates is reported below:
7.2.3. Ontological Definition Generation
For each ontological concept, we also provide an extended definition, using the predicate previously introduced. So, we obtained a logical
definition for each ontological concept. In what follows, instead of using the isA(X,Y) relationship, we adopt theprolog notation, in order to state
that X is a social being if X is an intelligent agent and X has both a social status and a role. A number of the ontological definitions of concepts
follow:
7.2.4. Entity Definition Generation
The case study aims to integrate two relational databases, a web log, and social data:
1. MusicalInstruments (see, Figure 7a);
2. Fruit&Vegetables (see, Figure 7b);
4. SocialWebData (see, Figure 7d).
For both databases (1) and (2), we provide their essential conceptual schemas.
MusicalInstruments is the database used by an on-line shop, in order to manage the sales of musical instruments and
accessories.Fruit&Vegetables is the database used by a farm, in order to manage the wholesale of fruit and
vegetables.MusicalInstrumentsWebLog is the log file generated by the web server of the e-commerce software. SocialWebData is the dataset that
can be obtained from most popular social networks.
For each database entity, we created a definition using the predicates we had previously generated. Indeed, such predicates represent the
vocabulary for the construction of the concepts using the first-order logic. Part of the data dictionary along with the logical definitions of database
entities is shown in Table 1.
Notice that these definitions often disagree with the ontological ones. In fact, entities are always defined without considering common and shared
concepts, since entities represent local concepts. This means we assume that the database designer ignores the ontology.
As concerns the web log, it must be transformed into a structured data source, composed of entities related by relationships. Also for these
entities, definitions must be provided. The last data source is represented by social networks, providing interesting information about users’
trend.
⋯ ⋯ ⋯ ⋯
The comparison is done automatically using inference rules defined in first order logic. These rules check the similarity degree between two
lists L1 and L2 containing a logical definition of a database entity (Ferilli et al., 2009).
where:
As an example, we show the comparison process between the entities client of S1 and customer of S2. These entities were previously defined as:
First, we bind the variable X of client to the variable X ofcustomer, and then we create two lists L1 and L2 using the predicates present in each
logic definition, which, for readability reasons, we informally represent as:
L1 = {socialBeing(X), individualAgent(X), has(X,Y),userAccount(Y)}
L2 = {legalAgent(X), has(X,Y), shop(Y)}
Then, we compare each unary predicate in L1 with all unary predicates in L2, considering the bound variables.
To do the comparison, we introduce the mapping operator ↔. When we map one predicate to another, we obtain:
• 1 for a successful mapping, if the predicates are the same or have a generalization in common in the ontology; or
In the first case, we add the common concept to the generalization list L (cf., Section 5, step 2.5.):
3. Third mapping: userAccount(Y) ↔ shop(Y) = 0, because there is no ontological relationship between the concepts. So, we
increment n because userAccount(Y)∈L1 anduserAccount(Y)∉L2. Moreover, we increase m becauseshop(Y)∉L1 and shop(Y)∈L2.
Now, we compare each binary predicate in L1 with all binary predicates in L2.
The only binary predicate in both lists L1 and L2 is has(X,Y). In order to explicitly show this binary relationship and the involved entities, we
consider the bound variables and make the substitutions: (i) has(client, userAccount) that means that a client is a social being having a user
account, and (ii) has(customer,shop) that means that a customer is a social being having a shop.
So, has(client, userAccount) ↔ has(customer, shop) = 0, and l = 1,n = 3, m = 2, and L = {socialBeing(X)} is the final result of the comparison.
Therefore, d = 0.36 is the similarity degree and L = {socialBeing(X)} is the list containing the common concept(s).
Since 0 < 0.36 ≤ 0.70 and L = {socialBeing(X)} ≠ ∅, the two entities client and customer do not refer to the same concept, but present a common
ontological concept (cf., case 2.5(c) of Section 5). This means that both client and customer are social beings.
The complete result of the case study is reported in Table 2. For each comparison between entities, both the similarity degree d (in the top cell)
and the generalization list L (in the bottom cell) are reported. (The symbol “≈” means that the entities are equivalent).
Table 2. Results of the similarity comparison
Customer Order
{socialBeing} ∅
∅ ≈
∅ ∅
∅ ∅
∅ ∅
7.2.6. Integrated Logical Schema Generation
Now we examine the results of the similarity comparison. We note that client and customer are always used as synonyms. However, the
comparison results indicate that the client and customer have not been defined identically and, therefore, they refer to different database entities.
In fact, the similarity degree is different from zero but lower that the threshold (fixed at 0.70, cf, rule 2.5a in Section 5), and they present a
common ontological concept.
This suggests introducing into the global schema G1 theSocialBeing entity and two specializations corresponding to a client who is a social being
with an account (that is, a registered user) and a client who is a social being with a legal title (that is, a company having a shop). This has been
obtained by applying rule 2.6(c) in Section 5.
Another generalization that has been detected is that betweenproduct in MusicalInstruments and product in Fruit&Vegetables. Even if there is a
syntactical concordance, the terms refer to very different items: the former refers to an instrument, the latter to a fruit or a vegetable. However,
these are both items having a monetary value and are produced to be sold. Then, we created a generalization, namely product, which is an item
having an assigned price. The specific products have been introduced as specializations, each with its own relationships. For example, an
instrument is produced by a company whereas the producer of vegetables is missing information in the Fruit&Vegetablesdatabase. This has also
been obtained by applying rule 2.6(c) in Section 5.
Finally, it is worth noting that the order entity has been defined in the same way in both databases because their similarity degree d is greater
than the threshold. So, they do not present a generalization because they refer to the same concept. This is the only overlapping concept. This has
been obtained by applying rule 2.6(a) in Section 5.
The final global conceptual schema G3 is shown in Figure 8. Notice that descriptive attributes are not shown in the figure.
After we have obtained the final global conceptual schema representing an integrated data source, we proceed to the elimination of hierarchies
and the transformation of this schema into a relational one, in order to use it in our hybrid data warehouse design methodology. This
transformation is based on algorithms present in literature and produces a set of relational metadata.
Figure 8. Global conceptual schema
Part of the metadata describing the relational schema is reported below (see the metamodel introduced in sub-Section 5.1):
table(product)
table(page)
key(product, productID)
key(page, pageID)
table(sales)
key(sales, productID)
key(sales, orderID)
In this section, we show the core of the methodology, which relies on the graph-based multidimensional model for representing the integrated
global schema. This schema will be remodelled according to the requirements emerged from business goals.
7.3.1. Compiling the Artifacts
First, we consider all the queries included in the workload. As an example, the query:
is translated into the following predicate to be submitted later to the logical program, when validating the data warehouse conceptual schema:
cube(sales)
measure(quantity, sales)
7.3.2. Identifying Facts
At this point, the metadata describing the global schema are considered. The tree function searches for a table C in data source such that C has
been marked as a cube in the requirements. In this running example, C is sales and this table is also asserted as the root of the tree to be build.
Then, the navigation through the logical schema starts from the sales table.
7.3.3. Building the Attribute Tree
The sales table, which has the quantity attribute, is linked to two tables via foreign keys: the one is product, whose primary key isproductID, and
the other is order, whose primary key is ordereID. On turn, order is related to customer, because each order is made by a customer. Each of these
tables has its own descriptive attributes, such as price and category. Further nodes not relevant for the running example, are not reported in the
figure.
7.3.4. Remodelling the Attribute Tree
The attribute tree resulting by remodelling rules is depicted in Figure 9(b) and it has been obtained on the basis of next operations.
Since quantity is already a root child, no operations are necessary to add measures. Then, dimensions must be defined.
The first dimension is product, which is a root child; therefore, no operation is done. The second dimension is city, which is acustomer’s
descriptive attribute; in this case, a change parentoperation is done, in order to add the city dimension to the root. Since city must be a
dimensional attribute—i.e., a non-leaf node—also a descriptive attribute is added (not shown in the figure due to tree readability) and a surrogate
key is used for this dimension. As an example, given a tuple t1 in the customer relation such thatt1[customerID, city] = <A1, Rome>, we should
create a tuples1[city, description] = <1, Rome>. The third dimension is time. So, a day node is added to the root, as specified in the requirements.
Now, hierarchies must be introduced. To this end, a change parent operation is done to make region a city’s child. As concernscategory, it is
already a child of productID, but a descriptive attribute is added and a surrogate key is used. Finally, a completetime dimension is built. At the
end, remaining nodes can be safely deleted.
7.3.5. Validating the Conceptual Schema
The aim is to check whether this query can be executed against the conceptual schema depicted in Figure 9(b).
The Cube test is accomplished for sales is the root of the tree. Also, the Measure test is accomplished because quantity is a root child.
The aggregation pattern is represented by the list [month, category, region]. Therefore, we have to check whether a path from the root to each
element of the list exists. In the tree, paths exist from sales to month, category, and region. Then, the Path test is accomplished. Moreover, all
these are non-leaf nodes. (We recall that region has its own descriptive attribute). For this reason, also the Aggregation test is accomplished. At
last, the Selection test is accomplished because year is a leaf-node reachable from the root. So, the given attribute tree can be safely transformed
into a logical schema during the next logical design phase (see, Figure 9(c)). It is worth noting that rules to transform a conceptual schema into a
logical schema are not discussed here, since they are quite intuitive.
Figure 9. (a) Part of the attribute tree of sales fact; (b) Part of
the remodelled attribute tree of sales fact; (c) Logical schema
7.3.6. On Agile Aspects of the Methodology
The main aspect of agile methodologies is the ability to address the frequent changes in user requirements (Collier & Highsmith, 2004), such to
have the minimum impact on the design process.
1. Addition of information: Usually, this kind of change is not incoherent with pre-existing requirements, but simply implies the creation
of new cubes, or the addition of further dimensions to a cube, new levels/hierarchies to a dimension, or further attributes to a dimensional
level;
2. Deletion of information: This case requires the deletion of existing attribute(s), dimension(s), or cube(s). To our knowledge, this is the
most unusual case, as data warehouse are normally used to preserve historical data;
3. Modification of requirements: This kind of change implies the complete revision of the conceptual/logical schemas, as it may involve
specifications eventually in contradiction with the produced artifacts.
Now, let us assume we have a physical data warehouse schema and that new user requirements are available. If we observe case b), then we can
directly jump to phase 2—that is, the conceptual design—using the produced logical schema as a source schema. So, applying to this schema the
rules for building the attribute tree, we are able to reverse to the conceptual schema. As an example, applying the constraints (reported at the end
of section 7.3.1) to the schema in Figure 9(c), we obtain the tree shown in Figure 9(b). At this point, the new requirements (in the form of new
constraints) imply the deletion of nodes.
On the other hand, if we observe case a) or c), then we have to restart the complete design process from the beginning, since we may need some
data that have been discarded. However, since the methodology is completely automatic, this does not represent a drawback of the proposed
approach.
8. FUTURE RESEARCH
One of the hot words in Big Data is Velocity. This refers not only to the rate at which data are produced, but mainly to the capacity of the system
to process them for information delivery. This capacity requires that a new data source should be integrated in the system in a scalable way, that
is, quickly and without affecting the pre-existing system.
The integration process is composed of two issues: schema integration, that is addressed in the data warehouse design phase, and data
integration, that is addressed in the ETL design phase.
While the roadmap for schemas integration is planned and it should be based on an ontological approach, only few authors focus their attention
on the automatic generation of ETL plans.
In (Romero et al., 2011) a semi-automatic process is presented. The explained strategy aims to identify necessary operators for populating the
target schema, after a multidimensional tagging has correctly identified each required concept and mapped it to the data source. Although the
paper presents an interesting method to define a preliminary ETL plan, some questions arise—for example, how the right aggregate function is
chosen, when an aggregation operation must be executed. So, when dealing with complex schemas or when finding an algorithmic solution is a
hard task, a manual participation is needed.
Another automatic method for ETL design is presented in (Muñozet al., 2009). It is devoted to the code generation, by transforming a conceptual
ETL plan into platform-dependent procedures. The method has the merit of being based on well-known standards—such as UML and QVT.
Furthermore, if a new ETL process must be added to feed a different target system, only transformations must be created once. However, the
starting conceptual plan must be manually defined by the designer. Therefore, the aim of creating ETL procedures using only a source schema
and a target schema still remains unresolved.
Hence, future trends must address the problem of automatically generating the ETL procedures, in order to minimize the time necessary for
effectively including a new data source into the data warehouse system.
9. CONCLUSION
We have presented a hybrid methodology for big data warehouse design. The core of the methodology is a graph-based model, aiming at
producing multidimensional schemas automatically. To this end, the inputs of the conceptual design are a global schema and a set of user
requirements.
The global schema is obtained through an integration process of data sources, which can be structured or unstructured. In order to solve
syntactic and semantic inconsistencies present in data sources, we used ontology for representing in a formal way the domain’s concepts, along
their relationships. The integration strategy aims to detect similar or identical concepts defined in different data sources and to identify common
ontological concepts.
User requirements, which consist in a set of constraints and a workload, are respectively used to perform the reengineering process of the global
schema in a supervised way, and the validation of the resulting multidimensional schema.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 115149, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bakhtouchi, A., Bellatreche, L., & Ait-Ameur, Y. (2011). Ontologies and functional dependencies for data integration and reconciliation . In De
Troyer, O., Bauzer Medeiros, C., Billen, R., Hallot, P., Simitsis, A., & Van Mingroot, H. (Eds.), Advances in Conceptual Modeling. Recent
Developments and New Directions,(LNCS) (Vol. 6999, pp. 98–107). Berlin: Springer. doi:10.1007/978-3-642-24574-9_13
Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Paraboschi, S. (2001). Designing data marts for data warehouses. ACM Transactions on
Software Engineering and Methodology , 10, 452–483. doi:10.1145/384189.384190
Chen, Z. (2001). Intelligent data warehousing: From data preparation to data mining . Boca Raton, FL: CRC Press. doi:10.1201/9781420040616
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., & Welton, C. (2009). Mad skills: New analysis practices for big data.Proceedings of the VLDB
Endowment , 2(2), 1481–1492.
Collier, K., & Highsmith, J. A. (2004). Agile data warehousing: Incorporating agile principles. Business Intelligence Advisory Service Executive
Report, 4(12).
dell’Aquila, C., Di Tria, F., Lefons, E., & Tangorra, F. (2009). Dimensional fact model extension via predicate calculus. InProceedings of the 24th
International Symposium on Computer and Information Sciences (pp. 211-217). IEEE Press.
dell’Aquila, C., Di Tria, F., Lefons, E., & Tangorra, F. (2010). Logic programming for data warehouse conceptual schema validation . In Pedersen,
T. B., Mohania, M. K., & Tjoa, A. M. (Eds.), Data Warehousing and Knowledge Discovery (LNCS) (Vol. 6263, pp. 1–12). Berlin: Springer.
doi:10.1007/978-3-642-15105-7_1
Di Tria, F., Lefons, E., & Tangorra, F. (2011). GrHyMM: A graph-oriented hybrid multidimensional model . In De Troyer, O., Bauzer Medeiros, C.,
Billen, R., Hallot, P., Simitsis, A., & Van Mingroot, H. (Eds.), Advances in Conceptual Modeling. Recent Developments and New
Directions (LNCS) (Vol. 6999, pp. 86–97). Berlin: Springer. doi:10.1007/978-3-642-24574-9_12
Di Tria, F., Lefons, E., & Tangorra, F. (2012). Hybrid methodology for data warehouse conceptual design by UML schemas.Information and
Software Technology , 54(4), 360–379. doi:10.1016/j.infsof.2011.11.004
Elmasri, R., & Navathe, S. B. (2010). Fundamentals of database systems (6th ed.). Reading, MA: Addison-Wesley.
Ferilli, S., Basile, T. M. A., Biba, M., Di Mauro, N., & Esposito, F. (2009). A general similarity framework for horn clause logic.Fundamentals of
Informatics , 90(1-2), 43–66.
Giorgini, P., Rizzi, S., & Garzetti, M. (2008). GRAnD: A goal-oriented approach to requirement analysis in data warehouses.Decision Support
Systems , 45, 4–21. doi:10.1016/j.dss.2006.12.001
Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: Modern principles and methodologies . New York: McGraw-Hill Osborne Media.
Hakimpour, F., & Geppert, A. (2002). Global schema generation using formal ontologies . In Spaccapietra, S., March, S. T., & Kambayashi, Y.
(Eds.), ER (LNCS) (Vol. 2503, pp. 307–321). Berlin: Springer.
Helfert, M., & Von Maur, E. (2001). A strategy for managing data quality in data warehouse systems. In E. M. Pierce & R. Katz-Haas (Eds.), Sixth
Conference on Information Quality (pp. 62-76). Cambridge, MA: MIT.
Jiang, L., Xu, J., Xu, B., & Cai, H. (2011). An automatic method of data warehouses multi-dimension modeling for distributed information
systems. In W. Shen, J.-P. A. Barthès, J. Luo, P. G. Kropf, M. Pouly, J. Yong, Y. Xue, & M. Pires Ramos (Eds.),Proceedings of the 2011 15th
International Conference on Computer Supported Cooperative Work in Design (pp. 9-16). IEEE.
Levine, J. (2009). Flex & bison text processing tools . Sebastopol, CA: O'Reilly Media.
Mazón, J. N., & Trujillo, J. (2009). A hybrid model driven development framework for the multidimensional modeling of data
warehouses. SIGMOD Record , 38, 12–17. doi:10.1145/1815918.1815920
Mazón, J. N., Trujillo, J., & Lechtenbörger, J. (2007). Reconciling requirement-driven data warehouses with data sources via multidimensional
normal forms. Data & Knowledge Engineering ,63, 725–751. doi:10.1016/j.datak.2007.04.004
Mazón, J. N., Trujillo, J., Serrano, M., & Piattini, M. (2005). Designing data warehouses: From business requirement analysis to
multidimensional modeling . In Cox, K., Dubois, E., Pigneur, Y., Bleistein, S. J., Verner, J., Davis, A. M., & Wieringa, R. (Eds.),Requirements
Engineering for Business Need and IT Alignment(pp. 44–53). Wales Press.
Muñoz, L., Mazón, J. N., & Trujillo, J. (2009). Automatic generation of ETL processes from conceptual models. In I.-Y. Song & E. Zimànyi
(Eds.), DOLAP 2009, ACM 12th International Workshop on Data Warehousing and OLAP (pp. 33-40). ACM.
Phipps, C., & Davis, K. C. (2002). Automating data warehouse conceptual schema design and evaluation. In L. V. S. Lakshmanan (Ed.), DMDW:
CEUR Workshop Proceedings, Design and Management of Data Warehouses (pp. 23-32). CEUR.
Romero, O., & Abelló, A. (2010a). Automatic validation of requirements to support multidimensional design. Data & Knowledge
Engineering , 69(9), 917–942. doi:10.1016/j.datak.2010.03.006
Romero, O., & Abelló, A. (2010b). A framework for multidimensional design of data warehouses from ontologies. Data & Knowledge
Engineering , 69(11), 1138–1157. doi:10.1016/j.datak.2010.07.007
Romero, O., Simitsis, A., & Abelló, A. (2011). GEM: Requirement-driven generation of ETL and multidimensional conceptual designs . In
Cuzzocrea, A., & Dayal, U. (Eds.), Data Warehousing and Knowledge Discovery (LNCS) (Vol. 6862, pp. 80–95). Berlin: Springer.
doi:10.1007/978-3-642-23544-3_7
Sure, Y., Erdmann, M., Angele, J., Staab, S., Studer, R., & Wenke, D. (2002). OntoEdit: Collaborative ontology development for the semantic web.
In I. Horrocks & J. A. Hendler (Eds.), International Semantic Web Conference (LNCS), (Vol. 2342, pp. 221-235). Berlin: Springer.
Thenmozhi, M., & Vivekanandan, K. (2012). An ontology-based hybrid approach to derive multidimensional schema for data
warehouse. International Journal of Computers and Applications ,54(8), 36–42. doi:10.5120/8590-2343
ADDITIONAL READING
Abadi, D. J., Madden, S., & Hachem, N. (2008). Column-stores vs. row-stores: How different are they really? In J. Tsong-Li Wang
(Ed.), SIGMOD Conference (pp. 967-980). ACM.
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., et al. (2012). Challenges and opportunities with big data. Retrieved
October 25, 2012, from http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf
Bellatreche, L., Ait Ameur, Y., & Pierra, G. (2010). Special issue on contribution of ontologies in designing advanced information systems. Data &
Knowledge Engineering , 69(11), 1081–1083. doi:10.1016/j.datak.2010.08.002
Bellatreche, L., Cuzzocrea, A., & Benkrid, S. (2010). A methodology for effectively and efficiently designing parallel relational data warehouses on
heterogenous database clusters . In Bach Pedersen, T., Mohania, M. K., & Tjoa, A. M. (Eds.), Data Warehousing and Knowledge
Discovery (LNCS) (Vol. 6263, pp. 89–104). Berlin: Springer. doi:10.1007/978-3-642-15105-7_8
Buitelaar, P., Olejnik, D., & Sintek, M. (2003). OntoLT: A protege plug-in for ontology extraction from text. In Proceedings of the International
Semantic Web Conference. IEEE.
Cuzzocrea, A. (2011). Pushing artificial intelligence in database and data warehouse systems. Data & Knowledge Engineering ,70(8), 683–684.
doi:10.1016/j.datak.2011.03.005
Cuzzocrea, A. (2011). Data warehousing and knowledge discovery from sensors and streams. Knowledge and Information Systems ,28(3), 491–
493. doi:10.1007/s10115-011-0440-2
Giunchiglia, F., Shvaiko, P., & Yatskevich, M. (2007). Semantic matching: Algorithms and implementation . In Spaccapietra, S., Atzeni, P., Fages,
F., Hacid, M.-S., Kifer, M., & Mylopoulos, J., et al. (Eds.), Journal of Data Semantics IX (LNCS) (Vol. 4601, pp. 1–38). Berlin: Springer.
doi:10.1007/978-3-540-74987-5_1
Hakimpour, F., & Geppert, A. (2005). Resolution of semantic heterogeneity in database schema integration using formal ontologies. Information
Technology Management , 6(1), 97–122. doi:10.1007/s10799-004-7777-0
Kalfoglou, Y., & Schorlemmer, M. (2003). Ontology mapping: The state of the art. The Knowledge Engineering Review Journal ,18(1), 1–31.
doi:10.1017/S0269888903000651
Khouri, S., & Bellatreche, L. (2010). A methodology and tool for conceptual designing a data warehouse from ontology-based sources. In I.-Y.
Song & C. Ordonez (Eds.), ACM 13th International Workshop on Data Warehousing and OLAP (pp. 19.24). ACM.
Khouri, S., Boukhari, I., Bellatreche, L., Sardet, E., Jean, S., & Baron, M. (2012). Ontology-based structured web data warehouses for sustainable
interoperability: Requirement modeling, design methodology and tool. Computers in Industry , 63(8), 799–812.
doi:10.1016/j.compind.2012.08.001
March, S., & Hevner, A. (2007). Integrated decision support systems: A data warehousing perspective. Decision Support Systems , 43(3), 1031–
1043. doi:10.1016/j.dss.2005.05.029
Prat, N., Akoka, J., & Comyn-Wattiau, I. (2006). A UML-based data warehouse design method. Decision Support Systems , 42(3), 1449–1473.
doi:10.1016/j.dss.2005.12.001
Queralt, A., Artale, A., Calvanese, D., & Teniente, E. (2012). OCL-lite: Finite reasoning on UML/OCL conceptual schemas. Data & Knowledge
Engineering , 73, 1–22. doi:10.1016/j.datak.2011.09.004
Rull, G., Farré, C., Teniente, E., & Urpí, T. (2009). MVT: A schema mapping validation tool. In M. L. Kersten, B. Novikov, J. Teubner, V. Polutin,
& S. Manegold (Eds.), ACM International Conference Proceeding Series (vol. 360, pp. 1120-1123). ACM.
Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured
data. International Journal on Semantic Web and Information Systems , 3(4), 1–24. doi:10.4018/jswis.2007100101
Song, I., Khare, R., & Dai, B. (2007). SAMSTAR: A semi-automated lexical method for generating STAR schemas from an ER diagram. In I.-Y.
Song & T. Bach Pedersen (Eds.), Proceedings of ACM 10th International Workshop on Data Warehousing and OLAP (pp. 9-16). ACM Press.
Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., & Skiadopoulos, S. (2005). A generic and customizable framework for the design of
ETL scenarios. Information Systems , 30(7), 492–525. doi:10.1016/j.is.2004.11.002
Weininger, A. (2002). Efficient execution of joins in a star schema. In M. J. Franklin, B. Moon, & A. Ailamaki (Eds.), SIGMOD Conference (pp.
542-545). ACM.
Yu, B., Cuzzocrea, A., Jeong, D., & Maydebura, S. (2012). On managing very large sensor-network data using bigtable. InProceedings of the 12th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 918-922). IEEE.
KEY TERMS AND DEFINITIONS
Conceptual Representation: A local representation of concepts, along with their relationships, that are true only in a given context.
DataDriven Approach: A data warehouse design strategy that aims to produce multidimensional schemas, by detecting facts and dimensions
in structured data sources.
Hybrid Approach: A multidimensional design strategy that takes into account both user requirements and data sources.
Logical Program: A program composed of facts and deductive rules, expressed according to the Predicate Calculus.
Ontological Representation: A formal, general and widely shared representation of concepts, along with their relationships in a specific
domain.
Reengineering Process: The process that modifies a relational schema by introducing and/or deleting attributes and functional dependencies.
RequirementDriven Approach: A data warehouse design strategy that aims to produce multidimensional schemas, by transforming
information requirements in multidimensional concepts.
Schemas Integration: The process that aims to define a global and reconciled schema, by solving syntactical and semantic inconsistencies in
data sources.
Workload Validation: The process that allows to check whether a schema is able to support a set of queries representing the analytical
workload.
CHAPTER 24
Visualization in Learning:
Perception, Aesthetics, and Pragmatism
Veslava Osinska
Nicolaus Copernicus University, Poland
Grzegorz Osinski
College of Social and Media Culture, Poland
Anna Beata Kwiatkowska
Nicolaus Copernicus University, Poland
ABSTRACT
Visualization is currently used as a data analysis tool and considered a way of communicating knowledge and ideas in many areas of life such as
science, education, medicine, marketing, and advertisement. The chapter contains complex interdisciplinary material and attempts to construct a
general framework of visualization roles in learning. The structure of this content is presented on Figure 1 as a kind of mind map. Authors
emphasize that visualization mechanisms are designed taking into account analytical and content-related potential, timeliness, online
availability, and aesthetics. But classical (tabular) forms still remain dominant in information presentation. A good portion of the discussion is
dedicated to the alternative solution – non-linear layout, such as network or fractals. Several visualization maps with specifically designed
architecture are demonstrated as important elements of contemporary education. Authors consider the potential and implementation of these
tools in e-learning platforms. Parallely, they underline the role of interdisciplinary collaboration in map making processes. Researchers in
different fields can apply contemporary trends in visualization including natural shape perception, 3D representation problems, as well as the
aspects of neuroaesthetics.
INTRODUCTION
Visualization methods are currently used for scientific and scholarly presentations and considered the communication tools in interactive web
applications. On the other hand, we must take into consideration an increasing role of mobile technology in communication processes.
Visualization methods in education are still underestimated. In most cases numerical data are presented in a tabular form or by two-dimensional
graphs and charts. An extended gap has appeared, between classic forms of information presentation and the users (students) who utilize new
technologies with a particular emphasis on mobile devices. It could be observed during e-learning processes where an emotional component
provided by the direct communication is lost. Introducing special graphic modules strictly related to new achievements in cognitive science
should fill this gap.
Usability of visualization mechanisms depends on several factors, namely their analytical and content-related potential, timeliness, online
availability and, of course aesthetics. Many information visualization (Infovis) projects available on the Web meet these criteria. They provide
their users with instant feedback and data manipulation options and offer social collaboration in visual analysis. Virtual exhibition Places &
Spaces1 is an interdisciplinary portal for researchers concerned in scientific domain mapping and human activity across global history. The visual
explorer IBM ManyEyes2 allows the users to visualize their own raw data and finally share results and interpretations.
In this chapter, the authors try to show that data visualization techniques are being more and more widely applied in the education process.
Being not only simple forms of visualization, but also colorful and often interactive maps, they become a perfect teaching tool in the education
process. Visualization is not just a methodology that originates from computer graphics and data analysis and is applied in many areas of life
(science, education, medicine, marketing, etc.), but it is also an effective and popular way of communicating knowledge and man’s ideas. Thus,
visualization is becoming more and more important for today’s users – students – readers – consumers. This chapter contains complex
interdisciplinary material and attempts to construct a general framework of the role of visualization in learning. This diversity of content requires
a special (non-linear) form of representation. Instead of a classical table of content, the conceptual map with graphical explanations is included in
Figure 1. The map shows the issues discussed in the chapter and the history of their origin – first that in the authors’ brain and then on paper.
The sequence, mutual connections, and similarity between specific issues are presented by arrows or close location. Based on visualization study,
the various aspects of perception of information structure are discussed, as well as science maps and visual elements used in education. Special
attention is given to the creation of new qualities of knowledge structures, and dynamic processes in the space of the mind (Duch 1998).
Figure 1. Veslava Osinska. Content map (instead of a table of
content of current chapter). (© 2014, V. Osinska. Used with
permission).
Image perception is here discussed from the user’s point of view, including both ergonomics of placed information resources and the aesthetic
structure of presented images. In recent years many works on that subject have appeared and a rich collection of results from scientific
experiments is already available.
Fractal structures are exceptionally intuitive in perception and reception because they originate from (or resemble) nature. This explains why
fractal-like visualizations are perceived better. Visual communication messages should be constructed by following such patterns. This
neuroscience-based issue has been solved in nature by fractal structures that are easy to compute in iterative way and reflecting the structural
complexity in the form of aesthetic communication. Discussion focuses on visualization methods as a tool for building a bridge between natural
perception and educational materials construction. This matter is currently used in scholar presentations and considered to be communication
tools in interactive web applications.
BACKGROUND
Visualization as a Map
The term ‘map’ is usually associated with cartography and those are the first maps that we come to know. They describe the world at a two-
dimensional level using a specific language. They use a plane colorful image to show a great amount of information about the world that we
cannot see directly, but which we can imagine. In the space of our mind we can create an image of the world presented on a map. Children make
their first journeys as ‘armchair travelers.’ This is how they get to know names of places, identify types of landscapes and distances. Technical
geographic maps provide information on climate, economy, social and economic conditions, and many other details. The world that has been
confined to a piece of paper is an indispensable element in the process of school education all over the world.
However, maps abandoned long time ago their original geography-related environment. First, they naturally appeared in history related sciences
where, in the same way as in geography, they described the longer existing worlds, and showed no longer present towns and societies.
Nevertheless, in the learning process during classes or individual learning these no longer existing worlds come to life in students’ minds, so they
can identify themselves with places and events from the past. Every teacher knows that analysis of a complex problem should start from working
with a map, which provides an insight into the whole problem. Maps enable a detailed scrutiny of each element in reference to various factors,
both these included in the map and these that students have learnt before. The effectiveness of teaching and learning with the use of good maps is
unquestionable, and every teacher and learner realizes it.
How can classic mapping techniques be used now in the 21stcentury, and serve for teaching and learning not only geography or history? In the
early 1980s, atlases related to many kinds of science started to appear, besides geographical and historic ones. They related to philosophy,
mathematics, physics, chemistry, and many other domains. In a classic way, maps and graphs attempted to show the entire systems and
methodologies of a given science field. However, they were encyclopedic in nature; it was difficult to find a science map similar to the
geographical one that would be not only interesting to students but also enable them to make individual multi-variant analyses of the presented
content.
Problems Related to Computer Visualization
The development of computing methods has enabled to facilitate and improve that process. Visualizations made on paper include such
limitations as a lack of interaction, so the users cannot manipulate the data nor play with their different configurations. Computers have
introduced a lot of possibilities and challenges into communication with users. The first interactive geographical atlases had already appeared
before the GPS technology became popular. They allowed dynamical viewing of maps on one screen. Limitations resulting from a two-
dimensional nature of paper disappeared, but other came into view. Excess of information – redundancy – became a problem. Dynamical maps
contained so much information that the user could feel lost. A great amount and diversity of information hindered a quick analysis of a specific
problem. The information factor outweighed the analytical one. First multimedia computer atlases basically became huge lexicons of data, except
that the text there was more often substituted by image, film, or sound. The introduction of the GPS has actually only exacerbated the problem.
Maps include information from the real world, which is however ephemeral and temporary in nature, such as information on restaurants,
gasoline stations, or tourist attractions. This does not contribute much to the analytical and creative problem solving. In spite of the application
of computer technologies, the potential analytical capability of maps has not increased, only their informative nature expanded. The situation is
similar in other sciences where computer atlases – map collections – have expanded their informative content, but have not created a new
analytical quality.
Educational Role of Maps
The application of maps in education has an entirely different nature than just a medium for a large collection of information. Everyone knows
that reading maps is something one has to learn. Current process of tabloidization of information causes that the analytical elements and the
semantic image descriptions are becoming insignificant. The whole information is to be communicated directly, hence a great popularity of
pictograms, even for expressing feelings. Emoticons used by users in text messaging have replaced reflection on one’s own emotional state. The
Technical Report of the OECD’s Programme for International Student Assessment (PISA 2009) shows that in many developed societies of
Europe and the USA over 30% of society finds it difficult to understand weather maps presented in everyday TV news. One can actually notice a
growing simplification of communication forms. Five years ago, many TV channels would present weather maps with isobars and isotherms that
showed a direction and type of weather change. Nowadays, more and more weather maps include only primitive icons of sun, clouds, or rain. It
does not solve the problem but only makes it worse. Analysis of a well prepared weather map not only enables us to understand whether it is
necessary to take an umbrella to work the next day, but also let us know should we expect weather changes during the following days, whether we
will be affected by continental or maritime air masses, and many other details.
Cognitive Aspects
From the cognitive point of view, the problem is that perception system is excessively loaded with a great number of visual data, with a
simultaneous disconnection of modules responsible for content-related analysis. A similar problem has been noticed in the development of e-
learning technologies. Too much information given at the same time and often too easy access to that information result in users being passive in
their individual attempts at information analysis and remaining ‘viewers’ but not ‘readers.’ It is a typical behavior for the generation that grew up
in a TV culture without a possibility to interact.
The reception of a map should be comprehensive, treated as a global resonance between a preconception in the user’s brain and the presented
graphic content (Duch & Grudzinski 2001; Duch, 2007a). Simply speaking, map should make us think. In order to make it possible, the user
should be taught how to read maps. That is undoubtedly an issue related to the education field and it requires an interdisciplinary approach of
many specialists. Education system should provide the separate education paths, which would be directed towards a deep analysis of the
presented content, not just towards reading simple messages. Therefore, special attention should be given to the application of preconception
systems that already exist in users (Duch, 2007b). Preconceptions are our intuitive and stereotypical ideas about processes and objects that we
have never reflected upon, but which have appeared spontaneously during the course of our life. Sometimes we do not realize that they exist, but
a specific problem that we face activates those processes in our brain in the first place. When we walk around any art gallery and stop in front of a
selected painting, in a natural way we think about what that painting reminds us, and then we start to analyze the message it carries. A painting
always carries an informative and emotional load. The primary mistake of many educational methods is that they omit or diminish the meaning
of the psychological factor. It is, however, the emotional load that activates these preconceptual meanings, which are crucial to the user.
Childhood memories or personal experience are more significant than a semantic description that often accompanies a presented image.
Aesthetics of Visualization
The aesthetics of the graphic communication is becoming a highly significant element that must be adjusted to an individual user. It is especially
important during map reading. Different colors and a selection of textures can be very attractive to one user, but off-putting to another. It is
difficult to select a universal graphic design, but map presentation technology enables a wide application of various techniques as well as their
combination. If that specific syncretism is used within a proper scope, it will probably help to find the golden mean in selection of an adequate
graphic design of a presented map. It is vivid in the versatility of applied techniques, e.g. the Places & Spaces exhibition. However, there is no any
easy recipe for a proper selection of a graphic design. Application of principles resulting from the Gestalt psychology is of course necessary but
definitely insufficient. It is difficult to assume that a process of the whole image perception is based just on a simple analysis of shapes, lines,
location, and configuration of specific shapes. The role of the limbic system in the image perception process is often underestimated, and this
influences the effectiveness of didactic communication. What still remains a great mystery is a huge emotional load that is stored in human brain
since the early childhood, and which has a decisive influence on the perception of the world around. We just ‘do not like’ some shapes and colors.
Different shapes can activate feelings of rejection or fear in our brain. On the other hand, positive associations are adopted easily; they activate
cognitive processes that can activate analytical processes. If the world around us does not have simple Cartesian shapes, why are students would
be expected to show interest in quadrilaterals, tables, lines, and rows of figures, which are often presented in course books and educational
presentations? We should turn towards non-linear methods that show the whole shapes and textures, which resemble real shapes. They present
data in the form of multi-dimensional graphs that should be deeply analyzed. Such thorough analysis always requires a specific amount of time.
Nowadays however, immediate solutions are important, and students expect a quick solution to a problem. Can these two seemingly
contradictory goals be combined so that a quick creative analysis of a considerable database is possible? It seems that application of visual
methods can facilitate the achievement of the assumed goal.
VISUAL PERCEPTION
Vision of Shapes or Vision of Structures?
An analysis of visualization is impossible without taking into account the mechanisms of perception occurring in neural correlates of the human
brain (Duch & Diercksen, 1995). Very often, in their works on visualization authors focus on a model of perception directly resulting from the
principles of the Gestalt theory. The assumptions of that theory, especially ‘laws of grouping’ are presented as fundamental principles of visual
perception of complex objects. The success of the application of the Gestalt Law is unquestionable, both in some applications in computer
science, especially in designing graphic interfaces, and computer vision systems. However, in the process of education it is more important to
activate higher cognitive processes by involving visual perception mechanisms (Figure 2).
It is crucial to understand how not only new impressions, but also new ideas appear in the brain. These processes are undoubtedly dynamical,
and a common term ‘shining’ shows that in the final stage of creating a new idea they are remarkably quick. Neural-dynamical concepts, which
result from psychological processes examined with neural imaging show a close relation between visual perception and the activity of neural
correlates in different parts of the brain – not only in the cortex area responsible for visual activity. These processes are characterized by very
quick dynamics, and their precise examination would probably require application of experimental methods combining genetic engineering,
nanotechnology, and fluorescent optics (Alivasatos, 2002). The study of dynamic conditions of neural correlates is a fast developing discipline.
These sets of neurons in the human brain take part in various processes, but time-related evolution makes some of their groups create permanent
schemes of links, which is manifested as repeated sequences of arousal; they are called attractors (Cossart, Aronov, & Yuste, 2003). Attractors
can become basic entities describing the behavior of human brain when thinking, remembering, or decision-making. However, they change so
quickly that the currently available experimental methods are not able to investigate them. Human beings can experience visual impressions as
an immediate resonance between a visual pattern and the attractor structure existing in the brain.
In the case of complex graphics, which any color map definitely is, it is not just the shape but also the structure of a perceived object that may be
important during visual analysis. There are many open source applications, which allow users to study an irregular structure of a big dataset. For
example, Gephi3 is free software package for networks and complex systems providing visualization and exploration in an impressive and instant
way. Like in case of Photoshop but with reference to data, the users interact with representations; they manipulate structures, shapes, and colors
to reveal hidden properties. In further chapters authors present its features in more detail. The Gestalt theory, which deals with the shape
perception, will not be useful in solving that problem. As early as in the 1950s scientists advocated an approach that was different from the classic
theory of the description of shape perception (Allport, 1955). However, if we look at our personal experience, we can recall situations when we
spent a lot of time trying to discern a small detail in the analyzed image, but we were not able to see anything; and the other way round, we could
see a small difference in the image that nobody else was able to see. It was definitely the ‘fault’ of our internal attractors; if it were possible to
identify and train them in the right way, our perception abilities would surly improve incommensurably. Unfortunately, we are not able to do that
yet; it is necessary to focus on methods that would help to obtain this ideal state. Applying “blindly” these methods would force the brain to seek
the most efficient solution.
How does visualization influence the user’s understanding? To answer this question we should take into account mechanisms of perception
occurring in neural correlates in the human brain. Both an eye and a visual cortex are powerful, parallel, wide-bandwidth processors directly
coupled with cognitive centers of information processing (Necka, Orzechowski & Symura, 2006). This indicates that vision and thinking are
closely connected during exploration of the world, and these two aspects are the reference points in the visualization research.
Imitation: Creation as a Cognitive Processes
Natural human ability to “see” causes that the brain receives projection of an object from the real life, which initially serves as the imitation of
reality. Next, in a process that uses various modules, the brain compares that imitation with the available familiar objects. After that, in the space
of the mind there appears a new entity that is recognized and manifested as the perceived object. Any theory of visual perception will be
incomplete if it does not take into account the problems resulting from linguistic representation that semantically describes the perceived image.
To understand this, try to search in the Internet browser not a text but a picture that you can well remember as a visual object. Let’s assume that
you remember a photo of you and your friends that was taken while sailing on a lake. You remember that you were standing next to each other
against a yacht named ‘Vision Queen.’ The water in the lake was light blue, and you could see some parts of the marina in the background. You
can remember that photo very well, but you are not sure about the names of your friends, and you cannot remember if it was five or six years ago.
How will you start looking for that photo? If you enter into the browser the words describing the view of the photo – yacht, lake, blue water –
these words will unfortunately describe millions of similar photos taken all over the world. You may of course, limit the scope of searching to a
given geographical place by entering the name of the lake, if you remember it. However, it is usually unsuccessful. A semantic description of a
photo does not always enable finding the correct image. Nevertheless, the brain can do it very well. It first compares familiar images, and then
‘attaches’ (although not always) semantic descriptions to them. Don’t you have in your album any photo of people and events that you actually
remember, but you are not able to name the persons or objects? This shows that in the brain there is usually a stronger visual representation than
a consequent semantic description that appears in the stage of image description and analysis. Although we think that these processes take place
simultaneously and have common features, activating both the specific parts of the brain and cognitive activity related to the perceived objects,
they are indeed separate processes that should be investigated based on various models of behavior.
The process of perception, although not yet entirely investigated, can be studied by means of computer simulations. They are based on the
discoveries in the field of human brain construction and functioning. Application of computer technologies enables the study of natural process
of perception and recognition of real complex objects (Barres & Lee, 2014). Based on the simulation results and studies from experimental
psychology, it is possible to present the whole process schematically (comprehension) in the form of a simplified scheme (Figure 3). In the
presented model the visual cortex modules responsible for shape recognition are activated first. Representation of the observed objects is formed
in Visual Memory (VM) where interconnections and their mutual arrangement exist. These objects relate to the memorized shapes stored in Long
Term Memory (LTM). They finally create a memory structure that reveals the semantic content. Artefacts appearing in that place can resemble an
observed shape, but they do not reflect its semantics. This is the base of optical illusions.
Figure 3. Veslava Osinska, A model that shows processes
occurring during map reading. Adapted from Barres and Lee’s
recent work (2014). (© 2014, V.Osińska, Used with
permission).
While reading a visualization map, cognitive perception structures are activated by means of analytical modules in a natural way, in contrast to
the forced use of verbal narrative alignment. By analyzing the scheme presented above and building on the authors’ own study (Osinska, Dreszer,
Osinski, & Gawarkiewicz, 2013), it is possible to draw conclusions about the process of reading and analyzing science maps. The most striking is
the searching for familiar problems, forms, or contents. When the scientists view a map, they look for their own disciplines, but students focus
their attention first on vivid and colorful presentations and only then on the content. Such behavior is congruent with the perception model.
Scientists have a vast knowledge acquired for years, so they instinctively understand presentations that match the already existing knowledge
structures. On the other hand, students first react according to the ‘first impression’ rule, and then try to understand and analyze a selected map.
This observation is an important indicator for designers of maps and school illustrations. Graphics should be constructed in a way that the first
impression effect is used, and directs the observer’s attention to the further path of deep analysis. Web designers and information architecture
specialists know and successfully use these rules in their weighty projects. But in the era of the network big data it is not enough to attain full
functionality of web applications. Knowledge about current visualization techniques and their integration with content architecture contributes to
the social success of web services.
VISUALIZATION METHODS IN INFORMATION ARCHITECTURE
The major problem in practical visualization is data structure’s presentation. The natural way to display a set of numbers is to place them in rows
and columns, i.e., make a tabular arrangement. Tables, as the basic form of quantitative dataset representation were known in ancient times.
According to some sources (Few, 2009), first tables were invented in Egypt to record astronomical observations. Greeks used first multiplication
tables. Today tabular presentations remain fundamental in information architecture, particularly in Web services. Most of the contemporary
website layouts are organized on grids. Such pattern is characterized by high functionality and legibility because it enables the use of modular
units, which can be developed by editors, specialists and members of other teams. Website templates, which are usually classified by themes,
graphic styles, or color schemes, consist of these constructing blocks. There are also conventions of organization hierarchies regarding the
content. Users should feel comfortable with navigation through the web service structure, as well as exploration and searching its resources. It
means that a design should be consistent and predictable in terms of users’ needs (Lynch & Horton, 2008). A table as a standard in the Web
architecture meets these criteria. Tables expose non-linear cognitive properties due to vertical and horizontal parallel arrangements of data.
Users can comprehend the change and the relationship between data only if they perceive sequential or selected items from the dataset in one or
more groups. How does this part of the data relate to that part? How are they similar, and how are they different? What meanings do the data
carry if they are grouped or taken apart? Ralph Lengler and Martin J. Eppler (2007). Presented visual complexity of visualization methods in the
form of a structured table. The project is called: A Periodic Table of Visualization Methods (Figure 4). The inventors classified properties of
Infovis techniques according to a rational key. They coded them both by color and position associated with a chemical element in a periodic table
of elements. The authors expressed an “effort of defining and compiling existing visualization methods in order to develop a systematic overview
based on the logic, look, and use of the periodic table of elements.”
Figure 4. Veslava Osinska, An example of tabular organization
of information. The set of known visualization methods takes
form of a periodic table. Each component (cell) represents a
method/technique of visualization. A detail adapted from an
interactive Web application “Periodic Table of Visualiation
Methods” designed by Lengler & Eppler, (2007)4. (© 2014,
V.Osińska. Used with permission).
Thus, it is possible to say that tables allow perception of information in a non-linear way. Designers try to extend the advantages of this simple
and intuitive organizational form. Rows and columns can be interchanged in pivot tables in Excel, which additionally provides a dynamical
summarization of the grouped data – the name of the table comes from pivoting mechanism.
It is worth noting the differences between data visualization and information visualization. Data visualization is focused on measurable data, such
as human body medical examination results or geographical information system data. Information visualization, on the other hand, is
concentrated on the abstract data such as text or hierarchical structures. The actual information representation engages knowledge of natural
ability of a human being to quickly recognize images. However, not all information can be reduced to its direct interpretation in the physical
world.
Infovis is a scientific discipline that seeks new graphic metaphors to present information, which lacks a natural and obvious representation. It
uses the achievements of related disciplines, such as scientific visualization, data mining, human-computer interaction, vision, visual perception,
and computer graphics. The idea of this discipline is well reflected in the following quotation: “The eye seeks to compare similar things, to
examine them from several angles, to shift perspective in order to view how the parts of a whole fit together” (Luther, Kelly & Beagle, 2005). Such
a visual analysis is focused on the process of understanding and discovering the meaning of data. Visualization techniques are used as one of the
most effective forms of data mining; they can usually find more correlations than typical statistical methods.
Some scientists consider information visualization in the context of knowledge management as a stimulus to its understanding. Then, a definition
in relation to that concept is: “Infovis – it is the process of knowledge internalization by the perception of information” (Dürsteler, 2007). As a
result, understanding can be interpreted as a continuum, which extends from primary data to wisdom, passing through information and
knowledge combined in the process of visualization.
Data are simple facts; when separated from the context they lose their potential to create knowledge representation in the user’s mind. However,
when formed in a proper structure adjusted to basic principles of human perception, they become more than just facts – they acquire a proper
representational context, and in combination with an aesthetic feeling they gain a new cognitive quality. Data become information provided that
they carry the information that we are able to understand and perceive as valuable.
Together with the development of computer technologies, engineers and scientists were intensively looking for new effective methods of data
structure visualization. Hierarchical trees were not represented as branches, but maps; one-dimensional typology was expanded to two
dimensions. Ben Shneiderman (2009) introduced in the early 1990s a tree-map concept, an approach to the information visualization of
hierarchical structures. Interactive data visualization software was named TreeMap (Figure 5).
Figure 5. Veslava Osinska, Representation of hierarchy as a
traditional tree design and contemporary treemap (left).
Screenshot of Treemap’s application for file structure
management (right). (© 2014, V. Osinska. Used with
permission).
The tree map solution is based on nesting rectangles where smaller rectangles’ areas are proportionate to the capacity of folder resources.
Another idea to move the catalogue tree structure into the two-dimensional space is a hierarchical scheme sketched by means of concentric
circles, in the middle of which there is primary information content while other concentric circles represent related information fields. Such
characteristics as general information content and type of data record are identified by segment angle and color, respectively. Another way of
visualization includes creating special maps, which show not only hierarchies but most importantly the structure. An individual structure can
show a natural dispersion of elements of a presented set. An example of such a map related to the visualization of different branches of computer
science is presented in Figure 6.
Figure 6. Veslava Osińska, Computer Science literature
visualization according to main 11 thematic categories
represented by colors. The map represents classified articles
from ACM Digital Library. (© 2014, V. Osinska, Used with
permission).
Network science can also be applied – objects are presented as nodes and relations between them as edges. The most popular tool for network
drawing and exploration is Gephi – open-source software for interactive visualization of large collections of data. Intuitive interface provides
different layouts based on force directed, multi level algorithms, statistics, and metrics of framework, as well as manual manipulation of graph
and data. As a result, multicolor graphs appear with complex structures, which constitute an example of large-scale data visualization that is easy
to read (Figure 7). The analogy of such structures to fractal shapes in the context of visual perception is discussed in sub-chapter below. Graphic
description of such difficult issues as the development of concepts and paradigms in the development of science is an example of the network
approach to information presentation.
Figure 7. Veslava Osińska, Social network graph, generated in
Gephi (© 2014, V. Osinska, Used with permission)
Science maps, as a useful tool not only in education, but also in the work of professionals, allow investigating mutual thematic relations and
making conclusions on the complementarity of the interdisciplinary research. In the era of unification of scientific research that we now live in,
science maps make a priceless tool. The 19th century division of sciences is sinking into oblivion.
Internet technologies have not only improved the access to research findings in various scientific circles, but most importantly they have allowed
those circles to become widely active in different science fields. Similarly, users such as students are less interested to which discipline a problem
they analyze belongs, than how to solve that problem.
Science map mining (Figure 8) enables an analysis of trends in future directions of research development in any region of the world, and – what
is significant – their comparison. Patterns appearing on the maps of disciplines and scientific specializations prove a growing interdisciplinarity
of science. It should be reminded that according to Thomas Kuhn (1996), the creator of the concept of paradigm shift in science, typical scientists
are not “objective and independent thinkers, but conservatives” because they apply knowledge that the theories they have been taught dictate. On
the contrary, users treat information in a practical way. It is up to the map designers whether their maps will be presented to users in an
aesthetically attractive way or whether users will be “closed” in the dry structures of tabular data in a classic way. Similar maps encompassing
different disciplines and using different graphic representations can be found in the Places & Spaces project.
On the other hand, while seeking new solutions, science researchers usually analyze the contents of scientific literature. They should not limit
themselves to the traditionally defined scope of a given discipline or tested and accepted methodologies. The study of sources coming from
thematically remote sciences may facilitate both finding surprising answers and broadening horizons. This, in turn, may definitely have a direct
influence on potential discoveries of new principles and theories.
VISUALIZATION IN LEARNING
Visualization of Knowledge
The history of human activity shows that visual representations support communication of ideas, concepts and information. To explore data and
to “examine them from several angles of view to shift perspective in order to view how the parts of a whole fit together” (Luther et al., 2005), not
only the eye but also the cerebral cortex are engaged. Visual representations facilitate perception of meanings of the data and relationship
between their parts. If application interface is interactive, it additionally amplifies human cognition due to possible actions such as data sorting
and filtering. Generally speaking, visualization is graphic information with some abstraction level produced artificially – not naturally such as in
a physical landscape. Abstractions can be easily recognized. Visual forms substitute the textual content. They have intuitive attributes such as
color, shape, size, position, texture and luminance. The initiator of graphics semiology, a French cartographer Jacques Bertin, defined six
attributes as main visual variables, which are the basic blocks of human vision. Finally, visualization can map the data into the space of visual
attributes representing them in meaningful way. These visual characteristics express the importance and dependences between data. End users
apply them for the analysis of data from different perspectives.
One could say the purpose of visualization is to help users acquire knowledge. That property has been known for a long time and applied in
education as a supplementary learning tool. Visualization is indispensable for a teacher who wants to show something that cannot be observed in
the surrounding world. So, computer simulations based on 3D mapping and rendering algorithms help the physics teachers demonstrate micro
objects, such as atomic structure, or interaction of molecules, as well as macro objects, for example galaxy birth or black holes origin. Historical
simulations introduce a prehistoric era and also show how borders of different countries changed in time after wars or under critical geopolitical
conditions.
Visualization can be used to track past events such as dinosaurs or to look at things that are difficult to reach like a human skeleton. Since the
National Science Foundation supported the initiative to encourage computer graphics to visualize the inside of human body, 3D imaging has
become one of the dominant directions in visualization (McCormick, DeFanti, & Brown, 1987). Thus, to prove the usefulness of visualization to
academic audience, one should mention such scientific domains as: History, Anthropology, Geography, Anatomy, Medicine, Physics, Biology and
Astronomy.
Mapping techniques such as mind maps are useful for representation of ideas and knowledge but also for quantitative data visualization. When
numerical information is presented visually, it gains a form, which allows users to have an insight into the data and catch differences, trends, and
exceptions. The traditional line or bar charts can provide a combined picture of relationships between data. It would be impossible to exhibit
these characteristics from the same data presented textually.
Gapminder5 is user-friendly software for big statistical data visualization, which supports understanding the world developments in socio-
economical contexts. It is a useful tool for teachers, because it provides a fact-based worldview by means of clear bubble charts animated over
time, and offers free collections of statistical data. Figure 8 illustrates the screenshot of a Gapmanider interface. Students can interactively
compare selected countries by economy, education, policies, and societies, and examine how they have changed during the last 200 years. For
example, it is possible to conclude that (in 2007) math education in Japan is very effective in contrary to that of the USA, despite that they have
similar income per person.
To explain functions of visualization, researchers often use the term insight. Seeing visualization in connection with insight is the major reason
for associating it with cognitive processes. Unfortunately, there is no clear-cut translation of this word into some other languages; scientists map
its meaning into a set of terms related to intuition, deep observation, careful look, or instant analysis or just perception.
Maps can serve for making a complementary analysis of the rate of changes in non-numerical information. Traditional linear data
representations – charts – are not sufficient tools in this case. Maps give multi-perspective analytical possibilities and larger interpretative
flexibility (Rafols, Porter, Leydesdorff, 2010).
Science, as a highly dynamical, varied and unpredictable entity, is often employed for bibliometric mapping. Science maps are used to describe
how specific disciplines or research fields are structured. There are some directions of the mapping study. One is focused on the representation of
collaboration ties within a scientific community. Another one attempts to show the structure of disciplines. There is also research oriented at
science dynamics and detection of future trends. The development of a scientific domain including its researchers is analyzed conceptually,
intellectually, and socially. A lot of useful visual resources regarding this topic are available on a collection of scientific mapsPlaces & Spaces,
Mapping Science (scimaps.org). One can find there conceptual, domain-related, and classic cartographical maps of science. It is not a trivial task
to group knowledge visualization maps that are varied in relation to the subject, methodology, data, and authors’ majors. The exhibition team
guided by Katy Börner (2012) categorized those maps according to specifications like ‘Visual Interfaces to Digital Libraries,’ ‘Reference Systems’
or end users, for example: scholars, economic decision makers, and children.
A characteristic example of a science map explored on the Places & Spaces service is the so called UCSD (University of California in San Diego)
map made by Richard Klavans and Kevin W. Boyack (2006, 2007) is demonstrated in Figure 9. This is the most frequently cited and
demonstrated map of worldwide scientific knowledge, probably due to the completeness of the map in terms of bibliographic data. The map is
used globally and universally to visualize the current state of science. Its additional virtue is that the authors have modernized it using an
6
extended dataset and developed an interactive version so that users can play with different graphic configurations on a sphere6. This spatial
graph shows the global science divided into 13 color-coded disciplines. They are (from the left): Computer Science, Mathematics and Physics,
Engineering, Chemistry, Earth Sciences, Biology, Biotechnology, Medicine, Infectious Diseases, Health Science, Brain Science, Social Sciences,
and Humanities. The highest density indicates a rapid development of Medical Science and Brain Sciences. Computer Science and Engineering
are most commonly used in medicine. Social Sciences and Humanities are linked to Computer Science. An increasing use of information and
computing technology (ICT) in Humanities is very vivid. Epidemiology as a separate category of Medical Science reveals the greatest isolation in
relation to other disciplines (Börner, Klavans, Patek, Zoss, Biberstine, Light, Lariviere, & Boyack, 2012).
When considering numerous examples of multivariate data mapping in the Places & Spaces collection, the authors relied on observations,
assuming that the maps of knowledge and the maps of science could extend and make e-learning methodology more effective. Science maps, as
an example of multidimensional presentation on a simple paper surface, constitute an interesting research material in terms of cognitive aspects
in learning processes (Osinska, Dreszer, Osinski, & Gawarkiewicz, 2013). The main focus in making observation-based conclusions depends not
only on age, profession, and interest areas but also on artistic sensibility of active followers. Undoubtedly, visualization stimulates learning about
the current state of knowledge. Maps also have an educational value as such applications include interaction mechanisms. The rules of human
perception and understanding are applied in designing a visual interface. Particular attention should be paid to how the brain processes color and
complex shapes (i.e. fractals). Certainly, color is not an absolute factor on which we can focus the analysis and design of visual message. Because
the large individual differences, one should be cautious in anticipating a desired effect. The perception of color changes with age and also
depends on cultural patterns of the users (Chen, 2012; Ware, 2004).
Modern sophisticated visualizations generated through dimension reduction and mapping algorithms are a kind of graphic layouts allowing for
multifaceted and objective data analysis. They have been long used in advanced exploratory analysis and are now an indispensable stage in data
mining. Within the existing qualification, visual mining accurately reflects its scientific-empirical objective and interaction with a user, including
feedback (Soukup & Davidson, 2002). Nowadays, when marketing is increasingly encroaching on the scientific activity, visualization methods
appear to be most appropriate. While designing, one should ensure not only good topological representation but also pay attention to the
aesthetical context of visual message. It would seem that this is the area reserved for artists.
Visualization on Computer Science Lessons
New graphic forms and animations, as well as appropriately selected visualizations are aimed at tackling teaching problems, especially when we
think about teaching of disciplines that are considered difficult, such as computer science. Information Technology is gaining increasing
recognition; it is becoming a universal language of almost many disciplines and provides them with tools and development opportunities. The
principal objective of educational alphabetization in terms of the ‘3R model’ (reading, ‘riting and ‘rithmetic) nowadays needs to be extended with
computer alphabetization called computational thinking, which includes algorithmic thinking, problem-solving, and the skill of programming,
applied in all areas of human activity. However, including programming into the educational canon at almost any level of formal and informal
education essentially involves devising new methods, which will allow for acquiring the skills of this difficult discipline more easily and at an early
age. Once again, what comes to our aid here is visualization, which by combining learning and playing allows students to learn more easily and
quickly, making it possible for them to comprehend complex issues. Steve Jobs once said, „Everybody in this country should learn how to
program a computer… because it teaches, how to think”. These words have become the motto for carrying out the global movement The Hour of
Code7 reaching tens of millions of students in 170+ countries (hourofcode.com/). In 2013, Mark Zuckenberg developed the Angry Birds Hour of
Code Game (Rodgers, 2013); he used characters known to the youngest users from the Angry Birds game to teach programming. Increasingly
complex commands of a programming language (Java scripts) are acquired in the course of the angry bird’s solving puzzles in order to catch a
green pig. The commands of the programming language are written in a simplified form, as a combination of interrelated blocks. This way, the
instructions of repetition or decision-making, which are usually hard to understand or write down at the early stages of learning, become user-
friendly and can be properly formulated with little effort by anybody. Since December 2013 almost 37 million people from all over the world have
participated in the Hour of Code, and new ones still are coming to solve its tasks. The users’ opinions available on the project’s websites confirm
the fact that the friendly visualization of a difficult programming language has breached the barriers and made many people, not necessarily the
ones with an Information Technology background, believe that they too can program a computer.
When solving complex tasks, visualizations may evolve towards a simpler model, which will allow to easily noticing the properties leading to a
solution. One of the options is introducing a suitable graph model in order to present the relations between pieces of data. As early as in 1736,
Leonhard Euler8 used this method to solve the problem of bridges over the Pregolya River in Konigsberg9. Modern technology allows observing
the satellite image of Konigsberg bridges, for example by using Google Maps, and creating a graph model on its basis (Figure 10), in order to later
consider the visualization created by means of such abstract model. On the basis of the graph obtained it is easy to notice when it is possible to
cross one bridge exactly once and return to the starting point. The sufficient condition is for each apex to have an even number of edges
connected to it.
ELearning Visual Solutions
E-learning materials should be designed in order to activate students, with particular attention to faceted visual messages, such as interactive
maps and illustrations, exploratory statistics (like Gapminder, 2014), and inference on demand. E-learning services designers forget (or do not
know) the basic Infovis principle, “focus+context.” The information architecture of e-learning services is usually linear in structure; it has a
tabular layout. The central column is dedicated to educational resources like files, links to websites and movies, quizzes and students activities.
While processing the material in a classroom, the teacher opens new content sections and students see an extended list. At the end of the course,
they see a sequence of sections sorted by topics or according to a calendar. During the course, it is impossible to estimate a quota of redo the
covered material according to the whole content – the rule “focus+context” is not present here. If the teacher decides to make all themes visible at
once, they will be arranged in a long list, which is not ergonomic for browsing. Students have to scroll through several screens down, which does
not help them evaluate the volume of the course content and discourages them from further learning. It would be more comfortable to use non-
linear architecture, for example a circle-segmented structure (Figure 11), or just spherical instead of the linear architecture.
10
Cisco Networking Academy10 educational program is an example of global online teaching. The Academy supplements the knowledge, which is
essential for learning the rules of computer networks’ functioning with visualizations: texts, graphics, and animations. The nonlinear, interactive
curriculum provides numerous attractive forms of teaching. The materials, which the Academy provides online include very intuitive and simple
visualizations, such as the one from Figure 12, which makes use of the world map in order to demonstrate the role of modern technology in
creating online societies, where international borders and physical limitations do not pose a problem.
Cisco Networking Academy also includes more advanced forms of teaching, created thanks to specialized visualizations. The Packet Tracer
program allows for working in a virtual laboratory, where any combination of computers, servers, routers and switches can be created as virtual
networks, and which can emulate the functioning of network devices. A visualization of actual solutions along with a simulation of their
functioning allows for learning the rules of computer networks’ functioning, analyzing the way of sending information by them, as well as
learning even the most complex topologies and network solutions. Figure 13 presents an example of a network topology, which needs to be
extended by adding a computer, providing it with a proper IP address, sub network mask, gate, and connecting selected computers to ports.
Figure 13. Veslava Osinska. The scheme of a network topology
in Packet Tracer virtual laboratory (© 2014, V. Osinska. Used
with permission).
Having done that, one can test the correctness of the proposed solution by turning on a simulation, which entails sending a package of data via
the network, and then observing the process. Visualization has in this case been supplemented with a simulation, creating a learning
environment based on augmented reality (AR) (Wojciechowski & Cellary, 2010). By using these solutions, students turn from passive recipients
to active participants of the teaching/learning process, learning by doing and experimenting.
An inventive solution of how to avoid a traditional grid layout in e-learning systems was presented at the World Conference of Computers in
Education (http://wcce2013.umk.pl/). An interactive mind map was used as a management tool of didactic resources (Figure 14). This open
source plugin on the Moodle platform has been demonstrated in a series of articles (Debska & Sanokowski, 2013).
Rigid linear architecture has a negative effect on e-learning systems, especially on mobile devices. Students’ attention is distracted from the entire
structure of document guidelines and they do not recognize the overall task and content orientation. Not only is the role of visual presentations in
e-learning underestimated, but also too much passive audio video streaming is used. This does not diminish the importance of video lectures, but
it should be noted that the emotional component during on-line transmission is much smaller than during standard lectures (Szelag, Dreszer,
Lewandowska, Medygral, Osiński, & Szymaszek, 2010). Using appropriate information architecture on e-learning websites as well as a parallel
use of visualization methods can effectively increase it. Rigid information architecture of the e-learning platforms such as Blackboard, Moodle, or
Olat should be abandoned in favor of flexible, dynamical, non-linear forms of graphic layout like Prezi.
A typical e-learning system based on the appointed sequence of topics is perceived by students as boring, limited, and emotionally attached to a
school learning model. The authors propose to apply a dynamical interactive interface, close to the natural spherical vision form that uses a
fractal texture (Osinska et al., 2013). A pilot study has been carried out to explore the collection of scientific articles (Figure 15).
Non-linearity in the arrangement of e-learning resources was demonstrated in a 2D mapping approach. A tabular structure is rigid; knowledge
maps presented in distinct topologies are, in contrast relative to self-similar shapes; they resemble fractal structures. Graphic structures based on
fractal patterns help students spontaneously focus their attention on difficult topics, and therefore should be frequently used in e-learning
systems at different education levels (Szelag, Dreszer, Lewandowska, Medygral, Osiński, & Szymaszek, 2010).
It should be noted that a tabular meme is firmly rooted in human consciousness: all unknown, unclassified items, events and phenomena are
often successfully presented in the form of periodic systems (see the Figure 4). The authors investigated similar and distinct research fields as
well as clusters organization by means of obtained graphic patterns. They also analyzed the dynamics of classification due to data series for
different publishing periods within a 10-year step. The results show that visualization of classified documents reveals organization of digital
library content and allows identifying hierarchical thematic categories.
TRENDS IN VISUALIZATION
Natural Shapes in Perfect Visualization
One of the ways of activating the ‘first impression’ is to use natural shapes. Typical course books and educational materials are full of Cartesian
shapes. Charts, tables, quadrilaterals, and classic shapes are the elements that students directly associate with the educational process, which is
not always attractive and interesting. A natural shape should attract their attention. Such behavior has also been observed during the Places and
Spaces exhibitions. Daniel Zeller’s (2007) map entitled Hypothetical Model of the Evolution and Structure of Science, presented in Figure 16, has
been very popular.
As the author says, “This drawing conceptualizes science as layers of interconnected scientific fields through a stimulating and creative visual
language. Starting with the very first scientific thought, science grows outwards in all directions. Each year, another layer is added to the meteor-
shaped manifestation of knowledge. New fields emerge (blue), and established fields (brown) merge, split, or die. The cutout reveals a layering of
fat years that produce many new papers and slim years in which few papers are added. Each research field corresponds to a tube-shaped object”
(Places & Spaces).
Why does a seemingly totally unorganized and color-wise unattractive map attract so much attention? The answer should be looked for in the
already described properties of the human visual perception system. While seeking an adequate complete visual resonance, fractal forms should
be engaged. Fractals are non-Cartesian shapes, which were first discovered and used by Benoît Maldenbrot (1977) in 1975. In the next forty years,
thanks to the development of computer technologies, their position in many fields of science was established, although the intensive theoretical
and experimental study of their properties is still being conducted. Today nobody doubts that the actual natural world is fractal, starting from the
micro-scale, i.e. the structure of the smallest biological organisms, through animal and plant worlds, to geological, landscape, and cosmologic
structures.
Evolution-wise, the human brain must be naturally adjusted to the perception of fractal shapes, since on savannahs and in forests where man
evolved and developed, there were only fractal shapes there; there were no quadrilaterals, triangles or straight lines in the natural environment at
that time. Therefore, numerous and easily accessible fractal algorithms should be used in graphic designs and book illustrations to add variety or
create complete shapes similar to the natural ones.
Unfortunately, during school education students are taught that only objects that are ‘completely’ dimensional can exist in the world. Teachers
teach about such objects during geometry lessons, and this system of the world description is present throughout the whole education system. It
is thus difficult to break the existing schemes and explain, even to specialists, that Cartesian geometry is very important, but is does not describe
the whole complexity of the world – because it cannot do so. Nature is more complex than geometry models invented by man; there constantly
exist fractal structures in the human brain, coded in neuronal networks, because such structures have surrounded man for millions of years.
Fractals are characterized by an incomplete dimension, and thus their symmetry is often broken, which confuses mathematicians but is very
useful to neurobiologists. That feature can be creatively used to make suitable graphics for educational materials and to teach.
We are used to think that when an object is rotated by 360 degrees, the same image is always received. It is not the case with fractal structures.
They emerge in the calculation process in computer memory, and when rotation transformation is applied, a fractal rotated by 360 degrees does
not have the same shape as the original element. Figure 17 shows the evolution of the fractal during the rotation transformation. It can be well
seen that typical perception properties, which are easily explained by geometrical objects, totally fail here. It is not possible to see similarity of
shapes, which can be easily done when rotating a square, for example.
Figure 17. Grzegorz Osinski. Fractal structure transformed in
rotational symmetry generates completely new shapes.
Graphics generated by means of Apophysics. (© 2014, G.
Osinski, . Used with permission).
What is more, a fractal rotated by 360 degrees is slightly different than the original one. A situation where it is not possible to identify inverse
transformations is well known to mathematicians. However, in the case of fractal graphics, it constitutes a difficulty while generating new shapes.
From a practical point of view, it should be remembered that fractals are usually representations of dynamical processes; therefore, their static
representations in the form of images are problematic. Nevertheless, the surrounding reality is not static either, it keeps changing, and photos
taken of people or objects are just temporarily caught frames from the continuously changing landscape. Human eyes and brain act the same
way. Despite the classic reduction approach to the perception process, which can be best explained by using static images, human perception
system is accustomed to ever changing structures. The mystery of fractals that do not return to their original shapes after symmetric
transformation is also the mystery of nature. It is worth quoting from a poem by a Noble Prize winner Wisława Szymborska that “nothing can
ever happen twice” (Wisława Szymborska, Calling Out to Yeti [Wołanie do Yeti], 1957).
Artists and poets can surly understand such difficult issues better when they create art masterpieces or write poems. The language of math and
the paradigm of contemporary education are just now discovering and trying to understand phenomena that are already well known to the world
of art. While painting, taking photos, or creating illustrations, the process dynamics is being frozen for a moment. However, fractal structures
cannot do that, and thus Zeller’s work presented during the Places & Spaces exhibition might have aroused such interest – it was both intriguing
and interesting. A shape created by a graphic designer has all basic features of a fractal structure. A correct application of that knowledge will
definitely help to create maps with a natural texture and with a greater number of information, and at the same time will be more interesting and
analyzed deeper than classic shapes.
Since color has a significant impact on perception, we use a carefully controlled palette forcing the brain to draw upon previous experiences or
points of reference when viewing art; thus, each viewer has an individual response to a painting. It is not necessary for the viewer to have any
knowledge of fractals to make a connection to the presentation.
While comparing the shapes presented in Figure 16 and the map representation in Figure 7, similarities can be definitely discerned. Keeping in
mind the symmetry breaking described above, a conclusion can be drawn that the structure of network connections presented on the map will not
have that characteristic either. It is difficult to predict the development of network; its recovery or an attempt to control it may always fail. Due to
fractal properties it is not easy to control natural spontaneous processes. Will such representation, however, be easier for the user to adopt? Will
a required feature of analytical image reception develop in the space of the user’s mind? Such questions can be answered in detail only by
properly designed experiments based on a large statistical sample. Based on observation, it can already be said that during the study conducted
on the participants of the Places & Spaces exhibition, maps of fractal shapes always aroused the greatest interest. The study participants replied
in the survey that the maps ‘attracted attention’. They therefore meet the condition of the ‘first impression’, and whether map designers will know
how to use it further depends solely on the properly applied data visualization techniques and the placement of data in an applicable context.
Obviously, the role of a guide or a teacher in the correct explanation of the map structure cannot be overestimated. It is the guide or the teacher
who should explain and lead the student through the further education process. These issues, however, constitute a totally different problem that
belongs to the methodology of education. Hopefully, it will be appropriately applied and will facilitate further development of knowledge
presentation methods.
3D Visualization: Another Dimension. Help or Hindrance?
We should think about whether popular 3D visualization technologies can be applied in map designing. 3D television and cinema have become a
standard now. We have become used to the fact that when we wear special glasses, our brain is tricked; we allow this and often like it. The truth is
that 3D projections use our natural perception system to create binocular vision and trick our eyes by delivering an image asymmetrically to one
eye and then to the other, using various techniques – polarization, chromatic filters or mixed techniques. This technology requires applying
complex electronic configurations supported by computer systems.
Technology-related difficulties can be avoided, if while creating 3D illusions we use technologies applied by artists such as Manfred Stadler. Being
a street art designer, he shows his works in public places, so that viewers can take photos next to them. The 3D effect while viewing his works is
achieved thanks to the author’s spatial imagination and prior complex perspective calculations. A graphic artist needs to know at what angle
objects should be painted so that they seem three-dimensional when looked at from a specific distance. It can be achieved by using a simple
linear perspective with spatial transformation. The illusion of three-dimensionality can be observed when a real 3D object, such as a person,
becomes an element of the layout. When the whole is captured in a photo, an ideal imitation of 3D space can be achieved by eliminating the
previously described dynamical image analysis effect (Figure 18).
Figure 18. Grzegorz Osinski. The kids ‘walking’ on 3D graphics
in a shopping mall. (© 2014, G. Osinski. Used with
permission).
Application of those techniques in designing maps and educational materials can be difficult but possible. One can imagine a graphic, which is a
projection of subatomic world, for instance, in which students walk around during their physics class. Photos taken by students during such a
lesson and shared in social networks will definitely become a significant educational material, perhaps even more interesting than ‘flat’
illustrations in a book. However, it seems that a common application of such a technique is beyond the education system now.
The situation looks different with the presentation of data in the form of classic three-dimensional solids. Instead of designing maps on cards, it
might be better to create a three-dimensional model. Just like a globe is a three-dimensional map of the Earth, perhaps it will be possible to place
a map on the surface of a sphere, which will show a complex analysis of a large dataset. The authors made such an attempt (Figure 19) and
presented the results at a science conference – scientists showed interest, but a proper spatial orientation became a problem (Osinska et al.,
2013). It turned out that not all could naturally identify the North Pole as the top axis and the South Pole as the bottom axis. Has the widespread
of the GPS, which changes the map orientation based on the movement direction, created a stereotype in our brain that our location is identified
against our position?
Have we forgotten which direction a compass shows and how a map should be oriented? Probably yes; however, the P&S exhibition also shows
data mapped onto the surface of spheres and they are very popular among students. Perhaps thanks to having contact with presentations in the
form of a sphere that shows an actual shape of our planet, students will not have problems with spatial orientation and will remember what the
term north direction and its representation on a flat cartographic map really mean.
Neuroaesthetics Aspects
When we read about Rembrandt who “loved what he painted and only painted what he loved,” we can understand an emotional level of
perception strictly compared with creative processes. Sometimes visualization starts to play the same role as an artistic presentation of large
structural data. The users of a visual message should show behavior similar to the behavior of the art gallery visitors. A desired effect of
communication would be achieved when the user is deeply engaged during the visual perception thanks to his or her aesthetic sensitivity. It has
already been emphasized that the state of the brain is indivisible – phonology and semantics are not separate perception systems. They operate at
the same time and are mutually coupled. In order to create visual communication in the right way, it has to be understood how current
neuroscience attempts at describing the issue of aesthetic experience. This problem belongs to neuroaesthetics, which studies creative processes
in art and tries to understand the mechanism of the human brain during such processes. Neuroaesthetics is not just the study of artistic
experiences, but it also emphasizes the crucial influence of the brain study on the understanding of human nature. The central point of nthe
euroaesthetics is the study of perception laws, which art creation is subjected to, both at the level of creating and viewing.
Studies within neuroaethetics have been conducted for years; the beginning of the scientific development of neuroaesthetics is arbitrary dated to
the work of Semir Zeki (1999). They are interdisciplinary studies; however, a consensus accepted by specialists from different disciplines is
difficult. Currently, there is an ongoing discussion among specialists regarding issues related to neuroarthistory. The pioneering work by John
Onians (2008) has opened an utterly new chapter in the study related to interaction with the beauty in a visual communication. Hopefully, the
work continuing in a growing circle of scientists, especially art historians, book historians, and neuroaestheticians, will help to discover unknown
dependencies and historic implications of creating universal communication. First attempts at a precise definition of art from the perspective of
neuroscience were made by Ramachandram in his already classical works (Ramachadran & Hirstein, 1999; Ramahadran, 2010). Following his
directions, especially the ‘Ramachandran’s nine laws of aesthetics,’ it is possible to create a visual communication, which follows classical rules of
aesthetics. However, the sole principles of creating an aesthetic communication do not yet explain why visual artistic impressions are experienced
in such a way. In recent years, new studies results have been appearing, which justify the proposed laws of aesthetics and thus show the
complexity of the problem.
Perhaps it is the fractal structure, which shows how the brain creates images in the space of the mind based on visual impressions. This
dynamical way is generated by perception experiences in the limbic system, which can be defined as ‘aesthetic impressions.’ However, it should
not be underestimated that one of the oldest structures of the brain, i.e. the limbic system reacts to danger directly, omitting the cortex paths.
That mechanism, which has developed evolutionally, allows avoiding danger. However, it cannot be said that its role is meaningless in the
analysis of images. In the limbic system of the brain, information couples with the system of emotions. This path is probably responsible for
aesthetic experiences of the viewed image. Before visual information reaches neuronal structures located in the neocortex, where analytical
processes of information processing will take place, it will surely first activate the structures of the limbic system. This will result in a specific
emotional state, from the perspective of which a further analysis will be conducted. Thus, special attention should always be paid to maintaining
a specific balance between the subject-related content of the communication and its aesthetic universality.
SUMMARY AND CONCLUSION
The authors show that visualization, in parallel with popularity in the net applications, is more and more useful in learning. Authors concentrate
on the role of maps as teaching tools during the education process. At the beginning of the chapter content map (Figure 1) aims to introduce the
readers visually to the discussed issues and problems of visualization, considering knowledge communication through the information maps,
their practical application and e-learning architecture conception as well as cognitive and perceptual, technological and contemporary art aspects
of Infoviz.
The process of map reading can be divided into layers (Figure 3), each of which is responsible for the components of the analysis: image
characteristics grouping, shape recognition, third dimension recognition, and assembly of objects into the final scheme. Parallel identification of
the objects in the mind influences long-term memory (LTM), the development of semantic terms, and the process of creation of new ideas. In
many works on visualization, researchers refer to the visual perception model directly resulting from the Gestalt principles. We believe that the
rules defined by psychologists in the 1930s are now being overused, especially in visualizations and infographics, or wrongly interpreted. One
should see that during the visual analysis not only shape is significant, but also the structure of the viewed object or scene.
Tabular shapes dominate in information architecture. This is the structure of website layouts. E-learning courses also use this traditional
convention. The alternatives are non-linear layouts, e.g. network or fractal. Mind maps have non-linear architecture in principle and therefore,
implementation of such tools in e-learning platforms should be supported and developed. Special attention has been given to 3D visualization,
supported with interesting examples, including the discussion on its advantages and disadvantages in the correct interpretation of images.
Fractal structures are exceptionally intuitive in perception and reception because they originate from (or resemble) nature. This explains why
fractal-like visualizations are perceived first. Visual communication messages should be constructed following such patterns. This typical
neuroscience issue has been solved in nature by fractal structures – easy to compute iteratively, but reflecting the structural complexity in the
form of aesthetic communication. The last one is the subject of neuroaesthetics - a cross-disciplinary research field related to art, history of art,
cognitive and computer sciences, communication, and mathematics (Zeki, 1998, 1999; Onians, 2008).
It is not possible to mention all issues or troubles of visualization, which take place in communication and learning processes today. This chapter
attempts to join the usable aspects of visualization that currently afflict both Infoviz researchers and practitioners. Data and/or information
visualization became an interdisciplinary methodology. The authors are from different work backgrounds and majored in different subjects, but
their specializations intersect in visualization space. Therefore we only present those studies, which according to the authors are most significant
in their area.
This work was previously published in the Handbook of Research on Maximizing Cognitive Learning through Knowledge Visualization edited
by Anna Ursyn, pages 381414, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
The authors wish to thank Wlodzislaw Duch for inspiration and practical suggestions.
REFERENCES
Alivasatos, P., Chun, M., Church, G. M., Greenspan, R. J., Roukes, M. L., & Yuste, R. (2002). The brain activity map project and the challenge of
functional connectomics. Neuron , 74(6), 970–974. doi:10.1016/j.neuron.2012.06.006
Allport, F. H. (1955). Theories of perception and the concept of structure: A review and critical analysis with an introduction to a dynamic-
structural theory of behavior . John Wiley & Sons Inc.
Barres, V., & Lee, J. (2014). Template construction grammar: From visual scene description to language comprehension and
agrammatism. Neuroinformatics , 12(1), 181–208. doi:10.1007/s12021-013-9197-y
Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., & Boyack, K. W. (2012). Design and update of a classification
system: The UCSD map of science. PLoS ONE , 7(7), e39464. doi:10.1371/journal.pone.0039464
Chen, C. H. (Ed.). (2012). Emerging topics in computer vision and its application . Hackensack, NJ: Word Scientific Publishing.
Cossart, R., Aronov, D., & Yuste, R. (2003). Attractor dynamics of network UP states in the neocortex. Nature , 423(6937), 283–288.
doi:10.1038/nature01614
Debska B. Sanokowski L. (2013). Automatic generation of mindmaps in courses implemented on Moodle platform. In Proceedings of 10th IFIP
World Conference of Computers in Education (WCCE 2013). Torun, Poland: Nicolaus Copernicus University Press.
Duch, W. (2007). Creativity and the brain . In Tan, A.-G. (Ed.), A handbook of creativity for teachers . Hackensack, NJ: World Scientific
Publishing. doi:10.1142/9789812770868_0027
Duch, W. (2007). Intuition, insight, imagination and creativity.IEEE Computational Intelligence Magazine , 2(3), 40–52.
doi:10.1109/MCI.2007.385365
Duch, W., & Diercksen, G. (1995). Feature space mapping as a universal adaptive system. Computer Physics Communications ,87(3), 341–371.
doi:10.1016/0010-4655(95)00023-9
Duch, W., & Grudzinski, K. (2001). Prototype based rules – A new way to understand the data. In IEEE Proceedings onNeural Networks , 3,
1858–1863.
Dürsteler, J. C. (2007). Diagrams for visualization. The digital magazine of InfoVis.net. Retrieved January 26, 2015, from
http://www.infovis.net/printMag.php?num= 186&lang=2.
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis . Oakland, CA: Analytics Press.
Klavans, R., & Boyack, K. W. (2006). Quantitative evaluation of large maps of science. Scientometrics , 68(3), 475–499. doi:10.1007/s11192-006-
0125-x
Klavans, R., & Boyack, K. W. (2007). Maps of science: Forecasting large trends in science. In K. Börner & J. M. Davis (Eds.), 3rd iteration: The
power of forecasts, places & spaces: Mapping science. Retrieved January 26, 2015, from:
http://scimaps.org/maps/map/maps_of_science_fore_50/
Kuhn, T. S. (1996). The structure of scientific revolutions (3rd ed.). Chicago: University of Chicago Press.
doi:10.7208/chicago/9780226458106.001.0001
Lengler, R., & Eppler, M. J. (2007). Towards a periodic table of visualization methods of management. In Proceedings of the IASTED
International Conference on Graphics and Visualization in Engineering (pp. 83-88). Anaheim, CA: ACTA Press.
Luther, J., Kelly, M., & Beagle, D. (2005). Visualize this. Library Journal , 3(1). Retrieved from http://www.libraryjournal.com/article/
CA504640.html
Mandelbrot, B. (1977). Fractals: Form, chance and dimension . New York, NY: W. H. Freeman and Co.
McCormick, B. H., DeFanti, T. A., & Brown, M. D. (1987). Computer graphics. ACM SIGGRAPH, 21(6). Retrieved January 26, 2015, from
http://www.evl.uic.edu/core.php?mod=4&type=3&indi=348
Necka, E., Orzechowski, J., & Szymura, B. (2006). Cognitive psychology. Warszawa, Poland: PWN (in Polish).
Onians, J. (2008). Neuroarthistory: From Aristotle and Pliny to Baxandall and Zeki . New Haven, CT: Yale University Press.
Osinska, V., Dreszer, J., Osinski, G., & Gawarkiewicz, M. (2013). Cognitive approach to classification visualization. In A. Slavic, A. Akdag Salah, &
S. Davies. (Eds.), Classification & visualization: Interfaces to knowledge:Proceedings of the International UDC Seminar (pp. 273-281).
Würzburg: Ergon Verlag.
PISA. (2009). Technical report. OECD Publishing. Retrieved January 26, 2015, from: www.oecd.org/pisa/pisaproducts/50036771.pdf
Rafols, I., Porter, A. L., & Leydesdorff, L. (2010). Science overlay maps: A new tool for research policy and library management.Journal of the
American Society for Information Science and Technology , 61(9), 1871–1887. doi:10.1002/asi.21368
Ramachandran, V. S. (2010). The tell-tale brain: A neuroscientist's quest for what makes us human . W. W. Norton & Company.
Ramachandran, V. S., & Hirstein, W. (1999). The science of art: A neurological theory of aesthetic experience . Journal of Consciousness
Studies , 6(6-7), 15–51.
Soukup, T., & Davidson, I. (2002). Visual data mining: Techniques and tools for data visualization and mining . John Wiley.
Szelag, E., Dreszer, J., Lewandowska, M., Medygral, J., Osiński, G., & Szymaszek, A. (2010). Time and cognition from the aging brain perspective:
Individual differences . Eliot Werner Publication.
Ware, C. (2004). Information visualization: Perception for design . San Francisco, CA: Morgan Kaufmann.
Wojciechowski, R., & Cellary, W. (2010). Interactive learning environment based on augmented reality. Edu@kcja Magazyn Edukacji
Elektronicznej , 1, 42–48.
Zeki, S. (1998). Art and the brain. Brain , 127(2).
Zeller, D. (2007). Hypothetical model of the evolution and structure of science. In K. Börner & J. M. Davis (Eds.), 3rd iteration: The power of
forecasts, places & spaces: Mapping science. Retrieved January 26, 2015, from: http://scimaps.org
ADDITIONAL READING
Börner, K., & Polley, D. E. (2014). Visual insigths. A practical guide to making sense of data . Cambridge, MA: The MIIT Press.
Burkard, F. P., Wiedmann, F., & Kunzmann, P. (1999). DTVAtlas philosophie. Warszawa, Poland: Proszynski i S-ka (in polish).
Burke, J. (1985). The day the universe changed . London, UK: The London Writers.
Cairo, A. (2013). The functional art. An introduction to information graphics and visualization . Berkeley, CA: New Riders.
Chen, Ch. (2006). Information visualization: Beyond the Horizon(2nd ed.). PA, USA: Springer Science & Business.
Glass, L., & Mackey, M. C. (1988). From clocks to chaos. The rhythms of life . Princeton, New Jersey: Princeton University Press.
McCandless, D. (2009). The Visual miscellaneum. A colorful guide to the Rold’s most consequential trivia . New York, NY: HarperCollins
Publishers.
Morville, P., & Callender, J. (2010). Search patterns . Sebastopol, CA: O'Reilly Media.
Morville, P., & Rosenfeld, L. (2006). Information architecture for the World Wide Web (3rd ed.). Sebastopol, CA: O'Reilly Media.
Nielsen, J., & Tahir, M. (2001). Homepage usability: 50 Websites deconstructed . Indianapolis, IN: New Riders Publishing.
Osinska, V., & Bala, P. (2010). New methods for visualization and improvement of classification schemes: The case of computer
science. Knowledge Organization , 37(3), 157–172.
Schroeder, W., Martin, K., & Lorensen, B. (2004). The visualization toolkit (3rd ed.). Kitware Inc.
Spencer, J. P., Thomas, S. C., & McClelland, J. L. (Eds.). (2009).Towards a unified theory of development. Connectionism and dynamic systems
theory re-considered . Oxford, UK: Oxford University Press. doi:10.1093/acprof:oso/9780195300598.001.0001
Stewart, I., & Cohen, J. (1997). Figments of reality. Warszawa, Poland: Proszynski i S-ka (in polish). doi:10.1017/CBO9780511541384
KEY TERMS AND DEFINITIONS
Attractor: A general activity function that describes big and complex sets of dynamical particles. In cognitive science the activity of neural
correlates can be identified by attractor systems. The attractor is defined as a smallest unit that cannot be decomposed. General activity of a
whole brain may be described as a system of multiple attractors.
Cartesian Shape: An integer dimension of a classical geometric shape such as square, triangle, or circle, in opposition to a non-Cartesian shape
with non-integer dimension.
Graph: Mathematical construction, which consists of the set of nodes and edges. Nodes usually represent investigated objects, while edges – the
links (relations) between them. Graph is popular visualization method in many areas like science, business, medicine, or education.
Neural Correlates: This is a minimal set of general neuronal events and mechanisms sufficient for a specific activity of the brain. In visual
systems we have correlates responsible for perception, idea creation, and other cognitive activities. In neuronal systems we have both type of
transfers: information about visual perception and energy of neural correlates.
Nonlinearity: A feature of a system or a structure where output is not directly proportional to input – just like in perception of human visual
systems. In general it is practical consequence of Aristotle statement: “The whole is greater than the sum of its parts.”
Preconceptions: They are intuitive and preconceived ideas about processes and objects that we have never reflected upon, but which have
appeared spontaneously during learning or another activity.
Resonance: A very general phenomenon that happens in many natural and artificial systems. The general sense represents that the energy or
information in a system exchanges from one form into another at a particular rate.
Science Map: Graphical representation of scientific domains using bibliographic, bibliometric, and scientomeric data. Science maps reveal the
domain structure of science(s) and collaboration ties between researchers such as co-authorship, co-citations. Eugene Garfield (1994,
http://wokinfo.com/essays/scientography-mapping-science/) called these representations scientographs.
ENDNOTES
1 Places & Spaces. Mapping Science. Online exhibition of science maps at: http://scimaps.org
2 Many Eyes is a free site to upload, visualize, discuss, and share visualization datasets and results: http://www-
958.ibm.com/software/analytics/manyeyes/ and http://www.bewitched.com/manyeyes.html
3 Gephi is a free open source interactive visualization and exploration platform for all kinds of networks and complex systems. The webpage:
www.gephi.org
4 The Periodic Table of Visualization Methods – an interactive online application showing visualization techniques with examples:
http://www.visual-literacy.org/periodic_table/periodic_table.html
5 Gapminder is an open source software for statistical data visualization and animation: www.gapmnder.org
6 Map of Science, available at: www.mapsofscience.com, is web service developed by SciTechStrategies and dedicated to a specific kind of
visualization – the UCSD maps.
7 The Hour of Code is a web service dedicated to develop programming skills of students: http://code.org
8 Leonhard Euler (1707-1783), Swiss mathematician and physicist is considered the founder of the graph theory.
9 Konigsberg in 1736 belonged to Prussia. At present it is called Kaliningrad and belongs to Russia.
10 Cisco Networking Academy Program helps students develop the foundational ICT skills needed to design, build, and manage networks. The
website at: http://cisco.netacad.net
11 Authors’ online application to explore computer science thematic categories and their dynamics: http://www-
users.mat.umk.pl/~garfi/vis2009v3/
CHAPTER 25
A Cognitive Analytics Management Framework (CAMPart 3):
Critical Skills Shortage, Higher Education Trends, Education Value Chain Framework, Government Strategy
Ibrahim H. Osman
American University of Beirut, Lebanon
Abdel Latef Anouze
Qatar University, Qatar
ABSTRACT
The main objectives of the chapter are to evaluate the impact of the tsunami of big data, business analytics, and technology on the delivery and
diffusion of knowledge around the world through the use of Internet-of-things and to design future academic education and training programs.
Global and local trends are analyzed to evaluate the impact of the digital tsunami on the delivery and diffusion of knowledge; to identify the
shortage of critical skills, drivers of challenges, hot skills in demand, and salaries in big data/business analytics; to highlight obstacles to make
informed decisions. CAM education framework is proposed to design customized higher education and training programs to meet current
shortage and future generation with the relevant and rigorous skills to boost productivity growth and to impact society and professional domains
in the digital economy. Finally, new ideas on how governments, academic institutions, technology companies, and professional employers can
work together to reform the traditional education value chain and integrate the “massive open online courses” to achieve mass diffusion of
knowledge, to transform people from loyalty to parties, clergies, and dictatorships to society’s loyalty, and to develop a culture of shared-value in
a move towards a smarter and fairer planet in the 21st century.
INTRODUCTION
SAMSA - Shared values; Analytics; Mission; Activities; andStructures- were introduced in Chapters 1 and 2. SAMSA is a mission-based
framework to create sustainable cognitive knowledge development and a smarter planet in the 21st century. One of its objectives is the
integration of strategic management and performance measurement fields using cognitive frontier data analytics to measure shared values to
guide the process improvement at organizations. After the previous explanation journey of SAMAS components, the roles of academic education
institutions, governments and society stakeholders to prepare future generation with the right critical skills are still to be explored. The
exploration aim is to guide the development of strategic initiatives to unlock the value of the new abundant digital data using the new advances in
management, science and technology.
Using the research mindset of previous chapters, reviews of the literature on relevant trends, change drivers and skills shortages worldwide will
be conducted. The best-world practices will be identified to determine the essential blocks of academic rigor and relevance in order to propose
the Cognitive Analytics Management (CAM) framework for academic development. The CAM framework would help education institutions in the
development of higher education programs based on strategic shared value missions to serve social needs and to provide critical skills in shortage
in the fast growing digital area. The essential components when revising a mission will be highlighted to determine the desired strategic
positioning on the shared value spectrum to determine offerings in degree programs. There will be no-unique shared value education models that
would fit all societal needs but many customized ones based on regional needs and interests of academic departments that would deliver the CAM
initiatives. Since CAM is a multidisciplinary field in nature, several academic departments can host it based on own specific focus. However, it is
advisable to agree on the suggested common name (CAM) for the purpose of developing a strong unifying brand to avoid the past dilution that
happened to different fields including decision science, operations research, management science, and industrial engineering from one side and
artificial intelligence, expert system, and cognitive system from the other side. Irrespective, the right CAM model that will be emerged based on
assessment, engagement and consultation of the relevant stakeholders (academics, employers, advisory boards, partners, policy makers among
others) would be better than relying solely on the traditional view of only academic providers which are partly to blame for the current causes and
shortage of critical skills. Finally, the chapter will provide summaries of the main findings of the three chapters and conclude with a positive
futuristic view on the world development in the 21st century.
GLOBAL EDUCATION CHALLENGES AND TRENDS
McKinsey’s center for government published in January 2013 a report on “education to employment: designing a system that works”. It is based
on analysis of a survey of more than 100 education-to-employment education value chain initiatives from 25 countries and stakeholders views,
(McKinsey 2013). The countries were selected on the basis of innovation and effectiveness of their education initiatives. It surveyed stakeholders
including youth, education providers, and employers that were selected from nine countries (Brazil, Germany, India, Mexico, Morocco, Saudi
Arabia, Turkey, United Kingdom, and United States). The report’s findings are summarized into six main highlights which are explained as
follows:
5. Creating a successful integrator for the educationtoemployment value chain system with specific set tasks, new incentives and
structures. To increase the rate of success, the education-to-employment HE value chain system needs to operate differently. First,
stakeholders need better data to make informed decision about educational choices and to manage performance efficiency and effectiveness
of education processes, people and systems. Second, the most transformative solutions are those that involve multiple providers and
employers working together within a particular industry. Finally, countries need to have integrators where each country will have a
responsibility to take a high-level view of the entire heterogeneous and fragmented education-to-employment system. The role of the system
integrator is to work with education providers, employers and even students to identify key issues and to recommend skills development
solutions. The integrator should gather data, and disseminate positive success stories to assure the quality of HE outcomes and outputs.
Such integrators can be defined by sector, region, or target population.
6. Educationtoemployment solutions need to achieve scale up to meet population growth. There are three changes with one solution for
each change. First, there are constraints on resources, investments and education providers need to findqualified faculty for expansion.
Coupling the Internet technology and a highly standardized curriculum into a single solution can help to supplement the faculty’s shortage
and spread education instructions in consistent ways at a modest cost. The Massive Open Online Courses- MOOCs - can be integrated within
any standard program to enhance the learning of students. Second, apprenticeships traditionally provided to youth with hands-on learning
experiences were difficult to scale up. Technology, in the form of “serious game simulations” can now help in offering tailored, detailed,
practical experiences to large numbers at comparatively low cost. It could become the apprenticeship of the 21st century.Third, the hesitancy
of employers to invest in training students/employees unless education involves specialized skills. Employers also do not want to spend
money on employees who might take their expertise elsewhere. One proven approach is to combine customization by offering a standard
core curriculum that can be complemented byengaging employers to provide specific topup skills to achieve mutual benefits like reduction
of post-graduation training due to have better ready-student for direct employments.
Further, Fraser (2014) discussed the concept of smart education; Katherine listed five main drivers of change and five trends for smart education
in the world. These drivers and trends are summarized below.
1. The Democratization of Knowledge: Universities are no longer the gatekeepers of knowledge. Holding tight on the combined
learning of their professors and libraries are not anymore the sole sources for learning.
2. The Digital Age: Digitization will transform education in the same way that it has transformed the media, newspaper, retail, banking,
and the entertainment industry. Digital delivery will replace classroom teaching for the overwhelming majority of learning situations.
3. A Global Marketplace: Universities will compete more intensely for students who are now better equipped to choose based on value
and quality. Governments will begin to “outsource” education provisions in order to use their resources more effectively in targeted
spending.
4. Industry Influence: Industries will negotiate greater influence on research and teaching in order to insure their resource and research
requirements are met. Universities will need to align themselves more strongly with industry if they wish to be the drivers of innovation and
growth.
5. Ubiquitous Access and Virtual Mobility: Access to any and every university will increase whilst the requirement to be on campus will
decrease. Global competition will intensify. Traditional importers of education will seek to become exporters: universities dependent on
overseas students will need to re-invent their business models.
1. The Boom in Undergraduate Study: Adults receiving tertiary education has grown worldwide from 19% to 29%. By 2025 the number
of University places required for students will have risen from 178m to 262m.
2. The Growth of Private Provision: Over the past 20 years, HE provision in Los Angelos has flipped from being predominantly public
to mostly private. Investment from the private sector is the only means by which aggressive enrolment targets can be met with the speed of
growth required.
3. Students Will Have to Pay Their Own Way: Average state funding for HE hit a 25-year low in 2011 whilst course fees increased 42%
between 2001 and 2010. HE Perception has shifted from public good to a private good.
Fraser reported that 18% of college students in Australia left the institutions they enrolled in 2009-2010, 84.7% of the first-year students cited to
be able to get a better job as the top reason for attending college, and 20% of students seeking a bachelor‘s degree require remediation. All these
drivers and trends pushed academic instructions to embark on efforts to leverage data through analytics at various operating functions.
First, presidents, provost, and deans have used analytics to monitor activities across the university to understand how the institution has
performed and to generate applied insights for improvements. Curriculum departments have used analytics to measure and understand degree
requirements. Students & instructors have been tracked with analytics to assess performance year over year. Budgeting and finance
departments have used analytics to measure and monitor revenues and expenses. Finally, Curriculum developmentmeasured the number of
classes offered and the number students in each class. In summary, digitization will transform education in the same way that it has transformed
the media, retail, banking, and entertainment industries.
Regional Education Challenges and Trends
At its annual conference (FIKR12) in Dubai in December 2013, the Arab Thought Foundation (ATF) launched two strategic initiatives to promote
job creation in the Arab world. They were vocational training using public private partnerships for investing in Research and Development
(R&D) and promoting quality of education as key pillars for the job creation strategy. A roadmap for strategic implementation was discussed to
create 80 million jobs by 2020 in the Arabic region. The first initiative of the strategy was a move-forward document to facilitate more
investments to create jobs. The second initiative was to launch a classification system for Arab universities as part of measures to strengthen the
quality of education. The rating system will be applied first in Saudi Arabia, Morocco and Lebanon. The above two initiatives were developed on
the basis of discussions of a study report unveiled by the Arab Thought Foundation on 'Enabling Job Creation in the Arab World,' which was
based on an extensive survey undertaken by ATM across the region, ATM (2013).
Also, Alpen capital investment banking published a perspective on the Gulf Cooperation Council -GCC -education sector (Alpen 2012). It
examined the current industry status, key market dynamics and scope of future growth, it also covered pre-primary education, primary &
secondary education, tertiary education, and special education segments in all GCC member countries: Saudi Arabia, United Arab Emirates,
Kuwait, Qatar, Bahrain, and Oman. The main findings are summarized as follows:
GCC Education Industry Outlook
1. The total number of students in the education sector in GCC of the Arab states is expected to grow at a Compound Annual Growth Rate
(CAGR) of 2.7% between 2011 and 2016 to reach 11.6 million in 2016. The pre-primary segment will see a growth rate of 11.2%, followed by
tertiary segment at 4.8%, primary segment at 1.7% and secondary segment at 1.6%.
2. The total enrolment in private schools will grow at a CAGR of 10.2% between 2011 and 2016 with a greater preference towards the better
education system in these schools. The share of students in private schools is expected to grow from 21.1% in 2011 to 30.4% in 2016.
3. The share of students in the pre-primary segment is expected to increase from 5.3% in 2011 to 7.9% in 2016; while that of tertiary segment
is expected to increase from 12.0% to 13.4% during the same period. However, the share of students in the primary and secondary education
segment is expected to decline from 82.7% in 2011 to 78.7% in 2016.
GCC Key Growth Drivers
1. The population in the GCC region is expected to increase at a CAGR of 2.5% between 2011 and 2013; while the share of expatriate
population is also likely to increase from 47.8% to 48.4% during the same period.
2. GDP per capita of the region is expected to grow at a CAGR of 2.6% between 2011 and 2016. This increase in income of individuals will be
cascaded on the middle class population to spend more on the education of their children, hence, driving demand for private sector
education.
3. There has been an increasing awareness about the quality of education as a result of a rising gap between education provided by private
and public schools. This quality gap is likely to push parents in GCC to lean towards private schools in the future.
GCC Key Challenges
1. The shortage of skilled teachers remains the biggest challenge for the GCC education sector. This shortage will pose a serious threat for the
private school operators to maintain the current provision of quality of education
2. The enrolment rate in the HE segment remains very low as compared to the developed nations. Thus, there will be a continuous mismatch
between skills taught to graduates and requirements of the labor market. This gap will be filled by expatriate skilled people.
3. The region has witnessed a lack of employment opportunity for fresh graduates mainly due to absence of tie-ups between the education
sector and the private companies in the region. This lack will have an impact on the unemployment rate of the region.
4. The strict financing environment in the region, particularly after the economic recession in 2008 has resulted in a difficulty to secure
development funds for new ventures. The setting up of a school involves not only a high capital requirement, but also high running costs. The
tuition fee increases however are regulated in GCC states and private schools do not have the flexibility of changing the fee structure in line
with high operating costs to address, the challenge of investments, in some GCC states.
GCC Trends
1. An increasing number of GCC nationals are shifting their children from public schools to private schools as a result of better quality of
education provided by private schools.
2. Despite the higher fee structure, private schools offering international curriculum are extremely popular among the growing expatriate
GCC population and has mushroomed the number of international schools in the recent years.
3. Recently, GCC has witnessed the establishment of several foreign affiliated universities and branches of foreign universities providing a
good potential to enhance the GCC quality of HE.
4. GCC has witnessed migration of students from other MENAcountries mainly on account of easy access to better quality HE institutes in
the region.
5. GCC has witnessed an increasing use of technology in the HE sector to improve the quality of teaching methods.
6. GCC private education market is highly fragmented and thus providing substantial opportunities to the existing operators to consolidate
and develop economies of scale.
Earnings, Higher Education Quality, and Impact of Internet on the Future of Education
The US bureau of labor statistics has recently published the earnings and the unemployment rate by education attainment level for the 2013 US
data, BLS (2013). Figure 1 shows how the unemployment rate decreases with the increase in the education level. It also shows how the increase in
weekly income is correlated with the increase in education level.
Higher Education Quality and University Ranking: Lu (2014) carried out a simple regression analysis to determine the correlation between
World-Class Top 500 universities (WCTUs) and GDP per capita as well as GDP growth. The results show that WCTUs per capita is strongly
correlated to the nation’s GDP per capita. However, the WCTUs per capita have an insignificant effect on GDP growth. The results show an
increase in significance level when the ranking lists are expanded from the Top 100 to Top 500. This suggests that it is crucial for any country to
increase the number of WCTUs (listed in the Top 500) in order to attain a higher GDP per capita, rather than having a few elite WCTUs in the
Top 100. Further, it was found that ‘freedom from corruption’ is the most significant institutional factor when institutional factors are added into
the regression model, followed by ‘property rights’, ‘business freedom’ and ‘investment freedom’.
Impact of Internet on the Future: On June 28th 2014 the Economist published articles on trend and future developments of higher HE
institutions. In the article “Creative destruction - a cost crisis, changing labor markets and new technology will turn an old institution on its head”
reported that the life-time net-present-value of income for a US graduate with a college degree was estimated at $590,000. Despite this good
value for money, this value seems not worth it in the eyes of students as 47% in US and 28% in UK did not complete their degrees. One of the
reasons was the average increase of tuition fees by 20% in US due to a drop in government funding per student (27% between 2007 and 2012).
Woluchem & George (2014) reported that the US students loan have reached a high limit of $1.1 trillion. It was less than $200 billion in 2003
with an increasing trend that has exceeded both the auto loan and credit loans since 2010. It has been an increasing trend since the inception of
the concept in 2003, as can be seen in Figure 2. This financial pressure is one of the main drivers for the emergence of new digital online
education. Similarly in UK, the tuition fees were zero two decades ago to become over $15,000 nowadays. This article has also listed two more
drivers of change. The second driver of change is the labor market, it was said according to an Oxford University study that “47% of occupations
are at risk of being automated in the next few decades. As innovation wipes out some jobs and changes others, people will need to top up their
human capital throughout their lives”.
Regardless of the goal of MOOCs – be it for profit or idealism – there are genuine educational concerns that need to be closely monitored by HE
administrators, Governments, and Policy makers. It should be recognized that a course with 10,000 (or even 1,000) students enrolled cannot
foster any significant discussion. Further, online programs are seen as a less expensive way of providing degrees, but few faculty members are
trained to work with them. Technology experts do not create innovative MOOCs, the ideas, the delivery, and the assessment come from the
forgotten faculty members. The recent changes are putting professors under more pressure to maintain excellence in teaching, research and
service, while their salaries are not growing with the rate of inflation to permit a middle-class life style. If this trend continues and committed
young faculty would leave the academic profession or do not enter at all. Hence, academic profession should be taken more seriously, otherwise,
we will be truly in danger of killing the goose that lays the golden egg and perhaps MOOCs will then take over as stated in Altbach & Finkelstein
(2014).
In summary, the Higher Education business is about to experience a revolutionary change to benefit from technology advances. For instance,
MOOCs can be integrated in the standard American four-year degree course where, students could spend an introductory-year learning via
MOOCs, followed by two years attending university and a final year starting part-time work while finishing online their studies. The top
universities will be able to sell their MOOCs around the world, but mediocre universities may suffer the fate of many newspapers. It is also
expected that Universities’ revenues would fall by more than half, employment in the industry would drop by nearly 30% and more than 700
institutions would shut their doors in US. The remaining universities would need to reinvent them-selves to survive.
Further, Garcia (2014) focused on the social implication of student debt increase. It was found that the debt makes students less likely to accept
to choose lower-paying careers; there is also a negative correction between marital satisfaction and student loan debt. Though correlation does
not mean a necessary causation, it is suggestive rather than conclusive. An interesting finding, each $10,000 in additional student debt decreases
the borrower’s long-term probability of marriage by 7 percentage points. Additionally, one can find also graduates are migrating to countries with
high salary payments, leading to brain-drain of skilled people in their countries. Although it is difficult to formulate the right policy to deal with
higher student loan debt, it is not easy given the difficulty of discharging student loans in bankruptcy or refinancing; the presence of for-profit
universities does little help to students in employments; there is a lack of prediction studies to help students in the choices of HE programs in line
with the need of future job market. Consequently, government and universities should revisit “the loan policy” and introduce a replacement of the
financial aid with loans, to just grants as well as looking at the high spending on things that have little or nothing to do with education quality
enhancement. The list includes deeply layered bureaucracy, luxurious administrative facilities, and the ultimate white elephant of higher
education.
Hot Skills in Demands and Salaries in Big Data and Business Analytics
Collett et al (2014) published in ComputerWorld a survey on the 2014 hot skills and their salary changes over 2013. They showed that the hottest
in demands are for people having technical ITskills and are business savvy.Figure 3 shows the list of hot jobs in three classes of skills in different
shade/color: data and business analysts; specialized management; technical computing and engineering developers. They include the three
intelligence outcomes of INFORMS analytics framework: business information intelligence, business statistical intelligence and business
modeling intelligence. Those skills and outcomes are leveraged using the advance in information and communication technology (ICT) to capture
organizations’ big data through integration of data from various sources. These data are further cleaned, processed, stored and visualized for
descriptive, predictive and prescriptive analyses to transform the data intelligently into insights to support the decision making process.
Figure 3. Salary of hottest jobs in analytics and percentage
change over 2013
Figure 3 also shows a need for specific skills at varying job-hierarchy levels, data analyst, data modeler, information manager, database network
analyst, database network manager, software engineering and mobile developers, business analysts, project managers, and decision scientists to
understand organizations’ ecosystems management and measurement. These people have to act as brokers between business analyst managers;
ICT engineers, and decision makers to provide guidance on the best selection of big data technology tools, models and applications. They have to
be able to integrate transactional data systems, business intelligence tools with operating and service systems such as Customer Relationship
Management (CRM) system, Enterprise Resources Planning (ERP) system, web analytical tools; social content and network integration tools (as
well as various mobile apps and various information systems) to create shared value for the benefit of all stakeholders and society.
Collett et al (2014) further provided details on pay-satisfaction trends, along with job security and stress levels. It was shown that the overall “hot
skills” jobs rose by more than 5 percent from 2013 as a key hiring areas for managers, some 63% of open jobs are for high skilled specialists. The
average salary increases were 2.1% in 2014, 0.7% in bonus, while the average range of salary change over 2013 was between 2.8 and 7.8%.
Moreover, the world average changes varied significantly between developed and emerging countries. For instance, the WorldatWork salary
budget survey for 2013-2104 published in July 9th, 2013 showed the salary budget increases by country and employee category. They started with
a lowest median average of 2.5% in Japan, to 3% for most of the developed countries (Germany, Canada, UK, and US) to the highest median
averages in India 11.3%, China 8.8%; Brazil 7.5%, Mexico, 4.8% and Singapore 4.5%, WorldatWork (2013). These changes create a huge challenge
to human resource management departments. They have to retain and attract the new talent and organizations can't get their hands on enough
workers with data skills, and this shortage of qualified workers is hampering some businesses' analytics initiatives including social media
analysis, migration to the cloud, and using Hadoop. The shortage in skills to execute such initiatives is particularly acute and provides strong
opportunities for workers with these skills in the coming few years.
Petronza (2014) reported that legacy companies continue to digitize operations and new startup companies will continue to pop up everywhere in
the world. These patterns would increase the demand for both qualified tech as well as context professionals. Based on a recent finding from a
recruiting company by Robert Half Technology, Petronza also mentioned that 16% of chief information officers plan to expand their teams in the
first half of 2014. Employers are really focusing to fill positions. The tech job requires background in computer science as software engineer, web
developer, network system analyst, mobile developers and IT manger, big data privacy while the business data analytics do not necessarily have
to be engineers. The CEO of recruiting agency (Marr McGraw) said “For recent graduates, sales development is perhaps the best opportunity for
non-engineers to get in start-ups with a $100K career track”. Hence, sales experience and internships during college are essentials.
Finally, Davenport and Patil (2012) reported that “data scientists” is the sexiest job in the 21st century, with an estimated number between
140,000 and 190,000 in deep analytical skilled job positions, and 1.5 million more data-savvy managers are needed to take full advantage of big
data in the US alone. Also, an IDC report on the change in cloud computing between 2012 and 2020 showed that the amount of digital work that
the chief information officers and information technology staffs need to manage will not just become bigger, but it will also become more
complex. The skills, experience, and resources to manage all these digital bits of data will become scarcer and more specialized, requiring a new,
flexible, and scalable IT infrastructure that extends beyond the cloud computing enterprise. It is estimated that the number of servers (virtual
and physical) worldwide will grow by a factor of 10. The amount of information managed directly by the enterprise datacenters will grow by a
factor of 14. Meanwhile, the number of IT professionals in the world will grow by a less than 1.5 factor. However, by 2020, it seems likely that
private clouds and public clouds will be commonplace, exchanging data seamlessly. There won't be one cloud; rather, there will be many clouds,
bounded by geography, technology, different standards, industry, and perhaps even vendors. We may still call it cloud computing, but it will be
interconnected air, easy to traverse but difficult to protect securely or manage properly.
Essential Requirements for Startups
The section is introduced to show that business, management, and technology savvy skills are the most essential for any successful startups. A
top-notch software–based startup needs the integrative mindset existence in the virtual presence of the cloud technology and support of business
and management functions. A large number of local sales forces to sell products in-person meetings at different locations may not be needed.
This virtually opens the possibility of launching a startup anywhere to boost economic growth and create jobs. The following steps need to be fully
understood with appropriate strategic and operational informed decisions in place:
1. Establish a personal support network of families and friends as building a business involves hard and stress-inducing work. The goal is
build a happy and well-rounded life, but not to sacrifice personal happiness to focus solely on a startup.
5. Access to local talent pool to grow a successful business for which you need to be near a strong academic center to pool strong graduates as
employees and to avoid payment of relocation costs.
6. Assess the legal perspective to establish a business in a location where law and order are respected, in addition to understanding the
impact of corporate income, franchise, personal income taxes on the business and legal filing fees.
The above steps require business analytics and management skills that technology cannot execute, but it can only facilitate the delivery of the
startup’s operating production processes. For more details, refer to Akalp (2014).
Analytics, Management Science, and Operations Research’s Decision Making Obstacles
Analytics are the set of scientific processes of transforming data into applied insights for making better informed decisions. Analytic is a unique
powerful approach to inform decision making. Analytics give executives the power to make more effective decisions and build more productive
systems based on more complete data; consideration of all available options; careful predictions of outcomes and estimates of risk; and the latest
decision tools and techniques. Analytics are built on three intelligence capitals (Information, Modeling and Statistical).Information
intelligence is used to capture, store, and process data as well as to customize reporting and visualization of data.Modeling intelligence builds
decision models and provides optimization solutions to generate prescriptive analysis in order to determine the very best option among
innumerable feasible options which are often difficult to identify and compare even by experienced mangers due to the existence of many
constraints; and are used in budget planning to allocate resources to best efficient and effective options. It also uses simulation to conduct
predictive analysis to measure and quantify risk, to provide confidence intervals of likelihood values, to predict fraud; to give the ability to
experiment with few ideas/scenarios for further improvements. Statistical intelligence use probability and statistics to conduct descriptive
analysis to test hypotheses, to make reliable forecasts and mine organizations’ data and text to find valuable connections and extract insights and
conclusions.
Operations Research has been successful in providing a systematic and scientific approach to all kinds of government, military, manufacturing,
and service operations with great impact on managerial decision makings. The term “Operations Research” in US is known as “Operational
Research” in Britain and other parts of Europe. It can be defined as the science of decision-making that deals with the application of advanced
analytical methods to help make better decisions, (Blumenfeld, et al. 2001). Operations research overlaps with other disciplines such as
“Industrial Engineering”, “Decision Sciences” and “Operations management”. It is often concerned with determining the maximum profit,
performance or yield or minimum cost, risk or loss. The multiplicity of names comes primarily from the different academic departments that
have hosted courses in this field. They encompasses a wide range of problem solving techniques in pursuit of improving decision making and
efficiency including simulation, mathematical optimization, queuing theory, Markov decision processes, economic methods, data analysis,
statistics, neural networks, expert systems, and decision analysis. Its major sub-fields include: computing; information technologies;
environment, energy, and natural resources; financial engineering; manufacturing, service science, and supply chain management; marketing
science; policy modeling and public sector work; revenue management; simulation; stochastic models; logistics and transportation. In general,
operations research is very closely related to advanced analytics.
Management science (MS) is an interdisciplinary branch of applied mathematics, engineering and sciences that uses various scientific research-
based principles, strategies, and analytical methods including mathematical modeling, statistics and algorithms to improve an organization's
ability to enact rational and meaningful management decisions. The major sub-fields of MS include: data mining; decision analysis; engineering;
forecasting; game theory; industrial engineering; logistics; mathematical modeling; optimization; probability and statistics; project management;
simulation; social network; transportation forecasting models and supply chain management. The terms analytics and management science are
sometimes used as synonyms for operations research. Because of the synonymy, OR and MS societies were integrated under the Institute for
Operations Research and Management Science (INFORMS), while the recently emerging analytics can be seen as the maturity model of
INFORMS. OR/MS is defined as the science of operational processes, decision making and management, in the encyclopedia of OR/MS (Gass &
Harris, 2001). OR/MS analytics is used virtually by every business and government throughout the world. They would remain an active area of
academic research and applications in banking, health, supply chain and so on, with great impacts in business, industry and society. For more
details on real life applications, publications, awards, and impacts, (INFORMS 2014).
In general, the decision-making process involves a number of steps including setting goals; collecting measurement; conducting descriptive;
predictive and prescriptive analyses; providing reporting and visualization and storytelling on the best set of actions for an effective
implementation by decision makers. The decision making steps for an effective implementation include:
Step 3: Data Collection Process. Observe operating systems by tracking key performance measures and validate collected data.
Step 4: Model Solving Process. Experiment with different models to obtain a solution to the problem using exact optimization methods,
approximate heuristics/meta-heuristics, network techniques, frontier analysis multi-criteria methods or simulation methods, whatever
appropriates.
Step 5: Validation and Analysis. Validate the obtained results, test and calibrate the model if needed, and conduct what-if sensitivity
analysis to establish model robustness.
Step 8: Documenting and Storytelling. The above processes can be automated using interactive mobile apps to give real-time insights to
executives to make informed decisions. Obstacles or best practices should be documented for further enhancement, sharing knowledge and
learning from best-practice experiences within an organization.
Essential to the above decision making process is the data modeling process in Step 1. Data preparation and data modeling concepts are vital for
creating data business models. They still require human molders with experience and skills to develop useful and accurate (descriptive, predictive
and prescriptive) models for operational and business intelligence applications. Business models can be facilitated by machine, but they still
cannot be built by machines. The modeler must also be flexible to communicate in the stakeholders’ business language. Thus, they require non-
traditional modeling effort to understand the business environment and stakeholders’ challenges. It is often unclear to stakeholders whether they
have issues unless the business is stressed for the critical issues start to emerge. Business analysts often use traditional mathematical modeling
concept, logic and physical notations. But it is important that the audience is comfortable with such notations. But it is often better to avoid such
notations, using instead notations that users understand. e.g., use spreadsheet modeling when working with financial experts, or pictures to
illustrate the concept with non-technical users.
The biggest challenge is to correctly capture the requirements of the data business model. Often when the project starts, there are only vague
requirements (if requirements at all), and the data model must represent these requirements completely and precisely. Therefore, it is a very
challenging task to go from ambiguity or vagueness to precision. A lot of questions need to be asked and the results must be documented on the
model. This process takes time and knowledgeable modelers to ask questions at a time projects lack the time as well as the expertise to answer
these questions. Data modeling is the thought process of learning about the business, to state clearly problem, to define its attributes, rules and
requirements. It is a time-consuming and challenging process.
Martinotti (2013) discussed the data obstacles and their impacts on the decision making process. Martinotto reported that successful leaders who
make decisions based on cognitive analytics do have access to the needed information to make informed decisions. They use big data tools to
generate intelligent analytical insights to create significant values to the world economy through enhancing the production processes,
competitiveness of organizations, creating substantial economic surplus and shared values to the community. The data driven asset can create
values using systematic and scalable approaches to turn data into actions. Companies with big operations like oil and gas industries are
embarking on automation journey. They are extending oil recovery beyond 60% and moving from reactive to proactive and from predictive to life
automation. Martinotti (2013) defined automation by the use of technology (e.g., sensors, storage, connectivity, internet, processing and control
system) to perform high-cost/low-cost (even dangerous) tasks and taking decisions with minimal or reduced human intervention. The necessary
decision making elements that are needed for a successful automation journey to create shared values from data as well as the information
leakage per data type, required actions are presented in Figure 4. We have expanded Martinotti’s items to create a 5Is-level model (Input,
Intelligence, Information, Insight and Impact) following a bottom up hierarchy of a decision making process based on an input-output system
approach. At each level, a set of 13 recommended actions are suggested to make full advantage of the decisions making process. Failure at any
level may impair the overall expected impacts from the analytics decision making process and may fail to address the societal challenges whether
in business, government and society.
Figure 4. Decision making information obstacles and
corrective actions
Despite the above mentioned obstacles, there have been many successful OR/MS stories to address real world decision-making problems. For
instance, IBM obtained a $93million contract to build a computer system for the US department of energy to make exact real-time simulation
models for atomic blasts; United Airlines developed first automated revenue management system in the travel industry in partnership with DFI-
Aeronomics at a cost estimated up to $20million for which United Air is expected to add $50 million annually to its revenues (Albright &
Winston, 2012).
Due to the nature of real-world problems, obtained solutions may fail to yield expected performance for one or a combination of reasons at each
level of the 5Is model: the model may be wrongly constructed or used; the data used in making the model may be incorrect; the solution may be
incorrectly carried out; the system or its environment may have changed in unexpected ways after the solution was applied, and stakeholders
behaviors and views may have changed. Corrective action is always required and any advancement in the underlying tools and technology will
change the landscape with a great positive impact. In this regards, one negative and one positive note can be made. The lack of proper
engagement of top managers who are not quantitative savvy at an early stage of the solution development process would lead to no
implementation of the recommended solution. Instead, they opt to use own thinking and intuition with many success and failures rather than
accepting a logical of the OR/MS decision making process in complex decision making situations. On the positive side, the OR/MS success stems
from the effective and scientific rigors of solutions which depend heavily of the availability of relevant data; efficient exact and approximate
optimization algorithms with internal processes based on artificial, biological, and nature-inspired meta-heuristic approximate algorithms,
(Osman & Kelly 2006). Due to the historical interdisciplinary link between the operational research and Computer Science (CS) since their
inceptions in the early 1940s, each discipline has contributed to the advances of each other. The high-performance computing processing
software of CS have led to executing OR complex optimization algorithms more efficiently when solving real-life problems, while the advanced
OR/MS models and associated algorithmic research have contributed to the development and implementation of more advanced computer
systems, (Osman 1995). Therefore, past operational researchers faced obstacles that were related to data availability constraints, and computing
power scarcity, any new advancement in computing power, information and communication technology and visualization techniques would have
a great positive impact on the future of OR/MS informed decision making.
Finally, the development of management information systems (MIS) and decision support systems (DSS) brought operations researchers and
industrial engineers to the forefront of business planning. These computer-based systems require knowledge of an organization and its activities
in addition to technical skills in computer programming and data handling. The key issues in MIS or DSS include how a system will be modeled,
how the model of the system will be handled by the computer, what data will be used, how far into the future trends will be extrapolated, and so
on. In much of this work, as well as in more traditional operations research modeling, simulation techniques have proved invaluable. The main
hindrance to OR/MS business analytics model was the lack of proper data capturing and availability of computing power. These two issues have
been eased with the new development in Hadoop distributed computing and the huge increase in computer processing power as well as the
integration of various data sources that were not possible in the past. Hence the new interest in cognitive analytics has increased to uncover
analytical insights and create more share value that was buried before in big data.
COGNITIVE ANALYTICS MANAGEMENT EDUCATION TRENDS AND FUTURE
Wladawsky-berger (2014) claims that we are in the middle of a “digital perfect storm” that is characterized by the emergence of several IT trends
in mobile, clouding, social media and big data. The main challenge of businesses is how to adapt to the “digitization of the economy”. This digital
transformation is affecting the customer experience that is becoming more and more digitized and changing the landscape of the services
industry.
In fact, the world, today, is experiencing not just a digital perfect storm, but a digital disruptive tsunami. It is creating a lot of changes, including:
economic favoritism to a few large vendors and operators, who are using new terminologies for strategic differentiation to attract more attention;
creation of new associations or adaptive changes in names for exiting associations and professionals. The Digital Analytics Association (DAA) was
launched in March 2012 to replace the web analytic association, DAA (2012). DAA’ new mission is to “help organizations illuminate and
overcome the challenges of data acquisition, exploration, deduction and application”. Digital Analytics was defined as “the science of analysis
using data to understand historical patterns with an eye to improving performance and predicting the future. The analysis of digital data refers to
information collected in interactive channels (online, mobile, social, etc.). Digital Analytics has become an integral part of core business strategies
and maintaining a competitive edge. Digital data started the Big Data culture as signaled by the blitz of data volume, variety and velocity. Digital
analytics is a moving target that is opening the door to new types of correlative discovery, innovation and exploration. That's what makes it
fascinating with many initiatives. For instance, a new analytics section was added to the OR and MS sections of INFORMS, Gorman (2011).
Analytics was added to focus on promoting the use of data-driven analytics and fact-based decision making in practice. The new section
recognizes that analytics is seen as both (i) a complete business problem solving and decision making process, and (ii) a broad set of analytical
methodologies that enable the creation of business value. To this purpose, the section promotes the integration of a wide range of analytical
techniques and the end-to-end analytics process. It will support activities that illuminate significant innovations and achievements in specific
steps and/or in the execution of the process as a whole, where success is defined by the impact on the business. Robinson et al (2010) illustrated
the three business analytics intelligence (Analytics, OR and MS) sections INFORMS in Figure 5.
Figure 5. The business analytics intelligence sections of
INFORMS
Both OR and MS have the same goals and objectives but they evolved separately in two different parts of the world. Despite their long-time
merger, both the INFORMS have almost a stable number of membership. Trick (2009) reported that INFORMS remains a stable society of
10,000-12,000 members, but there are more than 58,000 OR analyst and US statistic bureau predicts the analyst number to increase to 65,000
in 2016 and no more than a thousand of them are members of INFORMS since its inception in 1995. While the Institute of Business Analysis that
was founded in 2003 has more membership of 27,000 in 2014, it shows a significant increase of 69% from its 16,000 level in 2011.
Other names include big data instead of just data; cognitive systems instead of artificial intelligence systems; data science instead MS/OR;
Hadoop for centralization of internal data storage and processing over a distributed network (or cloud for external data infrastructure and
processing) instead of distributed computing; and visualization software instead of decision support system. Finally, the workforce titles have
changed to include: data analyst instead of (OR/MS analyst; business analyst), business analyst instead of business intelligence; chief data officer
(or chief digital officer) instead of chief information officer; and data modeler instead of OR/MS modeler. Further, the following 4Ds:
“distributed, discontinuous, decentralized, disengaged” changed in professional life. Those professionals are more globally connected and more
mobile than ever which disrupts how work gets done, (e.g., the whopping 35% of creative professionals who now operate as freelancers). On the
philosophical level, a little bit is heard about the movement to separate the information “I” from the technology “T”, where moving a portion of
infrastructure and a portion of data and information. All the above definitions are difficult and confusing to both academic and professionals;
they are hampering many decisions leading to a wait and see strategy instead of being at the forefront of cognitive analytics development.
There are five welcomed reasons of change in favor of business analytics. They include: i) technology turbulent and revolutionary improvement
that allows for real visual displays and communications on mobile devices; ii) new ability to capture data using social media and network tools for
better understanding of business and social needs; iii) new integration tools of various data sources to real-time analysis in virtual centralized
memory; iv) advances in storage and computational processing powers; v) new business social models with improved system efficiency and
shared value through training or giving ways about their systems; and finally vi) strategic management differentiations from using new trendy
names. For the IT organizations to remain relevant to mobile-friendly technology, they have to manage workforce experience, manage efficiently
productivity growth, and give consideration for a sustainable development of society (Laskowski, 2014). These welcomed changes are positive
moves to remove traditional silos and artificially created boundaries among equally important and necessary interdisciplinary fields that are
needed for the development of smarter society in the 21st century.
It should be noted that the recent action by INFORMS to add analytics as a third section to its traditional OR and MS sections, still focuses on
quantitative approaches despite the emergence of behavioral OR, soft OR, and behavioral management science under management. It is a
reactive incremental building approach rather than a fundamental radical integrative approach to encompass all the above mentioned emerging
changes. INFORMS seems able to include new data technology enablers and computing processing power that were among the main 5Is
hindrances which are essential for an effective implementation of the informed decision making process. Another equally important point to
highlight, for a better world, we do not need to focus on Business only, but also on Government, Non-Government Organizations (NGOs), and
other “do-good” stakeholders. Thus, “organization” is used as a generic term to reflect (association, business, government, and people). All types
of organizations face the same challenges when it comes to making an effective informed decision. Hence a better reflective name should be
emerged and informative to remove past barriers and silos, leading to a unifying effort and approaches to address societal challenges. In this
chapter, we aim to make a modest contribution to unify such efforts for a smarter world, for which we suggest “Cognitive Analytic Management
(CAM)” as an integrative name.
Cognitive Analytics Management (CAM) is an integrative holistic framework of engaging fields of knowledge. The main components belong to
different rigorous scientific and technological fields combined with the professional relevance of practical business, engineering and
management. Each field contributes and complements the others uniquely to collectively address complex real-life organizational problems.
These CAM components are complementary, and they are not parts of any single field, but they builds on individual best-practices to create
combined impact of shared values. The CAM framework’s main advantage is to avoid the potential of creating new artificial boundaries of silos
similar to what happen and still happening in the fields of OR/MS as well as other IT fields.
The main components of CAM framework consist of Cognitive technology, Analytics, and Management of contextual domain.
Themanagement component is needed to link the organization model to its strategic mission, vision goals and objectives in order to combine its
strategic management and measurement through analytics and set data governance strategies for organizations’ shared values. However the
contextual (professional) domains that would be impacted include agriculture, business, government, health, energy, environment, finance,
logistics, legal systems, and transportation, thus involving society at large. The analyticscomponent was defined as the scientific process of
transforming data into applied insights for making better decisions. People with analytical skills have the critical thinking to enable them to think
up and propose innovative approaches to problem solving. What is new over the classical OR/MS analytics is the sentiment analysis which refers
to the process of identifying, extracting and classifying opinions in text segments (used in CRM, online advertising and brand analysis).
The cognitive technologycomponent consists of cognitive science and technology science.Cognition is a mindset for the processing of
information, applying knowledge, and changing preferences that includes the attention of working memory, comprehending and producing
language, calculating, reasoning, problem solving, and decision making. Thetechnology enablers are not only to capture real time information
about people, places and things but also information resulting from the internal machine to machine interoperability for extracting feature and
contextual text analysis. The cognitive systems are intended to provide expert assistance to scientists, doctors, lawyers and engineers in a fraction
of real time. The need for those systems comes from the huge amount of data available nowadays to extract insights to enhance creativity that
leads to new innovative products and services.
The CAM framework illustrated in Figure 6 can be defined, as the underlying technology process and analytics of SAMAS’ framework to deliver
organizations’ shared values. The creation of cognitive analytics shared values and impacts can follow the input-output system process leading to
managerial insights for decision making, and to generate specific contextual societal impacts. Those impacts themselves induce new societal
changes leading to the needs for DLIGENTS assessment. Therefore an input-output-impact cycle of CAM can be restarted in a continuous
strategic planning and improvement cycle as shown in Figure 7. Not to mention that CAM framework can have sub-sections like, digital
technology, business analytics, government analytics, behavior management, and so on so that development sub-sections can impact the other
subsections instead of growing in silos.
Figure 6. Cognitive Analytic Management (CAM) framework
Figure 7. The CAM input-output-impact continuous cycle of
improvement
Organizations proclaiming success with “big data, business analytics” have typically focused on building new technological capabilities rather
than building management analytics capabilities which may already exist within organizations. However, analytics were hindered by the lack of
computing power as well as the availability of quality data. In general, CAM programs require technical as well as business savvy and know-how
skills. Following the mindset of SAMAS framework, the starting process commences with the assessment of the eco-system environment of an
academic HE intuition to determine what kind of CAM programs to develop in order to meet shortage of skills in business, governments, and
society.
To redesign the mission of a HE institution, its management should look at the characteristics of its eco-systems using the four guiding
principles:
Mission 1: Knowledge conversation to observe governance and legal systems, culture, clergy, soldiers, lawyers, doctors, business
practitioners, and social needs involving relevant stakeholders, and potential employers;
Once the mission assessment of social needs is conducted with feedbacks from internal as well as external eco-system stakeholders to identify
world best practices in order to inspire locally and aspire globally, the relevant education programs and academic initiatives can be developed.
After conducting a review of the world trends and best practices in academia, organizations, and CAM related fields, it can easily be seen that
there are diverse structures and program names. The diversity is primarily caused by the academic emphasis of the custodian department,
whether at business, computing or technology school, and the local societal needs for such programs. Hernández-Orallo (2014) discussed the
pros and cons of PostGraduate (PG) and UnderGarduate (UG) degrees offering worldwide. A summary is presented as follows.
4. Are the 3 year UG programs +2 year masters in US, or 4 years UG +1 PG year are really necessary, given the multidisciplinary nature of the
field?
The various pros/cons analysis has different impacts on the development of PG and UG programs. For instance, the US/Canada has experienced
a tremendous growth in programs offering. In 2012 there were 16 PG programs and this number has reached more than seventy programs in
2014 (KDnuggets, 2014a). Moreover there are more than 200 programs in Europe (KDnuggets, 2014b), and more than 50 online programs
(KDnuggets, 2014c) in addition to hundreds of MOOCs.
The Undergraduate (UG) Pros and Cos similarly were discussed. The UG pros include
3. Soft skills and attitudes are easier to shape when students are younger;
3. Less flexibility;
The number of UG programs is small around the world including data science (warwick.ac.uk, usfca.edu, cofc.edu, bellevue.edu, nku.edu, and
gmu.edu) and business analytics (nus.edu.sg, mq.edu.au, and unisim.edu.sg). Universities seem to be reluctant or slow to implement new degrees
or they may think that a bachelor degree plus a one or two year master is a better option.
Sherman (2014) detailed the skill sets and roles that are vital to the success of “big data” analytics initiatives and be part of big data analytics
deployments. The skill sets follow an organization perspective to build workforce capacities in this area. Sherman reported that technical skills
alone aren't enough but business knowledge is also required. The set of skills include:
1. Business knowledge for setting organization’s strategy, business operations as well as understanding its competitors, current industry
conditions and emerging trends, customer demographics, key performance indicators and metrics, and both macro- and microeconomic
factors.
2. Business analysts to gather business and data requirements, to design dashboards, reports and data visualizations for presenting
analytical findings to stakeholders and assist in measuring the business value.
3. Business intelligence (BI) developers to build the required dashboards, reports and visualizations for stakeholders; to enable self-service
BI tools and capabilities and to prepare data BI templates for business executives and other users.
4. Analytics model builders to analyze data based on internal organization needs, to understand how to gather and integrate data into
models, to create and refine models to generate applied insights.
5. Data architects to design the data architecture, to guide the development, incorporation of various database structures into architecture,
processes for data capturing and storing, data profiling, data quality, data governance and access to make available for various business
intelligent analyses.
6. Data integration developers to design and develop integration processes to handle the full spectrum of database structures, ideally ensure
that integration is done not in silos but as part of a comprehensive data architecture. Data integration tools can be used to support multiple
forms of structured, unstructured and semi-structured data sources, to avoid the temptation to develop custom codes that can’t scale up as
data volume, velocity and veracity continue to grow.
7. Technology architects to design the underlying IT infrastructure that will support big data analytics initiatives including traditional data
warehousing and business intelligent systems. Technology architects need to have expertise in new technology tools including open source
tools, such as Hadoop and cloud computing systems, in order to understand how to configure, design, develop and deploy them.
8. Data visualization expert to simplify the process of presenting results of analytics queries, to help corporate executives and organization
managers track and monitor the reach of business goals. Tools such as executive dashboards, performance scorecards, Tableau and other
data visualization tools to provide best-practices knowledge on how to create, develop, and set up data visualization and human interaction
tools to learn and interact in real-time with multidimensional data and for all business, government, healthcare, manufacturing and service
sectors.
9. Information communication skills to communicate with people. If one uncovers a vitally important piece of information in a set of data
but is incapable of imparting it to others or convincing them of its importance, the idea will have no impact, as if it were never discovered.
10. Corporate Performance Management (CPM) to measure business results, to track progress against business goals to improve financial
performance, to help make CPM strategy more successful, to provide useful information and insights on performance management trends.
Knowledge of corporate performance management (CPM) management concepts include balanced scorecard modeling, balanced scorecard
implementation, key performance metrics development, knowledge of enterprise performance, planning and budgeting software, and best
practices deployment of CPM systems.
11. Data warehousing to design enterprise data warehouse and applications, to determine and select the data warehouse tools to fit
organization’s business intelligence needs, to handle emerging data warehouse products, and to provide guidance on managing “big data”
technologies such as Hadoop Distributed File System (HDFS) to quickly ingest and store data of all shapes and MapReduce framework for
processing data and managing data warehouse projects.
12. Social media and networking to determine societal needs, social change drivers, capture media complains, social trends and so on to
develop and manage content and sustain active engagement for soliciting opinions on government white papers, social concerns, branding,
online advertising, and ability to use segmental analytic tools.
Bell (2013) reported that successful business analytics graduates will most likely be buyers of analytic systems from corporate suppliers rather
than developers of analytics. Hence, analysts should have a good understanding of five fundamentals.
2. Be informed well enough about analytics to identify potential value-enhancing projects and provide sufficient direction to a team working
on analytics applications.
It should be noted that those who are interested in the technology side of cognitive management are required to have computer science skills.
They need to know about data structures, algorithms, systems and scripting languages. The programming Language to learn for a successful
career in the knowledge discovery journey of cognitive insights include: data, internet programming and server languages for the development of
the front-end interactions and back-end operations, for the development of websites and smartphone Apps, and for human-to- machine and Cogs
for server-to-server machine interactions. In particular, they include (ASP.net, C/C++, Java, Perl, Python, PHP and Ruby) and Database Query
Language (BigTable, MySQL, HBase, MiscroSoft SQL server, MariaDB, NoSQL, and Oracle). The starting Languages for interactive websites are
PHP, MySQL and JQuery/JavaScript for interactivity. For instance, most big companies are using JavaScript for their front-end interactions but
have difference languages for back-end interactions including: Google (C/C++, Java and Python); Facebook (PHP, Java, Python C/C++);
YouTube (C/C++, Python, Java); Wikipedia (PHP); Blogger (Python), Twitter (C++, Java, Ruby), while Microsoft use .Net platform for its
applications. A CAM should have an understanding of at least one programming language to enhance his integrator role in the business,
management computing and technology world and be able to cope with fast moving developments. For instance, IBM CEO Ms Virgina Rometty
twitted by saying “creating cognitive machines “Cogs” that learn, soon we no longer have Apps but Cogs instead” and it is expected that 10% of
Cogs will be learning from each other by 2017, (Rometty, 2014).
List of Additional CAM Related Topics
CAM people should have background in business, computing management, information technology and relevant professional domain knowledge
in career area of interests whether banking, energy; engineering, food; health, electronic government, legal; public safety, Media, oil and gas;
retail, transport and tourism sectors whether for profit and non-profit regional and global organizations. They require hands-on skills and
applied insights that would come for the appropriate selection of additional topics. The self-select list is presented in alpha order rather than any
order of importance for an easy check as follows: accounting and finance; advertising models; business process management, content media
management; cloud computing; cognitive leadership; cognitive systems; data modeling techniques; database warehouse management; data
governance and security; data text/video/voice mining analytics; descriptive analytics; energy management; entrepreneurship and innovation;
enterprise computing; Hadoop HDFS and process distributed computing; digital marketing; forecasting and econometrics; health care models;
information management; logistics and supply chain; mobile development applications; multimedia; project management; predictive analytics;
prescriptive analytics, product development management; operating systems; organizational behavior and human resource; revenue
management; sentimental analysis; shared value social business modeling for sustainable development; and wireless technology and any other
specific knowledge in a professional domain. Cognitive technology skills may include the R-language which is a specialized free open source
programming language and very easy to learn. You can just download the language and start typing. Its core strength is data sampling, data
manipulations and simulations. The unstructured data types common in big data systems are often better managed using Non-relational
(NoSQL) graph database software than relational SQL data software for data management systems from IBM, Oracle and Microsoft can be useful
to learn.
The importance of soft communication skills should never be ignored, information visualization and storytelling have been recently highlighted
due to the complexity of communicating big data of different characteristics (GPS location information available at different sources and
formats), and the needs to display such information on digital Maps. Soft communication skills beyond traditional displays of simple charts,
graphs and tables provided by Microsoft office or IBM-SPSS tools are needed. Traditional quantitative data have been available in few formats
and stored on a single server. Therefore, there was no-need for data integration. Nowadays, data is currently available at different servers,
different locations; need more data filtering, formatting, special data management and visualization tools. Mackinaly et al. (2014) reported that
beyond data visualization, storytelling should accompany data display. Storytelling is a corner stone of human experience where data tells you
what’s happening. Stories tell you why it matters; take advantage of human cognition around facts in order to make them more memorable. The
main purpose of storytelling is not just understanding problem challenges but also changing them. The storytelling process consists of raising a
question with a little of actions (drama, comedy, serious dilemma or a conflict) to draw attention, followed by a logical sequence of narration of
results and explanation, ending up with a conclusion or resolution of the raised question. Therefore, storytelling may be groundbreaking start,
but it must be followed by a sequence of interactive visual dashboards stories of findings instead of traditional scorecard displaying.
Telling great stories with data will make CAM person a better analyst for the following reasons. Creating a good story can aid senior management
to focus on what is important, bring life to data and fact. Cultures have long used storytelling, quoted wisdom, to pass knowledge and content to
future generation; make sense and order; give vision to what the future can look like, interactive storytelling put people into the stories and
finally make your job easier. It should be noted that presenting data does not always equate with a good story! Guiding principles include: 1)
think of your analysis as a story and use a story structure to present it; 2) be authentic rooted in facts and fact rooted in data, make it personal,
emotional and supplement hard data with qualitative data; 3) be visual and think of yourself as a film editor, use pictures, graphs, and charts
whenever possible; design your graph for instant readability for audience to focus on meaning and draw conclusion in effective communication;
4) Make it easy for your audience by sticking to 2-3 key issues related to your audience, covering multiple stories within one overall presentation
are fine, but asking people to analyze information and draw owns conclusion is like asking them jump through too many hoops. You may risk
losing their attention. The creation of the story is not the delivery but also the comprehension of audience from effective visualization; and finally
v) invite and direct discussion to highlight key facts related to the story, extend the story parameters into questions, and invite the audience
discussion at the end.
In summary, the demand for data technical skills and business savvy specialists is much higher than the supply since the discipline is new. But
what is this new about the discipline and what are the skills one needs to succeed in it? The discipline simply consists of extracting knowledge
from data. But it is not simple analysis of data as this is the realm of statistics that has been around for centuries. Data analyst relies on analysis
of structure and unstructured data sets of various sources such as data images, text, videos and so on, while, statistics aims to explain phenomena
through the identification of patterns from quantitative structured data. Wladawsky-berger (2014) reported that the world is the middle of a
“digital storm” and the intelligent use of unstructured data is still minimal. The digital storm is characterized by the emergence of several IT
trends in mobile, cloud and Hadoop computing, social media and big data. The main business challenge is how to adapt to the “digitization of the
economy”. This digital transformation is affecting the customer experience. The world is becoming more and more digitized and changing the
landscape of the services industry. The world may see an end of data analyst as the capacity to manipulate and analyze data moves from the data
specialist hands to become required skills to every professional.
The importance of e-learning and training using online technology was investigated by Sokoloff (2012) who explored the apparent divide
between academic information literacy and the perception of information business literacy at a workplace. It was found that business schools
strive to provide a highly relevant, experiential learning curriculum to prepare students for postgraduate employment. But the job market is
highly competitive, and all kinds of business information flow freely online. The study reveals a clear disparity between the perception of business
information skills in school and those needed at workplace. Sokoloff (2012) further stressed the importance to maintain for business students
a baseline level of information skills and knowledge. An expanded effort is needed to include new nontraditional training methods on
information competencies to support individuals with building skills on how to locate and use information in their jobs, make interpretation and
apply results at work. Such requirements need engagements and fostering of independent long-life learning approach to keep pace with the daily
expansion of information sources and channels in order to meet employers’ expectation. As a consequence, a right balance between information
literacy and business information needs to take place. Further a sustainable approach to teach information technology literacy for reaching the
mass of students was developed using an online tutorial approach. An online tutorial was included in a core bachelor commerce course at the
University of Auckland at the business faculty. After the assessment period, a formal evaluation of the tutorial was completed by students to
answer questions about the usefulness of the skills learnt, clarity of instructions, and difficulty of content. The findings reported positive
outcomes where almost two thirds of students indicated that they would continue to refer to the tutorial in future, (Tooman & Sibthorpe, 2012).
Finally, the academic shared value is the outcome of the implemented education value chain process. It quality level depends on the efficiency
and effectiveness and the input-process-output production at each stage of the whole system-wide of the education value chain. Any failure at any
stage would lead to sub-quality shared values. Therefore the engagement all relevant stakeholders must take place for the assurance of the
revenant and rigor of shared value outcome and impacts on our society. The relevant stakeholders include students and parents; governments
(local authorities/bodies, education agencies and associations), civil society organizations or networks (public or private foundations); business
community (SMEs, large enterprises, micro/social enterprises and their associations); research and academic funding bodies; and national and
international accrediting agencies and partnerships; national and international incubators and agencies; and above all the leadership team of the
University, academics and support staff. The unification of the relevant stakeholders on common mission, clear vision; goals and objectives will
assure the attainment of desired shared values. Therefore, the relevant stakeholders should be involved in determining the right intellectual
capitals needed to address societal challenges at the start of this 21st century. For instance, a UK government report stated that typically there are
two types of person working in data analytics job: someone with a deep analytical education background (usually computing, mathematics or
statistics), who acquires specialist knowledge and policy skills through courses and on the job training; or someone with a sector specialism who
then acquires data analytics skills, whether through a conversion course or on the job training, (Sadbolt & Dawson 2013). Both types of workers
traditionally follow the above mentioned 3 –process value chain.
The Future of Higher Education and Its Value Chain Creation Process
The HE and learning processes have recently been exposed to tremendous change drivers from external stakeholders, namely, government
regulatory agencies; professional accreditation bodies; HE academic collaborators; HE industrial partners and finally the distance learning
organizations which are offering “Massive Open Online Courses” (MOOCs). We shall discuss each stakeholder role and function alone. The new
influence relationships on traditional HE value chain are shown in Figure 9. There are at least three stakeholders indicated in the first row of the
figure that are exerting pressure (dashed links) on traditional education value chain in the second row.
Figure 9. Impact of government, accreditation agencies,
technology companies, MOOCS on the HE future
Massive Open Online Courses: MOOCs are free of charge provided over the Internet. They are based on the accredited Open University model in
UK. MOOCs follow shared value education “do-good” idealism models, self-paced interactive products to advance the knowledge and skills
around the world. MOOCs enterprise is complex and may not be reliable means of supplying shortage of skills. They involve many for-profit and
not for profit organizations. Access to all MOOCs are free of charge, but all certifications are charging fees between $30-100$ per MOOC. MOOCs
are offered by organizations including Coursera, and edX platforms, Edraak, Minerva, Khan Academy, Udacity in America, Asia, Australia, China
and Europe. They all offer a wide range of high quality courses, with varying degrees of online support, machine-graded multiple-choice or text
and peer-reviewed assignments, assessment and even certifications for those completing programs.
MOOCs have a number of advantages. For instance, a Yale professor said it was amazing for him to see literally 10-year-old young students from
elementary school and retired adult professors joined his MOOC. MOOCs are really breaking the traditional entry barriers to seek knowledge and
education that were imposed on young and adult to go through certain attainment levels of education. As a result, exceptional bright young talent
and experienced adult with wisdom enrolled in the same class would definitely enrich the experience young talent and bring them early to
maturity level so that discovery of new talent can be used to advance special research areas and adults can be deployed on new shared value
initiatives. Unlike traditional university students who may be pushed into HE degree programs by parents, MOOCs’ students are freely enrolled,
this may bring serious students to the virtual classroom, they are returning to education for a particular objective in mind, they have will and will
have the power to succeed. MOOCs also bring students from all over the world into a single class to debate cultural, professional, and social
issues. This environment would allow greater understanding of different cultures, and more tolerance through interactive conversations among
students from different cultures. As a result, they provide a mean to diffuse “what is called a conflict among cultures” and build international
collaboration on big international initiatives to understand different cultures, to increase global knowledge and social networks, hence advancing
peace in the world.
Lewin (2013) reported in the New York Times, that the US government intends to create a global network around the world in partnership with
Coursera based on the applauded success of MOOCs. Coursera is the largest MOOC provider followed by the second largest edX started to work
with universities to offer blended, or hybrid, courses. They have forged a number of partnerships with universities in Australia, Indonesia,
Jordan, France, Switzerland and China with courses offered in languages such as Arabic, Chinese, Japanese, Kazakh, Portuguese, Russian,
Turkish and Ukrainian. Moreover, edX announced that it would be working with the French higher education minister to offer online French
courses in France. Its platform was selected to power China’s new online learning portal, Xuetangx. EdX is also working with the International
Monetary Fund to offer training. Some universities are partnering with MOOCS providers to offer degree programs at a fraction of normal cost
(e.g. $7000 for a MOOCs-based Master degree from collaboration between Udacity, AT&T and Georgia Institute of Technology in US). They are
accessed using Tablets and Smart Phones. IPhone is also offering few non-university sources using free video lectures among others from
eminent authorities using its iTunesU application.
Technology and Technology Companies Shared Value Initiatives
Companies like IBM academic initiatives, Microsoft educator network, Oracle Academy, and SAS Higher Education Solution, are providing their
technology tools as well as academic lectures and training courses. HE institutions are integrating such support into their programs. For instance,
IBM offers a wide range of business analytics products, management and technology solutions that can help HE academic institutions to enhance
their curriculum and enable students to develop competitive skills on the latest industry-standard software, systems, and tools. In addition, IBM
publicly made available its computing resources (IBM-Cognos) to academic partners who can get full-version of the software and its
professionally developed courseware at no charge, IBM academic initiatives (2014). At its IBM 2014 academic days, IBM reported willingness to
hire up to 10 top students per academic program for its partner universities which follow its own programs to meet own professional needs.
Microsoft through its innovative educator network also provides free of charge tools for teachers around the world and connects them to share
learning activities and specialised tutorials, join conversion forums and take online specialized programs on digital literacy, teaching with
technology and 21st century learning design, Microsoft (2014).
Oracle Academy helped more than 1.9 million students gain industry relevant experience in computer science and engineering education
resources, Oracle (2014). SAS is offering higher education solutions to help meeting the demand for business analytic professionals. SAS has
estimated that US demand for business professionals skilled in data analytics could outstrip supply by 60 percent – or 1.5 million jobs – by 2018.
SAS has taken the initiative to prepare students to fill the gap by offering solutions including: professor workshop, teaching resources, curriculum
development assistance, Master’s program support, certificate program design, free access to a selection of SAS e-learning courses on campus,
on-demand software access for academics at no cost to professor teaching and students for learning data management and analytics, and free
access to hands-on visual analytics for learning purposes through its Teradata university network, SAS (2014).
Cisco networking academy launched a training program for engineers on the latest networking technologies through partnerships with
educational institutions, non-government organizations and UN agencies with a significant social shared value, Cisco (1997). It adds skills to local
workforce to increase the trainees in their business locations. It has trained more than 4 million students since the launch of the program, and
more than 750,000 students are now considered as “Cisco Certified” with more than 50% of those graduates finding new jobs and 70% attaining
a new or better job with higher salary. The program is also strengthening Cisco’s business in return, since its growth depends on the supply of
qualified technicians who can support and administer its complex networks. Of course, not all trainees will work for Cisco but the company will
benefit indirectly firm its success as it will provide skilled trainees to its customers.
Finally, Starbucks launched its college achievement program to fund its employees’ education (Abbasi, 2014). The plan allows 135,000 Starbuck’s
employees to attend their junior and senior level class free of charge. These students only have one university to choose, Arizona State University
(ASU), but they can study online anything they desire – from business to politics to art. Employees who work at least 20 hour per week at
Starbucks and qualify for admission into freshman or sophomore will have their fees subsidized from government grants and ASU’s financial aid
department. This arrangement will swallow much of extra fees incurred until a junior credit level has been reached. One thing students do not
need to do is continue to work at Starbucks after they have completed their studies in order to boost employee engagement and attract better
people to the brand. Starbucks is not drawing the line at financial support; it is also providing academic advisors, enrollment coaches and
financial aid counselors to employees. Starbucks founder and CEO Howard Schultz said “businesses and business leaders must do more for their
people and more for the communities we serve.”
Government and Accreditation Agencies
MOOCS are jumping on the bandwidth, technology cannot be stooped, but educators must assure that these courses meet academic standards,
Caplan (2013). Ignoring the MOOCs revolution would not be possible; it would be like burying one's head in the sand. MOOCs considerably
advance students’ performance and are a vital source of knowledge and learning in many developing countries. The main challenge involved in
the democratization of education through MOOCS learning process is the assurance of quality of the learning outcomes due to the distance e-
learning nature knowledge delivery. Although MOOCs are free, the associated social business models are still unclear and not well developed. But
they will be coming soon for sustainability and hence the quality issue can be resolved in the near future with the help of government regulation.
Government must have a role in the assurance of quality education to avoid education laundering and ensure that future students get the best
education possible. Politicians will inevitably come under pressure to regulate this revolution. They could make it work better by backing
common accreditation standards to assure quality of education outcomes in order to avoid the potential of education laundering. In Brazil, a good
initiative has started to regulate the e-learning sector. For instance, private universities are enjoying success with innovative blended online e-
learning, televised and in-person classes at widespread locations. Students completing degree courses would take a government-run exam for the
assurance of education quality. Proctoring MOOCS exams may take place at regional testing centers similar to what have been done for English
and other knowledge competency exams. It would make sense to have single independent public or private agencies in every country to certify
and accredit degree programs. There is already some International and National examination agencies delivering IELTS, GMAT, GRE, TOFEL,
SAT examination certificates. There are also professional agencies giving certificates, these include CMA- Certified Management Accountant,
CPA-Certified Public Accountant, Microsoft Certificate etc. There are also international and national professional accreditation boards which
certify professional programs and graduate students like AACSB in business, ABET in engineering, ACGME in medicine and other national
medical boards. They all provide assurance that graduated professionals are meeting an accepted set of educational standards before they are
allowed to join the world of practice.
Another potential risk to the digital education liberation may come from politicians. Vianyak (2014) reported that “MOOCs could be
revolutionary, but US foreign policy is preventing that”. MOOCS are banned in certain countries like Cuba, Iran, Sudan, Syria, etc. Students in
those countries are suffering because of economic sanctions. They are the neediest people for education to increase digital literacy and
enhancement of critical skills to get better jobs. With the sanction, students would not become income-independent and would not be able to
make positive impact on society and improve governments. This sanction is political as the New York Times, reported that the US government
intends to create a global network around the world in partnership with Coursera based on its applauded success of MOOCs, Lewin (2013). In
fact, it is vital that such narrow policy be changed as it inhibits knowledge diffusion, the career ambitions of young people and investment in
developing the skills that are needed to deliver a world class smart economy in the 21st century.
Government Role in Setting Data Analytics Strategy at National Level
UK government developed a vision with industry and academic for the UK’s data capability over the next decade. The aim is to make UK a world
leader in extracting insight and value from data for the benefit of citizens and consumers, business and academia, public and private sectors. The
data strategy focused on three aspects: i) human capital (skilled workforce and data confident citizens); 2) tools and infrastructure; 3) data itself
as an enabler – (ability of consumers, businesses and academia to access and share data appropriately), (Sadbolt & Dawson 2013). The aim is to
prepare future data analytics person, with strong skills base, able to manage, analyze, interpret and communicate data. These data management
analytics workforce will be essential to underpin any growing innovation economy. The UK strategy has the following action items that can be
used to guide the development of other customization strategies:
1. Government will work with data stakeholders (employers, e-skills, Universities and Open Data Institute) to explore the skills shortages in
data analytics and set out clear areas for government and industry collaboration. The aim is to encourage people to pursue a career in
analytics.
2. Government will hold workshops bringing together representatives from universities, businesses and other relevant bodies to discuss how
to get the right skills to meet current and future needs, and illustrate the different career pathways in data analytics.
3. Universities were invited to review how data analytics skills are taught across different disciplines and assess whether more work is
required to further embed these skills across disciplines.
4. The research community will convene an Open Science Data Forum to develop proposals to support the access to, and use of, research
data. Data analytics depends on a range of enablers, including the availability of data, the ability of data from different sources to be
combined, and confidence that personal or sensitive data will be protected. Some data can be shared freely, and some can be aggregated or
have sensitive information removed in order to make it usable. Some data cannot, and should not, be made available openly, and so must be
handled appropriately and securely. UK government has released over 10,000 datasets through its data.gov.uk site.
5. E-infrastructure Leadership Council will monitor a program of activities to drive awareness, support, and access to e-infrastructure for
businesses across key sectors, as well as a separate campaign specifically aimed at SMEs. It is creating a National Information Infrastructure,
which will contain the data held by government. It is likely to have the broadest and most significant economic and social impact if made
available and accessible outside of government.
6. The government will convene a working group on widening the MIDATA program which allows utility companies such as energy ones to
provide consumers with access to their consumption and tariff data in an electronic format to allow better understanding of energy use and
make informed decisions about switching to alternative providers or tariff, and building trust and consumer confidence.
7. The Engineering and Physical Sciences Research Council (EPSRC) is developing a proposal for a national network of centers in big data
analytics to be considered as part of the research councils’ UK strategic framework for capital investment and research funding plan.
8. The government will bring into force a legislation to enable research community to deliver significant benefit from text and data mining
for non-commercial purposes in 2014.
9. Working with the Information Economy Council, the government will look at options to promote guidance and advice on the privacy and
data protection rights and responsibilities of data users.
An excellent initiative “Getstats” was launched by the Royal Statistical Society with financial support of UK government, Gestats (2013). Getstats
is a “statistics literacy for all” campaign. The aim is to improve the handling of numbers daily life, business and policy, and data into useful
information. The objectives are to enable the making of well-informed decisions: informing the state in making just laws; informing business inn
giving citizens the goods and services they want at affordable prices; and informing individuals and communities in making wise choices for
today and for a sustainable future for our children. Gestats reached journalist, politicians, policymakers, school teachers and children and the
wider community who were trained in statistical literacy in response to data-heavy world. Parliamentarians and policy makers have benefited
from regular events in parliament to explore the statistical underpinnings of different policy agendas such as crime, health and education. Similar
campaigns could be launched at other countries to achieve the same noble goals.
Finally, in a move towards more knowledgeable society, governments can help in this digital transformation with more attentions on three main
issues: 1) digitally engaging citizens and government employees; 2) connecting government agencies and various sectors by connecting various
databases to have open data government platform for use by all citizens for co-creation of values, and 3) resourcing government operations. With
respect toengagement, the greatest values come from the interaction among stakeholders and not from government. The government main role
is law enabler to encourage engagement among citizens, organizations within and outside of government, and its own public-sector workforce in
more efficient, agile and trustworthy relationships. At connecting governmental agencies whether technical, policy and operational levels via
wide and agile electronic service delivery networks. This has enormous potential to drive down costs and enable shared assets and coordinated
services. The value of enhanced system interoperability has enormous potential — even beyond government agencies to nonprofit and even
private-sector partners. Analytics insights can be provided into containing the cost of government services, while gaining greater potential and
impact on business shared value through proper spending on government IT infrastructure. Last the resourcing issues, it should be a major area
of focus for top officials, they need to have “the will to do” to understand IT-related initiatives via restructuring portfolios, new sourcing
alternatives and improved collaboration between public and private industrial partners.
Organization Strategy for Implementing Business Analytics and Big Data Initiatives
An organization must consider the main issues related to management leadership; skilled workforce, and data network infrastructure. The
questions related to these issues include:
1. Who is responsible for leading the various big data initiatives within an organization and how will those people be involved in the big data
analytics initiative, and does the organization have the capacity skills?
2. What are the required data sources and data types (key performance indicators) to share internally and externally with stakeholders?
3. What are the required readiness time (real-time, periodic, and so on), as not all databases support real-time data availability?
4. What are the interrelatedness of data and the complexity of the business rules that will be needed to link and integrate virtually various
database sources to get a broad view of corporate current performance, potential opportunities, and risk factors, in addition to other
performance metrics and scorecards?
5. What is the amount of data required (amount of historical data that needs to be included for predictive and trend analysis purposes)? If
one data source contains insufficient data (two years available instead of five years required), how will such gap be handled?
6. Which is the required technology infrastructure and who are the technology and software vendors? Do they have experience with big data
analytics in your industry, and what are their track success records?
7. Have you made a feasibility study of the project with consideration to the project expected cost, benefit, risk and opportunity and impact
of the project on the future of the organization?
8. Does the organization have the right commitment and financial support to execute such project?
Answers to the above questions need an in-depth assessment of the requirements. They are the starting points to put an organization on the road
to deploy a big data analytics system and identify the technology that would support best its operating functions. In general, management of an
organization must understand own big data issues in order to assess technology needs and then putting a big data analytics strategic plan in
place. Also to set a good competitive strategy, a good understanding of the market competition and the current trends affecting the drivers of
business shared values are needed. For a big data environment to really thrive, businesses will also need a strong committed leadership. It is as
important as having the required technical skills. Technology for the sake of technology will never work and the digital transformation may
encompass three specific personnel elements:
• Vision and Leadership: This is made possible through a committed leadership and a vision where stakeholder experience and
organization value complement each other to deliver superior benefits.
• Digital User Experience: That is made possible through the deployment of integration technology and appropriate training to establish
digital literacy at all operating functions of an organization.
• The Digital Transformation Team: The digital experience should be owned by everyone in the organization thus the organizational
structure should be built in a way to fit with the data analytics strategy where all talent and expertise should be leveraged and put to use for
success.
Depending on an organization size, two additional things are required for big data and analytics initiatives to succeed: A full time senior executive
that is championing the initiative across the organization and overseeing its deployment, and strong support of the CEO and other top executives.
In general, management should not be comforted with thinking that technology tools alone can bring big data success. The real key to success is
ensuring that the organization has the right workforce skills it needs to effectively manage large and varied data sets as well as the analytics that
will be used to derive shared value and applied insights from the available organization data.
Finally, any new education strategy to develop cognitive analytics management skills would take a long time to have impact on society, but it is
needed for its future success and sustainable development. Patrick Moynihan 50 years ago said “if you want to build a world class city, build a
great university, and wait 200 years”. Universities generate knowledge through research and pass such knowledge to students through teaching,
research, service among learning and training processes.
CONCLUSION
The Cognitive Analytics Management (CAM) framework is built on early cognitive behavior work, moving to cognitive computing systems,
cognitive business analytics, to cognitive analytics management. CAM leadership manages organizations using a hybrid combination of
horizontal-flat organization structure to collect bottom-up feedbacks coupled with a vertical-hierarchy structure to facilitate execution and
provide top-down direction based on analysis of collected feedbacks for smart management. In this process inputs are collected from all relevant
but equally important, stakeholders. These stakeholders have diverse knowledge and skills, but also independent from executive leadership, and
have equal interest in the organization and are willing to support the executive leadership team. Hence, the organization knowledge can be better
captured from 360 point of views. Consequently, positive and negative biases by managers at the various vertical hierarchy levels would be
removed. Especially, when new ideas/initiatives which come from bottom-level managers and regular employees, but such ideas are often
stopped or changed or taken without credits, with damage to initiators in some cases. They happen when they are not in the best interests of
managers at the middle and higher levels at an organization. In cognitive analytics management, all initiatives, people, objectives would be given
equal hearing and importance as long as they are in line with the mission and goals of an organization. They would then be evaluated based on
individual relative merits and contributions to the mission and goals of an organization. They would be measured on a common set of metrics
covering all aspects ranging from cost, complexity, capabilities, investment, opportunity, risk, and impact on stakeholders and society. The best
efficient and effective initiatives would then be selected for execution by all management teams at all levels. New advances in cognitive analytics
and technology would be used to capture and analyze internal and external best–practices to generate insights and knowledge to make informed
decisions for the continuous development of knowledge-based organization. Additionally, cognitive analytics executive leadership would be more
accountable for their actions as they are provided with the data, the scientific methods, and tools to do things right rather than relying on own
gut-feeling and intuitions that may lead to some negative financial, personnel damage, and other damages that would not be easily possible to
correct even with a change of the leadership team. The corrective and preventive approach is embedded in the CAM quality processes to capture
new ideas or spot wrong doings at early stages using all stakeholders’ skills and knowledge.
In the previous two chapters, the Cognitive Analytics Management (CAM) roadmap and the new SAMAS framework - Shared
values;Analytics; Mission; Activities; and Structures framework are introduced. It integrates strategic management and measurement fields and
technology to generate cognitive evidence-based insights using business and big data analytics models, powered by smart devices and technology
software tools. The associated cognitive analytics management models and supporting technological tools to attain organizations’ mission and
goals are explained. The general integration philosophy of marrying more than two components to capture their best values and introduce a new
innovative shared value/ process is demonstrated, the new shared value could not have been obtained by any of the single component alone.
These innovations are often found at the interface of related fields, but they are often locked by the narrow silos and barriers among fields. They
are, however, unlocked using our SAMAS integrated system-wide approach which is inspired by the way the human brain processes information,
draws conclusions, and codifies best-practice experiences into learning to improve the decision making process.
The applications of CAM framework and its analytics modeling processes can extend to all sectors, thus impacting humans and society. They can
be applied to investigate different measures on satisfaction, happiness, joy, retention, turn-over, shared value among others to induce more
engagement and facilitate the negotiation process among pairs or groups of people. The applied insights would lead to make better informed
decisions on strategic complex alternatives including marriage, selecting best school for kids, best location for buying home, selecting best
location of a holiday, selecting best retirement, investment and health-care plans. The same approach can be augmented to community initiatives
at large: cities, governments and societies around the world. The CAM framework is mission-based, with clear goals and objectives that should be
agreed upon by stakeholders of an organization. These common goals would determine targets for shared value, objective rankings of alternative
choices to ease negotiation, identify collaborative partnerships among people and countries having the same views. The new development in
social media and wireless technologies are the new data enabler that were hindering the use of analytical approaches. The main contributions
include:
Second, a frontier data envelopment analysis approach which uses the DILIGENTS metadata is proposed to capture stakeholders’ views. The
DILIGENTS metadata are grouped into short term cost and benefit categories to be relatively balanced against the long-term risk and
opportunity, outcome and impact value of each alternative. Frontier data envelopment analysis minimizes the Cost and Risk while maximizing
the Benefit and Opportunity. The combined COBRA-DEA analytical framework generates relative shared-value scores for alternatives based the
degree of stakeholders’ views and satisfaction. Hence, each alternative initiative of a strategic plan is evaluated based on its own relative merit
according to its contribution to the mission and goals of an organization. An alternative is deemed efficient if it has the highest aggregate
weighted measure of its multiple-outputs produced over its smallest aggregate weighted measure of its multiple-inputs used. The aggregation
weights are variables (unlink traditional fixed weights) are optimized each time in the best interest of the evaluated alternative to encourage
engagement as well as facilitating the negotiation process. The logic of best frontier data envelopment analysis is recommended to derive
evidence-based applied insights to manage performance, boost growth, reward talent, and make informed decisions.
Third, innovative social business models, emerging best-practice of social analytic models, and mobile applications and technology tools that
were used in products and services to deliver shared values are reviewed. Brief on real-life case studies and associated characteristics in different
sectors of the society are discussed to disseminate best-practice experiences to inspire locally and to aspire globally, i.e., we inspire knowledge to
transform into entrepreneurships, develop innovative products, and launch new small and medium enterprises and startups to serve global
markets.
Fourth, guidelines on how to develop an appropriate set of key performance indicators (KPIs) to assess and monitor performance of an
organization progress over time are presented. Challenges and pitfalls are highlighted to avoid them when developing and implementing Key
Performance Indicators for data collection strategies.
Fifth, the emergence of social networking, sensor networks and huge information storehouses create an overabundance of data which require
new computing systems, cognitive analytics and management capabilities. Cognitive analytics management (CAM) is the integration of
computing systems, cognitive analytics and management sciences to learn, draw conclusions, generate greater insights into hidden trends, and
predict behavior with greater accuracy, leading to an improved human decision making. CAM takes advantage of cognitive computing’s vast data
processing power and the new added channels for data collection (such as sensing applications), analytical management modeling, and social
context to provide practical insights. It transcends the limitations of traditional data management and analysis. The real value of cognitive
analytics management lies in its ability to capture, process and understand in real time the exploding volumes of structured and unstructured
data—including data that may contain wide variations of format, quality, structure, and unstructured (e.g. voice, images, social media channels,
videos and blog posts, among others) found on the internet. CAM applications analyze big data to monitor performance; to conduct descriptive
analysis, to identify patterns and trends from data; to perform predictive analytics to predict future risk, frauds, failure/success with certain
confidence estimates,. They find relationships trends in data that may not be readily apparent from descriptive analysis; and they use prescriptive
analytics to determine new ways to operate efficiently existing business processes and to allocate scarce resources based on the relative merit of
each activity. Additionally recent advances in technology including Cloud computing, Hadoop distributed files and processing systems with real
practical application in various sectors of the digital economy are discussed.
Sixth, the cognitive analytics management education and future trends are discussed. The critical skills shortage is analyzed using the same
mindset which was used throughout the past chapters. A bottom-up feedback and top-down analysis approach is considered to create better
shared value for all involved. The hottest jobs are identified; the value of quality education is highlighted. An integrative management and
technology framework (CAM) to identify the right balance among computing, business, management, and technology skills are identified. The
CAM framework would guide higher education institutions to design appropriate degree programs to meet professional needs of the society they
intend to serve. Finally the impact of the Massive Open Online Courses- “MOOCs” and technology providers, the role of government and
accreditation agencies in the assurance of education quality outcomes are suggested. Our aim was to shed lights on the needs to design new
education programs, based on shared value education models.
On a futuristic note on continuous development toward attaining the goal of a smart and sustainable world in the 21st century. The report ESPAS
(2012) emphasized the central role of individual in our interconnected and polycentric world, the growing sense of belonging to a single human
community with a greater stress on sustainable development against a backdrop of greater resource scarcity and persistent poverty, that are
compounded by the consequences of climate change; and the shift of power away from states due to inadequate governance and failure meet to
global public demands. Franklin and Andrews (2012) discussed the sweeping trends that are changing the world faster than any time in human
history and fundamental trend in the four decades to come. It offers clarity hope that the world in 2050 will be richer, healthier, more connected,
more sustainable, more innovative, better educated, and with less inequality between rich and poor and between man and women. The world
faces enormous challenges from managing climate change to feeding 9 billion people in 2050, and coping with a multitude of new security.
Therefore, a new look at the every aspects from health, religion, to outer space, these mega changes provide fascinating insights into the future.
Additionally, the old civilizations stressed the individual role in society, have provided sets of insights and wisdoms which are currently been
rediscovered. If they were followed properly, the current challenges would have been much less and the world would be in a much better status.
The holly religious books are full of guidance to humanity. For instance, the bible contains more than 50 verses about “doing good deeds”
including Colossians (3:23):“whatever you do, work heartily, as for the Lord and not for men”; Hebrews (10:24):“let us consider how to stir up
one another to love and good works”; and Quran (22:77):”do good that you may prosper”. The necessity to find the truth is highlighted in (John
8:44): “God instructs us to guard our thoughts. Satan is the “father of lies”, if our minds are not firmly grounded in truth, then we are more
susceptible to his deceptions”’; (Quran 49:6):“O you who have believed, if a rebellious evil person comes with a news, verify it, (investigate it), lest
you harm a people out of ignorance and afterword you become regretful, over what you have done”.
The importance of increasing people knowledge was also highlighted in Quran (20:114):“My Lord, increase me in knowledge.” For the world to
prosper, differences among people should be acknowledged, and differences be used for learning and further development. People are newly
empowered with information about how the world works and able to express themselves in powerful new ways. The cognitive analytics enabling
infrastructure to collect, store, and analyze massive amount of data by leveraging the cloud networks, high-end servers, distributed architectures,
and the cognitive analytic applications in business, health care, finance, supply chain, government, would improve the quality of human life.
Cognitive analytics inspired by human brain intelligence will have a great impact on the sustainability and development of our interconnected
planet in the 21st century. The essential elements for such sustainability are provided in Figure 10.
Trips, cultures, and religions should be used to enrich the world experiences through open dialogues at international and national conferences;
cross-region intermarriages; communication and virtual meetings using social media networks, voice and image over IP and other internet of
things to share best-practice knowledge, understand culture differences, share happiness, joy, love, peace, and sorrow in the cyber space. These
events are the drivers for a more sustainable development and growth based on shared value concept to achieve smart cities, smart societies and
smart plant in the 21st century. The do-good people will cooperate among themselves to combat evils and mischiefs that would co-exist until the
end of the earth. This general view can be found in Quran (49:13): “O mankind, we have created you from male and female and made you peoples
and tribes that you may know one another. Indeed, the most noble of you in the sight of Allah is the most righteous “do-good” of you. Indeed,
Allah is all-knowing and all-acquainted”. Further without the existence of such great world leaders, connected people who believe in “do-good”,
tolerance to push for the “rightness” on the earth, the earth would have been corrupted, Quran (2: 251) “If Allah did not check one set of people
by means of another, the earth would indeed be full of mischief, but Allah is full of bounty to mankind”. Finally independent people defend the
right of humanity; have ethical standards and share “do good” principles and more, whereas, dependent people defend only their own rights and
their own promoters. Fowler et al. (2014) have recently studied the genetic similarity between friends. Maps of the friendship networks showed
clustering of genotypes to indicate that friends are genetically similar to their fourth cousins. The results suggest that “do-good” people have
more in common than just shared values, ethical behavioral standards, but also share up to fifteen thousands of common genes.
In summary, cognitive analytics management is still in its early stages, and is by no means a replacement for traditional information and
analytics programs. But they help to understand the world around them to achieve our goal to develop a smarter society with more equal income
distribution and no poverty. It may contribute to a growing sense of belonging to a single human community and to develop a sustainable planet.
It is the results of the engagement of people to people, people to machine and machine to machine to exchange and generate better applied
insights for the sustainability of our world. The co-existence of different living pairs of people and things, including - (male, female), (good, evil);
(love, hate); (business-profit, charitable-money) - require a better understanding in order articulate goals and objectives to innovate for shared
value. These shared-value innovations cannot be obtained by any pair alone. They require integration of people, processes and technologies to
assurance the sustainable development of our plant. Despite all regulations, people would continue to have different views, and interests,
conflicts as shown in two angry cartoons with bloody fights among people and nations in the absence of tolerance and “do-good” people to
mediate. The world is fortunate to have great leaders dedicated to “do-good” principle. The existence of such do-good believers would definitely
spread knowledge; bring more collaboration on good initiatives for a better sustainable smarter world. Examples already exist including the
academic leaderships at MIT, Harvard, Yale, and Open University among others who are supporting the “Massive Open Online Courses- MOOCs”
initiative to diffuse knowledge and discoveries across the world using the internet of things. Also, some chief executives of top technology
companies have established ”do-good” foundations, including IBM-Waston foundation, Microsoft-foundation, Oracle and SAS companies. They
are providing their technology software tools and know-how experiences in partnerships with universities to provide better hands-on skills to
graduates to meet the current shortage in skilled professional in the digital economy. Finally, the existence of other great non-technology leaders
at world organizations and countries are setting up foundations to support science, technology, education and humanity around the world. The
world leaderships behind the “do-good” organizations are the hope for continuing sustainable planet in the 21st century.
This work was previously published in the Handbook of Research on Strategic Performance Management and Measurement Using Data
Envelopment Analysis edited by Ibrahim H. Osman, Abdel Latef Anouze, and Ali Emrouznejad, pages 190234, copyright year 2014 by
Business Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
This publication was made possible by a grant (NPRP 09-1023-5-158) from the Qatar National Research Fund (a member of Qatar Foundation)
to support the research assistant Maher Itani. The help of Dr Dariush Khezrimotlagh in editing the chapter is really appreciated. The statements
made herein are solely the responsibility of the authors. Finally, equal appreciation goes to the H.E. Husni Al-Sawwaf Chair in Business and
Management for the support of Dr. Ibrahim H. Osman to carry out the research of this project.
REFERENCES
Albright, S. C., & Winston, W. L. (2012). Management Science Modeling (4th ed.). South Western Cengage Learning.
Bell, P.C. (2013). Innovative Education: What every business student needs to know about analytics. ORMSToday, 40 (4).
BLS. (2013). Earnings and unemployment rates by education attainment. Retrieved June 20, 2014 from
http://www.bls.gov/eymp/ep_chart_001.htm
Blumenfeld, D. E., Elkins, D. A., & Alden, J. M. (2001).Mathematics and Operations Research in Industry. Retrieved June 30th, 2014, from
http://www.maa.org/mathematics-and-operations-research-in-industry#sthash.tHElA7SG.dpuf
Caplan, S. (2013). MOOCs - massive open online courses: jumping on the bandwidth. Guardian. Retrieved July 10th 2014, from
http://www.theguardian.com/science/occams-corner/2013/jun/06/moocs-massive-open-online-courses
Davenport, T. H., & Patil, D. J. (2012). Data Scientist. Harvard Business Review , 90, 70–76.
ESPAS. (2012). Global trends 2030 – Citizens in an interconnected and polycentric world. European Union Institute for Security Studies.
Condé-sur-Noireau.
Fowler, J. H., Settle, J. E., & Christakis, N. A. (2014). Correlated genotypes in friendship networks. Proceedings of the National Academy of
Sciences of the United States of America , 108(5), 1993–1997. doi:10.1073/pnas.1011687108
Fraser, K. G. (2014). Education Transformation: Digital, Personalized, Driving Better Outcomes Assuring individual, institutional and societal
success. Paper presented at Smarter Education Session, EMEA Academic Days 2014, Cognitive Systems for Smarter Business and Computing.
Politecnico di Milano, Italy.
IBM Academic Initiatives. (2014). IBM Academic Initiative Building skills for a Smarter Planet. Retrieved on July 10th, 2014, from http://www-
304.ibm.com/ibm/university/academic/pub/page/academic_initiative
Lu, C. T. K. (2014). University rankings game and its relation to GDP per capita and GDP growth. International Journal for Innovation Education
and Research , 2(4), 1–33.
Martinotti, S. (2013). The growing importance of technology in the Oil and Gas industry. Paper presented at the Big Data, Big Computing & The
Oil Industry: Opportunities for Lebanon & The Arab World Conference. Beirut, Lebanon. Retrieved on June 20, 2014 from
https://cms.aub.edu.lb/units/masri_institute/workshops/Pages/BigData.aspx
McKinsey. (2013). Education to employment: Designing a system that works. McKinsey Center for Government. Retrieved June 20 2014 from
http://www.mckinsey.com/client_service/public_sector/mckinsey_center_for_government/education_to_employment
Osman, I. H. (1995). An introduction to metaheuristics. InOperational Research Tutorial Papers, (pp. 92-122). Hampshire, UK: Stockton Press.
Osman, I. H., & Kelly, J. P. (1996). Metaheuristics. An overview . In Osman, I. H., & Kelly, J. P. (Eds.), Metaheuristics Theory and Applications .
Boston: Kluwer Academic Publishers.
Rometty, V. (2014). Creating machines that learn, soon we will no longer have apps but cogs instead -cognitive machines #mwc14 @IBM .
ThinkBig.
Sadbolt, N., & Dawson, P. (2013). Seizing the data opportunity a strategy for UK data capability. Retrieved on July 10th, 2014, from
https://www.gov.uk/government/publications/uk-data-capability-strategy
Sokoloff, J. (2012). Information Literacy in the Workplace: Employer Expectations. Journal of Business & Finance Librarianship , 17(1), 1–17.
doi:10.1080/08963568.2011.603989
Tooman, C., & Sibthorpe, J. (2012). A Sustainable Approach to Teaching Information Literacy: Reaching the Masses Online.Journal of Business
& Finance Librarianship , 17(1), 77–94. doi:10.1080/08963568.2012.629556
th
Trick, M. (2009). INFORMS: 30,000 members or 5000?Retrieved July 10th, 2014, from http://mat.tepper.cmu.edu/blog/?p=712
Vianyak, A. (2014). MOOCs could be revolutionary, but US foreign policy is preventing that. Guardian. Retrieved July 10th 2014, from
http://www.theguardian.com/commentisfree/2014/feb/05/us-must-lift-ban-on-moocs
KEY TERMS AND DEFINITIONS
Business Analytics Concept: The usage of business information intelligence, business statistical intelligence, and business modeling
intelligence to make business informed decision.
Cognitive Analytics Management: The scientific processes of acquiring data, and transforming them into applied insights to make informed
decisions using data analytics models, cognitive systems and tools in a specific contextual domain whether business, government, for profit and
not for profit organizations. They are inspired by human brain intelligence to make informed decisions in real time.
DILIGENTS: Demography, Innovation; Legal; Internal-Governance; Environmental, Needs, Technological, andStakeholders.
Education Value Chain: The total value of an education-wide system incorporating values to students, academic institutions and to society.
Innovation: The processes of generating, developing and implementing new ideas and behaviors.
Internet of Things: Brings people: processes, data, and devices together online to create the world networked society, to enrich their
knowledge and experiences, to turn data into actions with new unprecedented economic opportunity and capabilities for businesses, individuals,
and countries.
Shared Value: The sum of the business value to internal shareholders and the economic, environment and social impact values to external
stakeholders.
Smart Organization: Can learn, identify challenges, process critical experiences into new learning to draw conclusions, and make informed
real-time decisions using the internet of things anytime, anywhere, and anyplace.
Societal Needs: Include the essential elements for human survival that range from energy, food, and water, to health, job, security, and
transport.
CHAPTER 26
Big Data Paradigm for Healthcare Sector
Jyotsna Talreja Wassan
University of Delhi, India
ABSTRACT
The digitization of world in various areas including health care domain has brought up remarkable changes. Electronic Health Records (EHRs)
have emerged for maintaining and analyzing health care real data online unlike traditional paper based system to accelerate clinical environment
for providing better healthcare. These digitized health care records are form of Big Data, not because of the fact they are voluminous but also they
are real time, dynamic, sporadic and heterogeneous in nature. It is desirable to extract relevant information from EHRs to facilitate various
stakeholders of the clinical environment. The role, scope and impact of Big Data paradigm on health care is discussed in this chapter.
INTRODUCTION
Health care data are valuable resource which may consists of patient’s demographics (age, sex etc...), treatment plans provided by a clinician,
medical history of a patient, laboratory reports, radiology reports, billing data, insurance claiming requests etc. But the electronic storage,
management and retrieval of health care data are difficult tasks as the health data are complex, voluminous, dynamic, sporadic, unstructured and
heterogeneous (Wasan, S. K., Bhatnagar.V & Kaur, H., 2006). The activity of health care systems is reaching to terabytes even to petabytes and
more in various cases. It is important to store such Big Data in an efficient distributed manner over computing nodes. Big Data analytics has the
potential to improve health care at lower costs by gaining insights and discovering associative patterns within real time health care data. The aim
of this chapter is to review a trial on modelling of big data analytics to expedite the large scale processing of electronic health data for various
stakeholders.
BACKGROUND
The Big Data revolution is nascent and there is a lot of scope for new innovations and discoveries. It has set the path of rapid change in
technological world. The Big Data is impacting various areas like social networking, online education etc. including health industry. The major 5
V’s associated with Big Data (Figure1) are listed as follows (Marr, B., and February 2014):
4. Veracity: Health Care data is enormous; thus it is important to maintain its relevance and trustworthiness to give best possible benefits
to the patients.
5. Value: Health Care data is rich source of information and is useful if could be turned into valuable knowledge.
Various Big Data platforms may prove beneficial for decision-making process in treatment plans under digitized health care systems. The
applications of real-time health care systems e.g. detecting viral infections as early as possible, identifying various symptoms and parameters
swiftly, reducing patient morbidity and mortality etc. electronically would revolutionize healthcare.
Premier analyzes data various healthcare providers and enabling its members with high-performance, integrated and trusted
information. University of Ontario Institute of Technology (UOIT) is using IBM big data technology to capture and analyze real-time data from
medical monitors, alerting stakeholders to potential health problems for patients (Retrieved from http://www-
01.ibm.com/software/data/bigdata/industry-healthcare.html). Various repositories such as Microsoft HealthVault, Dossia are supporting
health data analytics (Steinbrook, R., 2008). Various technology driven applications based on big data paradigm such as Asthmapolis developed
for asthmatic patients for monitoring via GPS enabled tracker are emerging online. Thedashboard technology is also emerging and RiseHealth is
one such example of dashboards (Groves, P., Kayyali, B., Knott, D., & Van Kuiken, S., 2013).WikiHealth is an emerging personal health training
application.
Cost reduction for health care services managed electronically (as few mentioned above) based on big data paradigms is a promising feature.
MAIN FOCUS OF THE CHAPTER
Big Data platforms supporting distributed computing provide storage capacity and computing power over high speed networks, to extract
valuable information from large medical data sets. Big Data basically deals with two main concepts of data storage and data analytics. The main
focus of the chapter is to propose how data storage with Big Data stores and analytics with MapReduce paradigm, may be performed on
simulated health data.
Issues, Controversies, Problems
The structure and nature of health industry possess some challenges in concepts supporting Big Data. It is difficult to easily share and distribute
data over various health service providers due to privacy and security concerns regarding patients (Ash, J. S., & Bates, D. W., 2005). The issues
and problems are there due to semantic and legal barriers. Many times even the data in the same hospital is provided only to concerned
department due to lack of integration support for big data platforms. Also the lack of public support related to privacy and security issues has
hindered the progress of Electronic Health Records with big data needs. Thus it’s important to abide by stringent privacy protection while
accessing EHRs. Many people feel comfort zone in traditional evidence based care instead of accepting technology driven models in health care,
being sensitive area. Also data standardization, costing and ease of use factors are needed to be taken care off for accessing EHRs globally. Today
various medical standards (like HL7 etc.) are coming up to accelerate the global usage of EHRs. Measures like Health Insurance Portability and
Accountability Act (HIPPA) have been created which enforce privacy rules accounting to patient’s health information (Cheng, V. S., & Hung, P.
C., 2006). It’s important to aim for clear understanding of health care data source and providing a best possible solution.
ELECTRONIC HEALTH RECORDS
Healthcare industry is many times impacted by various challenges, like high costs, rising expenditures, inconsistent quality care, and delays in
providing care and limited access in various areas across the world. It is widely believed that the use of information technology in health care over
paper based phenomena can reduce the cost of health care while improving its quality. Electronic health records (EHRs), have been thought to be
possible solutions to these problems as they store all the information about a patient and make it interoperable and shared among different
health care providers. Thus EHRs are longitudinal collection of electronic health information about patients and is capable of being shared across
various stakeholders to provide better health care (Yina, 2010).
All healthcare stakeholders, patients, health care providers, payers, researchers, governments, etc. are being impacted by analysis of data stored
in EHRs, which may help in predicting how these players are likely to act. As the populations across countries are increasing; health related data
stored in EHRs is also increasing. EHRs aid in representing this data in comprehensive summarized form, including demographics, medical
history, medication and allergies, immunization status, laboratory test results, radiology images, and billing information etc. The following ways
emphasize on how electronic data available in large amounts will enable the healthcare industry to reduce costs and improve quality:
1. Improved Care: The integration and application of big data tools promote evidence-based care for patients as all health care providers
have the same information about a given patient and are working towards a common goal. This can improve outcomes, reducing medical
errors. EHRs can also provide drug recommendations; verify medications and dosages to ensure right drug for the patient. Big Data analysis
can help in matching the skills and specializations of the health care provider with the requirements of the patient. Service providers can
identify patients who are due for preventive visits and screenings etc. and can monitor how patients measure up to certain parameters, such
as vaccinations, sugar levels and blood pressure readings etc.
2. Improved Standard of Living: Knowledge from large amounts of data available online can help patients to play an active role in
monitoring their own health, not only treating and managing their current conditions, but also taking precautions for future ones by getting
informative recommendations online and following them. Today there are many patient portals that provide online interaction with health
service providers.
3. Improved Value: Big data can be used to ensure cost-effectiveness of healthcare through different methods, such as better health
insurances, patient-medical reimbursements and eliminating frauds and wastages in the health systems.
4. Improved Personalized Care: Patients are empowered through day to day life style measures and their personnel care such as diet,
exercise, and medication adherence to control their health issues. EHRs provide a means to share information so that patients and their
families can more fully take part in decisions about their health care.
There is estimated a 300-450 billion dollar reduction in U.S. healthcare costs via big data interventions (statistics available on
http://rockhealth.com). The data available on http://www.healthit.gov/ has reflected evidence of advantages of EHR adoption; few examples as
listed on the website are quoted as follows:
• “Researchers at the Center for IT Leadership (2010) Web Site Disclaimers studied the U.S. Department of Veterans Affairs, estimated the
savings of $4.64 billion from preventing adverse drug events.”
• “In Indianapolis, Finnell and Overhage (2010) found medical professionals have been benefited from access to pre-existing health
information like medication lists, allergies, and medical histories— via quick electronic exchange. This also proved useful in cases of
emergency.”
• “Shapiro et al. (2011) examined health information exchange projects in 48 States which depicted enormous potential for improving public
health reporting and investigation, emergency response, and communication between public health officials and clinicians besides some
financial and technical hurdles.”
• “Persell et al. (2011) found that EHRs can use information on patients' medical histories to improve quality significantly by providing best
methods of care for specific patients.”
The true potential of digitized information lies in big data and EHRs are built to provide structured output. EHRs play a major part in the
healthcare reform and can be mined to detect fruitful health related patterns using big data analytics tools and techniques. Health care providers
with busy practicing and patients with busy live schedules appreciate convenience in their health care transactions provided by EHRs.
HEALTH DATA STREAMS
Health data sets which grow and expand continuously in rapid manner over time are pertinent form of clinical data collected via digitized
treatment plans for patients whether they are clinical notes or laboratory reports etc. Various medical domains comprise of health sensors or
information systems generating real time data. Also in many of the medical situations, monitoring of constant real-time data is utmost important
e.g. heart beat monitoring etc. Clinicians make use of large amounts of time sensitive data while providing effective medication to patients.
Because of real time huge data, traditional sequential systems are not efficient to use. Dealing with large continuous flows of health care data,
require big data management and processing. Health Data Stream Analytics (HDSA) could play important role in clinical decision system (Zhang,
Q., Pang, C., Mcbride, S., Hansen, D., Cheung, C., & Steyn, M., 2010). The ultimate goal is to provide best treatment plan to the patient at reduced
cost in real time manner. Streaming is increasingly gaining importance as new paradigms of Big Data analytics are emerging to produce best
results in reasonable time as incremental tasks.
One of the IBM Stream Computing press releases states the following example of stream analytics (http://www-
03.ibm.com/press/us/en/pressrelease/42362.wss):
Emory University Hospital is using software from IBM and Excel Medical Electronics (EME) for a pioneering research project to create
advanced, predictive medical care for critical patients through realtime streaming analytics. Emory is exploring systems that will enable
clinicians to acquire, analyze and correlate medical data at a volume and velocity. The research application developed by Emory uses IBM’s
streaming analytics platform with EME’s bedside monitor data aggregation application to collect and analyze more than 100,000 realtime
data points per patient per second. The software developed by Emory identifies patterns that could indicate serious complications like sepsis,
heart failure or pneumonia, aiming to provide realtime medical insights to clinicians.
Accessing and drawing insights from realtime data can mean life and death for a patient,” says Tim Buchman, MD, PhD, director of critical
care at Emory University Hospital. “Through this new system we will be able to analyze thousands of streaming data points and act on those
insights to make better decisions about which patient needs our immediate attention and how to treat that patient. It’s making us much
smarter in our approach to critical care.” Emory's vision of the “ICU of the Future” is based on the notion that the same predictive capabilities
possible in banking, air travel, online commerce, oil and gas exploration and other industries can also apply in medicine.
As the medical community increasingly embraces the power of technology to help improve health outcomes for patients, predictive medicine is
finally becoming reality,” says Martin S. Kohn, MD, chief medical scientist at IBM. “The ability to pull actionable insights from patient
monitors in realtime is truly going to transform the way doctors take care of their sickest patients.
SOURCES OF BIG DATA IN HEALTH CARE
Big data in healthcare can come from Electronic Health Records,Clinical Decision Support Systems, laboratories, pharmacies, medical claim and
insurance companies in multiple formats (text, images, graphs etc.) and can be geographically distributed. The various sources are listed as
follows:
1. Logs of clickstream and interaction data on social networks discussing about health care (e.g. discussions on Facebook, Twitter or various
blogs.). It can also include Health Care Apps on electronic devices.
3. Billing Data in semi structured format, provided by medical insurance agencies electronically.
4. Genomic Data
5. Biometric Data
MHEALTH: GENERATING BIG DATA STREAMS
mHealth deals with generation and dissemination of health information via mobile or wireless devices. With the advancement in technology,
mobile apps and devices are proving them useful in collaborating with clinicians and patients to provide more personalized, preventive care. This
data also has the potential to enable a ‘learning health system’ to anticipate health risks before they become a problem. Using mobile devices to
monitor patient activity can help clinicians and patients in saving time and making patients healthier by providing instant solutions for health
issues. The massive amount of data being collected on mobile healthcare devices may lead to big data initiatives amongst various stakeholders.
Many people today are tracking their levels of daily activity and health statuses like blood pressure, body temperature etc. and clinicians may
increasingly use this data to improve levels of health care. The ubiquitous mobile devices are presenting opportunities to improve health services
with clinical efficacy. This may prove effective in realizing remote patient monitoring and providing better healthcare if used judiciously. There
may be some privacy and security concerns for which it is needed to follow health standards.
BIG DATA PARADIGM FOR EHRs
Big Data platforms focus primarily on three aspects; one is large amount of data storage and secondly data analytics and thirdly a supporting
query language for data retrieval (Figure 2). Various NoSQL data stores like Hadoop, MongoDB, and Cassandra etc. are emerging to acquire,
manage, store and query big data.
Figure 2. Big data stack for EHRs (IGI, 2014)
NoSQL stands for “Not Only SQL” or “Not Relational”. The movement of developing schema less data stores, parallel programming models and
various analytical platforms due to increasing amounts of data is known as NoSQL movement. NoSQL platforms work on distributed
environment and are horizontally scalable. These platforms may aid in aggregating and storing patient information in dynamic form enhancing
portability and accountability across hospitals. Various NoSQL stores are emerging to support Big Data paradigm. Few of them are listed in
Table1.
Table 1. NoSQL Data stores (Source: http://nosqldatabase.org/ last seen on October 2014)
Category Databases
MongoDB,Elasticsearch,Couchbase Server,
Document
CouchDB,RethinkDB,RavenDB, Clusterpoint
Data
Server, Terrastore, SisoDB,
stores
djondb,EJDB,densodb
All these NoSQL data stores could be explored for storing and analyzing health care data. Big data storage and analytics provide more accurate
knowledge about patients, clinicians, and diagnostic or insurance claim operations for EHRs. By processing a large stream of health care data
real-time or static data with different Big Data tools and techniques, clinicians can monitor emerging trends, can make time sensitive decisions
better and can drive on new better health opportunities. Big data could help greatly in following domains:
1. Creating more clinically relevant and cost-effective treatment plans and diagnostic measures.
2. Analyzing patient records to identify symptoms, responses, best suited vaccines etc. for them and providing them with best services..
6. Capturing real time health care data from devices to monitor health parameters and predicting risks for developing a particular disease
and providing preventive care.
The conceptual framework for a big data analytics in healthcare is different from traditional ones as for big data, processing is broken down and
executed across multiple computing nodes instead of one machine and their analytics can be performed in parallel fashion with help of
MapReduce, for making better-informed health-related decisions. Furthermore, open source platforms such as Hadoop/MapReduce, MongoDB
etc. are available on the cloud, have encouraged the application of big data analytics in healthcare. These platforms support concept of sharding.
Advantages of Using Big Data Paradigm in EHRs
Big Data is a nascent technology. It has set the platform for innovating, experimenting and analyzing data in almost every domain including
EHRs.
2. They are easy to scale and works well in distributed environment to provide more efficiency.
3. The Big Data paradigm gives both speed and capacity for cloud storage.
4. No fixed schemas are required in most Big Data platforms such as MongoDB, as was the case in traditional database systems.
6. They may support dynamic queries engines over Big Data stores.
7. The new wave may aid greatly in generating new health care opportunities.
MAPREDUCE MODELLING FOR BIG DATA WORLD
Google devised an approach called MapReduce (Dean, J., & Ghemawat, S., 2008) to deal with exponentially growing Web data. The MapReduce
model as shown in Figure 3 is designed to work with a massively distributed file system. MapReduce became motivation for an open source
technology called Hadoop, along with an associated file system called HDFS (Borthakur, 2007) and various other such platforms as listed in
Table2.Map/Reduce framework is capable of taking a big task and divide it into small discrete tasks that can be performed in parallel manner.as
illustrated in Figure 3. The Map and Reduce functions ofMapReduce are both defined with respect to data structured in (key, value) pairs.
Map Function: It accepts an input pair and produces a set of intermediate key/value pair’s.
The Map function is applied in parallel to every pair in the input dataset which produces a list of pairs for each call. After that, the MapReduce
framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different
generated keys.
Reduce Function: This function accepts an intermediate key and a set of values for a particular key.
The Reduce function is applied in parallel to each group, which in turn produces a collection of values in the same domain.
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
An efficient scenario could be achieved by running many Map tasks and many Reduce tasks in parallel. Health care data being big; may also use
MapReduce framework in various activities and may achieve significant improvement in its implementation. This paradigm could effectively be
used in analyzing EHRs as discussed in later sections.
QUERYING BIG DATA WORLD
Big Data paradigms also support various HLQL (higher-level query languages) to simplify both the specification of the MapReduce operations
and the retrieval of the result. Several of these HLQL’s like HiveQL, Pig Latin, and JAQL (Stewart, R.J. et al., 2011) etc. have emerged to query
Big Data stores. HiveQL etc. are like SQL of traditional database systems. All of them can also translate higher level jobs into MapReduce jobs.
Many platforms like MongoDB also provide their own query languages for retrieving the results.
DATA MINING FOR EHRs
Data Mining deals with extracting relevant patterns or items of interests from heterogeneous data sets.
EHRs are useful in capturing and utilizing health care data but it is also important to extract useful information for supporting good medical
health care and medical research using EHR data. Mining of Electronic Health Records (EHRs) has the potential for finding relevant patterns
and for establishing new clinician, patient or disease correlations. It also helps in evaluation of effectiveness of treatment plan, customer
relationship management, and the detection of fraud insurance claims etc.
Data Mining may also involve predictive modelling w.r.t health care data. The purpose is to transform information into knowledge to be utilized
in health care industry. Data mining approaches could be characterized into two main classes: i) Supervised Learning, ii) Unsupervised Learning.
The modelling of data in both the cases is different. A supervised learning approach deals with data set of classified labels from which a model is
derived to predict future labels from the existing features. Examples are: classifiers such as naive Bayes; artificial neural networks etc.
Unsupervised methods, such as clustering algorithms, take unclassified data set and try to group data vectors on the basis of similarity features.
Another important aspect for data mining is correlation mining. Association Rule Mining (ARM) is an example of correlation mining. It tries to
uncover hidden or previously unknown connecting patterns. A rule in the form of X=>Y denotes an implication of element Y by an element X i.e.
how two items (X and Y) are co-related with each other. This usually tends to find simple if-then rules in any data set for formulating hypothesis
to study further. The subsequent section mainly focuses on modelling of clustering technique on health streaming data. In computer science, data
stream clustering is defined as the clustering of data that arrive continuously such as data from social networks, online health records etc. Data
stream clustering aims at constructing a good clustering (grouping) of the stream i.e. given sequence of input data points, using reasonable
amount of resources such as memory, time etc. Data stream clustering has recently attracted attention for emerging applications that involve
large amounts of streaming data. The algorithm for data stream clustering must be able to detect changes in evolving data streams and grouping
data points representing a cluster.
PROPOSED MAPREDUCE MODELLING FOR CLUSTERING HEALTH DATA STREAMS
The real time analysis of digitized health data requires large volumes of multi-dimensional data at higher data rates. The processing of this kind
of data needs to be efficient to provide real time care to the patient. Thus modelling of scalable clustering on multidimensional data as per the
user real time requirements is preferable for providing better health care. Clustering may form coherent group of patients on the basis of
similarity of symptoms they are embracing. To solve the clustering problem on multidimensional health care data streams in an exclusive
manner, scalable paradigm is preferable. One of the basic structures that could be used for storing the multi-dimensional data coming from real
time health care application data stream isGRID. The proposed approach maps the incoming data stream into grid cells and then clusters the grid
cells to find similar or coherent patients. The basic flow of the proposed algorithm is shown in Figure4.
Figure 4. Flow of Proposed Algorithm (IGI, 2014)
The approach has been simulated on dummy two dimensional data for consideration. Table 2 shows sample dummy data with two attributes and
their respective domains and chosen granularities.
Attribute I is discretized as per user-defined granularity, g = 4 and categorical attributes are assigned granularities according to distinct values in
respective domain sets. An incoming data point is inserted in the appropriate cuboid of the Grid using its dimensional values. In this way, all data
points within physical proximity of each other are placed together in same cuboid region. The input sample data file is shown in Table3.
Table 3. Sample input data file
d1 a 12 0.25
d2 b 14 0.5
d3 c 40 0.75
d4 d 60 1
d5 a 40 1.25
d6 a 62 1.5
d7 b 45 1.75
d8 c 18 2
d9 d 42 2.25
d10 a 89 2.5
d11 b 63 2.75
d12 c 72 3
d13 d 09 3.25
d14 b 92 3.5
d15 c 98 3.75
The input file consists of data points and record of their arrival times as the last parameter. The parameter of arrival time is used to calculate the
speed of the stream which will be averaged out (aatc) to measure the recency of the data points. Also assumed athreshold value to be associated
with GRID (each cell) to see whether a grid cell is dense and will be considered for clustering or not. If weight of the grid cell is greater than the
predefined threshold then only it will be considered for clustering.
Considering above definitions, the input file (as shown in Table1) is mapped to a GRID and create an output file as shown in Table 3 with the help
some data transformation (here PERL SCRIPT was used). The output sample data file is shown in Table4.
Table 4. Sample output data file
The implementation of the discussed clustering approach could be done using three Mappers and reducers under MapReduce framework. Initial
sample input to first Map-Reduce is shown in Table3. Each MapReduce process changes the <key, value> input. The output of each MapReduce
process is the input of the next one. The three MapReduce functions are discussed as follows:
1. The initial mapper works for mapping and emitting the <Dimension, {gridIndex, Dimensions, weight, aatc, recent}> pairs of the grid. The
output of first MapReduce process is a list of Grid ID’s (i.e., <g1, g2 etc. >) and their corresponding local cluster ID’s after applying pruning
of the grid cells on the basis of comparison of weight of the grid cell with the minimum threshold and recency factor with aatc factor in the
Reduce function . Reduce function will consider a grid cell if its weight is greater than the threshold value and recency factor is greater than
aatc.
2. The second MapReduce clusters the grids into local clusters and produce list of local cluster ID’s corresponding to the GRID ID’s.
3. The third MapReduce combines the local clusters into global clusters according to the fact that if two local clusters contain the same grid
cell, they form a global cluster.
The proposed approach is being implemented in MongoDB platform on virtual machine and the results are shown in Table5.
Table 5. Results after applying MapReduce
SOLUTIONS AND RECOMMENDATIONS
All stakeholders have unique role in accessing health data. It is important to leverage comprehensive information amongst each stakeholder to
understand potential features of accessing EHRs. Also Big Data paradigm has great potential to transform health care industry into electronic
world. The following recommendations are proposed to share and work on EHRS:
1. Designing Big Data governing models to manage and share health data across organizations for making health care efficient and balancing
health care and costs.
2. Ensuring consistent data storage by using Big Data stores and successful distributed environment for sharing EHR data.
3. Designing use cases based on big data analytics to facilitate each stakeholder of health data.
4. Building base models based on Big Data technology to ensure research and development in medicine and health care domain.
5. Establishing efficient commination between various health care providers in EHR systems.
6. Analyzing health data streams using efficient data mining algorithms w.r.t clustering, classification or association rule mining.
Exploring platforms for big data management and analysis for EHRs to design effective health paradigm is useful. These platforms include
NoSQL databases MongoDB, Hadoop, programming models like MapReduce, architectures like ASTERIX (Borkar, V. R., Carey, M. J., & Li, C.,
2012)etc. which are different from traditional SQL based database systems and support heterogeneously structured voluminous data with its
management and access. The decision may depend on choosing an architecture (e.g. distributed, clustered machines) for storing the data and
programming models for development of data parallel, distributed applications for big data considering factors like scalability limits, query speed
etc. It is important to also focus on cost effectiveness and real time support for data from strategy to implementation. Big data analytics also
requires governance, privacy, and security.
FUTURE RESEARCH DIRECTIONS
Many hospitals have started embracing the meaningful use of EHRs; but now the concern is to facilitate ease of use of EHRs for patients and
interoperability of data they are using. It is beneficial to aim for two main areas: i) good and easy to use user interface for patients accessing
EHRs and ii) performing data analytics on EHR data. Since EHR is big data; MapReduce could prove to be a very efficient paradigm for analytics.
The continuous efforts taken for building timely and cost-effective management over “Big Data” are appreciable and could form the basis of key
ingredient for success in managing health records electronically. We must aim for designing and usage of more appropriate and efficient future
big data tools for gaining insight of their usage in medical industry. This may aid in reducing costs, saving time and enabling quick access to
patient records for efficient health care. Also various data mining techniques could be implemented via big data analytic procedures of
MapReduce, aggregation etc. to facilitate ease of access with efficiency. Computing over cluster of nodes instead of utilizing the capability of one
main sever is the future of various industries and this could help online medical records too. Big Data is new paradigm to explore and
underpinning many futuristic ways of innovation and experimentation with vast amount of data. Futuristic research proposals in-cooperate
evaluating proposed approach of stream clustering with various metrics and implementing various other algorithms to solve problems with
scalable distributed environment.
CONCLUSION
Both EHRs and the techniques for storage, management and analysis of big data are emerging. EHRs have played role in moving health records
from paper based information into electronic one. Big Data has become a ubiquitous technology. It could prove useful in generating new
knowledge in health care industry too by analyzing heterogeneous, unstructured and schema less health care data unlike traditional database
management systems. Big Data Analytics in the form of MapReduce distributed paradigm allows various stakeholders of EHRs to develop
efficient models for health care domain. This could facilitate access of health data at speeds and in cost effective manner. Various monitoring
devices, clinicians, patients access platforms etc. under health domain are generating flow of thousands of data per unit time known as Health
Streams. Health Data Stream Analytics (HDSA) with help of MapReduce or aggregation framework could play important role in clinical decision
system. EHRs, the electronic health records have vast scope for clinical practices and data mining paradigms implemented via big data
technology could utilize this scope well in today’s world of information overload. It may prove beneficial in implementing good clinical practices
and high degree of correlations amongst clinicians and patients. Big Data is accelerating the usage of products and services of medical domain
amongst each stakeholder by sharing over networks unlike paper based healthcare records which were not accessible though networks across the
world. Another advantage is that various Big Data platforms are open-source and freely available to use. Big Data paradigms are also supporting
easy, dynamic and high level query languages over big data stores.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 169186, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Ash, J. S., & Bates, D. W. (2005). Factors and forces affecting EHR system adoption: Report of a 2004 ACMI discussion. Journal of the American
Medical Informatics Association , 12(1), 8–12. doi:10.1197/jamia.M1684
Borkar, V. R., Carey, M. J., & Li, C. (2012). Big data platforms: What's next? XRDS: Crossroads . The ACM Magazine for Students, 19(1), 44–49.
doi:10.1145/2331042.2331057
Cheng, V. S., & Hung, P. C. (2006). Health Insurance Portability and Accountability Act (HIPPA) Compliant Access Control Model for Web
Services. International Journal of Healthcare Information Systems and Informatics , 1(1), 22–39. doi:10.4018/jhisi.2006010102
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM , 51(1), 107–113.
doi:10.1145/1327452.1327492
Duan, L., Street, W. N., & Xu, E. (2011). Healthcare information systems: Data mining methods in the creation of a clinical recommender
system. Enterprise Information Systems , 5(2), 169–181. doi:10.1080/17517575.2010.541287
Ebadollahi S. Coden A. R. Tanenblatt M. A. Chang S. F. Syeda-Mahmood T. Amir A. (2006, October). Concept-based electronic health records:
opportunities and challenges. In Proceedings of the 14th annual ACM international conference on Multimedia (pp. 997-1006).
ACM.10.1145/1180639.1180859
Feldman, B., Martin, E. M., & Skotnes, T. (2012). Big Data in Healthcare Hype and Hope. October 2012. Dr. Bonnie, 360.
Groves, P., Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The ‘big data ‘revolution in healthcare. McKinsey Quarterly. Retrieved from
http://www-01.ibm.com/software/data/bigdata/industry-healthcare.html
Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining electronic health records: Towards better research applications and clinical care. Nature
Reviews. Genetics , 13(6), 395–405. doi:10.1038/nrg3208
Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of Healthcare Information Management ,19(2), 65.
Lee, C. O., Lee, M., Han, D., Jung, S., & Cho, J. (2008, July). A framework for personalized Healthcare Service Recommendation. In ehealth
Networking, Applications and Services, 2008. HealthCom 2008. 10th International Conference on (pp. 90-95). IEEE.
Lee, K. K. Y., Tang, W. C., & Choi, K. S. (2013). Alternatives to relational database: Comparison of NoSQL and XML approaches for clinical data
storage. Computer Methods and Programs in Biomedicine , 110(1), 99–109. doi:10.1016/j.cmpb.2012.10.018
Lomotey, R. K., & Deters, R. (2013, June). Terms extraction from unstructured data silos. In System of Systems Engineering (SoSE), 2013 8th
International Conference on (pp. 19-24). IEEE. 10.1109/SYSoSE.2013.6575236
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (n.d.). Big data: The next frontier for innovation,
competition, and productivity. Retrieved from
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
Patra, D., Ray, S., Mukhopadhyay, J., Majumdar, B., & Majumdar, A. K. (2009, December). Achieving e-health care in a distributed EHR system.
In eHealth Networking, Applications and Services, 2009. Healthcom 2009. 11th International Conference on (pp. 101-107). IEEE.
10.1109/HEALTH.2009.5406205
Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press. doi:10.1017/CBO9781139058452
Ramakrishnan, N., Hanauer, D., & Keller, B. (2010). Mining electronic health records. Computer , 43(10), 77–81. doi:10.1109/MC.2010.292
Stewart, R. J., Trinder, P. W., & Loidl, H. W. (2011). Comparing high level mapreduce query languages . In Advanced Parallel Processing
Technologies (pp. 58–72). Springer Berlin Heidelberg. doi:10.1007/978-3-642-24151-2_5
Strauch, C., Sites, U. L. S., & Kriha, W. (2011). NoSQL databases.Lecture Notes . Stuttgart Media University.
Wang F. Ercegovac V. Syeda-Mahmood T. Holder A. Shekita E. Beymer D. Xu L. H. (2010, November). Large-scale multimodal mining for
healthcare with mapreduce. In Proceedings of the 1st ACM International Health Informatics Symposium (pp. 479-483).
ACM.10.1145/1882992.1883067
Wasan, S. K., Bhatnagar, V., & Kaur, H. (2006). The impact of data mining techniques on medical diagnostics. Data Science Journal ,5, 119–126.
doi:10.2481/dsj.5.119
Yina, W. (2010, April). Application of EHR in health care. InMultimedia and Information Technology (MMIT), 2010 Second International
Conference on (Vol. 1, pp. 60-63). IEEE. 10.1109/MMIT.2010.32
Zhang, Q., Pang, C., Mcbride, S., Hansen, D., Cheung, C., & Steyn, M. (2010, July). Towards health data stream analytics. In Complex Medical
Engineering (CME), 2010 IEEE/ICME International Conference on (pp. 282-287). IEEE. 10.1109/ICCME.2010.5558827
KEY TERMS AND DEFINITIONS
Clustering: Clustering deals with grouping of objects based on some similarity measure. Items in the same group (called a cluster) are more
similar to each other than to those in other groups (clusters).
Data Analytics: Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful
information, suggesting conclusions, and supporting decision making.
Data Intensive Domain: It is a classified area supporting various parallel computing applications, that use a data parallel approach to process
huge volumes of data like terabytes or petabytes in size and referred to as Big Data.
Data Mining: Data Mining is an analytic process designed to explore data for extracting interesting and relevant patterns.
Distributed System: It consists of autonomous machine nodes connected in a network to, share and coordinate their activities via message
passing to achieve a common system goal.
Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the
Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's
MapReduce and Google File System (GFS) papers. It implements MapReduce paradigm.
HLQL: High Level Query languages designed for extracting data from Big Data stores.
MapReduce: MapReduce is a parallel programming architecture, proposed by Google and is used to process large data sets via distributing data
over cluster of machines maintaining load balancing, parallel processing and sharing of data.
NoSQL: NoSQL systems are also called “Not only SQL” to emphasize non-adherence to fixed schema structures like traditional relational
systems. NoSQL systems are simple to use, dynamic and support various big databases via sharding.
Sharding: It is a principle supporting horizontal database partitioning to separate very large datasets into smaller, faster, easily manageable
chunks known as shards.
CHAPTER 27
Knowledge as a Service Framework for Collaborative Data Management in Cloud
Environments Disaster Domain
Katarina Grolinger
Western University, Canada
Emna Mezghani
Université de Toulouse, France
Miriam A. M. Capretz
Western University, Canada
Ernesto Exposito
Université de Toulouse, France
ABSTRACT
Decision-making in disaster management requires information gathering, sharing, and integration by means of collaboration on a global scale
and across governments, industries, and communities. Large volume of heterogeneous data is available; however, current data management
solutions offer few or no integration capabilities and limited potential for collaboration. Moreover, recent advances in NoSQL, cloud computing,
and Big Data open the door for new solutions in disaster data management. This chapter presents a Knowledge as a Service (KaaS) framework for
disaster cloud data management (Disaster-CDM), with the objectives of facilitating information gathering and sharing; storing large amounts of
disaster-related data; and facilitating search and supporting interoperability and integration. In the Disaster-CDM approach NoSQL data stores
provide storage reliability and scalability while service-oriented architecture achieves flexibility and extensibility. The contribution of Disaster-
CDM is demonstrated by integration capabilities, on examples of full-text search and querying services.
INTRODUCTION
Each year, a number of natural disasters strike across the globe, killing hundreds and causing billions of dollars in property and infrastructure
damage. As the number of such events increases, minimizing the impact of disasters becomes imperative in today’s society.
The role of information and communication technology in disaster management has been evolving. Large quantities of disaster-related data are
being generated. Behavior of critical infrastructures is being explored through simulation, response plans are being created by government
agencies and individual organizations, sensory systems are providing potentially relevant information, and social media (Twitter, Facebook) have
been flooded with disaster information (Hristidis, Chen, Li, Luis, & Deng, 2010). Traditional storage and data processing systems are facing
challenges in meeting the performance, scalability and availability needs of Big Data. Current disaster data storage systems are disparate and
provide few or no integration capabilities and limited potential for collaboration. To meet the needs of Big Data and make the most of available
information, a reliable and scalable storage system supported by information sharing, reuse, integration, and analysis is needed.
Another vital element of a successful disaster management is collaboration among a number of teams including firefighters, first aid, police,
critical infrastructure personnel, and many others. Each team or recovery unit is responsible for performing a well-defined task, but their
collaboration is essential for decision-making and execution of well-organized and successful recovery operations. Such diverse disaster
participants generate large quantities of heterogeneous data, making information gathering, storage, and integration especially challenging.
The activities of various disaster participants can be observed through four disaster management phases, as illustrated in Figure 1: mitigation,
preparedness, response, and recovery (Coppola, 2011). Mitigation includes all activities undertaken to reduce disaster effects by avoiding or
decreasing the impact of a disaster. The preparedness phase is concerned with preparing for disaster occurrence and includes activities such as
planning, establishing procedures and protocols, training, and exercises. The transition from the preparedness to the response phase is triggered
by disaster occurrence. The response is focused on addressing the direct, short-term effects of a disaster and includes immediate actions to save
lives, protect property, and fulfill basic human needs. The transition to the recovery phase starts when the direct disaster threat subsides and
includes activities focused on bringing society into a normal state. The approach presented in this chapter carries out both data collection and
delivery through all four phases; however, the focus is on data collection during the mitigation and preparedness stages, while during the
response and recovery phases, the focus is on data delivery. In other words, the main intent is not real-time collection of information during
disaster response, but better use of the information collected in different phases.
Figure 1. Disaster management phases
Recent advances in NoSQL, cloud computing, and Big Data have been changing how data are captured, stored, and analyzed. NoSQL solutions
have been especially popular in Web applications (Sakr, Liu, Batista, & Alomari, 2011), including Facebook, Twitter, and Google. However, the
use of NoSQL solutions and cloud technologies in disaster management has been sparse. NoSQL data stores are suitable for use as cloud data
management systems (Grolinger, Higashino, Tiwari, & Capretz, 2013) and therefore, many of the Database as a Service offerings, such as
Amazon’s SimpleDB and DynamoDB, are NoSQL data stores. Moreover, NoSQL data stores have a number of characteristics that can benefit
disaster data management, including: simple and flexible data model, high availability, horizontal scalability, and low initial investment.
The research significance of this chapter is in providing a data management framework which effectively supports information needs of disaster
management as well as other disaster-related activities. The proposed solution facilitates disaster preparedness, response, and recovery efforts by
providing a flexible and expandable storage solution for diverse disaster data. Supporting global information sharing, reuse, and integration, the
proposed solution provides improved and informed decision-making and therefore reduces the impact of disasters on human lives and property.
This chapter first introduces the motivating scenario and investigates related work. Next, a Knowledge as a Service (KaaS) framework for disaster
cloud data management (Disaster-CDM) is presented. Disaster-CDM has the objectives of 1) facilitating information gathering and sharing
through collaboration, 2) storing large amounts of disaster-related data from diverse sources, and 3) facilitating search and supporting
interoperability and integration. This research aims to facilitate better decision-making in disaster situations by supporting better use of
information through global information sharing, reuse, and integration. Moreover, by using NoSQL data stores, Disaster-CDM provides a
flexible, highly scalable, and reliable storage for diverse disaster data. Adopting service-oriented architecture, Disaster-CDM achieves flexibility
and extensibility while allowing for distributed deployments. Disaster-CDM was motivated by disaster scenarios and it was designed for the
management of disaster-related data; however, it could potentially be applied for data management in other domains.
The case study presented at the end of this chapter illustrates the use of Disaster-CDM on the data collected during the Disaster Response
Network Enabled Platform (DR-NEP) project. Disaster-CDM benefits are demonstrated by integration capabilities, on examples of full-text
search and querying services.
MOTIVATING SCENARIO
This work was motivated by a CANARIE sponsored Disaster Response Network Enabled Platform (DR-NEP) project (The University of British
Columbia, 2011). The project combined the expertise of a number of research groups, industry, government agencies, and response teams in
multiple geographical locations with the aim of improving the capability to prepare for and respond to large disasters.
A crucial element of disaster management, and DR-NEP project in particular, is simulation because it provides a means of studying the behavior
of critical infrastructures, as well as a way of exploring disaster response “what-if” scenarios. Therefore, disaster modeling and simulation played
a major role in DR-NEP project, with a special focus on critical infrastructure (CI) interdependency simulation.
The participation of Western University in the DR-NEP project involved the investigation of critical infrastructure interdependencies in an
incident that happened on its campus. As the event involved various infrastructures, it was simulated using several simulators including EPANET
(United States Environmental Protection Agency, 2008) water distribution simulator and the I2Sim (Rahman, Armstrong, Mao, & Marti, 2008)
interdependency simulator. Different disaster response strategies were explored and compared with decisions made during the event. Western
University collected information directly related to the event such as the event reports and timelines, data pertaining to the involved
infrastructures and a variety of other data that could help in better understanding and modeling the event.
As the DR-NEP project progressed, the quantity of available information was growing and it became difficult to manage it and even more difficult
to find information relevant in a specific context or to locate correlated pieces of information. For example, finding information about a specific
historic incident or locating all information about a stakeholder was largely a manual process and involved the user knowing the storage location
of the information and/or searching for it using operating system search capabilities. This approach was very time consuming, unreliable and
error prone.
Content management system (CMS) could alleviate this problem by providing a way to collect, organize, manage, and publish diverse content
including documents, workflow information, and multi-media content. Commercial products such as HP’s Autonomy (Hewlett-Packard, 2014)
and IBM’s Enterprise Content Management (ECM) (IBM, 2014) provide generic content management solutions for an enterprise. In contrast,
Disaster-CDM focuses on providing knowledge as a service for the disaster management domain. High availability is essential as a system needs
to remain operational even when a region is affected by a disaster; if a local data center fails, the system still needs to provide knowledge services
to disaster responders. Disaster-CDM achieves high availability by taking advantage of NoSQL storage solutions.
Another requirement in the context of the DR-NEP project relates to simulation. As simulation models are often stored in binary files, the success
of traditional CMS is limited. To include simulation models in text search or data analytics, simulation models need to be either extensively
annotated or represented in another form more suitable for data integration and analysis. Disaster-CDM achieves this by transforming
simulation models into an ontology-based representation.
Our previous work (Grolinger, Mezghani, Capretz, & Exposito, 2015) used Disaster-CDM in the context of Collaborative Knowledge as a Service
(CKaaS) and demonstrated collaboration among distributed KaaS entities. This chapter complements the previous one by addressing knowledge
acquisitions and storage aspects of the collaboration framework.
BACKGROUND
Research in disaster management involves many fields, including health science, environmental science, computer science, and a number of
engineering disciplines. Crisis informatics (Palen et al., 2010; Schram & Anderson, 2012), the area of research concerned with the role of
information and technology in disaster management, has been attracting increased research attention recently.
Hristidis et al. (2010) surveyed data management and analysis in the disaster domain. The main focus of their survey was on data analysis
techniques without the storage aspect. In contrast, in Disaster-CDM, storage and analysis are considered as integral parts. Hristidis et
al. identified the following data analysis technologies as relevant in disaster data management: information extraction, information retrieval,
information filtering, data mining, and decision support. Similarly, Disaster-CDM uses a number of technologies from information extraction
and retrieval. Their survey reveals that the majority of research has focused on a very narrow area of disaster management, for example, a specific
disaster event such as an earthquake or a flood, or specific disaster-related activities such as communication among actors, estimating disaster
damage, and use of mobile devices. Hristidis et al. recognized the need for flexible and customizable disaster management solutions that could be
applied in different disaster situations. Disaster-CDM aims to provide such a solution using NoSQL data stores.
Othman and Beydoun (2013) pointed out the importance of providing sharable disaster knowledge in facilitating better disaster decision-making.
They proposed a Disaster Management Metamodel with the objective of improving knowledge sharing and supporting the integration and
matching of different disaster management activities. Othman et al. (2014) analyzed the existing disaster models and created a unified view of
disaster management in the form of metamodel across the four phases of disaster management. Although Othman et al. highlighted the large
amount of information generated in the disaster domain, disaster data storage was not considered in their study. November and Leanza (2015)
studied collecting and sharing information related to disasters, risks, and crisis; they described how information was gathered, processed,
distributed, and used in different disaster and crisis situations. Their work highlighted the importance of information reformatting and
reformulation in order to ensure that information reaches the stakeholders and that it is understood as intended. Similarly, Disaster-CDM is
concerned with data collecting and sharing; but, the focus is on facilitating search and supporting interoperability and integration.
Silva et al. (2011) aimed to integrate diverse, distributed information sources by bringing them into a standardized and exchangeable common
data format. Their approach focused on data available on public Web sites. Chou et al. (2011) proposed an ontology for developing Web sites for
natural disaster management. Web elements contained in the ontology were identified using a ground theory approach with an inventory of
disaster management Web sites. While Silva et al. (2011) and Chouet al. (2011) addressed disaster Web sites, Disaster-CDM is concerned with a
variety of diverse data sources.
Palen et al. (2010) presented a vision of technology-supported public participation during disaster events. They focused on the role of the public
in disasters and how information and communication technology can transform that role. Similarly to Hristidis et al. (2010), they recognized
information integration as a core concern in crisis informatics.
Anderson and Schram (2011) also studied the role of public and social media in disaster events. They proposed a crisis informatics data analytic
infrastructure for the collection, analysis, and storage of information from Twitter. In their initial study (Anderson & Schram, 2011), data were
stored in a relational database, specifically MySQL. Later, after encountering scalability challenges, they transitioned to a hybrid architecture that
incorporates relational database and NoSQL data store (Schram & Anderson, 2012). Similarly, Disaster-CDM also uses a combination of
relational database and NoSQL data stores; however, a combination of several NoSQL data stores has been used to address the storage
requirements of diverse data. The works of Choi and Bae (2015), Ilyas (2014), and de Albuquerque et al.(2015) also considered Twitter
information in the context of disaster management. Choi and Bae (2015) presented a real-time system which monitors Tweeter feeds, analyzes
disaster-related tweets, and displays disaster situations in a map. Ilyas (2014) focused on image data; the proposed system scrapes tweets for
images and classifies the images using machine learning in order to assess the severity of damage. De Albuquerque et al. (2015) combine Twitter
data with authoritative data including sensor and hydrological data to identify useful information for disaster management. In contrast to those
works (Choi & Bae, 2015; de Albuquerque et al., 2015; Ilyas, 2014), Disaster-CDM is a generic approach suitable for a variety of diverse data
sources.
Disaster-CDM incorporates the KaaS approach to make disaster-related knowledge available as services and to enable the collaboration between
consumers and providers. Within KaaS, a knowledge provider answers requests presented by knowledge consumers through knowledge services
(Khoshnevis & Rabeifa, 2012). Generally, KaaS publishes knowledge models that represent a collection of learned lessons, best practices, and
case studies as services that help consumers get knowledge from a distributed computing environment. This approach has been used in various
domains: Lai et al. (2012) presented a KaaS model for business network collaboration in the medical context, and Qirui (2012) introduced the
KaaS in the agricultural domain to provide farming recommendations. While those works store data in the relational database or do not address
data management layer, the KaaS in Disaster-CDM accommodates both structured and unstructured data by taking advantage of relational
databases and NoSQL data stores.
The collaboration aspect in the disaster management has been emphasized in the work of Waugh and Streib (2006) in which they discussed the
importance of collaboration and argued why command and control approaches are problematic. The collaboration aspect in general has been
extensively studied by Computer-Supported Cooperative Work (CSCW) (Association for Computing Machinery, 2014) and groupware
researchers. Mittleman et al. (2008) classified collaboration technologies into four categories: jointly authored pages, streaming technologies,
information access tools and aggregated systems. Disaster-CDM is somewhat similar to information access tools which provide ways to store,
share and find related content; however, Disaster-CDM focuses on providing high availability in disaster management situations. Synchronous
groupware such as chat systems, whiteboards, and video conferencing enables users to interact in real time, but requires simultaneous presence
of participants. In contrast, Disaster-CDM entails an asynchronous approach in which participants contribute at different times. An
asynchronous approach was chosen because the main objective of Disaster-CDM is not the online communication among disaster participants,
but an approach that can make better use of data collected during the mitigation and preparedness stages.
A typical groupware system addresses various aspects of generic enterprise collaboration while Disaster-CDM focusses on the disaster
management domain. Moreover, the Disaster-CDM framework is customizable as it can be easily expanded to include new data processing
services capable of handling new data sources. The presented case study shows how the framework addresses simulation models and integrates
them with other data sources which is not possible using general-purpose content management or CSCW tools.
DISASTERCDM FRAMEWORK
A successful disaster management relies on the collaboration among participants; however, the diversity of the involved participants and their
activities results in massive data heterogeneity. This heterogeneity of data, together with their volume, is one of the main challenges in providing
a comprehensive solution that could be used by various stakeholders in diverse disaster situations. Disaster-CDM addresses those Big Data
challenges by integrating NoSQL storage with the KaaS approach which provides disaster-related knowledge as a service.
The Disaster-CDM framework illustrated in Figure 2 is the adaptation of the framework proposed in our previous work (Grolinger, Mezghani,
Capretz, & Exposito, 2013). It consists of two parts: knowledge acquisition and knowledge delivery services.Knowledge acquisition is responsible
for acquiring knowledge from diverse sources, processing it to add structure to unstructured or semi-structured data, and storing it.
Heterogeneous data from sources like documents, simulation models, social media, and web pages, are handled by applying processes such as
text extraction, file metadata separation, and simulation model transformation. This results in outputs including extracted text, annotated data,
and ontology-based simulation models. Processed data are stored in a variety of relational databases and NoSQL data stores. Knowledge
delivery services are responsible for integrating information from different data stores and delivering knowledge to consumers as a service. In
contrast to the initial proposition (Grolinger, Mezghani, Capretz, & Exposito, 2013), the framework presented in Figure 2 has been extended to
include the Key-value store as a NoSQL storage option and the full-text search as an additional knowledge service.
Figure 2. Disaster-CDM framework
The following two subsections provide an overview of the two main parts of Disaster-CDM: knowledge acquisition and knowledge delivery.
Knowledge Acquisition
The knowledge acquisition services obtain data from heterogeneous data sources, process them, and store them in the cloud environment. It was
decided to process the information and to store the processed, enriched data because this will allow shorter query response time than performing
the processing “on the fly”.
Heterogeneous Data Sources
A few examples of information related to disasters are disaster plans, incident reports, situation reports, social media, simulation models
including infrastructure and health-care simulation. As for representation formats, examples include MS Word, PDF, XML, a variety of image
formats (jpeg, png, tiff), and simulation model formats specific to simulation packages. From our experience working with local disaster
management agencies, the majority of information is stored in unformatted documents, primarily MS Word and PDF files. This agrees with the
work of Hristidis et al.(2010), who reported that most information is in MS Word and PDF files.
Data Processing Services
The processing is driven by the input data and by data processing rules, as illustrated in Figure 2. Data processing rules specify what data
processing services are to be applied to which input data and in which order. According to the KaaS approach, Disaster-CDM provides data
processing services which can be composed by means of processing rules. The representative services with their associated outputs are included
in Figure 2:
• File Metadata Separation Service makes use of file and directory attributes, including file name, creation date, last modified date, and
owner. For example, the creation date and last modified date can assist in distinguishing newer and potentially more relevant information
from older and possibly outdated information.
• Text Extraction Service recognizes the text in an image and separates it. (Sumathi, Santhanam, & Devi, 2012). This step prepares
images and PDF files for other processing steps such as tagging. Text extraction is especially important in the case of diagrams such as
flowcharts or event-driven process chains because these documents contain large amounts of text that can be used for tagging.
• Pattern Processing Service makes use of existing patterns within documents to extract the desired structure. Hristidis et al. (2010)
observed that most of available disaster-related information is stored in unstructured documents, but that “typically the same organization
follows a similar format for all its reports”.
• Simulation Model Service is the process of converting simulation models into a representation which enables model queries and
integration with other disaster-related data. To extract as much information as possible from simulation model files, an ontology-based
representation of simulation models has been used (Grolinger, Capretz, Marti, & Srivastava, 2012).
• Tagging and semantic annotation Services. Tagging is the process of attaching keywords or terms to a piece of information with the
objective of assisting in classification, identification, or search. Semantic annotations additionally specify how entities are related. In disaster
management data tagging, both manual and automated tagging are needed.
The presented data processing services are common processes for addressing file-style data; nevertheless, Disaster-CDM can be easily expanded
to include new data processing services.
Data Storage
Relational databases (RDBs) are traditional data storage systems designed for structured data. They have been used for decades due to their
reliability, consistency, and query capabilities through SQL; however, RDBs are facing many challenges in meeting the requirements of Big Data.
The new storage solutions, namely NoSQL data stores (Sakr et al., 2011), have emerged in an attempt to address those challenges in cloud
environments.
Disaster-CDM, as illustrated in Figure 2, accommodates both relational database and NoSQL data stores. The following discussion introduces the
four NoSQL data store categories:
• Keyvalue Data Stores are used for fast and simple operations. They have the simplest data model: they provide a simple mapping from
the key to the corresponding value. When using a key-value data store, relations between data are handled at the application level. This data
model greatly restricts integration capabilities, and therefore it is avoided in Disaster-CDM.
• Document Data Stores offer a flexible data model with query possibilities. They focus on optimized storage and access for semi-
structured documents as opposed to rows or records. Document data stores are considered an evolution of key-value data stores because
they include all the benefits of the key-value data stores while adding query capabilities.
• Columnfamily data stores are on the surface similar to relational databases as they both have the notions of rows and columns.
However, in the relational database columns are predefined and each row contains the same fixed set of columns, whereas in the column-
family data store columns that form a row are determined by the client application and each row can have a different set of columns.
Column-family data stores provide query capabilities.
• Graph Data Stores are specialized for efficient management of heavily linked data. Applications based on data with many relationships
are well suited for graph data stores because the cost of intensive operations like recursive “join” operation can be replaced by efficient graph
traversals.
Despite the advantages of NoSQL data stores, Disaster-CDM framework also accommodates relational databases. RDBs are still an appropriate
solution for many applications because of their characteristics such as ACID (Atomicity, Consistency, Isolation, Durability) transactions, their
status as an established technology, and their advanced query capabilities. Moreover, existing data in relational databases do not need to be
migrated.
Knowledge Delivery
The Disaster-CDM knowledge delivery services will answer information requests submitted by service consumers by integrating data stored in
the cloud environment. In this stage, the collaboration is achieved by providing the integrated knowledge as a service to collaboration
participants. As presented in Figure 2, the data access is mainly composed of two parts:
• Data interfaces: Data interfaces enable translation of the generic query into a specific language that corresponds to the underlying data
store system. Thus, the data stored in heterogeneous sources can be accessed, analyzed, and administered. An attempt to unify access to
NoSQL systems is proposed in the work of Atzeni et al. (2013) where NoSQL models and their programming tactics are reconciled within a
single framework.
• Services: This is the access layer for users. It provides services independently of how the data are stored. Thus, users are unaware of the
storage architecture and are provided with a unified view of the data.
The application of the presented Disaster-CDM approach on data formats commonly presents in the disaster management domain, i.e. file-style
data formats, is further detailed in the following section.
DATA MANAGEMENT FOR FILESTYLE DATA
The Disaster-CDM framework is designed to accommodate heterogeneous data sources, including PDF files, MS Word documents, simulation
models, Web pages, and social media data. The introduction of a new data source to the framework requires:
1. Adding a new processing block to existing data processing capabilities. For example, video processing would require a new process which
would attach textual context to videos.
2. Defining data processing rules for the new data source. For instance, a video processing rule might specify that video files first undergo
metadata extraction followed by a new video-specific process.
3. Determining the data storage appropriate for the new data source. Disaster-CDM does not define storage data structure or even the type of
data store; in this step, the data store type suitable for the new data source is determined.
From the authors’ experience working with local disaster data agencies, which agrees with the work of Hristidis et al. (2010), the majority of
information is stored in unformatted documents, primarily MS Word and PDF files. Another crucial element of disaster management is
simulation because it provides a means of studying the behavior of critical infrastructures. Consequently, this chapter focuses on processing
information stored in files, including plain text, image and PDF files, MS Office documents including Word, PowerPoint, Excel, and Visio, and
simulation model files. The common element among those information sources is that information is typically stored in self-contained and
largely unrelated files.
Data Processing Services for FileStyle Data
The main data processing services required to handle information stored in files are included in Figure 2. File metadata separation service is used
in processing anything that is stored as a file. Tagging and semantic annotation services are applied on textual data; however, in the case of
images or PDF files, text is first extracted from the image or PDF files and then passed on for tagging and semantic annotation. All files are tagged
and semantically annotated unless other processes were unable to extract any text from the file. Some other data processing services are more
format-specific, such as optical character recognition (OCR) or simulation model transformation.
Data Processing Rules
Data processing rules define how a category of data sources needs to be processed before being stored in a data store. They are influenced by the
format of the data source and the available processing services.
For example, Algorithm 1 illustrates a data processing rule for all MS Office files. First, metadata are separated, and text is extracted. Next, if
there are images in the file, they are extracted (line 4). For each image, text is extracted using OCR methods (lines 5 to 7). Finally, text extracted
from the file and from the images is tagged (lines 8 and 9).
Algorithm 1. Data processing rule for MS Office files
1: iffile = MSOfficeFile then
2: processMetadata(file)
3: fileText = extractText(file)
4: images = extractImages(file) //extract all images
5: for eachimage in images
6: imageText += OCRProcess(image)
7: end for
8: tagText(fileText)
9: tagText(imageText)
10: End
The presented algorithm represents generic processing for all MS Office files regardless of file type. However, some MS Office files, such as Excel
files, possess additional formatting that can be exploited to add additional structure to data. For example, since Excel organizes data in tabular
form, data processing can take advantage of this formatting and create table-like structures in a data store. In this case, a service needs to be
added which can take advantage of this specific formatting, and the data processing rule needs to be refined to include Excel-specific processing.
Another category of files that is especially significant in disaster data management are simulation files. An example of a processing rule for
simulation models is presented as Algorithm 2. Like the MS Office rule, it starts with metadata separation. Next, the simulation model is
transformed to its corresponding ontology-based representation (line 5), which is described in an ontology representation language.
Then postProcessOntology deals with specifics of the ontology representation language and prepares ontology-based representation for tagging;
for example, it replaces special characters and separates compound words such as those in camel-case naming. Finally, as with MS Office files,
the simulation model processing rule ends with text tagging.
Algorithm 2. Data processing rule for simulation models
1: iffile = SimulationModel then
2: processMetadata(file)
3: //Transform simulation model to its corresponding
4: //ontology-based representation
5: ontModel = transformSimModelToOntology(file)
6: fileText = postProcessOntology(ontModel)
7: tagText(fileText)
8: end
Similarly to these rules for MS Office files and simulation model files, rules are defined for other file categories that need to be processed,
including PDF files and a variety of image formats. Overall, generic file processing consists of separating metadata, extracting text from source
files using file type-specific processing followed by tagging of extracted text. When a source file contains additional formatting, such as in Excel
documents, data processing rules can use this formatting to add additional structure to processed data.
Data Storage
Flexibility of data storage is the core of the Disaster-CDM framework because it enables a choice of storage according to the characteristics of the
data to be stored. For each data source category, two steps must be performed: determining the type of data store, and designing the data model.
Determining the type of data store consists of choosing among relational database, key-value, document, column-family, and graph data stores.
The file-style data considered in this section are stored in self-contained, apparently unrelated files. Although the file contents might be related,
this relation is not explicitly specified. Therefore, storage models focusing on relations, including relational database and graph data stores, are
not the best suited for such data. The document data store model has been chosen here for the storage of file data because it is designed around
the concept of a document, providing flexible storage while allowing structure specification within a document.
The data model design in the case of a document data store consists of defining a document structure. Document data store implementations
differ in their internal representations of documents; however, they all encapsulate and encode data in some form of encoding. Therefore, the
data model design is independent of the choice of data store implementation provided that the data store belongs to the document category.
Table 1 depicts the data model designed for storing file data in a document data store. It is a generic model for storing a variety of file-style data
with flexibility that enables it to accommodate different file types and a variety of attributes. The presented data model is relatively standardized
to support querying abilities. In contrast, allowing uncontrolled naming of fields within documents would negatively impact querying abilities.
Several fields, such asfileName or origFileLocation, are mandatory because they are common for all file types and must exist in each document in
the data store. On the other hand, other fields such as docImageTextand tag are optional and exist only in documents that need to record those
attributes. Two fields, metaData and tag, have a number of child fields for storing different attributes of the parent field. The number and names
of the child fields are different among files of different types: for example, an image file might have metaData child fields such
as imageWidth orresolutionUnits, but these child fields will not exist for other file types. With respect to tag fields, the number and names of the
child fields depends on the tagging approach used.
Table 1. File storage data model: Document data store
CASE STUDY
The motivating scenario described at the beginning of this chapter was used for this case study. Specifically, Disaster-CDM was applied on data
collected by Western University during the two-year period of the Disaster Response Network Enabled Platform (DR-NEP) project (The
University of British Columbia, 2011). Public databases, such as Emergency Events database (http://www.emdat.be) and a number of databases
from Global Risk Information Platform (http://www.gripweb.org/gripweb/?q=disaster-database) were considered; however, those databases
contain only public information. In contrast, data set from DR-NEP project includes public data as well as sensitive data which are not accessible
to the general public.
The collected data set is heterogeneous and includes data sources such as different institutions’ disaster plans, reports of previous incidents,
incident timelines, minutes of DR-NEP meetings as well as various other disaster response meetings, information about different critical
infrastructures, risk analysis documents, and information about a number of disaster-related stakeholders. These data sources are owned by
various participants who need to collaborate and share the owned information in order to achieve successful disaster management. Because
simulation was of special interest in the DR-NEP project, the collected data include simulation models that were used to explore critical
infrastructure interdependencies, specifically EPANET (United States Environmental Protection Agency, 2008) water-distribution models and
the I2Sim (Rahman et al., 2008) interdependency simulator models.
With respect to format, the data set includes text files, image files in a variety of formats, text and PDF files, and MS Office documents, including
Word, Excel, PowerPoint, and Visio. Simulation model files are simulator-specific: I2Sim models are stored in a Simulink-style .mdl file format,
while EPANET models are stored in .NET or .INP files.
Because this chapter focuses on knowledge acquisition and storage, the presented case study shows how knowledge from DR-NEP data set is
acquired and stored. The benefits are demonstrated through the knowledge delivery services, specifically by full-text search and querying
services.
Our previous work (Grolinger, Mezghani, Capretz, & Exposito, 2015) demonstrated simulation model processing and storage of the results in a
graph data store, and illustrated simulation model querying for the purpose of model checking and validation. In contrast, this chapter is
concerned with a variety of file-style sources.
Implementation
The Web application was implemented to provide access to the Disaster-CDM system using a Web browser. Specifically, this Web application
provides access to KaaS including knowledge acquisition and knowledge delivery services from anywhere and from a variety of devices. Apache
Wicket (Apache wicket, 2015), a component-based Web application framework for the Java programming language, was used for front end
development. The following subsections describe the implementation of the two main Disaster-CDM components: data processing services, and
data storage.
Data Processing Services
Disaster-CDM provides data processes as services: each data processing component is implemented as a separate Web service. This choice of
architecture enabled flexible deployment of services in the cloud environment and the service composition for the provision of knowledge
acquisition services. Specifically, the RESTful (Representational State Transfer) Web service architecture was used with the GlassFish application
server (GlassFish, 2014). In this case study data processing services were deployed on the Windows machine with Intel Core i7 processor with
16GB RAM; however, they can be deployed on a cluster or a cloud.
This case study focuses on data stored in a variety of file formats, and therefore it implements the data processing required for such data sources.
Implementations of most of the data processing approaches from Figure 2 are available either as open source or commercial products. This case
study uses open source products, adopts them when needed, and wraps them as RESTful Web services. The following data processing entities
have been implemented:
• File metadata separation uses the Apache Tika Toolkit (The Apache Foundation, 2013).
• Text extraction from MS Office documents was also performed by Apache Tika; however, Tika is incapable of extracting text from images.
Therefore, text extraction from image files and from images embedded in MS Office files was performed using the Tesseract (Smith, 2007)
Optical Character Recognition (OCR).
• Tagging was carried out using the General Architecture for Text Engineering (GATE) tool suite (Cunningham et al., 2013). Specifically, an
information extraction system called ANNIE (A Nearly-New IE system), which is distributed with GATE, was used.
Data Storage
This case study focused on data stored as files, and accordingly the storage model chosen was document data stores, as presented in Data Storage
subsection. The data model portrayed in Table 1 is designed for NoSQL document data stores and can be realized in any document data store
implementation. This case study used the Apache CouchDB document data store (Anderson, Lehnardt, & Slater, 2010).
CouchDB is designed for Web applications, it used HTTP for an API and JavaScript Object Notation (JSON) to represent documents. The
primary reasons for choosing CouchDB were its scalability, high availability, and partition tolerance. Ability to scale over many commodity
servers enables CouchDB to store large amounts of data, while its high availability ensures system operation even when a region is affected by a
disaster and a local data centre fails. Partition tolerance refers to the ability of the system to remain operational in the presence of network
partitions, which is especially relevant in disaster-related applications because it can be expected that parts of the network will fail. CouchDB
achieves partition tolerance using an asynchronous replication approach; multiple replicas possibly placed on geographically distant locations
have their own copies of data, and in case of network partition, each replica modifies its own copy. At a later time, when network connectivity is
restored, the changes are synchronized.
The primary way of querying and reporting on CouchDB documents is through views which use the MapReduce model with JavaScript as a query
language. In the MapReduce model, the Map function performs filtering and sorting while the Reduce function carries out grouping and
aggregation operations.
The Apache Lucene library (McCandless, Hatcher, & Gospodnetic, 2010) provides full-text search of data stored in CouchDB. In general, Lucene
is an open-source, high-performance text search engine library written in Java. It is suitable for almost any application which requires full-text
search and has been recognized for its utility in Internet search engines. With respect to Disaster-CDM, Lucene enables ranked searches and field
specific searches such as searching for a specific file name or an author. This case study takes advantage of CouchDB-Lucene project (CouchDB-
lucene project, 2012) which integrates Lucene with CouchDB. For the purpose of the presented case study, CouchDB was deployed on a single
machine with Intel Core i7 processor with 16GB RAM; however, this setup can be changed for a cluster or a cloud.
Knowledge Acquisition Services
Western University stored the data collected as part of the DR-NEP project on a server in a dedicated area. It was the responsibility of the
individual participants to place data that needed to be shared among participants onto the server. Therefore, this case study uses data from this
DR-NEP server as its data source. In the knowledge acquisition stage, these data were processed by data processing services and loaded into the
Disaster-CDM system, specifically into the CouchDB. A total of 1129 files were successfully loaded into the Disaster-CDM system, resulting in the
same number of documents in the data store. A number of files failed to load; however, further review revealed that they were in file formats
which are outside the scope of this case study, including pub, zip, mat, dll, and exe. Nevertheless, the number of these files is small, and including
them in the knowledge acquisition process would not have resulted in a major system improvement.
Table 2 shows a number of files of each type loaded into the system. As expected, there were many MS Word and PDF files. Furthermore, the
number of PowerPoint presentation files (pptx) was large, which may be explained by the nature of the DR-NEP project, which was a
multidisciplinary project involving a large number of stakeholders, in which presentations were often used to transfer knowledge or convey
findings. In addition, a large number of .m and .h text files were found, but their significance in knowledge delivery is minor because they are
MATLAB and C-language program files. With regard to simulation data, there were 20 EPANET model files (.net) and 12 MATLAB model files
(.mdl).
Table 2. Loaded file types
pdf 247
m 149
pptx 104
h 73
jpg 64
docx 60
txt 54
png 51
. .
. .
. .
net 20
mdl 12
Presently, the knowledge from each file is acquired once, and the system does not keep track of subsequent changes to the file. New files can be
loaded into the system at any time.
Knowledge Delivery Services
Two knowledge delivery services are illustrated: full-text search and querying. The two are complementary approaches for accessing data stored
in a NoSQL data store, with each one exhibiting strengths for specific data access tasks.
FullText Search
Storing data in a document data store enables variants of full-text search. Three variants of full-text search have been observed:
• Searching attached documents: This search relied solely on document attachments in the CouchDB data store. Because original files
were attached to the CouchDB document in their initial form, this search was somewhat similar to using an indexing and search engine,
Lucene in this case, directly on the original files. This strategy does not take advantage of any data processing performed during knowledge
acquisition and is the baseline for comparison with other strategies.
• Searching extracted text: This strategy includes only the contents of docText fields. Because text extracted from images is
in docImageText field, this strategy ignores text contained in images, including text in image files and text in images embedded in other
documents. Note that ontology-based simulation models are stored in docText fields and therefore are included in this strategy.
• Searching extracted text, including text from images: This approach takes full advantage of Tika text extraction as well as OCR text
extraction by including both fields, docText and docImageText, in the search. This strategy takes full advantage of the data processing
performed in the knowledge acquisition stage.
A full-text search screen from the implemented Web application is displayed in Figure 3. This application enables users to choose among the
three described search strategies; on the screen in Figure 3, the extracted text strategy is selected. The result of searching for the term “power
house” are displayed in the table with two columns: document and last modified. It can be noted that the search result is made up of various file
types, including pdf and text files, MS Word, PowerPoint, and simulation model files. Some of the files appear several times with different last
modified date; this is caused by files residing in different folders, but having the same name. Disaster-CDM does not check whether files with the
same name have identical content, but rather creates a new document in the data store for each loaded file.
Figure 3. Full-text search
Table 3 provides an overview of different strategies with respect to the main file categories addressed in this case study. For the three file
categories, PDF, text and I2SIm model files, all three search strategies were virtually the same. Even though searching I2Sim models produced
the same results set, the ranking of the documents was different because the searches were based on different text content. The attached
document strategy searched mdl files, which are text files, directly, while the other two strategies searched the ontology-based simulation models.
Consequently, the attached document strategy ranked simulation models lower than the other two strategies.
Table 3. Search strategies
Search Strategy
PDF files ✓ ✓ ✓
MS Office files ✓ ✓ ✓
Does not Does not
include include
text from text from
images images
Image files ✗ ✗ ✓
Text files ✓ ✓ ✓
Simulation
model files
Simulation ✓ ✓ ✓
files - I2Sim (mdl file
model files are text
(.mdl) files)
Simulation ✗ ✓ ✓
files - EPANET
With regard to MS Office files, the difference among the various searches depended on whether or not they were using text extracted from
images. The data set for this case study contained 82 MS Word files (doc and docx), of which only 8 contained images from which text was
successfully extracted. In contrast, out of 140 PowerPoint files (ppt and pptx), only 6 did not benefit from the OCR process. Therefore, the OCR
process had a higher impact on processing PowerPoint files than on processing Word files. With respect to image files, out of 116 images, text was
successfully extracted from 75; however, some of the extracted text did not contain readable words and therefore was not beneficial for searching.
Therefore, the OCR process had greater impact on PowerPoint files than on image files, which can be explained by the common use of diagram-
style graphs in PowerPoint presentations.
Transforming simulation models into their corresponding ontology-based representations did not have a major impact with I2Sim models, but
was essential for including EPANET models in the text search. The attached document strategy did not search EPANET models because they are
represented in .net binary files; however, the extracted text strategies searched EPANET models by taking advantage of the ontology-based
simulation models stored in docText fields.
Note that the attached document search strategy took advantage of the CouchDB-Lucene project (CouchDB-lucene project, 2012), which uses
Apache Tika (The Apache Foundation, 2013) to search the attached documents. This case study also used Tika to extract text from files, and
therefore the only major difference between the attached-document and the extracted text strategies was with respect to the EPANET model files.
Only the extracted text strategy searched the EPANET model files.
Full-text search can also be achieved by applying text search engine such as Lucene directly on the file system containing disaster-related data;
however, such search ignores text contained in images as well as text in images embedded in other documents. In contrast, full-text search in
Disaster-CDM includes image text because OCR performed in the knowledge acquisition stage extracted text from images. Moreover, direct full-
text search on the file system does not include EPANET .net model file as they are binary file. Disaster-CDM transforms EPANET model files into
ontology-based representation, and consequently includes them in full-text search. Additionally, storing data in NoSQL data store facilitates
querying file-style data and allows Disaster-CDM to take advantage of scaling and replication capabilities provided by the NoSQL store.
Querying FileStyle Data
The documents contained in the document store are semi-structured: the data within a document is encoded, but each document can have a
different structure. The data model designed for storage of file-style data, as presented in Table 1, was flexible enough to enable storage of diverse
data, but at the same time was relatively standardized to support querying abilities. In this case study, querying was used to obtain aggregate
information about the contents of the data store, such as the number of documents of each type or the number of documents containing images.
Aggregate querying is illustrated here on a simple example, that of counting the documents of each type. In CouchDB, this is achieved by views
which make use of the MapReduce approach. The Map function extracts the value of the fileExtension field from within each document, while the
Reduce function groups by fileExtension(which is in the key argument passed to the Reduce function), and counts the entries for
each fileExtension.
Map function:
function(doc) {
emit(doc.fileExtension, 1);
}
Reduce function:
function (key, values) {
return sum(values);
}
The data presented in Table 2 were obtained by executing this query. As illustrated, obtaining such information from the Disaster-CDM system is
very simple; however, doing this without the Disaster-CDM system would require extensive manual efforts or use of specialized (custom or off-
the-shelf) software.
The full-text search described in the previous subsection did not take full advantage of the tagging performed during data acquisition. When text
was extracted from documents, tagging was performed, and the results were stored within different tag fields. Because the tag fields are encoded
within the document, they facilitate querying. For example, as part of the DR-NEP project, Western University explored an accident on the
university campus which involved a local power plant. During data acquisition, the text extracted from documents was forwarded to the tagging
processes. If a power plant was mentioned in a document, the ANNIE tagging process recognized ‘power plant’ as an organization and therefore
tagged it as organization=’power plant’. Consequently, the resulting document in the data store contained the following entry: tag:
{organization: [“power plant”]}. This document structure can be used to find all documents referring to power plants and CouchDB view created
for this purpose is displayed in Listing 1. The Map function outputs theorganization tag as the first array element because this is a search
criterion. In addition, this view includes fileName to identify the original file and creationDate to distinguish more recent documents. The
Reduce function eliminates duplicates produced by the Map function. After this view is created, data can be queried by specifying values in HTTP
calls. A few rows of the results of searching for the organization tag “power plant” are displayed in Table 4.
Table 4. Query results for “Power Plant”
Listing 1. Querying for “Power Plant” - Map and Reduce functions for CouchDB view
Map function
function(doc) {
if (doc.tag.Organization && Array.isArray(doc.tag.Organization)) {
doc.tag.Organization.forEach(function (organizationTag) {
var creationDate = doc.metaData[“dcterms:modified”];
if (creationDate == null) {
creationDate = doc.metaData[“dcterms:created”]
}
emit([organizationTag.toLowerCase(), doc.fileName, creationDate], null);
});
}
}
Reduce function
function (key, values) {
return null;
}
In this case study, only automated tagging was used, and therefore tags typically reassembled phrases found in text extracted from documents. In
this situation, querying as described in the example gave similar results to the full-text search described in the previous subsection. However,
Disaster-CDM was designed to allow manual tagging by end users in addition to automated tagging. In a manual tagging scenario, the
effectiveness of queries similar to the organization tag example would be increased.
Discussion
Two knowledge delivery services were explored: full-text search and querying. This section further discusses benefits, advantages and drawbacks
of the two approaches.
• Various full-text search were investigated which allowed for the analysis of the effects of data processing performed during knowledge
acquisition on the full-text search results. Overall, the benefits of data processes vary according to the file format as well as the file content.
For example, as expected, the OCR process had a major impact on the image file searching; however, experiments showed that the
PowerPoint files also benefited greatly from this process. Full-text search does not take advantage of automated tagging, and therefore, if the
knowledge delivery relies only on the full-text search, the tagging process can be omitted.
• Querying service proved advantages in obtaining various types of aggregate information about the stored contents. Some of the query tasks
explored in this case study, such as searching for a word or a phrase, can also be achieved by full-text search. In those circumstances full-text
search has the advantage over querying due to its simple call interface and the ability to rank documents according to their relevance.
However, the querying approach is promising with respect to manual tagging as it allows easy and fast access to tagged data.
Consequently, the two knowledge delivery services explored in this case study, full-text search and querying, are complementary services suitable
for different tasks. Knowledge delivery services, together with knowledge acquisition services, facilitate collaboration by providing a platform for
sharing and integrating disaster-related information.
The main limitation of the proposed approach is related to the knowledge acquisition services; data need to be processed according to data
processing rules before they are stored in the cloud storage and used. This means that data are loaded into Disaster-CDM and not used in their
original form or from their original locations. Consequently, the proposed approach does not carry out real-time data collection during disaster
response, but is focused on data collection in other disaster phases. However, it is important to highlight that this design choice achieves shorter
query response time than preforming the processing “on the fly”.
The challenges of implementing the proposed framework are twofold. Firstly, for diverse data sources, different data processing services need to
be implemented. The quality of the services provided to the end users is highly dependent on the implemented data processing services.
Secondly, it is challenging to provide knowledge delivery services with diverse storage systems due to the difficulty of integrating different data
models. Future research directions aim to address those challenges.
FUTURE RESEARCH DIRECTIONS
The knowledge acquisition and data storage components of the Disaster-CDM framework were the focus of this chapter. Directions for future
research related to this aspect include data acquisition from other sources such as Web sites and social media, dynamic data processing rule
specification, changes to existing knowledge (knowledge evolution), knowledge conflicts, and NoSQL data store comparison in the context of the
presented framework.
This chapter has presented the main design of the knowledge delivery component without addressing details; thus future work will involve
various aspects of knowledge delivery. Since NoSQL data stores were designed for different purposes, they differ greatly in their data models and
querying abilities, which presents an obstacle to integration. Integration of NoSQL stores will be explored in order to accommodate different data
stores within the framework and provide integrated knowledge to consumers.
The case study presented in this chapter involved query and full-text services, but analytics services were not addressed. Various Big Data
analytics approaches will be explored with the objective of providing a better insight into disaster-related data and therefore better disaster
management.
To successfully support collaboration on a global scale and across governments, industries, and communities, privacy and security will be
addressed. This is challenging for a number of reasons, including cloud storage on third-party premises and in a shared multi-tenant
environment, diversity of the storage models involved, and the large number of collaboration participants.
The presented Disaster-CDM framework is designed for use with disaster-related data; however, it could potentially be applied in other domains.
For example, Disaster-CDM for file-style data, could be applied to any file-type data and is not restricted to disaster-related data. Future work
will explore the potential of using the same framework, possibly with some adaptations, in other domains.
CONCLUSION
In recent years, we have witnessed a number of extreme weather events and natural disasters. At the same time, changes in software and
hardware have created opportunities for new solutions in disaster management.
This chapter has presented Disaster-CDM, a framework for disaster data management. Disaster-CDM stores large amounts of data while
maintaining high availability by using NoSQL solutions. Collaboration among partners, knowledge sharing, and integration are facilitated
through knowledge acquisition and knowledge delivery services. The knowledge acquisition service is responsible for acquiring knowledge from
diverse sources and storing it in the cloud environment, while knowledge delivery services integrate diverse information and deliver knowledge to
consumers.
Special attention has been paid to data management for file-style data such as MS Office documents, PDF files and images because these are the
formats most commonly present in the disaster management domain. File processing, data processing rules, and data storage for file-style data
have been described.
The case study presented in this chapter demonstrated the use of Disaster-CDM on data collected during the Disaster Response Network Enabled
Platform (DR-NEP) project. Specifically, knowledge was acquired from the DR-NEP data set and stored in a NoSQL document data store.
Knowledge delivery was illustrated by querying and full-text search examples.
This work was previously published in Managing Big Data in Cloud Computing Environments edited by Zongmin Ma, pages 183209,
copyright year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Anderson, J. C., Lehnardt, J., & Slater, N. (2010). CouchDB: The definitive guide . Sebastopol, CA, USA: O'Reilly Media.
Anderson K. M. Schram A. (2011). Design and implementation of a data analytics infrastructure in support of crisis informatics research: NIER
track.Proceedings of the 33rd International Conference on Software Engineering,Honolulu, Hawaii (pp. 844-847). 10.1145/1985793.1985920
Association for Computing Machinery. (2014). Conference on computer-supported cooperative work and social computing (CSCW). Retrieved
from http://cscw.acm.org/2015/index.php
Atzeni, P., Bugiotti, F., & Rossi, L. (2013). Uniform access to NoSQL systems. Information Systems , 43, 117–133. doi:10.1016/j.is.2013.05.002
Choi, S., & Bae, B. (2015). The real-time monitoring system of social big data for disaster management. Computer Science and its Applications,
330, 809-815.
Chou, C., Zahedi, F., & Zhao, H. (2011). Ontology for developing web sites for natural disaster management: Methodology and
implementation. IEEE Transactions on Systems, Man, and Cybernetics. Part A, Systems and Humans , 41(1), 50–62.
doi:10.1109/TSMCA.2010.2055151
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., . . . Heitz, T. (2013). Developing language processing
components with GATE. University of Sheffield department of computer science. Retrieved from http://gate.ac.uk/sale/tao/split.html
de Albuquerque, J. P., Herfort, B., Brenning, A., & Zipf, A. (2015). A geographic approach for combining social media and authoritative data
towards identifying useful information for disaster management. International Journal of Geographical Information Science , 29(4), 667-689.
doi:doi:10.1080/13658816.2014.996567
Grolinger, K., Capretz, M. A. M., Marti, J. R., & Srivastava, K. D. (2012). Ontology–based representation of simulation models.Proceedings of the
24the International Conference on Software Engineering and Knowledge Engineering, San Francisco Bay, CA, USA (pp. 432-437).
Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. (2013). Data management in cloud environments: NoSQL and NewSQL data
stores. Journal of Cloud Computing: Advances .Systems and Application , 2. doi:doi:10.1186/2192-113X-2-22
Grolinger K. Mezghani E. Capretz M. A. M. Exposito E. (2013). Knowledge as a service framework for disaster data management.Proceedings of
the 22nd WETICE Conference (pp. 313-318). Hammamet, Tunisia. 10.1109/WETICE.2013.48
Grolinger, K., Mezghani, E., Capretz, M. A. M., & Exposito, E. (2015). Collaborative knowledge as a service applied to the disaster management
domain. International Journal of Cloud Computing ,4(1), 5–27. doi:10.1504/IJCC.2015.067706
Hristidis, V., Chen, S., Li, T., Luis, S., & Deng, Y. (2010). Survey of data management and analysis in disaster situations. Journal of Systems and
Software , 83(10), 1701–1714. doi:10.1016/j.jss.2010.04.065
Ilyas, A. (2014). MicroFilters: Harnessing twitter for disaster management. Paper presented at the IEEE Global Humanitarian Technology
Conference (pp. 417-424). 10.1109/GHTC.2014.6970316
Khoshnevis, S., & Rabeifa, F. (2012). Toward knowledge management as a service in cloud-based environments.International Journal of
Mechatronics . Electrical and Computer Technology , 2(4), 88–110.
Lai, I., Tam, S., & Chan, M. (2012). Knowledge cloud system for network collaboration: A case study in medical service industry in China. Expert
Systems with Applications , 39(15), 12205–12212. doi:10.1016/j.eswa.2012.04.057
McCandless, M., Hatcher, E., & Gospodnetic, O. (2010). Lucene in action . Stamford, CT, USA: Manning Publications.
Mittleman, D. D., Briggs, R. O., Murphy, J., & Davis, A. (2008). Toward a taxonomy of groupware technologies. Lecture Notes in Computer
Science , 5411, 305–317. doi:10.1007/978-3-540-92831-7_25
November, V., & Leanza, Y. (2015). Risk, disaster and crisis reduction: Mobilizing, collecting and sharing information . Springer International
Publishing; doi:10.1007/978-3-319-08542-5
Othman, S. H., & Beydoun, G. (2013). Model-driven disaster management. Information & Management , 50(5), 218–228.
doi:10.1016/j.im.2013.04.002
Othman, S. H., Beydoun, G., & Sugumaran, V. (2014). Development and validation of a disaster management metamodel (DMM). Information
Processing & Management , 50(2), 235–271. doi:10.1016/j.ipm.2013.11.001
Palen L. Anderson K. M. Mark G. Martin J. Sicker D. Palmer M. Grunwald D. (2010). A vision for technology-mediated support for public
participation & assistance in mass emergencies & disasters.Proceedings of the ACM-BCS Visions of Computer Science Conference,Edinburgh, UK
(pp. 1-12).
Qirui Y. (2012). Kaas-based intelligent service model in agricultural expert system.Proceedings of the 2nd International Conference on Consumer
Electronics, Communications and Networks,Yichang, China (pp. 2678-2680). 10.1109/CECNet.2012.6201763
Rahman H. A. Armstrong M. Mao D. Marti J. R. (2008). I2Sim: A matrix-partition based framework for critical infrastructure interdependencies
simulation.Proceedings of the Electrical Power and Energy Conference, Vancouver, BC, Canada (pp. 1-8). 10.1109/EPC.2008.4763353
Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments.IEEE
Communications Surveys and Tutorials , 13(3), 311–336. doi:10.1109/SURV.2011.032211.00087
Schram A. Anderson K. M. (2012). MySQL to NoSQL: Data modeling challenges in supporting scalability.Proceedings of the 3rd Annual
Conference on Systems, Programming, Languages and Applications: Software for HumanityTucson, AZ, USA (pp. 191-202).
10.1145/2384716.2384773
Silva T. Wuwongse V. Sharma H. N. (2011). Linked data in disaster mitigation and preparedness.Proceedings of the Third International
Conference on Intelligent Networking and Collaborative SystemsFukuoka, Japan (pp. 746-751). 10.1109/INCoS.2011.113
Smith R. (2007). An overview of the Tesseract OCR engine.Proceeding of the Ninth International Conference on Document Analysis and
RecognitionCuritiba, Brazil. 10.1109/ICDAR.2007.4376991
Sumathi, C. P., Santhanam, T., & Gayathri Devi, G. (2012). A survey on various approaches of text extraction in images.International Journal of
Computer Science and Engineering Survey , 3(4), 27–42. doi:10.5121/ijcses.2012.3403
The Apache Foundation. (2013). Apache Tika toolkit. Retrieved from http://tika.apache.org/
The University of British Columbia. (2011). DR-NEP (disaster response network enabled platform) project. Retrieved from
http://drnep.ece.ubc.ca/index.html
United States Environmental Protection Agency. (2008). EPANET - water distribution modeling. Retrieved from
http://www.epa.gov/nrmrl/wswrd/dw/epanet.html
Waugh, W. L., & Streib, G. (2006). Collaboration and leadership for effective emergency management. Public Administration Review , 66(1),
131–140. doi:10.1111/j.1540-6210.2006.00673.x
ADDITIONAL READING
Brewer, E. (2012). CAP twelve years later: How the “rules” have changed. Computer , 45(2), 23–29. doi:10.1109/MC.2012.37
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., & Gruber, R. E. (2008). Bigtable: A distributed storage system for
structured data. ACM Transactions on Computer Systems , 26(2), 1–26. doi:10.1145/1365815.1365816
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters . New York: ACM; doi:10.1145/1327452.1327492
Hwang, K., & Hwang, K. (2012). Distributed and cloud computing: From parallel processing to the internet of things . Waltham, MA, USA:
Elsevier/Morgan Kaufmann.
Kannimuthu S. Premalatha K. Shankar S. (2012). Investigation of high utility itemset mining in service oriented computing: Deployment of
knowledge as a service in E-commerce.Proceedings of the Fourth International Conference on Advanced Computing,Chennai, India (pp. 1-8).
10.1109/ICoAC.2012.6416812
Kapucu, N., & Garayev, V. (2011). Collaborative decision-making in emergency and disaster management. International Journal of Public
Administration , 34(6), 366–375. doi:10.1080/01900692.2011.561477
Lakshman, A., & Malik, P. (2010). Cassandra: A decentralized structured storage system. Operating Systems Review , 44(2), 35–40.
doi:10.1145/1773912.1773922
Sadalage, P. J., & Fowler, M. (2013). NoSQL distilled: A brief guide to the emerging world of polyglot persistence . Upper Saddle River, NJ, USA:
Addison-Wesley.
Stonebraker, M., Madden, S., Badi, D. J., Harizopoulos, S., Hachem, N., & Helland, P. (2007). The end of an architectural era: (It’s time for a
complete rewrite). Proceedings of the 33rd International Conference on very Large Data Bases, Vienna, Austria (pp. 1150-1160).
Sumathi, C. P., Santhanam, T., & Gayathri Devi, G. (2012). A survey on various approaches of text extraction in images.International Journal of
Computer Science and Engineering Survey , 3(4), 27–42. doi:10.5121/ijcses.2012.3403
Waugh, W. L., & Streib, G. (2006). Collaboration and leadership for effective emergency management. Public Administration Review , 66(1),
131–140. doi:10.1111/j.1540-6210.2006.00673.x
KEY TERMS AND DEFINITIONS
Big Data: Collection of massive data sets too big to be processed using traditional approaches. It is characterized by the 3Vs: volume, velocity
and variety.
Cloud Computing: A computing paradigm in which large groups of computing resources are networked in order to provide on-demand access
to a shared resource pool.
Disaster Management: The organization and management of responsibilities and resources for dealing with emergencies in order to reduce
the impact of disasters.
Knowledge as a Service: An on-demand, self-service computing paradigm in which a knowledge service provides answers questions
presented by knowledge consumers.
NoSQL Data Store: NoSQL is used as an acronym for “Not only SQL”, which emphasizes that SQL-style querying is not the crucial objective of
these data stores. The term is used as an umbrella classification that includes a large number of immensely diverse data stores.
Ontology: An explicit formal specification of a conceptualization. It describes concepts in the domain and relations among them.
CHAPTER 28
Building a Visual Analytics Tool for LocationBased Services
Erdem Kaya
Sabanci University, Turkey
Mustafa Tolga Eren
Sabanci University, Turkey
Candemir Doger
Sabanci University, Turkey
Selim Balcisoy
Sabanci University, Turkey
ABSTRACT
Conventional visualization techniques and tools may need to be modified and tailored for analysis purposes when the data is spatio-temporal.
However, there could be a number of pitfalls for the design of such analysis tools that completely rely on the well-known techniques with well-
known limitations possibly due to the multidimensionality of spatio-temporal data. In this chapter, an experimental study to empirically testify
whether widely accepted advantages and limitations of 2D and 3D representations are valid for the spatio-temporal data visualization is
presented. The authors implemented two simple representations, namely density map and density cube, and conducted a laboratory experiment
to compare these techniques from task completion time and correctness perspectives. Results of the experiment revealed that the validity of the
generally accepted properties of 2D and 3D visualization needs to be reconsidered when designing analytical tools to analyze spatio-temporal
data.
INTRODUCTION
Over the past few years, the visualization community has worked on problems closely related with the cartographic and geographic information
system (GIS) communities. The cross disciplinary connection between these fields facilitates the visual display and interactive exploration of
geospatial data and the information derived from it. Analysis of geographic information in time and space is becoming an important subject with
the increasing use of location data in everyday life. Some key challenges are understanding the dynamics of data in time and space, identifying
spatial and temporal data patterns, and correlating spatio-temporal data to other data such as sales.
In their extensive work, Andrienko and Andrienko (2006) emphasize the need of visualization techniques and analytical tools that will support
spatio-temporal thinking and contribute to solving a large range of problems. Nevertheless, due to the sophisticated nature of spatio-temporal
data analysis, current visualization techniques and analytical tools are not fully effective and need to be improved (Andrienko et al., 2010).
In this work, we are not proposing novel visualization techniques for spatio-temporal data. Instead, we propose that whether well-known aspects
of 2D and 3D representations are also valid in spatio-temporal visualization. To support our findings we conducted an experiment with highly
representative scenarios and tasks that could emerge in spatio-temporal data analysis.
The main contribution of this work is a novel empirical study leading to the conclusion that 3D visualizations should be considered as a valid
option in spatio-temporal data visualizations. To our knowledge this is the first work providing evidence opposing the findings of the previous
research against 3D techniques on this domain. A particular kind of visualization technique is not completely advantageous compared to others
as suggested by previous work (Andrienko & Andrienko, 2006; Hicks, O'Malley, Nichols, & Anderson, 2003; Kjellin, Pettersson, Seipel, & Lind,
2010; Munzner, 2008; Robertson, Fernandez, Fisher, Lee, & Stasko, 2008). On the contrary, 2D and 3D visualizations seem to be counterparts
completing each other. The advantages of 2D representations over 3D for various kinds of data seems to be well-understood which might mislead
to the understanding that 3D has more drawbacks than 2D in spatio-temporal visualization. Based on our study and that of Kjellin et al. (2010), it
appears to be the fact that there is enough evidence to reject the idea that 3D visualization should only be considered as secondary option in the
visualization of spatio-temporal data.
We have analyzed 2D and 3D density visualization techniques, namely density map (Figure 1a and 1b), and density cube (Figure 1c). Before
designing our evaluation methodology, we have interviewed system administrators from a Location Based Services (LBS) company and identified
most likely scenarios based on which we performed a laboratory experiment to compare density map and density cube techniques from time (to
complete the tasks) and correctness perspectives.
Figure 1. Views from our experimental tool showing the 2D
and 3D representations of spatio temporal data from a
commercial friend finder application: 2D density map (on the
left) and 3D density cube (on the right). Note that the third
dimension for the density cube is time.
Based on the scenario-wise analysis of our collected data, we found out that participants were able to analyze the data faster with density cube
technique in the cases where they need to view a given data window as a whole (e.g. trend detection). However, they were able to answer the
questions more accurately overall when they viewed the data with density map technique. Particularly, density map technique was significantly
better than the density cube technique in the case of cluttered data. Density map technique assisted participants better in terms of accuracy in
finding minimum and maximum, and comparison questions, while no significant difference was observed for trend questions. Our scenario-wise
analysis revealed that density cube and density map techniques were superior to each other depending on the scenario. These results suggest the
reconsideration of well-known properties of 2D and 3D representations particularly when the data to be analyzed is spatio-temporal.
The chapter is organized as follows. Firstly, a brief background about the 2D and 3D representations along with spatio-temporal data and its
visualization is given. After the specification of the data and the techniques are presented in the next section, we report the methodology of the
experiment. And finally, we present a discussion of the results of our experiment and concluding remarks as well as future research directions, in
the last three sections of our chapter.
BACKGROUND
Properties of 2D and 3D Representations
The choice of 2D or 3D representations in information visualization seems to be a highly debatable and complex task. Furthermore, previous
body of work suggests that well-understood commonly accepted advantages and drawbacks of one representation over the other may depend on
the context and data analysis objectives. As Munzner (2008) states, the 3D choice becomes meaningful when the 3D representation is implicit in
the dataset and when that representation fits to the mental model of the user about the phenomenon.
It would be a subject of a different work to enumerate all the studies comparing the 2D and 3D representations. Common known properties of
these representations will briefly be explained with a few example studies.
The effectiveness of 2D representations in terms of both effectiveness (e.g. time to complete task) and accuracy (e.g. error rate) have also been
reported in a number of studies. For example, users were able to locate regions of interest more effectively and accurately when they view the
blood flow visualization with 2D representations (Borkin et al., 2011). In their experimental study, Hicks et al. (2003) compared various
performance levels of 2D and 3D visualizations of e-mail usage log, which they consider as temporal data. They reported that 3D representations
poorly performed in terms of task completion time whereas the very same representation aided their participants in answering the comparison
questions more accurately where the participants had to view the whole data. Their findings differ from ours possibly due to the natural
differences between temporal and spatio-temporal data, which we discuss in the Discussion section.
Another property of the 2D visualizations is related to “spatial memory” as suggested by previous work. In their study, Cockburn and McKenzie
(2002) suggest that their subjects’ ability to quickly locate web page images deteriorated as their freedom to use the third dimension increased.
Case studies and formal user studies demonstrated that 2D data encodings and representations are generally more effective than 3D for tasks
involving spatial memory, spatial identification, and precision.
The advantage of the 3D representations becomes clear with the tasks requiring the holistic view of the data (Hicks et al., 2003; Plumlee & Ware,
2006; Robertson et al., 2008). As Hicks et al. (2003) state, cognitive error degrades when the users are to view the data as a whole. The
experimental findings of Kjellin et al. (2010) also supports this idea as their participants were able to better track multiple vehicles when they
were equipped with 3D visualization techniques. Nevertheless, the holistic view might bring its own issues such as occlusion, one of the common
problems associated with 3D representations (Hicks et al., 2003). Perspective foreshortening is another problem with the 3D representations.
Users might suffer from justifying the measures of the objects in the 3D visualizations due to the perspective projection while performing trend
comparison tasks.
SpatioTemporal Data
In their extensive work, MacEachren and Kraak explore different techniques and the research challenges in geovisualization field (Kraak, 2003;
MacEachren, 2004; MacEachren & Kraak, 2001). They address the important points of geovisualization such as representation of geospatial
information, integration of visualization with computational methods, interface design for geovisualization environments and cognitive/usability
aspects of geovisualization. Different cartographic techniques have been used to represent geospatial information. Among many other
visualizations that use geographic maps, thematic mapping techniques are designed to show a particular theme connected with a specific
geographic area (Slocum, 2009). Density map technique is also adopted for geographic data visualization and data analysis in several important
examples (Wilkinson & Friendly, 2009). Fisher (2005) proposes an interactive heat map system that visualizes the geographic areas with high
user attention in order to understand the use of online maps. Mehler et al. (2006) also use a geographic visualization technique similar to heat
map in which they geographically analyze the news sources. Another interactive framework taking advantage of heat maps is introduced by
Scheepens et al. (2011a, 2011b) in which they aim to visualize the trajectory data of vessels. The density map implementation employed in our
experimental study is heavily influenced by heat map visualization technique.
Perhaps one of the most important and ubiquitous data types is the one with references to both time and space, usually referred to as spatio
temporal data. The concept of spatiotemporal data is defined in both geographic information systems (GIS) (Yuan, 1996), data mining (Roddick,
Hornsby, & Spiliopoulou, 2001), and visualization (Andrienko, Andrienko, & Gatalsky, 2003). Visualization of spatio-temporal data involves the
direct depiction of each record in a data set so as to allow the analyst to extract noteworthy patterns by viewing the representations on the
displays and interacting with those representations. Increasing number of studies on management (Abraham & Roddick, 1999) and analysis
(Andrienko & Andrienko, 2006; Rhyne, MacEachren, & Dykes, 2006) of spatio-temporal data in the last decade indicate the importance of the
analysis of this data type. The analysis of such data with references both in space and in time is a challenging research topic. Major research
challenges includescale, as it is often necessary to consider spatio-temporal data at different spatio-temporal scales; the uncertainty of the
data as data are often incomplete, interpolated, collected at different times, or based upon different assumptions; complexity of geographical
space and time, since in addition to metric properties of space and time and topological/temporal relations between objects, it is necessary to
take into account the heterogeneity of the space and structure of time; and finally the complexity of spatial decision making processes, since a
decision process may involve actors with different roles, interests, levels of knowledge of the problem domain and the territory (Andrienko et al.,
2007). When the spatio-temporal data sets are very large and complex, existing techniques may not be effective to allow the analyst to extract
important patterns. Users may also have difficulty in perceiving, tracking and comprehending numerous visual elements that change
simultaneously. One way to deal with this problem would be the aggregation or summarization of data prior to graphical representation and
visualization (Demšar & Virrantaus, 2010; Scheepens et al., 2011a, 2011b). Infinitely many visualization studies have been conducted regarding
spatio-temporal data visualizations. For visualizing the spatial change over time in data, Scheepens et al. propose an interactive visualization
framework, which analyzes the trajectory data of vessels to understand their behavior and risks (Scheepens et al., 2011b). After the space-time
cube method has been revisited for the analysis of geographic data in many works (Fisher, 2005; Kraak, 2003), it has been used frequently in
visualizing spatio-temporal data (Gatalsky, Andrienko, & Andrienko, 2004; Turdukulov, Kraak, & Blok, 2007). The space-time cube approach
bore the idea of using the third axis for representing time. 3D visualization techniques have been used on visualizing hierarchies that change over
time in a geo-spatial context (Hadlak, Tominsky, Schulz, & Schumann, 2010). Several time-oriented visualization methods are also presented
(Boyandin, Bertini, Bak, & Lalanne, 2011; Shanbhag, Rheingans, & desJardins, 2005) to analyze and support effective allocation of resources in a
spatio-temporal context. In their analytic review, Andrienko et al. (2003) discuss various visualization techniques for spatio-temporal data, with
a focus on exploration. They categorize the techniques by what kind of data they can be used for and the kinds of exploration questions can be
asked. Being a very brief summary of existing literature, these works include evaluations of the visualizations to some particular extend; however,
when and if their findings for the spatio-temporal data apply also to other types of data seems to be rather unheeded. Effects of the
representation type may be different than expected when the data to explore is spatio-temporal, where the dimensionality inherently increases.
Kjellin et al. (2010) report different cases how 2D and 3D representations outperform each other. Their 2D representations of movement plots
assisted the participants to perform better while they were trying to estimate the intersection of two vehicles. Contrarily, 3D representation
helped the users better when they had to estimate a similar measure for more vehicles, supporting our idea that the nature of the spatio-temporal
data might affect the generally accepted drawbacks of the 3D representations.
DOMAIN PROBLEM SPECIFICATION
Many spatio-temporal visualization techniques and tools have been designed for the analysis of trajectory data. However, the geographic data
employed in our visualizations are spatial events with GPS-accuracy geographic coordinates and a seconds-precision timestamp. In other words,
we do not have any trajectories and our analysis of spatio-temporal data solely depends on the spatial correlation between regions of consecutive
density representations. Our methods are designed considering these characteristics of the data.
Data Description
Real Data
The geographic data employed in our study comprises spatial event records with geographic coordinates with GPS-accuracy as well as a seconds-
precision timestamp. Approximately 2.5 millions of such records have been collected by a LBS company. Each data record corresponds to when
and where a mobile phone application was invoked. Dataset involves records saved between February 2nd, 2011 and April 1st, 2012 from almost
all provinces of Turkey. Even though the density of the data adequately enabled us to infer possible interesting trends, we opted to use fictious
data so that we could generate a number of patterns that were not existent in our real data.
Figure 2. Given an origin point , (a) a data point is generated
with randomized angle and a radius selected from the
Artificially Generated Data
For our experimental study, we decided employing artificial data merely to generate a wider spectrum of different scenarios that were not
available in the real data.
A data generator has been designed to reflect the nature of spatio-temporal data. For a given region centered at and a specific time interval ,
the generator creates data records each of which corresponds to a point in the data space with latitude , longitude , and time
dimensions.
For a given time period , sub-intervals with even durations are created such that the duration of each sub interval . Based on the
requirements of a predefined scenario, a distribution function f[t] is defined as the planned total number of data points per sub interval . The
actual total number of data points per sub interval is calculated as
where ( ) is the noise parameter controlling the distortion of the number of the data points to be generated, and the function
selects a random value between and .
Point Generation—The following equations are employed to generate data points each of which are comprised of tuples of
latitude, longitude, and timestamp.
where , , is the origin point on which the data generation calculation based on (Please see Figure 2a), and and are the
beginning and the ending time stamps, respectively, of a given sub interval . Being the inner and outer radii of the selected annular subarea,
and , will be explained further in the next section.
Scatter of the Generated Points—The geographical scatter of the generated data points is based on the “Lottery Scheduling Algorithm” developed
by Waldspurger and Weihl (1994). In the lottery scheduling algorithm, each process to be run on a CPU is assigned a number of tickets. At the
beginning of each run period, scheduler picks a ticket randomly and selects the process holding this ticket. Apparently, the process holding more
tickets than the other rivaling processes will have more chance to be scheduled on the CPU. Our data generator tool benefited from this algorithm
for the scatter of data. Such that, the area of interest is divided into concentric annular nested areas as in Figure 2b, and each area is assigned a
number of tickets , in other words, scatter parameters, associated with area . The number of tickets, , defines the probability of a data point to
be generated in area . For a group of sub-intervals , the scatter of the data can be arranged by simply modifying scatter parameters. The
number of scatter parameters (and also the number of nested areas), , is specified manually depending on the requirements of the task scenario.
The selected area is defined by two radii, and which is calculated as
and
where is the diameter of the circular area for which the data will be generated.
Scatter Noise—To generate data looking more real, we considered another noise factor , namely scatter noise. Scatter noise parameter creates
distortions on the scatter of the data by randomizing the latitude and longitude values of the origin point within a predefined limited area.
is calculated before the generation of each data point with following formulae,
where , , is the central point of region with a radius selected by the experimenters according to the requirements of
the scenarios.
SpatioTemporal Data Analysis Needs
We conducted interviews with system operators at a LBS company in order to form a practical basis for our experimental study. We found out
that two spatial analysis cases were of their firm’s critical importance: (a) Measuring the effectiveness of a service promotion (e.g. analyzing the
effects of a promotion for five consecutive days after the advertisement) and (b) identification of local service disruptions (e.g. being able to find
of 30-minutes-long disruption in a week-long worth of data). It was important for them to quantify the effective breadth and duration of a
promotion to understand the market dynamics. Fast and correct spatial analysis of service disruption cases was found out to be crucial. For
example, a local service drop could trigger a false alarm, which may lead to a legal dispute in particular cases. Even though tracing back from the
usage logs of the service might be considered as a reasonable approach, the process had usually been evaluated as exhaustive and unfruitful.
It was also important to locate and identify the general usage patterns, customer behavior patterns, and user profiles for the service providers,
since the data specifically tailored for some special user groups might bring potential benefits.
Yet another important aspect of the data at hand was that correlating the data with spatial (e.g. business district, tourist site, shopping area) and
temporal landmarks (e.g. days of week, holidays, tax time). Such information would require spatio-temporal data to be correlated with other data
and would add further challenges. Despite these challenges, such correlations would significantly improve customer segmentation and would
lead to improved effectiveness of services.
Visualization Techniques
In order to represent 2D and 3D techniques in our experimental study, density map and density cube representations were implemented.
Density Map Animation—Density map animation has been implemented as the sequential animation of raster images. Firstly, the density map of
the artificial data has been created by making use of kernel density estimation. Calculated intensity values of pixels were scaled into the range of
0-255 to produce a gray-scale intensity map image. Upon the calculation, we modeled each geographic location as a radial gradient, a filled circle
having a gradually descending intensity as the distance from the center increases. To calculate the intensity of each pixel with this method,
additive blending technique has been utilized where the intensity values of geographical points occupying the same pixel have been summed up.
Consequently, at the beginning of the colorization process, grayscale maps have been created by scaling the intensity values of each pixel between
0-255. Finally, appropriate color schemes have been chosen to be applied on the gray-scale intensity maps (Willems, Van De Wetering, & Van
Wijk, 2009). The still images generated with this method are used as the frames for the animation.
Density Cube—As a time-space cube variant for spatiotemporal data visualization, density cube represents the data of interest as a whole on the
display so that the cognitive load on analysts is reduced by obviating the need for remembering the previous frames for tracking the changes.
Density cube technique overcomes this limitation by utilizing a 3D texture, which actually is a stacked version of consecutive density maps
(Figure 3). The 3D texture is then rendered by employing a GPU ray-caster to visualize all time slices in a single frame. Similar visualization
techniques have been used in medical imaging field in recent years (Kruger & Westermann, 2003). Imitation of the 3D texture conveys the
continuity of the data. Since the frames has readily been created for density map animation technique, density cube required less computation
load. As for interaction, the participant was able to change the orientation of the cube to better analyze the desired portions of the data. The
participant could also change the time scale which would either compress or extend the visualization along the time axis (axis-y in Figure 3).
Figure 3. Generation of 3D texture by stacking density maps:
Density maps are generated for each time interval with the
data lying in the corresponding interval. Then, these density
maps are stacked according to the order of the time intervals
to generate the 3D texture.
EXPERIMENT
The methodology to analyze the effectiveness of visualization techniques has benefited much from Munzner’s Nested Model for visualization
design and validation (Munzner, 2009). While many studies on information visualization demonstrates “how” they measure potential usability of
their techniques, we considered it useful to specify “which” evaluation method should be employed based on the contribution category of our
findings. As Munzner states, the contribution of an information visualization study should be well defined since each contribution venue requires
a different evaluation technique. According to her evaluation model, possible contributions of information visualization studies can be classified
as (a) domain problem characterization, (b) operation and data type abstraction, (c) visual encoding and interaction design, and (d) algorithm
design. In order to assist readers to comprehend how our study fits in the existing literature, we defined the level of the possible contributions of
our study. The density map and density cube techniques have been implemented to aid the users to notice the possible trends and outliers in
spatio-temporal data. On the one hand, the change in the data is abstracted to particular visual formations, which could be visually analyzed by
the users. On the other hand, users are provided a number of interactions with the assistance of which they can rotate, manipulate, and scale the
visualization to make more sense of the data.
Methodology
Much research has been done to investigate how effective different venues of visualization techniques are to aid the analysis. While many
researchers suggested that 2D representations should not be the only technique of choice (Ghani, Elmqvist, & Yi, 2012; Robertson et al., 2008),
there are considerable number of studies suspecting potential benefits of 3D representations. We believe that the reconsideration of well-known
characteristics of 2D (e.g. reasonable level of detail) and 3D (e.g. occlusion) representations when the spatio-temporal data visualization is
considered might yield unexpected results. That is in other words, we aim to empirically evidence whether the commonly accepted characteristics
of these two classes of representations are also valid in the spatio-temporal data visualization. The typical metrics for the performance of a
visualization technique could be the “time” spent by a user on a given task and could be the “correctness rate” of the inferences made based on
the observed data. From this vantage point, a study aiming to explore possible advantages of one technique over another might benefit from these
metrics.
Experimental Design
To explore the advantages of the visualization techniques, we designed a within-subject experiment with a single independent factor (Type of the
visualization technique: Density map vs. Density cube). As implied by the design of the experiment, each participant viewed the same scenario
twice with different visualization technique. The performance of the participants was measured in terms of time and correctness rate.
We have conducted interviews with three data analyst in a location-based services company in order to formulate our experimental scenarios and
tasks. To better simulate the cases that might exist in real data, five scenarios were created. As can be seen in Figure 4, the experiment comprised
two sections, each of which included five tasks and was dedicated to one of the two visualization techniques, namely density map or density cube.
According to our experimental design, participants were expected to finish both sections which include tasks generated based on the same
scenarios, which could have caused learning effect. To prevent possible learning effects, each scenario was realized with two datasets sharing the
same data formation but generated for different location and sampling frequency. For example, if scenario 1 emphasizes an increasing trend in
the data, two datasets (dataset 1A and 1B) generated for this scenario must have the same trend, but the datasets should be generated for
different locations, and should span across different periods of time.
Figure 4. Scenarios, tasks, and datasets: Task sequence for
experiment Group 1. Sequences for other groups can be
derived from Figure 7.
The order of the visualization techniques viewed by the participants was of importance to prevent bias on the collected data. Furthermore, the
dataset-visualization technique combination had to be counter-balanced such that each combination had to be viewed by the participants with
the same frequency. To address these problems, we generated four participant groups. As shown in Figure 7, dataset-visualization technique
combination and the ordering has been fully stratified. For example, participants in Group 1 viewed the scenarios realized with Datasets A and
visualized with density map technique in the first half of the experiment, while they viewed the same scenarios realized with Datasets B and
visualized with density cube technique. Similarly, participants in Group 2 viewed the same sequence except for the order of the datasets.
Figure 7. Experiment conduct flow
Scenarios—We believed that evaluation of visualization techniques should benefit from the formations (e.g. trends) that could exist in a real-
world data. Such formations could inform the generation of scenarios that would be used during the experiments. After examination of the real
data and the results derived from the interviews, we established five scenarios whose characteristics were, we believe, capable and diverse enough
to aid us in investigating how and in which ways a visualization technique would outperform the other.
In the first scenario, as depicted in Figure 5(a), the usage of the application in a city has been demonstrated. By employing this scenario we aimed
to facilitate an evaluation setting where users were expected to locate different levels of increase in the usage of the LBS service. The first scenario
dictated data generation process with a moderate increase followed by an exponential one.
Figure 5. Scenarios and accompanying trends: Five scenarios
created for the experiment. In the left column of the figure,
general spatial distribution of the data is shown. In the right
column, noisefree trends of application usage rate as a
function of time has been depicted. The actual data
represented in the experimental tool includes some level of
noise in order to complicate the tasks and to imitate the real
world data.
A typical workday in a city’s downtown and suburban areas has been simulated in the second scenario. Moreover, a service failure in the
downtown area between 12-3pm has been injected to the scenario design. As can be seen in Figure 5(b), a steep increase in the usage rate
between 3-9pm implies a common case where the application is used more frequently at the end of a work day. In the suburban area, a sudden
increase in the usage rate at around 3pm is shown.
The third scenario, similar to the second scenario, demonstrates a usage rate pattern in a city’s downtown and suburban areas (Figure 5(c)).
However, in this scenario a weekend day has been simulated where the usage rate in downtown increased consistently during the day. A slight
increase followed by a decrease in the suburban area has also been added to the design.
Commute of farmers from their villages to farm fields in rural areas was simulated in the fourth scenario (Figure 5(d)). In this scenario, the
decrease of the usage rate in all villages earlier in the day followed by an increase in the afternoon was demonstrated. The usage rate peak around
1pm in the farm field has been generated to realize the theme that farmers could use the application more often while they work in the field.
Finally in the last scenario (Figure 5(e)), participants were expected to infer a periodical trend of the usage rate in a city of moderate size. A
typical work day usage rate trend has been repeated three times with random noise added to the slope of the trend.
Datasets—To demonstrate the scenarios in the experimental tool, we generated datasets with our data generator explained in the Domain
Problem Specification section. Our design dictates that participants will view each scenario twice; however, with different visualization
techniques and datasets combinations (Figure 4). To prevent possible learning effects, two separate datasets, hereinafter twin datasets, for each
scenario has been generated. The twin datasets share the same data formation specified according to the corresponding scenarios. However, they
differ in the noise they include, the location and time information where and when the scenario takes place, and finally the length of the time
period along which the scenario spans. For example, twin datasets 1A and 1B have identical trends that are specified by Scenario 1. However, the
scenario takes place in the cities of Gonen and Tomarza in October 2011 and November 2012 in datasets 1A and 1B, respectively.
Tasks and Groups—As illustrated in Figure 4, each run of the experiment was comprised of 10 tasks, each of which is a combination of a scenario,
an accompanying dataset, and the technique to visualize the dataset. The order of the scenarios was specified as shown in the Figure 4, and this
order was never modified throughout the experiments. However, the dataset and the accompanying visualization technique that will be used for a
given task was determined according to the experiment groups (Figure 7), which were created to prevent learning effect. For example,
participants in Group 1 viewed tasks 1 through 5 which were created with datasets 1A through 5A visualized with density map technique while
they viewed tasks 6 through 10 generated with datasets 1B through 5B visualized with density cube technique. All possible dataset and technique
orderings were stratified among these four experiment groups.
Questions—In each task, users were expected to answer four or five multiple-choice questions. Please see Figure 6 for the number of questions
asked in each task. We asked three types of questions to our participants during the experiment:
Figure 6. The number of questions asked per task is
demonstrated. Participants were expected to answer four or
five questions in total for each task: Two MinMax questions,
one Comparison question, and finally one or two questions
related to Trend analysis.
MinMax, finding the minimum and maximum usage rate in the data set,
Q) In which of the following time period was the service used most ?
a) The usage rate on July 13th, 2011 is higher than the one on July 18th,2011
b) The usage rate on July 18th, 2011 is higher than the one on July 13th,2011
c) The usage rates on July 18th, 2011 and July 13th,2011 are almost at the same level.
Q) Which of the following is correct about the usage rate variances around city shown to you?
a) The usage rate moderately increases during the first half of the period. During the second half of the period, usage rate increases
much more faster than it does in the first half of the period.
b) The usage rate moderately decreases during the first half of the period. During the second half of the period, usage rate increases
saliently.
c) The usage rate moderately decreases during the first half of the period. During the second half of the period, usage rate remains at an
almost constant level.
d) The usage rate moderately increases with a constant acceleration during the whole period.
In total, the participants answered 46 multiple-choice questions (23 for each of density map and density cube techniques) during the experiment.
Procedure
Experimental procedure is shown in Figure 7. At the beginning of the experiment, participants were shortly briefed about the experiment
followed by a visual capability test. Upon the completion of the test, participants were randomly assigned to one of the experiment groups.
Depending on the experiment group, the first section of the experiment started with an instruction part for either density map or density cube
technique. In instruction part, each participant received about 10 minute of training with our tool so that they had adequate amount of time to
explore a small example dataset. Every question asked by the participants were answered by the experimenter to ensure that they understood the
functioning of the tool and gained sense of how the data should be interpreted. Following the instruction part, participants were shown the first
five tasks with the sequence explained in the previous subsection. For example (Figure 7), a participant in Group 2 took a training about the
density map technique and completed five tasks generated with datasets B and visualized with density map technique. The second section of the
experiment has been completed with the same sequence of processes.
Measures
Our measures included both objective and subjective data. We defined the descriptiveness capability of each technique in terms of time to solve
the questions and the correctness value indicating how accurately the participants answered the questions. The time measure was collected for
each task separately. The correctness value, collected both as task-based and as total, has been normalized so that it was ranged between 0 and
100.
Participants
14 participants (4 female), graduate and undergraduate students in a university in Turkey, volunteered for the experiment. Nevertheless, one of
the cases was discarded due to the color vision deficiency (CVD) of one of the participants (Hardy, Rand, & Rittler, 1945).
Results
Analysis of the collected data was conducted in three steps; (1) effects of 3D visual capability baseline, (2) analysis of task completion time, and
finally (3) analysis of correctness rate. To comply with the question types discussed in the previous subsection, analysis of task completion time
and correctness rate was evaluated separately for minimum-maximum, comparison, and trend questions.
Effects of 3D Visual Capability Baseline
We performed general linear model (GLM) repeated measures in order to analyze the variance of the correctness and time along with the 3D
visual capability as the covariate. As the result of our analysis, we found out that visual capability did not have a significant effect on time
measure, F(1,6) = 1.82, p < .23, while it was marginally effective on correctness measure, F(1,6) = 4.652, p < .1, implying an opportunity for
further investigation with visual capability as a factor of interest.
Analysis of Task Completion Time
Task completion time performance of our participants was evaluated separately for minimum-maximum, comparison and trend questions. For
each question type, analysis of each scenario (3 x 5 paired t-tests) and total time spent across all scenarios (3 x 1 paired t-tests) was conducted.
The comparison of the total time periods spent on each technique (for each question type) showed that neither technique assisted participants in
solving the tasks in less time than the other technique did as demonstrated in Table 1. However, a notable exception occurred for the minimum-
maximum questions of the second scenario, t(12) = 1.862, p < .1, where the density cube technique (M = 18.95, SD = 14.85) outperformed the
density map technique (M = 35.58, SD = 31.2) (Figure 8a). Another exceptional case existed for the comparison questions of the fifth scenario,
t(12) = -1.861, p < .1, where the density map technique (M = 25.72, SD = 11.07) aided users to better compare the usage rates than the density
cube technique did (M = 37.97, SD = 21.33) (Figure 8b).
Table 1. Statistical results for analysis of overall time measure for each question type: No significant difference could be found between the task
completion time measures of the two visualization techniques (all ps > .1) (Means of task completion times are in seconds.).
Density M=53.3,
t(12) =
Map SD=18.5
Trend 1.283, p >
Density M=38.5,
.1
Cube SD=31.9
Density M=38.0,
t(12) =
Minimum- Map SD=17.6
.054, p >
Maximum Density M=37.6,
.1
Cube SD=18.6
Density M=21.1,
t(12) =
Map SD=18.6
Comparison -.276, p >
Density M=23.0,
.1
Cube SD=22.8
Figure 8. Exceptional cases in the analysis of task completion
time: (a) Task completion time analysis of minimum
maximum questions of the second scenario: Participants
answered the questions faster in minimummaximum
question category when using density cube visualization. (b)
Task completion time analysis of comparison questions of the
fifth scenario: Participants answered the questions faster in
comparison question category when using density map
visualization.
Analysis of Correctness Measure
Analysis of correctness measure revealed significant findings about how visualization technique was predictive on our participants’ success. For
each question type (trend, minimum-maximum, and comparison), analysis of overall success rate of our participants was conducted. According
to the results of paired t-tests applied on the correctness measures, we found out that participants were able to answer both trend (t(12) = 2.215,
p < .05) (Figure 9) and comparison (t(12) = 3.482, p < .01) (Figure 10) questions more correctly when they viewed the data with density map
technique. Statistical results are presented in Table 2. However, there was no significant difference between the correctness measures of the two
visualization techniques (t(12)=1.443, p>.1) for minimum-maximum question type.
Figure 9. Comparison of density map and density cube
techniques according to overall correctness rate in trend
questions: Participants were able to answer trend questions
more accurately when they viewed the data with density map
technique.
Figure 10. Comparison of density map and density cube
techniques according to overall correctness rate in
comparison questions: Participants were able to answer
comparison questions more accurately when they viewed the
data with density map technique.
Table 2. Statistical results for analysis of overall correctness measure for each question type: A zero value means that all users answered the
questions wrong, where a value of 100 demonstrates the opposite. Participants answered the questions more accurately when they viewed the
data with density map animation technique for trend and comparison question types. Please note that p values for trend and comparison
questions are significant.
Density M=74.62,
t(12) =
Map SD=12.66
Trend 2.215, p* <
Density M=59.23,
.05
Cube SD=21.78
Density M=70.00,
t(12) =
Minimum- Map SD=19.15
1.443, p >
Maximum Density M=58.46,
.1
Cube SD=18.64
Density M=80.00,
t(12) =
Map SD=20.00
Comparison 3.482, p**
Density M=58.46,
< .01
Cube SD=20.75
DISCUSSION
Listing the benefits and drawbacks of 2D and 3D representations for each kind of data and task is out of scope of this work. We believe that there
exists enough evidence in order to argue that 3D representations in the visualization of spatio-temporal representations are accepted to be by the
existing literature. This is to say that, some common and well-known drawbacks of 3D visualizations seem not to be significantly present when
the spatial and temporal data is to be analyzed. Even more, 3D representations of spatial and temporal activities can aid the users perform
significantly better in particular tasks where 2D representations accepted as the de facto leading option (Kjellin et al., 2010).
The analytical use cases that can occur in spatio-temporal data are inexhaustible in terms of the possible formations (i.e. patterns lying under the
data), and tasks to be performed. The setup for the visualization and the role of the users of the visualization are other factors contributing to this
variety. In our experiment, we were able to cover only five of them that were found to be the best representatives of LBS data analysis by the field
experts. As suggested by Andrienko and Andrienko (2006), the tasks remain analogous across various visualizations, hence their task typology
divided into two main groups as “elementary” and “synoptic” tasks. Some tasks may require a holistic view of data, while some others might lead
users to make analysis based on a particular point in the visualization. According to this typology, our minimum-maximum, comparison, and
trend questions correspond to inverse comparison, direct comparison, and pattern identification, respectively, task groups. This correspondence
forms a basis for our discussion on whether and why 2D and 3D representations fail for particular tasks in spatio-temporal data visualization
compared to the visualizations of other data types.
While analyzing the data represented in the second scenario, users observed the whole dataset as a set of 3D objects in our experimental tool
enabling them to draw decisions more quickly than they did with density map animation. This is highly likely due to the fact that users tend to
model the data in their mind as a whole rather than interpreting data in smaller chunks, causing them to browse through all data as fast as
possible (Plumlee & Ware, 2006). Similar findings have also been reported in previous work (Hicks et al., 2003; Robertson et al., 2008). Hicks et
al. (2003) claim that “computational offload” that occurs during the holistic observation of the data assists to complete the undirected
comparison tasks in less time. Earlier works of Tversky et al. (2002) and Baudisch et al. (2006) imply that the static representation of motion
may be more effective than animation. Robertson et al. (2008) state that “static depictions of trends appear to be more effective” (p. 1332),
supporting our findings about the density cube visualization, which is actually a static representation of all the data for a given time period and
region. According to the previous work along with our findings, the holistic overview capability of 3D representations suggested in these studies
seems to be valid also for the spatio-temporal data visualization.
In their experimental study, Hicks et al. (2003) report various performance levels of 2D graph, 3D plot, and 3D helix visualizations of e-mail
usage log, which they classify as temporal data. They conducted a laboratory experiment to compare the performances of these visualizations
against three groups of tasks, namely information retrieval, directed, and undirected comparisons which can be classified as information look up,
direct comparison, and inverse comparison, respectively, according to typology of Andrienko and Andrienko (2006). They reported that 3D
representations poorly performed in terms of overall task completion time compared to 2D graph. However, in our experiment, we were not able
to observe any significant superiority of 2D representation over 3D in terms of overall task completion time. The scenario wise task completion
times of the two representations also were not significantly different. The fact that 2D representation were not able to outperform 3D could
potentially be due to the higher dimensionality of our spatio-temporal data compared to 2D temporal data used by Hicks et al. (2003). Our
participants spent more effort while analyzing with density map technique probably since they had to navigate to the other timestamps of the
animation. On the other hand, participants were able to partially view the 2D temporal data used by Hicks et al. (2003). Density map animation
technique leads analysts to focus on a representation of data for a unique moment of time causing the need for traversing back and forth in time
dimension of the visualization tool during the analysis process, which was also reported by Robertson et al. (2008). Kjellin et al. (2010) also
reported that 2D visualization of spatio-temporal data led to a better analysis performance particularly when the tasks required detecting
structures in the visualization that were invariant to Euclidean and similarity transformations. Nevertheless, we could rarely observe the
significant superiority of the 2D representation in terms of either time or accuracy during our experiment.
As Hicks et al. (2003) reported, 3D plot of the temporal data facilitates convenience and accuracy with the undirected comparison tasks.
Nevertheless, we did not observe similar results in our experiments where our participants were able to make more accurate inferences with
density map technique (our 2D implementation). Particularly, they completed the tasks with more accurate answers while they were analyzing
the fourth scenario where the data were more cluttered compared to other scenarios. As seen in Figure 11, locating minimum or maximum usage
rates (inverse comparison task) with 3D representation has been a challenging task due to occlusion as reported by both our participants and the
experiment results. On the other hand, density map technique as our 2D representation implementation allowed users to delve into the details of
the data and draw more accurate conclusions. The occlusion problem inherent in 3D visualizations is more apparent in the visualization of
spatio-temporal data possibly due to the increased dimensionality compared to temporal data.
Figure 11. a) A frame from density cube visualization: The
problematic occlusion in Scenario 4 is depicted. b) The
density map animation of the same scenario: Occluded
regions are viewed more clearly.
Another difficulty with 3D visualization is complexity of making accurate size estimate due to the perspective foreshortening (Munzner, 2008;
Tory et al., 2006). Heights and widths that are at different distances from the user complicates comparing patterns (categorized as behavior
comparison by Andrienko and Andrienko (2006)) as suggested by Ware (2012). Our participants were able to locate trends (named as behavior
identification by Andrienko and Andrienko (2006)) significantly more accurately with the 2D representation of the location-based services data.
This result is inline with those of Andrienko and Andrienko (2006), Tory et al. (2006), and Ware (2012), suggesting that the perspective
projection effect in 3D representation is also a deterious effect in the visualization of spatio-temporal data.
Subtle regular patterns (e.g. periodically repeating formation as in our fifth scenario and information retrieval tasks by Hicks et al. (2003)) might
aid revealing invaluable inferences with prominent importance to the analysts. As suggested by our results and Hicks et al. (2003), 3D techniques
might better unveil subtly changing patterns over time. However, the choice of 3D technique is of importance since not all kinds of 3D
representations could aid the detection of regular patterns without introducing occlusion.
As discussed above, the inferences previously made about the visualization techniques seem to be failing for the visualization of the spatio-
temporal data. Due to its intrinsic properties, given the findings of our experiment, analysis of spatiotemporal data requires reconsideration of
the widely accepted properties of 2D and 3D representations. Nevertheless, more research has to be done to investigate how and in which tasks
2D and 3D representations effect visualization of spatio-temporal data.
CONCLUSION
Visualization of spatio-temporal data still remains as a daunting problem not only because of the variability of the cases that could be observed,
but also due to the fact that particular cases usually require their intrinsic approaches. Even more, well known aspects of 2D and 3D
representations in visualizations of other kinds of data might not directly be applied when the data at hand is spatio-temporal. Given the
complexity of the visualization context space comprising the task type, data type, and the user role, it is almost infeasible to cover all research
space in order to conclude that 2D representations should be the first option to be considered. We conducted an experiment where we treated our
participants with five scenarios defined by field experts each of which is visualized with both density map and density cube techniques, relatively
simple examples of 2D and 3D representations. Based on the findings of our experimental study, along with the findings of the previous work
(Hicks et al., 2003; Kjellin et al., 2010), we claim that there is enough evidence that 2D representations seems not to be significantly better than
3D representations while analyzing spatio-temporal data.
This work was previously published in GeoIntelligence and Visualization through Big Data Trends edited by Burçin Bozkaya and Vivek
Kumar Singh, pages 150180, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Abraham, T., & Roddick, J. F. (1999). Survey of spatio-temporal databases. GeoInformatica , 3(1), 61–99. doi:10.1023/A:1009800916313
Andrienko, G., Andrienko, N., Demsar, U., Dransch, D., Dykes, J., Fabrikant, S. I., & Tominski, C. (2010). Space, time and visual
analytics. International Journal of Geographical Information Science , 24(10), 1577–1600. doi:10.1080/13658816.2010.508043
Andrienko, G., Andrienko, N., Jankowski, P., Keim, D., Kraak, M.-J., MacEachren, A., & Wrobel, S. (2007). Geovisual analytics for spatial
decision support: Setting the research agenda.International Journal of Geographical Information Science , 21(8), 839–857.
doi:10.1080/13658810701349011
Andrienko, N., & Andrienko, G. (2006). Exploratory analysis of spatial and temporal data . Berlin, Germany: Springer.
Andrienko, N., Andrienko, G., & Gatalsky, P. (2003). Exploratory spatio-temporal visualization: An analytical review. Journal of Visual
Languages and Computing , 14(6), 503–541. doi:10.1016/S1045-926X(03)00046-6
Baudisch P. Tan D. Collomb M. Robbins D. Hinckley K. Agrawala M. Ramos G. (2006, October). Phosphor: explaining transitions in the user
interface using afterglow effects. In Proceedings of the 19th annual ACM symposium on User interface software and technology (pp. 169-178).
ACM.10.1145/1166253.1166280
Borkin, M., Gajos, K., Peters, A., Mitsouras, D., Melchionna, S., Rybicki, F., & Pfister, H. (2011). Evaluation of artery visualizations for heart
disease diagnosis. Visualization and Computer Graphics. IEEE Transactions on , 17(12), 2479–2488.
Boyandin, I., Bertini, E., Bak, P., & Lalanne, D. (2011, June). Flowstrates: An Approach for Visual Exploration of Temporal Origin Destination
Data. Computer Graphics Forum , 30(3), 971–980. doi:10.1111/j.1467-8659.2011.01946.x
Cockburn A. McKenzie B. (2002, April). Evaluating the effectiveness of spatial memory in 2D and 3D physical and virtual environments. In
Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 203-210). ACM.10.1145/503376.503413
Demšar, U., & Virrantaus, K. (2010). Space–time density of trajectories: Exploring spatio-temporal patterns in movement data. International
Journal of Geographical Information Science ,24(10), 1527–1542. doi:10.1080/13658816.2010.511223
Ghani, S., Elmqvist, N., & Yi, J. S. (2012, June). Perception of Animated Node Link Diagrams for Dynamic Graphs. In Computer Graphics
Forum (Vol. 31, No. 3pt3, pp. 1205-1214). Hoboken, NJ: Blackwell Publishing Ltd. doi:10.1111/j.1467-8659.2012.03113.x
Hadlak, S., Tominski, C., Schulz, H. J., & Schumann, H. (2010). Visualization of attributed hierarchical structures in a spatiotemporal
context. International Journal of Geographical Information Science , 24(10), 1497–1513. doi:10.1080/13658816.2010.510840
Hardy, L. H., Rand, G., & Rittler, M. C. (1945). Tests for the detection and analysis of color-blindness. JOSA , 35(4), 268–271.
doi:10.1364/JOSA.35.000268
Hicks, M., O'Malley, C., Nichols, S., & Anderson, B. (2003). Comparison of 2D and 3D representations for visualising telecommunication
usage. Behaviour & Information Technology ,22(3), 185–201. doi:10.1080/0144929031000117080
Kjellin, A., Pettersson, L. W., Seipel, S., & Lind, M. (2010). Evaluating 2d and 3d visualizations of spatiotemporal information. [TAP]. ACM
Transactions on Applied Perception ,7(3), 19. doi:10.1145/1773965.1773970
Kraak M. J. (2003, August). The space-time cube revisited from a geovisualization perspective. In Proc. 21st International Cartographic
Conference (pp. 1988-1996).
Kruger J. Westermann R. (2003, October). Acceleration techniques for GPU-based volume rendering. In Proceedings of the 14th IEEE
Visualization 2003 (VIS'03) (p. 38). IEEE Computer Society.
MacEachren, A. M. (2004). How maps work: representation, visualization, and design . New York, NY: Guilford Press.
MacEachren, A. M., & Kraak, M. J. (2001). Research challenges in geovisualization. Cartography and Geographic Information Science , 28(1), 3–
12. doi:10.1559/152304001782173970
Mehler, A., Bao, Y., Li, X., Wang, Y., & Skiena, S. (2006). Spatial analysis of news sources. Visualization and Computer Graphics .IEEE
Transactions on , 12(5), 765–772.
Munzner, T. (2008). Process and pitfalls in writing information visualization research papers . In Information visualization (pp. 134–153). Berlin,
Germany: Springer. doi:10.1007/978-3-540-70956-5_6
Munzner, T. (2009). A nested model for visualization design and validation. Visualization and Computer Graphics . IEEE Transactions
on , 15(6), 921–928.
Plumlee, M. D., & Ware, C. (2006). Zooming versus multiple window interfaces: Cognitive costs of visual comparisons. ACM Transactions on
Computer-Human Interaction , 13(2), 179–209. doi:10.1145/1165734.1165736
Rhyne, T. M., MacEachren, A., & Dykes, J. (2006). Guest Editors' Introduction: Exploring Geovisualization. IEEE Computer Graphics and
Applications , 26(4), 20–21. doi:10.1109/MCG.2006.80
Robertson, G., Fernandez, R., Fisher, D., Lee, B., & Stasko, J. (2008). Effectiveness of animation in trend visualization.Visualization and
Computer Graphics . IEEE Transactions on ,14(6), 1325–1332.
Roddick, J. F., Hornsby, K., & Spiliopoulou, M. (2001). An updated bibliography of temporal, spatial, and spatio-temporal data mining research .
In Temporal, Spatial, and Spatio-Temporal Data Mining (pp. 147–163). Berlin, Germany: Springer. doi:10.1007/3-540-45244-3_12
Scheepens, R., Willems, N., van de Wetering, H., Andrienko, G., Andrienko, N., & van Wijk, J. J. (2011). Composite density maps for multivariate
trajectories. Visualization and Computer Graphics . IEEE Transactions on , 17(12), 2518–2527.
Shanbhag, P., Rheingans, P., & desJardins, M. (2005, October). Temporal visualization of planning polygons for efficient partitioning of geo-
spatial data. In Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on (pp. 211-218). IEEE. 10.1109/INFOVIS.2005.32
Slocum, T. A. (2009). Thematic cartography and geovisualization . Upper Saddle Hall, NJ: Prentice Hall.
Tory, M., Kirkpatrick, A. E., Atkins, M. S., & Moller, T. (2006). Visualization task performance with 2D, 3D, and combination
displays. Visualization and Computer Graphics . IEEE Transactions on , 12(1), 2–13.
Turdukulov, U. D., Kraak, M. J., & Blok, C. A. (2007). Designing a visual environment for exploration of time series of remote sensing data: In
search for convective clouds. Computers & Graphics , 31(3), 370–379. doi:10.1016/j.cag.2007.01.028
Tversky, B., Morrison, J. B., & Betrancourt, M. (2002). Animation: Can it facilitate? International Journal of Human-Computer Studies , 57(4),
247–262. doi:10.1006/ijhc.2002.1017
Waldspurger C. A. Weihl W. E. (1994, November). Lottery scheduling: Flexible proportional-share resource management. In Proceedings of the
1st USENIX conference on Operating Systems Design and Implementation (p. 1). USENIX Association.
Wilkinson, L., & Friendly, M. (2009). The history of the cluster heat map. The American Statistician , 63(2), 179–184. doi:10.1198/tas.2009.0033
Willems, N., Van De Wetering, H., & Van Wijk, J. J. (2009, June). Visualization of vessel movements. [). Hoboken, NJ: Blackwell Publishing
Ltd.]. Computer Graphics Forum , 28(3), 959–966. doi:10.1111/j.1467-8659.2009.01440.x
Yuan M. (1996, January). Temporal GIS and spatio-temporal modeling. In Proceedings of Third International Conference Workshop on
Integrating GIS and Environment Modeling, Santa Fe, NM.
KEY TERMS AND DEFINITIONS
Data Analysis Tasks: A unit of work that is performed by data analysts during the general process of extracting insights from given data.
Data Visualization: A mapping from data to the infinite set of visual elements possibly with a purpose of conveying some information laying
under the data.
Density Cube: Three-dimensional visualization of spatio-temporal data by stacking two-dimensional density map representation of an attribute
of the data. Typically, the axis laying along the stacking direction is used to represent time dimension.
Density Map: Visualization of some data with references to 2D spatial aspects emphasizing the density of a dimension of concern which is
generated with a density estimation algorithm.
Laboratory Experiment: One of the methods to formally verify a hypothesis regarding a phenomenon under controlled conditions.
SpaceTime Cube: The concept of representing two-dimensional spatial data with references to time in three dimensional analytical space.
SpatioTemporal Data: Data having references to time and space with possible high dimensionality.
Section 3
Jayalakshmi D. S.
M. S. Ramaiah Institute of Technology, India
R. Srinivasan
S. R. M. University, India
K. G. Srinivasa
M. S. Ramaiah Institute of Technology, India
ABSTRACT
Processing Big Data is a huge challenge for today’s technology. There is a need to find, apply and analyze new ways of computing to make use of
the Big Data so as to derive business and scientific value from it. Cloud computing with its promise of seemingly infinite computing resources is
seen as the solution to this problem. Data Intensive computing on cloud builds upon the already mature parallel and distributed computing
technologies such HPC, grid and cluster computing. However, handling Big Data in the cloud presents its own challenges. In this chapter, we
analyze issues specific to data intensive cloud computing and provides a study on available solutions in programming models, data distribution
and replication, resource provisioning and scheduling with reference to data intensive applications in cloud. Future directions for further
research enabling data intensive cloud applications in cloud environment are identified.
INTRODUCTION
Massive amounts of data are being generated in scientific, business, social network, healthcare, and government domains. The “Big Data” so
generated is typically characterized by the three Vs: Volume, Variety, and Velocity. Big data comes in large volumes, from a large number of
domains, and in different formats. Data can be in structured, semi-structured or unstructured format, though most of the Big Data is
unstructured; the data sets might also grow in size rapidly. There are many opportunities to utilize and analyze the Big Data to derive value for
business, scientific and user-experience applications. These applications need to process data in the range of many terabytes or petabytes and are
called data intensive applications. Consequently, computing systems which are capable of storing, and manipulating massive amounts of data are
required; also required are related software systems and algorithms to analyze the big data so as to derive useful information and knowledge in a
timely manner.
In this chapter we present the characteristics of data intensive applications in general and discuss the requirements of data intensive computing
systems. Further, we identify the challenges and research issues in implementing data intensive computing systems in cloud computing
environment. Later in this chapter, we also present a study on programming models, data distribution and replication, resource provisioning and
scheduling with reference to data intensive applications in cloud.
Data Intensive Computing Systems
Data Intensive Computing is defined as “a class of parallel computing applications which use a data parallel approach to processing large volumes
of data” (”Data Intensive Computing”, 2012). They devote most of their processing time to I/O and manipulation of data rather than computation
(Middleton, 2010). According to the National Science Foundation, data intensive computing requires a “fundamentally different set of principles''
to other computing approaches. There are several important common characteristics of data intensive computing systems that distinguish them
from other forms of computing(Middleton, 2010).
• Data and applications or algorithms are co-located so that data movement is minimized to achieve high performance in data intensive
computing
• Programming models that express the high level operations on data such as data flows are used, and the runtime system transparently
controls the scheduling, execution, load balancing, communications and movement of computation and data across the distributed
computing cluster.
Challenges and Research Issues for Data Intensive Computing Systems
Parallel processing using data-parallel approach is widely accepted as the way to architect data intensive applications. Many different system
architectures such as parallel and distributed relational database management systems have been implemented for data intensive applications
and big data analytics. However these assume that data is in structured form whereas most of the big data is in unstructured or semi-structured
form.
Typical data intensive applications include scientific applications handling large amounts of geo-distributed data for which grid architectures
have been used extensively. Hence, loosely coupled distributed systems with message passing are preferred over typical, tightly-coupled HPC
systems. The challenge is to architect and implement applications that can scale to handle voluminous and geo-distributed data in different forms
in a reliable manner, and in real time in some applications.
Cloud computing systems with their promise of seemingly infinite, elastic resources lend themselves to these requirements and hence data-
intensive cloud applications are the focus of current research. Building data-intensive applications in cloud computing environments is different
due to the levels of scale, reliability, and performance.
• The first challenge is the type of data management solution. Data-intensive applications may be built upon conventional frameworks, such
as shared-nothing database management systems (DBMSs), or new frameworks, such as MapReduce(Dean & Ghemawat, 2004), and so have
very different resource requirements.
• New programming models needed are required to express data intensive computations in cloud to enable fast and timely execution.
• Since large-scale data-intensive applications use data-parallel approach on possibly geo-distributed data, scheduling and resource
allocation should be done so as to avoid data transfer bottlenecks.
The main research issues can be classified as platforms and programming model centric issues, data centric issues and communication centric
issues (Shamsi, Khojaye, & Qasmi, 2013). Much research work is being carried out in solving these issues such as implementing efficient
algorithms and techniques for storing, managing, retrieving and analyzing data, dissipation of information, placement of replicas, data locality,
and data retrieval. The remainder of this chapter provides a study of the recent research in addressing these issues.
PROGRAMMING ABSTRACTIONS FOR DATA INTENSIVE APPLICATIONS
There are a number of programming paradigms and platforms available, each of them addressing different application characteristics.
MapReduce is extensively used to implement large scale data processing in distributed systems. However it imposes restrictions on the way the
processing is expressed by forcing them to be written as a map-reduce pair. There are also limitations to MapReduce in that the data set has to be
staged in the local storage before processing, intermediate data has to be materialized in local files, and MapReduce is best suited for batch
processing of data (Sakr, Liu, Batista, & Alomari, 2011).
MapReduce faces challenges in handling Big Data with respect to data storage (relational databases and NoSQL stores), Big Data analytics
(machine learning and interactive analytics), online processing, and security and privacy (Grolinger et al., 2014). There have been many
improvisations of the Apache Hadoop, the popular implementation of the basic MapReduce mode, hybrid implementations using SQL-like
constructs on Hadoop and adapting Hadoop to process structured, graph and streaming data as well as perform iterative, interactive, and in-
memory computations (Refer Table 1).
Table 1. Programming platforms for different dataintensive application types
MapReduce Google MapReduce, Hadoop
based models
Streaming STORM
Interactive Google Dremel/ BigQuery,
Apache Drill
Iterative HaLoop
Graph-based Pregel
Structured Hive, Tenzing
Relational Map –Reduce Merge
In-memory Piccolo
A large amount of computing power and resources are available to the end users by means of cloud computing, but they may not have the
expertise to harness these effectively. They may not be able to express their computing workload in a manner which is intuitive in their subject
domains, nor able to optimize the resource usage. The end users need to be provided with high-level programming abstractions that allow them
to easily express their data intensive workloads as well as efficient execution of the workloads. Hence the challenges are to find programming
abstractions that allow different types of data intensive computations on a common platform and allow application developers to express the
application needs without concerns about the system management aspects. We now discuss some of the programming abstractions available to
handle this issue (refer Table 2).
Table 2. Programming models for dataintensive cloud applications
All-Pairs(Moretti, Bulosan, Thain, & Flynn, 2008) is a programming abstraction that fits the needs of several data-intensive scientific
applications such as biometrics and data mining. The objective is to provide abstractions that allow non-expert users to express large, data-
intensive workloads so that resources are used effectively.
All-Pairs(set A, set B, function F) returns a matrix M and compares all elements of set A to all elements of set B via function F, yielding matrix M,
such that M[i,j] = F(A[i],B[j]). In All-pairs the workflow is modeled so that execution can be predicted based on grid and workload parameters
such as the number of hosts, and a spanning tree is used to distribute data to the compute nodes. Based on the model, batch jobs are structured
to provide good results and sent to the compute nodes. Once the batch jobs have completed, the results are collected into a canonical form for the
end-user, and the scratch data left on the compute nodes is deleted. The workload’s data requirement is served by implementing demand paging
similar to a traditional cluster, and using active storage wherein data is pre-staged to local file systems on each compute node.
Problems in science and engineering can be expressed as variations of the All-Pairs problem. However, All-pairs is not a universal abstraction
and it may be possible to express only few, specific applications using All-pairs. It also suffers from serious drawbacks in handling large data sets
which do not fit into a single compute node. The model also makes assumptions on the homogeneity and availability of compute nodes which
might not always hold true in a distributed environment. Also, the All-pairs state engine needs to use local state to ensure that it can complete the
job within the local state limits, as well as clean the scratch data after the completion of the job.
Meandre DataIntensive Application Infrastructure
Data driven execution revolves around the idea of applying transformation operations to a flow or stream of data when it is available. Meandre
(Ács et al., 2011) provides semantic-driven data flow execution infrastructure to construct, assemble, and execute components and flows.
Whereas MapReduce requires processes to be expressed as directed acyclic graphs, flows in Meandre are aggregations of basic computations
tasks in a directed graph, cyclic or acyclic.
Meandre provides an application infrastructure including tools for creating components and flows, a high-level language to describe flows, and a
multi-core and distributed execution environment based on a service-oriented architecture. The programming paradigm creates complex tasks by
linking together a bunch of specialized components. Meandre's publishing mechanism allows components developed by third parties to be
assembled in a new flow.
There are two ways to develop flows: – Meandre’s Workbench visual programming tool and Meandre’s ZigZag scripting language. Modeled on
Python, ZigZag is a simple declarative language for describing data- intensive flows using directed graphs. Command-line tools allow ZigZag files
to compile and execute. A compiler is provided to transform a ZigZag program (.zz) into Meandre archive unit (.mau). Mau(s) can then be
executed by a Meandre engine. The Meandre server can mutate transparently from standalone to clustered mode without any extra effort
providing scalability. It also uses virtualization techniques for rapid deployment in the cloud.
Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids
Large scale distributed systems typically use a custom language for describing the application and the applications are not portable across
systems. Makeflow, based on the Unix Make, provides a language for expressing highly parallel data intensive applications (Albrecht, M.,
Donnelly, P., Bui,P., & Thain, D., 2012). The Makeflow applications are portable and run without any modification on grids, HPC clusters and
storage clouds. Makeflow scales in size of data, in time, across systems and across user-expertise.
Blue: A Unified Programming Model for Diverse DataIntensive Cloud Computing Paradigms
Cluster manager frameworks such as Apache YARN (V. K. Vavilapalli, 2013) and Mesos (B. Hindman, 2011) introduce a resource management
layer on which different application paradigms can be implemented. A unified programming framework called Blue (Varshney, 2013) introduces
an intermediate layer between the resource management layer and the application layer and provides an abstraction for distributing computing
to the cluster applications. It supports different computational paradigms such as batch processing, streaming, iterative, graph-based, structured,
in-memory, etc.
The Blue Programming framework works under the assumptions of decomposability, determinism and parallelism. A cluster program is modeled
as a collection of interconnected “tasks”. For example, the Map Reduce program comprises of four tasks: reader, mapper, reducer, and writer.
The tasks communicate by sending data over unidirectional links. The program, therefore, can be viewed as a directed graph. The links are
represented as Output Queue and Input Queue at the source and destination processes, respectively.
The Blue model can be implemented by analyzing the graph and resources scheduled exploiting data and network locality. At runtime, the tasks
are launched as processes and read data from input queues and write data to output queues as discrete records. Multiple processes on different
machines are launched for a given task to provide parallelism. However, though program developers can control some high-level properties, they
do not have control as to how data from output queues are assigned to the input queues.
Blue supports in-memory caching of data either explicitly by programmer or opportunistically by system thereby improving the latency and
throughput of interactive and iterative programs. Finally, the Blue model provides simple and consistent semantics for fault-tolerance of acyclic
as well as cyclic dependency graphs.
The Blue Framework is targeted for data-intensive computational problems and is not best suited for task parallelism. It is also not suited for
query processing with bounded response delays, and message queue based architectures. The processes cannot reopen or rewind their input
queues. For algorithms with cyclic dependencies, the processes cannot persist across iterations.
epiC: An Extensible and Scalable System for Processing Big Data
The MapReduce programming model manages unstructured data such as plain text data effectively, but it is found to be inconvenient and
inefficient for processing structured and graph data, and iterative computation. Systems like Dryad [15] and Pregel [20] are built to process those
kinds of analytical tasks.
The problem of data variety can be handled by a hybrid system. In a hybrid system with its dataset consisting of sub-datasets each of a different
format, the multiple types in the data can be stored in a variety of systems based on types; for example, structured data can be stored in a
database; unstructured data can be stored in Hadoop. These data can be processed by splitting the entire job into sub-jobs and executed on
appropriate systems based on the data types. The final result can be obtained by aggregating all the results of the sub-jobs loaded into a single
system, either Hadoop or database, with appropriate data conversions.
The complexity in a such a hybrid approach is in maintaining several clusters such as Hadoop cluster, Pregel cluster, database cluster,etc., and
the overhead of frequent data formation and data loading for merging output of sub-jobs during data processing. However, the different systems
(Hadoop, Dryad, Database, Pregel) designed for different types of data all share the same shared-nothing architecture and decompose the whole
computation into independent computations for parallelization. Therefore, epiC (Jiang, Chen, Ooi, Tan, & Wu, 2014) proposes an architecture
that decomposes the computation and communication pattern and enables users to process multi-structured datasets in a single system. It
provides a common runtime system for running independent computations and developing plug-ins for implementing specific communication
patterns
epiC introduces a general “Actor-like concurrent programming model” for specifying parallel computations, independent of the data processing
models. Users process multi-structured datasets by using appropriate data processing models for each dataset, mapping those data processing
models into epiC’s model and writing appropriate epiC extensions. Like Hadoop, programs written in this way can be automatically parallelized
and the runtime system takes care of fault tolerance and inter-machine communications. Table 2 presents the different programming
abstractions.
DATA DISTRIBUTION AND REPLICATION
The main issue with respect to data intensive applications is that the granularity of data partitions and placement of replicas is decided by the
underlying distributed file system. Existing data parallel frameworks, e.g., Hadoop, or Hadoop-based clouds, distribute the data using a random
placement method for simplicity and load balance.
Many data intensive applications exhibit interest locality which only sweep part of a big data set. That data are often accessed together results
from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below
the efficiency of optimal data distribution. When the semantic boundaries or interest locality are not considered, problems may arise in
applications that read binary files like image, video, etc. The challenges lay in identifying optimal data groupings and re-organize data layouts to
achieve the maximum parallelism per group subjective to load balance. DRAW (J. Wang, Xiao, Yin, & Shang, 2013) dynamically scrutinizes data
access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group
subjective to load balance.
The authors in (Guo, W., Luo, X., & Cui, L., 2014) use genetic algorithms to evolve a data placement strategy based on cost of distributed
transactions. If a transaction needs to access two different slices of data stored on different data nodes, collaboration costs between the two data
nodes are involved. This information, obtained through data log files and applications, is represented as a matrix. Genetic algorithm is used to
find an optimal data slice placement strategy so as to minimize total collaboration costs while not exceeding the capacity limits of the data nodes.
With a large number of nodes in the cloud computing system, it is difficult to ask all nodes with the same performance and capacity in their
CPUs, memory, and disks. If the nodes in a data center are not all homogeneous, it is possible that the data of a high-QoS application may be
replicated in a low performance node with slow communication and disk access latencies. Later, if data in the node running the high-QoS
application is corrupted, the data of the application will be retrieved from the low-performance node. Since the low-performance node has slow
communication and disk access latencies, the QoS requirement of the high-QoS application may be violated.
The QoS-aware data replication (QADR) problem for data-intensive applications in cloud computing systems is addressed in (Lin, Chen, &
Chang, 2013). The main goal of the QADR problem is to minimize the data replication cost and the number of QoS violated data replicas. To solve
the QADR problem, the authors in (Lin, Chen, & Chang, 2013) propose a greedy algorithm, called the high-QoS first-replication (HQFR)
algorithm. In this algorithm, if an application has a higher QoS requirement, it takes precedence over other applications to perform data
replication. Since the HQFR algorithm could not achieve the above minimum objective, the optimal solution of the QADR problem was
formulated as an integer linear programming (ILP) formulation. However, the ILP formulation involves complicated computation. Hence, to find
the optimal solution of the QADR problem in an efficient manner, the QADR problem was transformed to the minimum-cost maximum-flow
(MCMF) problem. Compared to the HQFR algorithm, the optimal algorithm takes more computational time. However, the two proposed
replication algorithms run in polynomial time. Their time complexities are dependent on the number of nodes in the cloud computing system. To
accommodate to a large scale cloud computing system, the scalable replication issue is considered by the use of node combination techniques to
suppress the computational time of the QADR problem without linear growth as increasing the number of nodes.
Cloud computing is based on using commodity servers and hence failures of nodes in data centers is more a norm than an exception. When a
failure occurs, the intermediate data of the workflow executed until the point of failure stored in the failed node is lost. To provide fault tolerance,
the workflow is re-executed so that the lost intermediate data is recovered. CARDIO uses a trade-off between the cost of replicating intermediate
data and the cost of re-executing a given a dataflow with a set of stages (Castillo, C., Tantawi, A.N., Arroyo, D., & Steinder, M., 2012). In CARDIO
the minimum reliability cost problem is formulated as an integer programming optimization problem with nonlinear convex objective functions.
CARDIO takes into account the probability of loosing data, the cost of replication, the storage capacity available for replication and potentially the
current resource utilization in the system. CARDIO is implemented as a decision layer on top of Hadoop that makes intelligent replication
decisions as the dataflow advances towards completion; CARDIO reconsiders its replication strategy at the completion of every stage in a
workflow.
Geo-replication of data across multiple datacenters offers numerous advantages – improved accessibility and reduction in access latency to the
user, fault tolerance and disaster recovery for the service providers. Simple static replica creation strategies that assign the same number of
replicas to all data are not suitable in such scenario. To address this issue, Zhen Ye, et al. propose a two-layer geo-cloud based dynamic replica
creation strategy called TGstag (Ye, Li, & Zhou, 2014). TGstag addresses the issue with two strategies: policy constraint heuristic inter-datacenter
replication and load aware adaptive intra-datacenter replication. TGstag aims to minimize both cross-datacenter bandwidth consumption and
average access time with constraints of policy and commodity node capacity.
Geo-replication of key-value stores is relatively easier as the atomicity of accesses is limited to a single key, whereas the traditional data
management approach takes a holistic view of data, which makes it complex to scale commercial database management solutions (DBMSs) in a
distributed setting. Google’s Megastore, a storage system for interactive online services, provides a sharded data model which combines the
scalability of a NoSQL datastore with the convenience of a traditional RDBMS (Baker et al., 2011). It provides both strong consistency guarantees
and high availability. It provides fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows synchronously
replication of each write across a wide area network with reasonable latency using Paxos based replica consistency. It supports seamless failover
between datacenters but suffers from poor write-throughput. Google’s Spanner proposes to build globally distributed database over multiple
datacenters (Corbett et al., 2012). It uses a sharded data-model and has a synchronous replication layer which uses Paxos. It provides versioned
data, supports SQL transactions as well as key-value read/writes, external consistency, automatic data migration across machines (even across
datacenters) for load balancing and fault tolerance. Table 3 presents some research work addressing data placement and replication related
issues in data intensive cloud systems.
Table 3. Data placement and replication for dataintensive cloud systems
The main resource in contention in data intensive applications is the storage for large amount of data. Existing scheduling techniques focus on
task scheduling based on processing time: storage needs need to be looked into as well. It is necessary to support this by scaling out for resources
across the boundaries of the data centre into hybrid cloud, multi cloud or federated cloud scenarios. Both time and cost aware execution of data-
intensive applications executed in such cloud settings is the need envisaged in the papers reviewed. The main challenges are therefore use of
hybrid, multi- and federated clouds, topology aware resource allocation to minimize costs and QoS awareness.
The authors in (Lee, Tolia, Ranganathan, & Katz, 2011) propose an architecture called TARA for optimized resource allocation for data intensive
workloads in Infrastructure-as-a-Service (IaaS)-based cloud systems. The idea is to allocate VMs considering network topology so as to route
inter-VM traffic away from bottlenecked network paths. They use a “what if” methodology to guide allocation decisions taken by the IaaS. The
architecture uses a prediction engine with a lightweight simulator to estimate the performance of a given resource allocation and a genetic
algorithm to find an optimized solution in the large search space. The prediction engine is the entity responsible for optimizing resource
allocation. When it receives a resource request, the prediction engine iterates through the possible subsets of available resources (each distinct
subset is known as a candidate) and identifies an allocation that optimizes estimated job completion time. However, even with a lightweight
prediction engine, exhaustively iterating through all possible candidates is infeasible due to the scale of IaaS systems. A genetic algorithm-based
search technique allows TARA to guide the prediction engine through the search space intelligently.
A dynamic federation of data intensive cloud providers is proposed in (Hassan & Huh, 2011) to gain economies of scale and an enlargement of
their virtual machine (VM) infrastructure capabilities (i.e., storage and processing demands) to meet the requirements of data intensive
applications. They develop effective dynamic resource management mechanism based on game theory for data intensive IaaS cloud providers to
model the economics of VM resource supplying in federating environment. Both co-operative and non-cooperative games using price–based
resource allocation are analyzed.
Modern servers have multiple cores and a range of disk storage devices presented to the user as a single logical disk. Multiple parallel processes
and virtual machines are provisioned on theses servers so as to efficiently utilize the computing power. However disk I/O resources are still
scarce and the multiple parallel processes competing for I/O interfere with each other degrading the performance. The authors in (Groot, S.,
Goda, K., Yokoyama, D., Nakano, M., and Kitsuregawa, M., 2013) propose a model for predicting the impact of I/O interference which can be
used for efficient resource allocation and scheduling for data intensive applications.
Jrad et al. (Jrad, Tao, Brandic, & Streit, 2014) propose a multi-dimensional resource allocation scheme to automate the deployment of data-
intensive large scale applications in Multi-Cloud environments. A two level approach is used in which the target Clouds are matched with respect
to the Service Level Agreement(SLA) requirements and user payment at first. In the next level, the application workloads are distributed to the
selected clouds taking data locality into consideration while scheduling. Table 4 lists the various resource allocation strategies specifically
targeting data intensive cloud systems.
Table 4. Resource allocation for dataintensive cloud systems
• Supporting dynamic load balancing of computations and dynamic scaling of the compute resources.
Large-scale data centers leverage virtualization technology to achieve excellent resource utilization, scalability, and high availability. Ideally, the
performance of an application running inside a virtual machine (VM) shall be independent of co-located applications and VMs that share the
physical machine.
However, adverse interference effects exist and are especially severe for data intensive applications in such virtualized environments. TRACON
(Chiang & Huang, 2011) is a novel Task and Resource Allocation CONtrol framework that mitigates the interference effects from concurrent
data-intensive applications and greatly improves the overall system performance. TRACON utilizes modeling and control techniques from
statistical machine learning and consists of three major components: the interference prediction model that infers application performance from
resource consumption observed from different VMs, the interference-aware scheduler that is designed to utilize the model for effective resource
management, and the task and resource monitor that collects application characteristics at the runtime for model adaption. The results presented
indicate that TRACON can achieve up to 25 percent improvement on application throughput on virtualized servers.
In scientific applications such as high energy physics and bioinformatics, we encounter applications involving numerous, loosely coupled jobs
that both access and generate large data sets. These data sets may be available at multiple, geo-distributed locations and an application might
seek to use all these data sets. Data Grids make it possible to access geographically distributed resources for such large-scale data-intensive
problems(Mansouri, 2014). Yet effective scheduling in such environments is challenging, due to a need to address a variety of metrics and
constraints such as resource utilization, response time, global and local allocation policies.
It is a well known result in grid systems that while it is necessary to consider data locality while scheduling, job scheduling and data replication
can be effectively decoupled (Ranganathan & Foster, 2002). Data intensive scientific workflows process files in the range of terabytes and
generate voluminous intermediate data. Many recent works have focused on scheduling workflows so as to minimize data transfer costs. An
evolutionary approach to task scheduling is proposed in (Szabo, Sheng, Kroeger, Zhang, & Yu, 2014) that considers and optimizes the allocation
and ordering of tasks in the workflow such that the data transferred between tasks and the execution runtime are minimized together. A similar
concept is used in (Xiao, Hu, & Zhang, 2013) wherein a novel heuristic called Minimal Data-Accessing Energy Path for scheduling data-intensive
workflows aiming to reduce the energy consumption of intensive data accessing.
Data intensive applications need to process large intermediate data whereas schedulers typically consider processing demands of an application
and ignore the storage needs. This can cause performance degradation and potentially increase costs. An integer linear program scheduler that
considers disk storage scheduling besides the task scheduling based on processor time is proposed in (Pereira, W.F., Bittencourt, L.F., & da
Fonseca, N.L.S.,2013). The proposed scheduler aims to meet a deadline set by the user while minimizing costs.
When the scientific data intensive workflows process data which are geographically distributed and the web services used themselves are
distributed, the workflow orchestrator itself is moved to a location which is close to the data source and the web service nodes (Luckeneder &
Barker, 2013). Here the data transfer time is minimized in turn reducing execution times. There have been many attempts to bring the
computation closer to geo-distributed data sources even in the Hadoop community. Many recent works have implemented MapReduce across
data centres (L. Wang et al., 2013; Mattess, Calheiros, & Buyya,2013; Heintz et al., 2014). Table 5 lists some recent work on scheduling in data
intensive applications in the cloud.
Table 5. Scheduling dataintensive applications in the cloud
Data Intensive computing is gaining a lot of attention due to the “Data Deluge”. Apart from the works discussed here, a dispersed cloud
infrastructure that uses voluntary edge resources for both computation and data storage is proposed in (Ryden, Oh, Chandra, & Weissman,
2014). The lightweight Nebula architecture enables distributed data- intensive computing through a number of optimizations including location-
aware data and computation placement, replication, and recovery. There is a renewed interest in hybrid (Bicer & Chiu, 2012) and multi-cloud
(Jung & Kettimuthu, 2014) architectures for data intensive cloud applications.
Deploying data intensive application in cloud environment is still fraught with many challenges. The complications arise due to the diverse
nature of data intensive applications, the geographically distributed data sources and the legal jurisprudence issues over accessing and storing
data. Each of the application types needs different type of application architecture and performance guarantees. Some of the possible research
directions in this regard are identified as below.
• The geo-distribution of data sources can be handled by either taking the computation to the data or transferring data to the computation.
o A single cloud might not have enough resources to hold the entire application data. Hence an inter cloud architecture – multi cloud,
hybrid cloud or federated clouds, is an attractive option to process application datain situ avoiding costly data transfers.
o Efficient data transfer techniques can be devised to enable large scale inter-datacentre data transfers across WAN. These techniques
could also be useful in transferring data from its source to the cloud data centre also.
• Minimizing the overhead associated with data transfers by performing efficient, dynamic data replication across data centers.
• QoS-aware resource management and resource allocation within and/or across data centres.
• Providing simple and intuitive programming models to help ease application development across multiple data centres and interoperable
clouds.
CONCLUSION
Data intensive computing provides enormous benefits in science, governance, social and business applications. The technology for data intensive
cloud computing is constantly improving, but a large number of problems also are being encountered. This paper provides a study of the current
work in the area of data intensive cloud computing. Some directions for further research enabling data intensive cloud applications in cloud
environment are identified. The challenges are many and open a lot of opportunities for research in this area.
This work was previously published in Advanced Research on Cloud Computing Design and Applications edited by Shadi Aljawarneh, pages
305320, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Ács, B., Llorà, X., Capitanu, B., Auvil, L., Tcheng, D., Haberman, M., … Welge, M. (2011). Meandre Data-Intensive Application Infrastructure:
Extreme Scalability for Cloud and / or Grid Computing. In New Frontiers in Artificial Intelligence (pp. 233–242). Academic Press.
Albrecht, M., Donnelly, P., Bui, P., & Thain, D. (2012). Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and
grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (SWEET '12).
10.1145/2443416.2443417
Baker, J., Bond, C., Corbett, J., & Furman, J. (2011). Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR,
223–234.
Bicer, T., Chiu, D., & Agrawal, G. (2012). Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds. In 2012 12th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), (pp. 636-643). 10.1109/CCGrid.2012.95
Chiang R. C. Huang H. H. (2014). TRACON : Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments, 2011
International Conference for High Performance Computing, Networking, Storage and Analysis (SC),25(5), 1-12. 10.1109/TPDS.2013.82
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., … Woodford, D. (2012). Spanner : Google’s Globally-Distributed Database.
In Proceedings of OSDI’12: Tenth Symposium on Operating System Design and Implementaton, (pp. 251–264). doi:10.1145/2491245
Dean J. Ghemawat S. (2004). MapReduce : Simplified Data Processing on Large Clusters. In 6th Symposium on Operating Systems Design and
Implementation, OSDI ’04, (pp. 137–149). OSDI.
Grolinger K. Hayes M. Higashino W. A. L’Heureux A. Allison D. S. Capretz M. A. M. (2014). Challenges for MapReduce in Big Data. In Proc. of
the IEEE 10th 2014 World Congress on Services (SERVICES 2014), (pp. 182-189). IEEE. 10.1109/SERVICES.2014.41
Groot, S., Goda, K., Yokoyama, D., Nakano, M., & Kitsuregawa, M. (2013). Modeling I/O interference for data intensive distributed applications.
In Proceedings of the 28th Annual ACM Symposium on Applied Computing SAC '13. ACM.
Hassan, M. M., & Huh, E. (2011). Resource Management for Data Intensive Clouds Through Dynamic Federation: A Game Theoretic Approach .
In Furht, B., & Escalante, A. (Eds.), Handbook of Cloud Computing (pp. 169–188). Boston, MA: Springer US; doi:10.1007/978-1-4614-1415-5_7
Heintz, B., Member, S., Chandra, A., Sitaraman, R. K., Weissman, J., & Member, S. (2014). End-to-end Optimization for Geo-Distributed
MapReduce. IEEE Transactions on Cloud Computing ,7161(c), 1–14. doi:10.1109/TCC.2014.2355225
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R. H., . . . Stoica, I. (2011, March). Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center. Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI'11).
USENIX Association.
Jiang, D., Chen, G., Ooi, C., Tan, K., & Wu, S. (2014). epiC : An Extensible and Scalable System for Processing Big Data .Proceedings of VLDB
Endowment , 7(7), 541–552. doi:10.14778/2732286.2732291
Jrad F. Tao J. Brandic I. Streit A. (2014). Multi-dimensional Resource Allocation for Data-intensive Large-scale Cloud Applications.Proceedings
of the 4th International Conference on Cloud Computing and Services Science, (pp. 691–702). doi:10.5220/0004971906910702
Jung, E., & Kettimuthu, R. (2014). Towards Addressing the Challenges of Data-Intensive Computing on the Cloud. Computer ,47(12), 82–85.
doi:10.1109/MC.2014.347
Lee, G., Tolia, N., Ranganathan, P., & Katz, R. H. (2011). Topology-aware resource allocation for data-intensive workloads. Computer
Communication Review , 41(1), 120. doi:10.1145/1925861.1925881
Lin, J., Chen, C., & Chang, J. (2013). QoS-aware data replication for data intensive applications in cloud computing systems. IEEE Transactions
on Cloud Computing, 1(1), 101–115. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6562695
Luckeneder, M., & Barker, A. (2013). Location, Location, Location: Data-Intensive Distributed Computing in the Cloud. 2013 IEEE 5th
International Conference on Cloud Computing Technology and Science, (pp. 647–654). doi:10.1109/CloudCom.2013.91
Mansouri, N. (2014). Network and data location aware approach for simultaneous job scheduling and data replication in large-scale data
grid. Frontiers of Computer Science , 8(3), 391–408. doi:10.1007/s11704-014-3146-2
Mattess, M., Calheiros, R. N., & Buyya, R. (2013). Scaling MapReduce Applications Across Hybrid Clouds to Meet Soft Deadlines. 2013 IEEE 27th
International Conference on Advanced Information Networking and Applications (AINA), (pp. 629–636). doi:10.1109/AINA.2013.51
Middleton, A. M. (2010). 05 - Data-Intensive Technologies for Cloud Computing . In Furht, B., & Escalante, A. (Eds.), Handbook of Cloud
Computing (pp. 83–136). Boston, MA: Springer US; doi:10.1007/978-1-4419-6524-0_5
Moretti C. Bulosan J. Thain D. Flynn P. J. (2008). All-pairs: An abstraction for data-intensive cloud computing.2008 IEEE International
Symposium on Parallel and Distributed Processing, (pp. 1–11). 10.1109/IPDPS.2008.4536311
Pereira W. F. Bittencourt L. F. da Fonseca N. L. S. (2013). Scheduler for data-intensive workflows in public clouds.2nd IEEE Latin American
Conference on Cloud Computing and Communications (LatinCloud), (pp. 41-46). 10.1109/LatinCloud.2013.6842221
Ranganathan, K., & Foster, I. (2002). Decoupling computation and data scheduling in distributed data-intensive applications. InProceedings 11th
IEEE International Symposium on High Performance Distributed Computing (pp. 352–358). IEEE Comput. Soc. 10.1109/HPDC.2002.1029935
Ryden M. Oh K. Chandra A. Weissman J. (2014). Nebula: Distributed edge cloud for data-intensive computing.2014 International Conference on
Collaboration Technologies and Systems (CTS), (pp. 491–492). 10.1109/CTS.2014.6867613
Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A Survey of Large Scale Data Management Approaches in Cloud Environments. IEEE
Communications Surveys and Tutorials ,13(3), 311–336. doi:10.1109/SURV.2011.032211.00087
Shamsi, J., Khojaye, M. A., & Qasmi, M. A. (2013). Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and
Solutions. Journal of Grid Computing , 11(2), 281–310. doi:10.1007/s10723-013-9255-6
Szabo, C., Sheng, Q. Z., Kroeger, T., Zhang, Y., & Yu, J. (2014). Science in the Cloud: Allocation and Execution of Data-Intensive Scientific
Workflows. Journal of Grid Computing , 12(2), 245–264. doi:10.1007/s10723-013-9282-3
Vavilapalli, V. K. (2013). Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud
Computing (SOCC '13). ACM. 10.1145/2523616.2523633
Wang, J., Xiao, Q., Yin, J., & Shang, P. (2013). DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications
With Interest Locality . IEEE Transactions on Magnetics , 49(6), 2514–2520. doi:10.1109/TMAG.2013.2251613
Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., & Chen, D. (2013). G-Hadoop: MapReduce across distributed data centers for data-
intensive computing. Future Generation Computer Systems , 29(3), 739–750. doi:10.1016/j.future.2012.09.001
Xiao, P., Hu, Z.-G., & Zhang, Y.-P. (2013). An Energy-Aware Heuristic Scheduling for Data-Intensive Workflows in Virtualized
Datacenters. Journal of Computer Science and Technology , 28(6), 948–961. doi:10.1007/s11390-013-1390-9
Ye, Z., Li, S., & Zhou, J. (2014). A two-layer geo-cloud based dynamic replica creation strategy. Applied Mathematics and Information
Sciences , 8(1), 431–440. doi:10.12785/amis/080154
KEY TERMS AND DEFINITIONS
Data Placement: To store appropriate pieces of data locally at the node, rack, data centre, region or availability zone depending on the context;
the aim is to allow the application to operate on data available locally, so as to avoid communication and data transfer costs.
Data Transfer: Large scale data movement across geographically distributed data centres using Wide Area Network (WAN).
GeoDistributed Data: Data that is generated across countries from simulations, observations, experiments, etc., and are stored at their site of
origin.
Intercloud: Globally interconnected clouds or a Cloud of clouds similar to definition of Internet as a network of networks. The common future
use cases and functional requirements for Intercloud computing are published in a white paper by the Global Inter-Cloud Technology Forum
(GICTF).
Multicloud: A type of Intercloud where the clouds operate independent of each other in contrast to federated clouds which have an agreement
to use each other’s resources.
Programming Models: An abstraction of computing systems that serves as an intermediary between the hardware architecture and the
software layers available to applications; allows algorithms and data structures to be expressed independently of the programming language.
Replication: Creating multiple copies of a file, database or object with a view to increase data locality, availability and fault tolerance. In the
cloud computing scenario, data replication techniques need to deal with consistency and lifetimes issue of replicas while keeping down storage
and data transfer costs.
Scheduling: Allocating resources to jobs so that the work specified in the jobs are completed while maximizing throughput, and fairness among
contending jobs as well as minimizing response time and latency.
CHAPTER 30
Techniques for Sampling Online TextBased Data Sets
Lynne M. Webb
University of Arkansas, USA
Yuanxin Wang
Temple University, USA
ABSTRACT
The chapter reviews traditional sampling techniques and suggests adaptations relevant to big data studies of text downloaded from online media
such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites (e.g., Facebook). The authors review
methods of probability, purposeful, and adaptive sampling of online data. They illustrate the use of these sampling techniques via published
studies that report analysis of online text.
INTRODUCTION
Studying social media often involves downloading publically-available textual data. Based on studies of email messages, Facebook, blogs, gaming
websites, and Twitter, this essay describes sampling techniques for selecting online data for specific research projects. As previously noted (Webb
& Wang, 2013; Wiles, Crow, & Pain, 2011), research methodologies for studying online text tend to follow or adapt existing research
methodologies, including sampling techniques. The sampling techniques discussed in this chapter follow well-established sampling practices,
resulting in representative and/or purposeful samples; however, the established techniques have been modified to apply to sampling online text
—where unusually large populations of messages are available for sampling and the population of messages is a state of constant growth. The
sampling techniques discussed in this chapter can be used for both qualitative and quantitative research.
Rapidly advancing internet technologies have altered daily life as well as the academic landscape. Researchers across disciplines are interested in
examining the large volumes of data generated on internet platforms, such as social networking sites and mobile devices. Compared to data
collected and analyzed through traditional means, big data generated around-the-clock on the internet can help researchers identify latent
patterns of human behavior and perceptions that were previously unknown. The richness of the data brings economic benefits to diverse data-
intensive industries such as marketing, insurance, and healthcare. Repeated observations of internet data across time amplify the size of already
large data sets; data-gathered across time have long interested academics. Vast-sized data sets, typically called “big data,” share at least four
shared traits: The data are unstructured, growing at an exponential rate, transformational, and highly complicated.
As more big data sets become available to the researchers through the convenience of internet technologies, ability to analyze the big data sets
can weaken. Many factors can contribute to a deficiency in analysis. One major obstacle can be the capability of the analytical systems. Although
software developers have introduced multiple analytical tools for scholars to employ with big data (e.g., Hadoop, Storm), the transformational
nature of big data requires frequent software updates as well as increases in relevant knowledge. In other words, analyzing big data requires
specialized knowledge. Another challenge is selecting an appropriate data-mining process. As Badke (2012, p.47) argued, seeking “specific results
for specific queries” without employing the proper mining process can further complicate the project instead of helping manage it. Additionally,
data of multi-petabyte which include millions of files from heterogeneous operating systems might be too large to back up through conventional
computing methods. In such a case, the choice of the data mining tool becomes critical in determining the feasibility, efficiency, and accuracy of
the research project.
Many concerns raised regarding big data collection and analysis duplicate concerns surrounding conventional online data collection:
• Credibility of Online Resources: Authors of the online text often post anonymously. Their responses, comments, or articles are
susceptible to credibility critiques;
• Privacy Issues: Internet researchers do not necessarily have permission of the users who originally generated the text. Users are
particularly uncomfortable when data generated from personal information, such as Facebook posts or text messages on mobile devices, are
examined without their explicit permission. No comprehensive legal system currently exists that draws a clear distinction between publically
available data and personal domains;
• Security Issues: When successful online posters, such as bloggers, enjoy the free publicity of the internet, they also can be victimized by
co-option of their original work and thus violation of their intellectual property rights. It is difficult for researchers to identify the source of a
popular Twitter post that is re-tweeted thousands of times, often without acknowledging the original author. Therefore, data collected from
open-access online sources might infringe authors’ copyrights.
Despite these concerns, researchers and entrepreneurs collect large data sets from the internet and attempt to make sense of the trends contained
therein. Howe et al. (2008) issued a call to action for scientists to assist in coping with the complexities of big data sets. Bollier (2010) observed
that “small samples of large data sets can be entirely reliable proxies for big data” (p. 14). Furthermore, boyd (2010) raised serious questions
about representative sampling of big data sets. Indeed, such incredibly large and complex data sets cry out for effective sampling techniques to
manage the sheer size of the data set, its complexity, and perhaps most importantly, its on-going growth. In this essay, we review multiple
sampling techniques that effectively address this exact set of issues.
BACKGROUND
Because millions of internet venues exist, containing thousands of posts with an ever increasing number of messages, a sampling plan is essential
for any study of new media. Indeed, this wealth of data awaiting harvest can be bewildering in its complexity (Hookway, 2008) and thus studies
of online textual data require methodical planning and procedures for sampling. We adopt Thompson’s definition of sampling as “selecting part
of a population to observe so that one may estimate something about the whole population” (2012, p. 1). In this section, we describe well-
established sampling practices that are widely used in the social sciences and explain how internet researchers working with big data can employ
such practices to produce representative and/or purposeful samples.
Population, Census, and Sampling
In an ideal world, every research project would conduct a census of its objects of study. That is, in an ideal world, the researcher would examine
the entire population and report the findings. We define population as every instance of the phenomenon under study to which the researcher
desires to generalize (e.g., all Twitter posts about a new product during the week following its launch). A census (sampling all relevant instances
of a phenomenon) eliminates any concerns about whether a sample is representative versus biased or inaccurate, because all instances are
examined.
On rare occasion, an approximation of a census is possible. For example, Webb, Hayes, Chang, & Smith (2012) examined every blog post on the
topic of brides or weddings that appeared in the five longest threads of conversation on every fan website of AMC’s drama Mad Men (n = 11) to
describe how fans interpret and describe the weddings that appeared on the TV show across its first three seasons. Note, however, that Webb,
Hayes et al. approximated a census. They examined only the five longest threads of conversation. Furthermore, they did not examine all fan
websites—only those surrounding one television program. Finally, the researchers did not examine every mention of brides or weddings
on all fan websites. Because their sample, even though comprehensive in certain ways, examined only the fan websites for one object of
fandom, Mad Men, they can generalize their findings only to what they sampled, fans of this one television show who posted on fan websites
during the first three seasons. Such is the dynamic relationship between samples, populations, and generalizations—the researcher’s results only
apply to the given sample; however, if the sample is comprehensive or representative, an argument can be made for generalizing beyond the
sample to the larger population.
In the real world of research, censuses are rarely attempted, as time and budgets limit the number of incidences examined. Additionally,
researchers often employ analyses so detailed that only a finite number of incidences can be subjected to analysis for the project to be completed
in a timely manner. For this reason, researchers might sample the population of incidents to select a limited number of incidences for analysis.
The subset selected for analysis is called “the sample.” The set of all incidences of the object of study from which the sample is selected is called
“the population.”
SAMPLING AND BIG DATA SETS
The Question of Sample Size
Whenever a researcher draws a sample, the question of sample size arises. Researchers desire to draw samples large enough to represent the
diversity in the population, but no larger than necessary. Each sample draw requires time, effort, and potentially money to both gather and
analyze the data. Thus, efficient researchers aspire to draw samples of sufficient size, but no larger. What is the ideal sample size? Exact formulas
exist for calculating appropriate sample size for quantitative analyses, depending on three factors: population size, statistics of interest, and
tolerance for error (see, for example, Chapter 4 in Thompson, 2012). Appropriate sample size for qualitative analyses is determined typically by
the saturation process. When no new themes emerge from qualitative analyses of the data set, then the analyses are considered complete and the
sample is determined. The difficulty, of course, with using saturation as a guideline for sampling internet data is that by the time the researcher
knows that more data is needed for inclusion in the sample, the time of data collection can be long past. Therefore, as a practical matter, most
qualitative researchers rely on previously published reports of sample size as a guideline for how much data to download. As the researcher
conducts the relevant literature review, a mean sample size employed in previous studies can be ascertained by taking special notice of the
sample sizes in previous studies of similar phenomena using similar methods. In addition to the typical sample size employed in published work
on similar objects of study, cautious researchers add a cushion of an additional 10 – 20% of incidences in case saturation comes later than usual
in the given sample.
Sampling Techniques
The techniques the researcher employs to select incidences from the population and into the sample have received intense scholarly scrutiny
(Daniel, 2012; Kalton, 1983; Scheaffer, Mendenhall, Ott, & Gerow, 2012; Thompson, 2012). Any acceptable sampling technique must be carefully
selected and defended using the principles discussed in this section of the chapter. The goal of all sampling techniques is to obtain a
representative sample (i.e., a sample in which the incidences selected accurately portray the population). Of course, obtaining a representative
sample from a big-data set is especially challenging, given its ever changing and ever growing nature as well as its potential complexity. However,
widely-accepted methods for obtaining a representative sample are multiple; the most common techniques are discussed below.
Critical Decision Point: Probability vs. Convenience Sampling
The first choice before the researcher is whether to employ a probability sample or a convenience sample. An argument can be made that both
kinds of samples are representative of their respective populations; however, the argument relies on the phenomenon under study.
“Sample designs based on planned randomness are called probability samples” (Schaeffer et al., 2012, p. 10). In probability sampling, each
element has a known, nonzero probability of inclusion into the sample (Kalton, 1983). In the case where every instance has an equal probability
of selection into the sample (simple random sampling), the researcher can offer a statistical argument for the representativeness of the sample.
The argument is based primarily on two points: “selection biases are avoided” (Kalton, 1983, p. 7), and statistical theory allows for the prediction
that the sample is likely to be representative of the population from which it was drawn.
In convenience sampling, the researcher includes easy-to-access incidences in the sample, thus saving time and money involved in drawing a
probability sample. “The weakness of all nonprobability sampling is its subjectivity” (Kalton, 1983, p. 7). Conversely, convenience sampling is
always the best choice when a compelling argument can be made that the phenomenon under study is likely to be equally represented in all
incidences of the object of study. For example, the studies to determine human body temperature were not conducted with random samples of
the human population, but rather with the incidences “at hand,” specifically medical residents and students as well as nurses and nursing
students. The researchers reasoned that if human beings as a species retain an average body temperature, it does not matter which human
specimens were included in the sample; the average temperature of any sample of humans would serve as an accurate approximation for the
average temperature in the population. Given the tremendous diversity of internet text, researchers examining this object of study rarely have the
opportunity to argue for the universality of the phenomenon under study and thus typically employ probability rather than convenience
sampling.
Probability Sampling
How do researchers engage in probability sampling? Such sampling involves two steps described in detail below: (a) selection of the population
and sampling frame as well as (b) choosing the sampling technique that selects incidences for inclusion in the sample.
Selecting a Population and Sample Frame
The researcher must decide on a population to study. For example, will a study of political blogs examine a sample drawn from all political blogs
in the world or in a given nation? Will the study examine only filter blogs, campaign blogs, popular blogs, political party blogs, or another type of
blog of interest?
Most sampling experts describe selecting a population as a “first step” in methodological design (e.g., Kalton, 1983, p. 6). In reality, for most
research projects, the selected population under study exists in a dynamic relationship to its sampling frame. That is, researchers often define the
population via the sampling frame and visa-versa. The term sampling frame refers to a list of all incidences in the population—all the incidences
available for sampling (Scheaffer et al., 2012). For example, Waters and Williams (2011) defined their population and their sampling frame as all
U. S. government agencies; they reported randomly selecting from this sampling frame 60 government agencies to examine their recent tweets.
Given the dynamic nature of internet texts, discovering an accurate sampling frame can prove challenging. Researchers may employ convenience
sampling when “there is no single systematic register from which to select randomly” (Thewall & Stuart, 2007, p. 530). This practice was more
common in early studies on online text (e.g., Bar-Ilan, 2005). Contemporary researchers can select among multiple techniques to address this
problem:
• Rely on pre-existing services and companies to provide sampling frames. As Tremayne, Zheng, Lee, and Jeong (2006) noted, “A number of
websites provide rankings of blogs, usually according to the total number of web links pointing to each blog” (p. 296). Thus, help is available
to identify existing sampling frames. For example, Thelwall, Buckley and Paltoglou (2011) “downloaded from data company Spinn3r as part
of their (then) free access program for researchers” (p. 410). Similarly, Xenos (2008) captured blog discussion on topics of interest using “an
early incarnation of the Blogrunner (also known as the Annotated New York Times)” (p. 492). Alternatively, Subrahmanyam et al. (2009)
used a simple Google search to locate blog-hosting websites that were then examined for adolescent blogs; the Goggle search results
provided their sampling frame. Finally, Huffaker and Calvert (2005) retrieved their sampling frame “using two weblog search engines, as
well as from Blogspot andLiveJournal (p. 4);
• Rely on features built into online technologies, such Twitter’s key-word search feature (e.g., Cheng, Sun, Hu, & Zeng, 2011). Websites
typically contain search features that allow both random sampling as well as purposeful sampling by key word. For example, Webb, Wilson,
Hodges, Smith, and Zakeri (2012) reported using a Facebook feature that provides on request a random page from the user’s network.
Similarly, Ji and Lieber (2008) reported sampling profiles on a dating website by using its search feature. Some blog host sites require
researchers to join the blogging community and establish an empty blog to access the search feature, but the service is typically free with
open access after joining;
• Use commercial software packages with sampling functions. For example, Greer and Ferguson (2011) report using Webpage Thumbnailer
“to capture digital pictures of Web pages from a batch list” (p. 204) and then sampling the captured pages. See Boulos, Sanflippo, Corley, and
Wheeler (2010) for detailed descriptions and reviews of multiple social web mining applications;
• Define the population quite narrowly such that a census or near census is possible and/or a sampling frame can be readily ascertained (e.g.,
Webb, Hayes, et al., 2012);
• Define the population as the sampling frame. For example, Thoring (2011) defined her population as the Twitter feeds of “all UK trade
publishers that were members of the Publishers Association (PA) and/or Independent Publishers Guild (IPC) at the time of surveying” (p.
144-145);
• Employ multiple, overlapping sampling frames that are likely to capture the vast majority of the population. For example, Hale (2012)
reported using three overlapping search engines to locate blogs for sampling and analysis;
• Define the population and sampling frame, in large part, by a given time frame that might or might not be tied to the phenomenon
understudy. For example, Ifukor (2010) sampled blog posts during three time periods: pre-, during, and post-election. Alternatively,
Thelwall, Buckley, and Paltoglou examined Twitter posts between February 9, 2010 and March 9, 2010; they identified “the top 30 events
from the 29 selected days using the time series scanning method” (2011, p. 410);
• Sampling across time (see Intille, 2012 for three options). For example, to correlate Twitter mood to stock prices, Bollen, Mao, and Zeng
(2011), collected all tweets posted across a 9.5 month period and then used a software text analytic tool “to generate a six-dimensional daily
time series of public mood” (p. 2);
• Systematic sampling at fixed-intervals (e.g., every day at noon for 2 weeks). For example, McNeil et al. (2012) collected tweets via keyword
searches each day at one specific time across a seven day period.
Selecting a Probability Sampling Technique
After a sampling frame is identified, the researcher can engage in “pure random sampling” meaning numbers can be drawn from a random
number table and used as the basis for selecting incidences into the sample. For example, Waters and Williams (2011) defined their population
and their sampling frame as all U. S. government agencies; they reported randomly selecting from this sampling frame 60 government agencies
to examine their recent tweets. Random number tables appear in the back of most statistics book; they are available free of charge on the internet
and can be discovered via a Google search. Using a random number table is equivalent to assigning numbers to all incidences in the sampling
frame, printing the numbers on individual pieces of paper, tossing the pieces of papers into a bowl, thoroughly mixing the pieces of paper, and
then drawing the numbers from the bowl. “To reduce the labor of the selection process and to avoid such problems as pieces of paper sticking
together, the selection is more commonly made using a random number table” (Thompson, 2012, p. 11). In random sampling, each incidence has
the same or an equal probability of inclusion into the sample (Scheaffer et al., 2012).
Simple Random Sampling
Researchers employ simple random sampling when the sampling frame is stable and can be examined at a fixed point in time, such as customers’
incoming email messages during the first week following the launch of a new product. In simple random sampling, each incident has an equal
probability of inclusion in the sample and that probability is based on population size. For example, if the researcher identified 963 tweets
containing a key word sent on a given day, simple random sampling of those tweets would allow each tweet a 1 in 963 chance of selection into the
sample. Fullwood, Sheehan, and Nicholls (2009) reported randomly selecting open My Space pages for analysis. Similarly, Williams, Trammell,
Postelnicu, Landreville, and Martin (2005) downloaded the front page of the Bush and Kerry campaign websites every day after the conventions,
and then randomly sampled blog posts from the downloaded pages for analysis.
Simple random sampling can be done with or without replacement—meaning a researcher can either allow or fail to allow an incident to be
selected more than once. With larger samples, replacement seems unnecessary, as many incidences are available for selection. Also, the more
diverse the population, the more each incident potentially represents a unique set of characteristics, the less the researcher would desire to
include multiple copies of that unique set of characteristics in any given incidence, and thus the less likely the researcher would employ
replacement sampling. For example, McNeil, Brna, and Gordon (2012) collected data via keyword searching Twitter posts, but excluded re-tweets
and duplications to avoid skewing the data. Random sampling with and without replacement can be conducted within a large data set via
standard statistical analysis software packages such as SPSS and SAS.
Systematic Sampling
When a population is quite fluid, such as tweets posted about a product recall, the researcher might employ a systematic sample, also called a “1
in K.” “In general, systematic sampling involves random selection of one element from the first k elements and then selection of every kth
element thereafter” (Scheaffer et al., 2012, p. 219). For example, a researcher could elect to analyze all email messages generated in a given
organization during a given week by sampling every 12th email message generated from 12:00 AM Sunday to 11:59 PM the following Saturday.
Usually the researcher uses a random number table to select a number representing a small portion of the population, typically between one and
ten per cent and samples every kth incident. For example, if a collection of comments on a gaming website contain a total of 997 posts, a
researcher could calculate that 10% of 997 is 10, then select a random number between 0 and 9 from a random number table. If we assume that
selected number was 6, then the researcher samples posts number 6, 16, 26, 36, etc. until post number 996. In this way the researcher will
randomly sample 99 posts or 10% of the population. Vandoorn, vanZoonen, and Wyatt (2007) reported employing a systematic sampling
technique in their examination of online gender identifiers displayed in Dutch and Flemish weblogs: From their sample frame, “every 11th, 12th,
and 13th weblog was searched. On this basis, 97 weblogs were selected” (p. 148).
Systematic sampling has multiple advantages that simple random sampling does not share (Kalton, 1983)—advantages that can make it more
useful for big data research:
• The population can be fluid and contain fluctuations in time, space, and frequency that would make random sampling difficult. For
example, in their research on adolescent blogging, Subrahmanyam et al. (2009) examined “the last three entries posted between April 15 and
May 15” in each identified blog” (p. 226) regardless of how many posts each blog contained. Similarly, Greer and Ferguson (2011) analyzed
the first page of Twitter accounts, thus reading the latest post and reports by account holders;
• Compared to random sampling from a random number table, a sample can be selected with little effort or complexity;
• Systematic sampling is easy to apply and thus involves no complex mathematics or training for the sample selector;
• Systematic sampling can “yield more precise estimates than a simple random sample” (Schaeffer, 2012, p. 11). For example, in attempting
to identify a network of bloggers who regularly post about the Iraq War, Tremayne et al. (2006) “used a systematic sample to isolate posts
from each political blog” discussing the War. Specifically, they selected posts from 16 days for coding, because “a Lexis-Nexis search revealed
these months to be higher than average for war in Iraq news, which served as assurance that a considerable number of [relevant] post would
be found ” (p. 296).
Cluster Sampling
“A cluster sample is a probability sample in which each sampling unit is a collection, or cluster, of elements” (Scheaffer et al., 2012, p. 252). For
example, a researcher might desire to sample the longest strings of conversation on fan blogs—a sting composed of an original post and all
subsequent commentary (e.g., Webb, Chang et al., 2012).
How does the researcher draw a cluster sample? “The population is partitioned into primary units, each primary unit being composed of
secondary units. Whenever a primary unit is included in the sample, every secondary unit within it is observed.… Even though the actual
measurement may be made on secondary units, it is the primary units that are selected” (Thompson, 2012, p. 157). For example, Webb, Chang et
al. (2012) observed 11 fansites, discovered all strings on each fansite, and then downloaded all elements within the five longest strings on each
fansite. As with any form of sampling, the first step is to specify the population and sampling frame—in this case, appropriate clusters. Typically,
incidences within a cluster are “close” in some way (e.g., geographically, chronologically, emotively) and “hence tend to have similar
characteristics” (Scheaffer et al., 2012, p. 253). Indeed, on fansites, all elements in a string of conversation appear next to one another in reverse
chronological order and tends to discuss the same topic. Cluster sampling is preferable under two conditions: (1) when the phenomenon under
study cannot be easily sampled any other way such as looking at email messages in a trail, given that the messages appear attached to one
another and (2) when the phenomenon under study can only be observed in clusters such as blog interaction which is by definition comprised of
clusters of posts.
MultiStaged Designs
Big data sets easily accommodate multi-staged designs, in which sampling occurs in a repeated fashion across time. “If, after selecting a sample of
primary units, a sample of secondary units is selected from each of the primary units selected, the design is referred to as twostage sampling. If
in turn a sample of tertiary units is selected from each selected secondary unit, the design isthreestage sampling. Higher-order multistage
designs are also possible” (Thompson, 2012, p. 171). For example, Subrahmanyam et al. (2009) used a simple Google search to locate blog-
hosting websites (N=9) that were then examined for adolescent blogs (N=201). The Google search results provided the sampling frame. After
adolescent blogs were located within the blog-hosting websites, the team harvested the three most recent three entries within a 3 month time
frame from each adolescent blog as their data for analysis (N=603 entries). Multi-staged designed are a useful alternative for digging into a
complex data set to find exact incidences of the phenomenon under study.
Critical Decision Point 2: Balancing the Goals of Representativeness vs. IndecentRich Sampling
The second choice before the researcher is how to balance the competing goals of drawing a representative sample but also a purposeful (incident
rich) sample. The researcher’s ultimate desire is a representative sample, but the object of study can relatively rare (e.g., How does a researcher
locate the Facebook pages of pre-adolescents among the sea of open Facebook pages available for examination?). Within a set of big data, access
can be difficult to obtain (e.g., Pre-adolescents might be quite secretive about their Facebook posts and thus send updates to friends only, thus
effectively blocking parental viewing as well as researcher viewing of their posts). In such cases, the researcher might opt for an in-depth analysis
of an incident rich sample without a known probability. To ameliorate the concerns of representativeness raised with a nonprobability sample,
researchers often select incidences in a way that infuses the sample with diversity that they believe exists in the population. For example,
Subrahmanyam, Garcia, Harsono, Li, and Lipana (2009) gathered adolescent blogs from nine hosting sites in the belief that different hosting
sites attract different kinds/types of adolescent bloggers. When researchers employ non-probability samples, they often acknowledge in the
limitations section of their discussion that the generalizability of the study’s conclusion remain unknown.
Purposeful Sampling
How does a researcher go about carefully selecting incidences that illustrate a rare phenomenon under study and thus create an incident-rich
study? Multiple sampling techniques exist for such purposes:
• Samples can account for the rate of message production and/or number of views. For example, Ferguson and Greer (2011) desired to
sample “normally active” Twitter feeds and therefore dropped from their sample frame twitter feeds from accounts they considered inactive
and overactive, as defined by the number of tweets per day;
• Samples can be defined by time frames. For example, Sweetser Trammell (2007) analyzed blog posts aimed at young votes on campaign
websites during the intense period of campaigning (Labor Day through Election Day 2004). Similarly, Gilpin (2010) studied blog posts and
Twitter messages within “two financial quarters, an important unit of measure for publically traded companies” (p. 271);
• The object of study can be defined so narrowly that all incidences of the phenomenon can be included in the sample. For example, Gilbert,
Bergstrom, and Karahalios (2009) sampled authoritative blogs, defined by “the number of distinct blogs linking to each indexed blog over
the last six months” (N=100; p. 7). Similarly, Subrahmanyam et al. (2009) used a simple Google search to locate blog-hosting websites (N=9)
that were then examined for adolescent blogs (N=201);
• Samples can be defined in response to user initiatives. For example, Trammell (2006) examined only those campaign blog posts that
mentioned the opponent, issues, or political matters directly. Similarly, Hardey (2008) examined conversational threads relating to eDating
on “23 public English language newsgroups that were hosted by UK websites from April to June 2005” (p. 1114) and ignored the vast
majority of conversational threads that failed to discuss the online dating website eDating;
• Samples can be defined by the web location itself. Because Schiffer (2006) examined new coverage of one political incident and desired to
learn which journalists broke which parts of the story when, he purposefully sampled text from blogs known for cutting edge reporting—blog
posts on “ten of the leading Weblogs and the Daily Kos diaries” (p. 498);
• Exemplar sampling allows the researcher to selectinformationrich internet text to serve as case studies. For example, Hayes (2011)
carefully selected one mommy blog to examine in depth to explain how the blogger and her readers created a sense of community through
the expression of identity and discussion on the blog. Hayes chose one specific blog for analysis exactly because its posts and comments
provided clear examples of the notions she discussed. Similarly, Keating and Sunakawa (2010) observed online interactions between two
carefully-selected gaming teams as a case study of co-ordinated online activity activities;
• Contextsensitive sampling is “purposive sampling in response to automatically detected user behavior, context, or physiological response,
as measured passively or semi-automatically using sensors” (Intille, 2012, p. 268). Examples include key words relative to weather, place,
proximity to others relevant phenomenon (e.g., blogroll). The sampling can be two-tiered in that detection of an occurrence of an initial
keyword can trigger searches for additional and/or secondary key words, such as rain-lightening, sale-discount, Republication-Romney.
Adaptive Sampling: Using Probability Techniques to Obtain Purposeful Samples
Researchers often use probability sampling techniques within studies of narrowly defined or rare phenomenon to obtain incident-rich samples
that are generalizable because they have known probabilities. Such techniques are called adaptive sampling because they employ either
systematic or random sampling techniques that “depend on observed values of the variable of interest” (Thompson, 2012, p. 6). We consider
adaptive sampling an ideal sampling choice for big data sets when a researcher desires to study a specific aspect of a data set but also to
generalize the findings to a population. Adaptive sampling offers the advantages of both probability sampling and purposive sampling with the
disadvantages of neither. Below we discuss multiple techniques for adaptive sampling:
• Creating comparison groups based on characteristics of interest. Creating comparison groups is essential when the researcher desires to
examine contrasts such as the differences between popular versus unpopular websites or males versus female authored texts. For example,
Reese, Rutigliano, Hyun, and Jeong (2007) used Technorati ratings to locate the most popular conservative political blogs and the most
popular liberal political blogs to analyze text from blogs with opposing viewpoints and thus likely capture national discussion on political
issues. Alternatively, Burton and Soboleva (2011) examined the Twitter feeds of multi-national businesses in two English-speaking countries
to discover differing uses of the interactive features of Twitter to accommodate the culturally-different expectations of consumers. Finally,
Jansen, Zhang, Sobel, and Chowdury (2009) examined tweets about representative companies from segments of major industries; their
created groups were the segments of industry.
Stratified Random Sampling
“In stratified sampling, the population is partitioned into regions or strata, and a sample is selected by some design within each stratum”
(Thompson, 2012, p. 141). Strata are defined in such a way that they function as nonoverlapping groups; the researcher selects specific incidences
from each strata into the sample typically via random sampling (Scheaffer et al., 2012). For example, Boupha, Grisso, Morris, Webb, & Zakeri
(2013) reported randomly selecting open Facebook pages from each U. S. state; their 50 strata were the 50 United States. Alternatively,
Trammell, Williams, Postelnicu, and Landreville (2006) analyzed blog posts from the 10 Democratic primary presidential candidates in the 2004
election; they stratified their sampling across time by gathering posts from 14 target days spanning the beginning of the primary season (Labor
Day 2003) through the Iowa caucus (January 2004). Stratified random sampling can be conducted within a large data set using standard
statistical analysis software packages such as SPSS and SAS, if each sample entry is coded for the strata of interest. This sampling technique is
appropriate when researchers desire to compare multiple groups or strata across variables of interest.
Stratified Sampling at Random Times Across a Fixed Period (e.g., 12 Randomly Selected Times Across a 48 Hour Period)
This sampling technique is appropriate when researchers desire to compare multiple groups or strata across variables of interest—and across
time; researchers use this technique to observe how the comparisons between the strata change across time. For example, Williams, Trammell,
Postelnicu, and Martin examined hyperlinks posted in a stratified random sample of “10% of the days in the hot phase of the general election
period, from Labor Day through Election day 2004” (2005, p. 181). In another study of candidates’ 2004 websites, Trammell, Williams,
Postelnicu, and Landreville (2006) examined blog posts from ten Democratic presidential candidates’ websites during the 2004 primary
campaigns. The researchers reported that using a stratified sampling method, specifically “10% of the days spanning the beginning of the primary
season (Labor Day 2003) through the Iowa caucuses (January 2004)” (p. 29); thus, they identified a total of 14 target days (or 14 strata) for
analysis.
Sampling Nodes Based on Attributes of Interest (e.g., Location, Activity, Longevity)
For example, Thelwall, Buckley, and Paltoglou examined Twitter posts between February 9, 2010 and March 9, 2010. Then “the top 30 events
from the 29 selected days were identified using the time series scanning method” (2011, p. 410). The researchers then employed a “3-hour burst
method” (p. 410) to identify topics that sustained increases in commentary for three-consecutive hours. Tweets about such topics were inclusion
in the sample. This technique is especially appropriate for complex phenomenon that occur in nodes or clusters based on location, activity, or
longevity.
Critical Decision Point 3: Balancing the Goals of Representing Typical, Popular, and Rare Phenomena
A third choice before the researcher is whether to examine (a) typical phenomena, view-points, media, and texts or (b) popular occurrences, or
(c) rare and unusual occurrence. All three objects of study are worthy of examination and represent important segments of the human
experience. Big data sets contain sufficient numbers of incidences to allow for random sampling to reveal typical phenomena as well as careful
sampling of the unusual. More challenging is operationally defining popularity. Researchers have assessed website popularity via the number of
links to the page, Google PageRank, number of hits, and number of unique page views (Webb, Fields, Boupha, & Stell, 2012). Researcher can
select and defend choices based on conceptual thinking or simply employ multiple assessment techniques in the given study. For example, Webb,
Fields et al. (2012) measured blog popularity in two ways: number of comments and number of hits. Their analyses yielded two paths to
popularity; one set of variables was associated with popularity as assessed by number of comments (length of homepage and number of
comments opportunities), whereas a different set of characteristics were associated with popularity as assessed by number of hits (number of
tabs, link, and graphics as well as the website’s internal accessibility). Thus, the operationalization of popularity is no small matter, as it can
influence sampling decisions and ultimately the results of the study.
Managing and Analyzing Samples of Big Data
As big data became gradually available to web researchers, major methodological concerns emerged such as how to store and analyze such large
quantities of data simultaneously—even samples carefully drawn from such large data pools. Specific technologies and software emerged to
address these concerns. Big data technology includes big data files, database management, and big data analytics (Hopkins & Evelson, 2011).
One of most popular and widely available data management systems for dealing with hundreds of gigabytes or petabytes data simultaneously is
the Hadoop programming model popularized by Google. Its strengths include providing reliable shared online storage for large amounts of
multiple-sourced data through the Hadoop Distributed Filesystem (HDFS), analysis through MapReduce (a batch query processer that abstracts
the problem from disk reads and writes), and transforming data into a map-and-reduce computation over sets of keys and values (White, 2012).
MapReduce works well with unstructured or semi-structured data because it is designed to interpret the data at processing time (Verma,
Cherkasova, & Campbell, 2012). While the MapReduce system is able to analyze a whole “big data” set and large samples in batch fashion, the
Relational Database Management System (RDBMS) shows more strength in processing point queries where the data is structured into entities
with defined format (i.e., structured data) as may occur in key-word or key-characteristic sampling (White, 2012). Different from MapReduce’s
linear scalable programming that is not sensitive to the change of data size and cluster, RDBMS is a nonlinear programming which allows
complex functions such as quadratic or cubic terms (Sumathi & Esakkirajan, 2007) in the model. RDBMS could be retrieved from http://mysql-
com.en.softonic.com/. Google’s success in text processing and their embrace of statistical machine learning was decoded as an endorsement that
facilitated Hadoop’s wide-spread adoption (Cohen, Dolan, Dunlap, Hellerstein, & Welton, 2009). Hadoop, the open-source software can be
downloaded from http://hadoop.apache.org/releases.html.
On the other hand, additional technologies and software are available for use with big data sets and samples. They represent reasonable
alternatives to Hadoop, especially when data sets display unique characteristics that can be best addressed with specialized software. Such
alternatives include the following:
• High Performance Computing (HPC) and Grid Computing also are designed to process large-scale data. This system works well for
compute-intensive jobs, but becomes problematic when the compute nodes need to access larger data volumes (White, 2012), making it a
good candidate for complex analyses of smaller samples of big data;
• A competitive pool of analytical tools for big data is provided by a diversity of software developers. For example, Ayasdi launched its
Topological Data Analysis to seek the fundamental structure of massive data sets in 2008. This software could be used in the areas of life
sciences, oil and gas, public sector, financial services, retail, and telecom (Empson, 2013) and works well with large and small samples;
• SAS also provides a Web-based solution that leverages SAS high-performance analytics technologies, SAS Visual Analytics, to explore huge
volumes of data by showing the correlations and patterns within big data sets and samples, thus identifying opportunities for further analysis
(Troester, 2012). Such software is quite useful in exploratory analyses;
• More commonly used in social science research are software programs that specialize in analyzing the statistical patterns of big data, after
the sampling process. For example, the U. S. computer technology company Oracle released a version of R integrated into its database and
big data appliance (Harrison, 2012);
• For those who are not familiar with syntax-based statistical tools such as STATA, SAS, or R and are handling a relatively small dataset,
Minitab and IBM SPSS could be a more practical choice. However, it is important to note that Minitab allows a maximum of 150 million
cells, or a maximum of 4,000 columns and 10 million rows; while SPSS, on the other hand, holds up to 2 billion cases in dataset. Thus, SPSS
can process larger samples than Minitab.
Solutions and Recommendations
This chapter reviewed a wide variety of sampling options and techniques applicable to big data sets of text harvested from the internet. Based on
studies of email messages, Facebook, blogs, gaming websites, and Twitter, the chapter described sampling techniques for selecting online data for
specific research projects. We considered the essential characteristics of big data generated from diverse online resources—that the data might be
unstructured, transformational, complicated, and growing in fast pace—as we discussed sampling options. Rationale for the use of each
technique were offered as the options were presented through an examination of important sampling principles including representativeness,
purposefulness, and adaptively. Three critical decision points were reviewed and multiple examples of successful sampling were presented. This
information can assist the researcher in making an informed and appropriate choice of sampling techniques, given the goals of the research
project and its object of study. However, because of the ever-changing nature of the online data sets, it is necessary to update our understanding
and analytical techniques for big data in an on-going way. Additionally, issues regarding privacy, credibility, and security must be carefully
monitored during the online data collection process.
FUTURE RESEARCH DIRECTIONS
As future researchers continue to publish reports of analyses of big data sets, many positive developments are likely to occur: Researchers
examining internet text are likely to:
• Develop new and innovative sampling techniques not yet invented and thus not discussed in this chapter. Such innovative techniques are
likely to be based on the time-honored set of sampling principles reviewed above;
• Develop a set of sampling conventions to effectively manage big data sets. As these conventions are repeatedly replicated, they will become
normative and sampling big data sets could become simplified;
• With improved sampling techniques, examine smaller and smaller data sets to gain a deeper understanding of the latent patterns in big
data sets;
CONCLUSION
The development of internet technology has made large comprehensive data sets readily available, including publically-available, textual, online
data. Such data sets offer richness and potential insights into human behavior, but can be costly to harvest, store, and analyze as huge data sets
can translate into big labor and computing costs for mining, screening, cleansing, and textual analysis. Incredibly large and complex data sets cry
out for effective sampling techniques to manage the sheer size of the data set, its complexity, and perhaps most importantly, its on-going growth.
In this essay, we review multiple sampling techniques that effectively address this exact set of issues.
No sampling method works best in all studies. This chapter assists researchers in making critical decisions that result in appropriate sample
selections. Discovering feasible and appropriate sampling methods can assist the researcher in discovering otherwise invisible patterns of human
behavior available for discovery in big data sets.
Big data sets present sampling challenges that can be addressed with a working knowledge of sampling principles and a “tool box” of sampling
techniques reviewed in this chapter. The chapter reviewed traditional sampling techniques and suggested adaptations relevant to big data studies
of text downloaded from online media such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites
(e.g., Facebook). Specifically, we reviewed methods of probability, purposeful, and adaptive sampling of online data. We illustrated the use of
these sampling techniques via published studies that report analysis of online text. As more big data analyses are published, sampling
conventions are likely to emerge that will simplify the decision-making process; the emergent conventions are likely to follow the guiding
principles of sampling discussed in this chapter.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 95114, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bollen, J., Mao, H., & Zeng, X. J. (2011). Twitter mood predicts the stock market. Journal of Computational Science , 2(1), 1–8.
doi:10.1016/j.jocs.2010.12.007
Boulos, M. N. K., Sanfilippo, A. P., Corley, C. D., & Wheeler, S. (2010). Social web mining and exploitation for serious applications: Technosocial
predictive analytics and related technologies for public health, environmental and national security surveillance. Computer Methods and
Programs in Biomedicine , 100, 16–23. doi:10.1016/j.cmpb.2010.02.007
Boupha, S., Grisso, A. D., Morris, J., Webb, L. M., & Zakeri, M. (2013). How college students display ethnic identity on Facebook. In R. A. Lind
(Ed.), Race/gender/media: Considering diversity across audiences, content, and producers (3rd ed.), (pp. 107-112). Boston, MA: Pearson. boyd,
d. (2010, April). Privacy and publicity in the context of big data. Retrieved from http://www.danah.org/papers/talks/2010/WWW2010.html
Burton, S., & Soboleva, A. (2011). Interactive or reactive? Marketing with Twitter. Journal of Consumer Marketing , 28(7), 491–499.
doi:10.1108/07363761111181473
Cheng, J., Sun, A., Hu, D., & Zeng, D. (2011). An information diffusion-based recommendation framework for micro-blogging.Journal of the
Association for Information Systems , 12, 463–486.
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., & Welton, C. (2009). MAD skills: New analysis practices for big data . Lyon, France: Paper
Presented at Very Large Data Base.
Daniel, J. (2012). Sampling essentials: Practical guidelines for making sampling choices . Thousand Oaks, CA: Sage.
Ferguson, D. A., & Greer, C. F. (2011). Local radio and microblogging: How radio stations in the U.S. are using Twitter.Journal of Radio & Audio
Media , 18, 33–46. doi:10.1080/19376529.2011.558867
Fullwood, C., Sheehan, N., & Nicholls, W. (2009). Blog function revisited: A content analysis of Myspace blogs. CyberPsychology & Behavior,
12(6), 685-689. doi: 10. 1089/cpb.2009.0138
Gilbert, E., Bergstrom, T., & Karahalios, K. (2009). Blogs are echo chambers: Blogs are echo chambers. In Proceedings of the 42ndHawaii
International Conference on System Sciences. IEEE. Retrieved http://comp.social.gatech.edu/papers/hicss09.echo.gilbert.pdf
Gilpin, D. (2010). Organizational image construction in a fragmented online media environment. Journal of Public Relations Research , 22, 265–
287. doi:10.1080/10627261003614393
Greer, C. F., & Ferguson, D. A. (2011). Using Twitter for promotion and branding: A content analysis of local television Twitter sites.Journal of
Broadcasting & Electronic Media , 55(2), 198–214. doi:10.1080/08838151.2011.570824
Hale, S. A. (2012). Net increase? Cross-lingual linking in the blogosphere. Journal of Computer-Mediated Communication , 17, 135–151.
doi:10.1111/j.1083-6101.2011.01568.x
Harrison, G. (2012). Statistical analysis and R in the world of big data. Data Trends & Applications , 26(3), 39.
Hassid, J. (2012). Safety valve or pressure cooker? Blogs in Chinese political life. The Journal of Communication , 62, 212–230.
doi:10.1111/j.1460-2466.2012.01634.x
Hayes, M. T. (2011). Parenting children with autism online: Creating community and support online . In Moravec, M. (Ed.),Motherhood online:
How online communities shape modern motherhood (pp. 258–265). Newcastle upon Tyne, UK: Cambridge Scholars Publishing.
Hookway, N. (2008). Entering the blogosphere: Some strategies for using blogs in social research. Qualitative Research , 8, 91–113.
doi:10.1177/1468794107085298
Hopkins, B., & Evelson, B. (2011). Expand your digital horizon with big data . Washington, DC: Forrester Research, Inc.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., & Hide, W. (2008). Biddata: The future of bio-curation. Nature , 455(4), 47–50.
doi:10.1038/455047a
Huffaker, D. A., & Calvert, S. L. (2005). Gender, identity, and language use in teenage blogs. Journal of ComputerMediated Communication, 10.
Retrieved September 10, 2008, from http://www3.interscience.wiley.com/cgi-bin/fulltext/120837938/HTMLSTART
Ifukor, P. (2010). Elections or selections? Blogging and twittering the Nigerian 2007 general elections. Bulletin of Science, Technology &
Society , 30(6), 398–414. doi:10.1177/0270467610380008
Intille, S. S. (2012). Emerging technology for studying daily life . In Mehl, M. R., & Conner, T. S. (Eds.), Handbook of research methods for
studying daily life (pp. 267–283). New York, NY: Guilford Press.
Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter power: Tweets as electronic word of mouth. Journal of the American Society
for Information Science and Technology , 60(11), 2169–2188. doi:10.1002/asi.21149
Ji, P., & Lieber, P. S. (2008). Emotional disclosure and construction of the poetic other in a Chinese online dating site.China Media
Research , 4(2), 32–42.
Keating, E., & Sunakawa, C. (2010). Participation cues: Coordinating activity and collaboration in complex online gaming worlds. Language in
Society , 39, 331–356. doi:10.1017/S0047404510000217
McNeil, K., Brna, P. M., & Gordon, K. E. (2012). Epilepsy in the Twitter era: A need to re-tweet the way we think about seizures.Epilepsy &
Behavior , 23(2), 127–130. doi:10.1016/j.yebeh.2011.10.020
Reese, S. D., Rutigliano, L., Hyun, K., & Jeong, J. (2007). Mapping the blogosphere: Professional and citizen-based media in the global news
arena. Journalism , 8, 235–261. doi:10.1177/1464884907076459
Scheaffer, R. L., Mendenhall, W. III, Ott, R. L., & Gerow, K. G. (2012). Elementary survey sampling (7th ed.). Boston, MA: Brooks/Cole.
Schiffer, A. J. (2006). Blogsworms and press norms: News coverage of the Downing Street memo controversy. Journalism & Mass
Communication Quarterly , 83, 494–510. doi:10.1177/107769900608300302
Subrahmanyam, K., Garcia, E. C., Harsono, L. S., Li, J. S., & Lipana, L. (2009). In their words: Connecting on-line weblogs to developmental
processes. The British Journal of Developmental Psychology , 27, 219–245. doi:10.1348/026151008X345979
Sumathi, S., & Esakkirajan, S. (2007). Fundamentals of relational database management systems . New York, NY: Springer. doi:10.1007/978-3-
540-48399-1
Sweetser Trammell, K. D. (2007). Candidate campaign blogs: Directly reaching out to the youth vote. The American Behavioral Scientist , 50,
1255–1263. doi:10.1177/0002764207300052
Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and
Technology , 62(2), 406–418. doi:10.1002/asi.21462
Thelwall, M., & Stuart, D. (2007). RUOK? Blogging communication technologies during crises. Journal of Computer-Mediated
Communication , 12, 523–548. doi:10.1111/j.1083-6101.2007.00336.x
Thoring, A. (2011). Corporate tweeting: Analysing the use of Twitter as a marketing tool by UK trade publishers. Public Relations Quarterly , 27,
141–158. doi:10.1007/s12109-011-9214-7
Trammel, K. D. (2006). Blog offensive: An exploratory analysis of attacks published on campaign blog posts from a political public relations
perspective. Public Relations Review , 32, 402–406. doi:10.1016/j.pubrev.2006.09.008
Trammel, K. D., Williams, A. P., Postelnicu, M., & Landreville, K. D. (2006). Evolution of online campaigning: Increasing interactivity in
candidate web sites and blogs through text and technical features. Mass Communication & Society , 9, 21–44. doi:10.1207/s15327825mcs0901_2
Tremayne, M., Zheng, N., Lee, J. K., & Jeong, J. (2006). Issue publics on the web: Applying network theory to the war blogosphere. Journal of
Computer-Mediated Communication , 12, 290–310. doi:10.1111/j.1083-6101.2006.00326.x
Troester, M. (2012). Big data meets big data analytics: Three key technologies for extracting realtime business value from the big data that
threatens to overwhelm traditional computing architectures. SAS Institute. Retrieved from www.sas.com/resources/whitepaper/wp_46345.pdf
van Doorn, N., van Zoonen, L., & Wyatt, S. (2007). Writing from experience: Presentations of gender identity on weblogs. European Journal of
Women's Studies , 14, 143–159. doi:10.1177/1350506807075819
Webb, L. M., Chang, H. C., Hayes, M. T., Smith, M. M., & Gibson, D. M. (2012). Mad men dot com: An analysis of commentary from online fan
websites . In Dunn, J. C., Manning, J., & Stern, D. (Eds.), Lucky strikes and a three-martini lunch: Thinking about television's Mad Men (pp.
226–238). Newcastle upon Tyne, UK: Cambridge Scholars Publishing.
Webb, L. M., Fields, T. E., Boupha, S., & Stell, M. N. (2012). U. S. political blogs: What channel characteristics contribute to popularity? In
Dumova, T., & Fiordo, R. (Eds.), Blogging in the global society: Cultural, political, and geographic aspects (pp. 179–199). Hershey, PA: IGI
Global.
Webb, L. M., Thompson-Hayes, M., Chang, H. C., & Smith, M. M. (2012). Taking the audience perspective: Online fan commentary about the
brides of Mad Men and their weddings. In A. A. Ruggerio, (Ed.), Media depictions of brides, wives, and mothers(pp. 223-235). Lanham, MD:
Lexington.
Webb, L. M., & Wang, Y. (2013). Techniques for analyzing blogs and micro-blogs. In N. Sappleton (Ed.), Advancing research methods with new
technologies (pp. 183-204). Hershey, PA: IGI Global.
Webb, L. M., Wilson, M. L., Hodges, M., Smith, P. A., & Zakeri, M. (2012). Facebook: How college students work it. In H. S. Noor Al-Deen & J. A.
Hendricks (Eds.), Social media: Usage and impact(pp. 3-22). Lanham, MD: Lexington.
White, T. (2012). Hadoop: The definitive guide (3rd ed.). Sebastopol, CA: O’Reilly Media.
Wiles, R., Crow, G., & Pain, H. (2011). Innovation in qualitative research methods: A narrative review. Qualitative Research , 11, 587–604.
doi:10.1177/1468794111413227
Williams, A. P., Trammell, K. P., Postelnicu, M., Landreville, K. D., & Martin, J. D. (2005). Blogging and hyperlinking: Use of the web to enhance
viability during the 2004 US campaign. Journalism Studies , 6, 177–186. doi:10.1080/14616700500057262
Xenos, M. (2008). New mediated deliberation: Blog and press coverage of the Alito nomination. Journal of Computer-Mediated
Communication , 13, 485–503. doi:10.1111/j.1083-6101.2008.00406.x
ADDITIONAL READING
Aggarwal, C. C. (Ed.). (2011). Social network data analytics . Hawthorne, NY: IBM Thomas J. Watson Research Center. doi:10.1007/978-1-4419-
8462-3
Daniel, J. (2012). Sampling essentials: Practical guidelines for making sampling choices . Thousand Oaks, CA: Sage.
Eastin, M. S., Daugherty, T., & Burns, N. M. (2011). Handbook of research on digital media and advertising: User generated content
consumption . Hershey, PA: IGI Global.
Franks, B. (2012). Taming the big data tidal wave: Finding opportunities in huge data streams with advanced analytics . Hoboken, NJ: John
Wiley & Sons.
Hine, C. (Ed.). (2005). Virtual methods: Issues in social research on the internet . Oxford, UK: Berg.
Kolb, J. (2012). Business intelligence in plain language: A practical guide to data mining and business analytics . Chicago, IL: Applied Data Labs.
Marin, N. (2011). Social media: Blogging, social networking services, microblogging, wikis, internet forums, podcasts, and more . New York, NY:
Webster’s Digital Services.
Presll, C. (2012). Social network analysis: History, theory & methodology . Thousand Oaks, CA: SAGE.
Scheaffer, R. L., Mendenhall, W. III, Ott, R. L., & Gerow, K. G. (2012). Elementary survey sampling (7th ed.). Boston, MA: Brooks/Cole.
Smolan, R., & Erwitt, J. (2012). The human face of big data. Sausalito, CA: Against all odds.
KEY TERMS AND DEFINITIONS
Articulated Networks: A social networking source where large data sets could be retrieved by recording interactions between users who are
connected by explicitly announced relationships, for example, a public friend list on Facebook.
Behavioral Networks: A social networking source for collecting large data sets by extracting reply relations between user comments.
Behavioral networks also are called behavior-driven networks. Example of behavioral networks include instant message service and texting
message through mobile devices.
Big Data: Diverse and complex data in rapidly expanding volume drawn from an unusually large population. Big data sets are usually produced
by and harvested from new media technology such as the Internet.
Big Search: The behavior of locating or generating big amount of data to bring a wide scope of results for one single query.
Blog: A medium established through the Internet, which enables people to publish their personal stories, opinions, product reviews and many
other forms of texts in real time.
Blog Audience: Boyd (2010) introduced four analytical of blog’s audiences. First, the intended audience which comprises a blogger’s general
idea of the audience she or he wants to address. Second, the address audience comprises those people that are addressed in a specific blog
posting, which can be the same as the intended audience in general but can also be a specific subset. The third category contains the empirical
audience who actually take notice of any given posting or tweet. The final category includes the potential audience who are determined by the
technological reach of a blog within the wider context of network by communication.
Blogosphere: The totality of all blogs and its interconnections which implies that blogs are connected to each other in a virtual community.
Convenience Sample: A sample containing easy to access incidences in the sample with no known probability of inclusion.
FrontEnd Approach to Big Search: A method used to carry out a big search by throwing large amount of data into the search query and
allowing themes to emerge.
MicroBlog: An online medium inheriting all the features from the traditional blogging, and differing from traditional blogging by imposing
limit on the number of characters in a single posting (140 characters) and facilitating a more instant updating speed, through more flexible forms
of platforms (web, text messaging, instant messaging and other third-party application).
Random Sampling: Every instance in the sampling frame has an equal probability of selection into the sample.
RSS: A standardized format that automatically syndicates blog content into summarized text and is sent to the readers who subscribed to the
bloggers. RSS is often dubbed Really Simple Syndication.
Representative Sample: A sample in which the incidences selected accurately portray the population.
SemiStructured Data: Raw data arranged with hierarchies or other signs of distinctness within the data, but does not conform to formal
structure which is widely accepted in other data. Examples of semi-structured data include email messages and other text based online data.
Structured Data: Highly organized data retrieved from databases or other sources which process and manage large quantity of data. Data
listed in Google search results could be regarded as structured data.
Social Networking Service: An online platform that builds social structures among people sharing common interests, activities, and other
social connections. The term is often presented in abbreviation as SNS.
UnStructured Data: Raw data directly extracted from online applications without being organized into effective formats. Examples of un-
structured data include mobile text data and mp3 files.
Web 2.0: A combination of different web applications that facilitates participatory information sharing and collaborating in a social media
dialogue (such as social networking sites, blogs sites) in a virtual community.
CHAPTER 31
Sentiment Analysis for Health Care
Muhammad Taimoor Khan
Bahria University, Pakistan
Shehzad Khalid
Bahria University, Pakistan
ABSTRACT
Sentiment analysis for health care deals with the diagnosis of health care related problems identified by the patients themselves. It takes the
patients opinions into perspective to make policies and modifications that could directly address their problems. Sentiment analysis is used with
commercial products to great effect and has outgrown to other application areas. Aspect based analysis of health care, not only recommend the
services and treatments but also present their strong features for which they are preferred. Machine learning techniques are used to analyze
millions of review documents and conclude them towards an efficient and accurate decision. The supervised techniques have high accuracy but
are not extendable to unknown domains while unsupervised techniques have low accuracy. More work is targeted to improve the accuracy of the
unsupervised techniques as they are more practical in this time of information flooding.
INTRODUCTION
In this time of technology, people share their issues online and take advice on them, just like they previously did from their friends and family.
This online data can be found on various sources like blogs, forums, social media websites etc. covering a vast range of topics. There are health
related blogs and forums where people discuss their health issues, symptoms, diseases, medication etc. The experience related to the health care
centers visited, in the locality is also shared in terms of availability, service, environment, satisfaction, comfort etc. It is of great value to the new
patients to learn from others experience about taking decisions regarding their health, medication or choosing a health care center. This
information is also very important to the health care centers to identify the patients concerns and address them. Patients share this information
wrapped in their own sentiments and emotions, which is the driving force of this type of analysis. (Liu, 2010) has explained Sentiment analysis as
identifying the sentiments of people about a topic and its features. The health related content available online is free and is in huge amount,
therefore, it is less practical to analyze all this information manually and conclude them towards a rapid and efficient decision. Sentiment
analysis techniques perform this task through automated processes with minimal or no user support.
Surveys and questionnaires have been used previously for this purpose which were expensive and time taking. The professional articles produced
by experts are in small number and they do not address the problems faced by the patients or rarely consider the patients perspective. Sentiment
analysis takes into account the opinions of patients expressed in millions of documents, that is spread over multiple platforms. The output of
sentiment analysis can be in the form of categorization of health decisions into two classes as recommended or not recommended. By digging
deeper the aspects or features of the health problem can also be extracted. The aspects of a target entity e.g. medicine can be price, taste,
packaging, availability, side effect, time effective etc. This led to the foundation of Aspect-based sentiment analysis in (Liu & Hu, 2004). Aspect
based sentiment analysis perform sentiment analysis at the aspect level, thus aggregating users’ opinions towards each aspect of the target entity.
This type of analysis is more realistic as a good medicine or treatment may not have all aspects rated good. It empowers patients to look for
medication and treatment procedures that have high rating for the aspects of their concern. New studies in the field of sentiment analysis try to
reveal the reasons behind sentiment orientation. Such a system will not only reveal the satisfaction level of patients but will also show the reasons
behind their feelings. It will provide much targeted information as the reasons to address for improvement are also specified.
The objective of this article is to highlight the importance of opinions expressed by millions of patients regarding their illness, treatments,
medication etc. The recent advancements in hardware technologies have made it possible to process the large-scale sentiment data through
automatic machine learning techniques. These techniques perform heavy statistical evaluations to predict prominent semantic patterns. Utilizing
this information, the health care centers and the government health ministry can make policies accordingly to address these issues that would
directly impact the masses. It will empower the patients to raise voice for their own problems directly to the higher authorities without following
painful procedures. Such feedback systems, based on sentiment analysis is already been used for governance, university management systems
etc. The sentiment dataset possessing timestamp can be categorized based on time slots while sentiment analysis is performed at each slot
separately. This type of analysis reveals a trend of public opinion over a period of time. It can be used to track the performance of a patient,
instrument or health center, where the ones with dropping performance can be pointed out. Normally people are reluctant to new procedures of
treatment and it can track the change in the perception of people.
BACKGROUND
Sentiment analysis (also known as opinion mining) techniques have been used successfully for commercial products over the last decade. It has
gained popularity as people prefer to know about others opinions before making a decision. Sentiment analysis explores popular opinion patterns
and presents it in a way that is easy to understand. In the context of health care they may lead to practices and decisions that majority of the
patients used to beat their illness. Sentiment analysis has been a hot research topic and has outgrown from business intelligence for commercial
products to other disciplines like social sciences, politics, geography, management sciences, health care, stock market etc. Sentiment analysis is
separated into various sub streams that are trend analysis, bias analysis, danger analysis; emotion analysis etc. (Muhammad et al., 2011) has
interesting findings in identifying gender of the email sender through sentiment analysis. Sentiment analysis is also applied on novels and fairy
tales to identify emotion patterns (Muhammad, 2011). Most of the work on Sentiment analysis has been carried out from the ML (Machine
learning) and AI (Artificial Intelligence) perspective. However, it has got its roads crossed with other disciplines including NLP (Natural
Language Processing), Computational Linguistics, Psychology etc. The results of ML and AI techniques may not be improved considerably
without taking these disciplines into perspective. NLP has many open challenges which are not answered satisfactorily because of the richness of
the natural language. It has this behavior transferred to sentiment analysis as well.
The ML techniques used for sentiment analysis train classifiers on labeled datasets and classify the test review documents based on it. The
commonly used classifiers for sentiment analysis are Naïve Bayes, kNN (k-Nearest neighbor), Centroid based and SVM (Support vector
machine). These classifiers have produced promising results in text categorization and summarization while dealing with objective content based
on factual information. They fall under the umbrella of information extraction and knowledge discovery. Although, ML classifiers achieve high
accuracy, they lack generalization and therefore require to be trained for each domain separately. The labeled datasets are limited and are
expensive to produce following an exhaustive process. They also do not cover recent issues, while in some problems the results lose their
significance, if not produced in time. Therefore, classifiers are only used for sensitive issues while for efficient analysis unsupervised techniques
are preferred. The unsupervised techniques can be applied on any type of data directly, where the accuracy is compromised. The unsupervised
techniques either follows a dictionary based or corpus based approach. The dictionary based approach requires access to an external sentiment
dictionary to identify the orientation of sentiments expressed. The sentiment dictionary provides sentiment polarity based on common language.
Certain sentiments polarities are domain specific for example “being positive on a medical test” is a negative sentiment. The corpus based
approach does not have this problem as they draw sentiments based on the probabilistic analysis of words co-occurrence. It reveals domain
specific sentiments only. However, they require large corpus, where the accuracy drops as the side of the corpus goes smaller. In order to improve
on the results of the corpus based techniques semi-supervised techniques are also proposed where the probabilistic model require some domain
specific user intervention. The hybrid approaches require a small training set to identify initial values for parameters that are later used with an
unsupervised probabilistic topic model to achieve better results.
Supervised Techniques
The supervised techniques consist of the Machine learning classification techniques that achieve high accuracy when trained with a labeled
dataset for specific domain. The labeled dataset is expected to have possible cases representing all categories, with equal proportion in ideal case.
There are two categories i.e. positive and negative in binary class classification. Introducing a neutral class in a multi-class classification problem
has shown improvement in results. For finer analysis more classes are used. Let D = {d1, . . ., dn} be a set of review documents, F={f1, f2...fm} be
the set of features or aspects and be the possible classes C = {c1, . . ., cn}. The task is to identify all the sentiment polarities expressed in a review
document and aggregate them at the document level. Based on the cumulative score of the document it is classified into one of the available
classes. This task has been performed differently by various classifiers.
Naïve Bayes classifier is extensively used for text classification. It computes cumulative aspect probabilities in association to the class labels. The
new document will get the class label with which it has the highest probability. The information regarding the probability score is also preserved;
it can be used to show the confidence of a feature vector in a label. Equation 1 is used to calculate the probability scores of a feature vector with
each class. If the value for an attribute is missing the product of scores results in zero, therefore, logarithm of attributes scores are added instead
to deal with this problem. Some smoothing variables are also used to stabilize the classifier and make it robust to noise. In a weighted scheme
used with Naïve Bayes, the contribution of prominent features towards classification can be highlighted. The naïve Bayes classifier assumes all
the sentences to be subjective and that features of the review document are independent of each other. Despite of this unrealistic assumption it
produces good results and is used in various practical applications.
(1)
The k-nearest neighbor (kNN) classifier assign label to a document based on the labels of its k nearest neighbors. kNN classifier has problem of
bias towards bigger classes as they have more influence because of having more training examples. This problem was later catered with using a
variable value of k for each class. Finding a suitable value of k for a domain is a challenge, where the most optimal value is chosen after trying a
range of values. Since kNN consult all training examples to label a test document, therefore it takes more time. Some variations of kNN are
proposed e.g. Tree-fast kNN, which are focused on improving the efficiency of these techniques. The Centroid based classifier calculates a
centroid vector for each classifier to which the test document vectors are compared. Since it does not consult training data for resolving the label
of a test document, it has better performance. Its efficiency is proportional to the number of classes rather than number of training documents.
There are different approaches used to calculate the centroid of a class e.g. Rocchio algorithm, average score, sum of positive cases etc. Centroid
based classifier is sensitive to noise and therefore its variations are proposed to make it robust. Support Vector Machine classifier finds a margin
of separation between the classes, which is called hyper-plane. The hyper-plane is used for classifying the test document without consulting the
training data each time. In order to show better results the hyper-plane should have maximum separation between the classes. The performance
of SVM is dependent on the use of a suitable kernel function that is calculated with methods like linear, polynomial and Gaussian etc. SVM is
sensitive to noisy data close to the hyper plane for which slack variables are added to mitigate their effect.
Unsupervised Techniques
The dictionary based techniques do not require any training data and rather assign polarity based on the semantic orientation of a review
document. The orientation or polarity of sentiment or opinion words is identified from the external sentiment dictionary. The polarities are
aggregated to find the overall polarity of the document. These techniques are also known as the semantic orientation based techniques or the
lexicon based techniques. It can only be applied to those languages for which the sentiment dictionary is developed. The dictionary requires a
sentiment word and returns its polarity along with polarity strength in numbers. In case of words with no results in the dictionary, the online
sources are consulted through search engines, where the top N results are considered to resolve the polarity of the unknown sentiment word. This
approach is domain independent, however, it produce better results with general domains. Princeton University’s WordNet is one popular
sentiment dictionary. In a semi-supervised approach, some domain specific seeds are provided for which synonyms and antonyms are identified.
The newly found words are again explored for synonyms and antonyms until no new words are extracted. The sentiment orientation (SO) of a
subjective term tcan be identified by finding its distance with the reference pointsgood and bad as shown in Equation 2.
(2)
The corpus based techniques consists of the probabilistic topic models that performs analysis based on the words co-occurrence in the corpus.
The words co-occurrence can be identified through Point-wise mutual information (PMI) shown in Equation 3. Probabilistic latent semantic
analysis (pLSA) shown in Equation 4 and Latent dirichlet allocation (LDA) are also used to find words co-occurrence. The words are grouped into
various topics where each topic represents a cluster of words with high co-occurrence probability. LDA (Blei et al., 2003) make use of a three
level hierarchical Bayesian model by separating document into topics and topics into words. LDA has outperformed pLSA as it has more reliable
model and its corpus based hyper-parameters can help tune the model for a specific domain. The hyper-parameters contribute toward coarse or
fine level distribution of document into topics and topic into words. They require matrices having words as columns and documents or
paragraphs as rows. The results of corpus based techniques improve with the size of the corpus. In a semi-supervised approach with corpus based
techniques, some domain specific seeds are provided by domain experts. The words co-occurrence is explored with the words provided in an
attempt to find more coherent topics which results in improved accuracy. Although, the user intervention improves accuracy but requires manual
tuning by domain experts which limits its application to partially explored sensitive data.
(3)
(4)
HEALTH CARE SENTIMENT DATA
Health care sentiment dataset is expected to have subjective data that represent the authors’ own opinion about the discussed matter. However,
since this data is collected from online sources which do not follow any regulation, all sorts of possibilities are also expected. There can also be
review documents that does not indicate any sentiment polarity and rather pass on general information. Such reviews are of little use to
sentiment analysis and are therefore filtered off, in order to focus on the opinion rich content only. Subjective reviews may also contain some
objective statements representing facts and figures which are filtered too for the same purpose. For example “The health care center for children
specialist is on parliament road”. A study on health care analysis used clinical narratives as dataset while there is another study conducted on
deafness with dataset extracted from online blogs on deafness. Because of the popularity of sentiment analysis and its impact on the minds of
people, it is a hotspot for spammers to promote their agenda and to demote others. The spamming problems and their possible restrictions are
discussed in the later section.
Since these review documents are produced by amateur authors, they are expected to have all sorts of inconsistencies. They contain spelling
mistakes, over or under use of capitalization, grammatical errors, word shortening, regional slangs and swearing etc. (Chau et al., 2007) has
discussed different writing styles of online content writers that are to be considered. (Qiang et al., 2009) has claimed that blogs are more widely
used for information rich analysis as they contain more details. The blog reviews at times gets into a discussion between the author and
commentators which possess more information but are hard to evaluate. The order is also very important in such discussion as they lead towards
common grounds. They not only express sentiments for a health care problem but also support their opinions with strong reasons which make
them highly useful.
There are different types of sentences that can be found in a review document. Although this problem has its roots in NLP but it has its diverse
effects on the performance of sentiment analysis. A simple sentence is the one that has sentiment target (the target entity or its aspect) along with
a sentiment word. For example “The XYZ health care center provides medicines on subsidized rates”. They are the easiest to evaluate. A
compound sentence is the one that has multiple sentiments about multiple aspects discussed together. For example “Doctor XYZ is very
experienced but it takes weeks to book for a checkup”. Multi-link refers to the use of a single aspect associated with multiple sentiments or
multiple aspects sharing the same sentiment word. As in the sentence “They have the best available doctors, equipment and treatment
facilities” or “Its doctors are well qualified, approachable and vastly experienced”. Comparative sentences have sentiments about two target
entities or their aspects in comparison to each other. A comparative sentence would be “Medicine X is very comforting but medicine Y is
available with all chemists”. Complex sentences refer to sarcastic sentences that have implicit sentiment polarity opposite to the one explicitly
shown. For example “What a great treatment! The illness circled back in few weeks”. The sentiment analysis techniques face problems to deal
with hard sentences. Such sentences are either simplified or filtered out at the pre-processing stage.
HEALTH CARE ANALYSIS
Health care sentiment analysis is the use of sentiment analysis techniques on health related textual data. This data is preferably extracted from
online sources to explore commonly mentioned patterns. The health care sentiment analysis helps to identify the areas that are appreciated,
criticized, suggested with improvements or reasoned with performance. Sentiment analysis employee machine learning techniques to mine such
patterns with high efficiency and accuracy. Sentiment analysis has got its roads crossed with other disciplines like NLP (Natural Language
Processing), Computational linguistics, psychology etc. Without considering them, their possible effects cannot be resolved effectively. NLP has
many challenges open for research, which are hard to address due to the richness of natural languages. This behavior of NLP is transferred to
sentiment analysis as well. Detail on NLP challenges are covered in the later section.
The performance of sentiment analysis techniques drops with noise in the dataset. Therefore, dataset cleansing is performed at pre-processing in
order to make it suitable for the use of the analysis technique. The online dataset is considered to have also sorts of inconsistencies because of it
being an unmonitored medium of communication. The irrelevant tags and advertisement that do not have any contribution in the analysis
process are removed. Review documents that do not show any sentiment polarity are also filtered out. While the sentiment polar documents
having objective statements are either removed at pre-processing or they are assigned to a neutral class used along with positive and negative
classes in the multi-class classification. Sentences that show dual polarity are also removed for being inconclusive, however, they still possess
polarities but are hard to evaluate. For example “RMI has better doctors in Peshawar but the place is too congested to find one for parking”.
Since a sentence like this has both positive and negative sentiments therefore it is ignored as being hard to identify its polarity. Labeling them
with neutral class is not a good decision either as a positive and negative sentiment may not mean a neutral sentiment all together. In practice the
features or aspects are to be identified and analysis is to be performed for features to provide aspect level results. This type of information is not
preserved in entity level analysis which concludes the target entity as positive, negative or neutral. Aspect based sentiment analysis addresses this
problem which perform sentiment analysis at the aspect level. Therefore, aspect based sentiment analysis would rate RMI good for doctors’
quality but bad for parking space. Aspect based sentiment analysis is discussed in detail in the later section.
In a multilingual system, the first step is to identify the language of the review document. Text categorization can be used for this purpose which
is a mature area. Since every word does not contribute towards sentiment polarity, therefore, the unwanted words are removed to speed up the
process. POS (Parts of speech tagging) is performed for this purpose. There are POS libraries that attach a POS tag to each word in a sentence.
After applying POS tagging the stop words are filtered out. In natural language the words are sequenced for a specific purpose, which could not
be fulfilled if the same words are re-ordered. It shows that the order information is also very important to improve the accuracy of the system.
The corpus based techniques follow a BOW (Bag of words) approach in which only the presence or absence of words is considered but not the
order in which they occur. The dictionary based approach also ignores the words sequence as they check the polarity of a sentiment word with the
dictionary, irrespective of the position of the word in a sentence. The supervised classifier does preserve the sequence information as the words in
a specific order are matched in order to identify the label. Some patterns are provided to the unsupervised techniques in the form of determinants
to provide specific meaning for a certain sequence of words as well; however, it is impractical to identify all such possible cases.
A sentiment document can be accessed at the three different levels for analysis. The document level analysis aggregate all the sentiment scores
used in the document at the document level and assign that averaged polarity to the document towards the target entity. Since this type of
approach does not provide enough information about the individual features of the target entity, therefore sentiments were analyzed at the
sentence level. Using this technique the sentiments towards an aspect of the target entity presented in the sentence is used to aggregate all the
sentiments for the given aspect. However, such techniques also have problems since sentence may have multiple aspects described with different
sentiment polarities. It is therefore; more feasible to perform aspect based sentiment analysis at the phrase level, where each sentiment phrase
conveys sentiment polarity towards an aspect. The phrase level analysis goes well with aspect based analysis and therefore phrase level analysis is
also sometimes called aspect level analysis.
OUTCOME OF SENTIMENT ANALYSIS
The purpose of applying sentiment analysis techniques is to process health related opinions of millions of users and conclude them towards
useful information. Therefore, the outcome of sentiment analysis has to be simple and conclusive that can be utilized for the purpose of decision
making. The outcome of sentiment analysis can be in the form of binary classes representing percentages of positive and negative sentiments. If
finer level categorization is required the aggregated outcome is distributed among different categories like excellent, good, mediocre, bad, worse
etc. It has been observed that this type of outcome shown in numeric figures is more useful for machines to consume as compared to humans. In
order to make the outcome more user-friendly, it is transformed into a short summary which gives more information about why people are of
such opinion and is in a format that humans are more familiar with. The summary can be either extractive or abstractive, where abstractive
summary has popular opinions from the review documents while extractive summary is generated from the numeric evaluations. Since the
summary is in natural language, it is hard to be produced in an error free manner that is concise and make sense to the end user.
NLP CHALLENGES
The open NLP challenges affect the performance of sentiment analysis on multiple fronts. Some of these challenges are specific to the type of data
while others are common to any type of textual analysis. The problems common in NLP can be distributed into four categories based on the level
at which they are faced in the analysis process as shown in Table 1. The document level problems are related to the review document that can
have discussions which are only found in blogs. Blog reviews allow comments that usually shape into a discussion. These documents have
opinions that are specific to the domain. Based on the human psychology and variety in natural languages, people express themselves differently.
Opinion spamming is also a very sensitive issue where people present false sentiments planned for promoting or demoting specific target entities.
There may also be advertisements as review documents that would have nothing to do with the target entity in review. Unfortunately there are
individuals and even companies involved in the business of sentiment review spamming. Experts in sentiment analysis expects as many as half of
the reviews to be spammed on some commercial review sources. Over or under utilization of certain sentiment gestures are also expected as the
content writers are amateurs and not professional writers. Spelling mistakes, regional slangs, word shortening etc. are other commonly faced
problems.
Table 1. The NLP related challenges faced in sentiment analysis
ASPECT BASED SENTIMENT ANALYSIS
Aspect based sentiment analysis was introduced as research area in (Liu & Hu, 2004). They identified aspects and claimed how they are the
direct sentiment targets rather than the target entity. In aspect based sentiment analysis aspects are identified for the target entity. Instead of
sentiment analysis for the target entity, sentiment analysis is performed for each aspect of the target entity. Therefore, the important task in
aspect based sentiment analysis is the extraction of aspects. Aspects can be of various types depending upon the perspective considered. Based on
prior knowledge about the domain the aspects can be known or unknown. Known aspects are the ones that are known prior to the analysis
identified by the domain experts. Known aspects are of key importance in supervised analysis to construct training data. Since it is hard to find all
possible aspects of a target entity, therefore, unsupervised aspect extraction techniques can be used that explore the domain without any prior
knowledge and extract aspects for any type of domain. The unsupervised aspect extraction techniques have lower accuracy by identifying
incoherent aspects. The aspects can be common having frequent mentions in the domain or they can be rare mentioned by only some of the
authors. The aspects can be single term or multi term, for example environment is a single term aspect while service quality is a multi-term
aspect. Multi-term aspects are hard to identify with an unsupervised technique. Aspects can also be explicit or implicit based on the mention of
the aspect term in the sentence. For example “The environment was peaceful” has the environment aspect explicitly mentioned. In sentence “It’s
cheap to have an appointment for checkup in the morning” there is a positive sentiment for the aspect price or cost which is not explicitly
mentioned.
Aspect Extraction
Aspect extraction is the most challenging part of the aspect based sentiment analysis due to the different variations of aspects discussed above.
Unsupervised techniques are preferred for aspect extraction as they explore the whole domain for the aspects rather than guiding them for a
particular type of aspects. With the supervised approach, the known aspects will be identified with high accuracy, but all the unknown aspects
will be ignored. Aspect extraction consists of two steps, which are:
• Aspect Identification: Aspect identification refers to the process of identifying nouns and noun phrases as candidate aspects. The
candidate aspects are the probable aspects from which aspects are extracted. The aspect identification process is supported by frequency of
occurrence or some relational patterns in which they may occur. For example an aspect oftreatment may occur as “____ of the
treatment” or“treatment having ____”. They are called determinants and this type of aspect identification approach is called frequency-
relation based approach. As is evident from the name it makes long list of candidate aspects based on frequency and then filter out aspects
that are not found in the specific relation pattern. The low frequency aspects are added to the list through presence along a sentiment word,
as an indication of being an aspect. The probabilistic topic models extract aspects through words co-occurrence. These models are based on
LDA which is a hierarchical Bayesian model and is successfully used for topic extraction in text articles. Since aspects have a specific nature
and unlike topics any word cannot be aspect, therefore, these models are extended differently to focus to aspects only. MG-LDA, MaxEnt-
LDA, ILDA etc. are some of the variations. From the list of aspects identified as candidate aspects, multi-term aspects are generated by
making pair of candidate aspects if they appear together in a sentence. Pairs are formed only in the order in which they appeared in the
sentence. The multi-term aspects generated are also added to the list. Pruning steps are applied through various filters to remove candidate
aspects with less support from the corpus.
• Aspect Aggregation: The aspect aggregation step receives a list of aspects terms consisting of both single and multi-term aspects that
have passed through the filters applied in the previous step. The purpose of aspect aggregation is to group aspects synonyms and near
synonyms together as they refer to the same aspect using different words for it. Dictionary is required to find word synonyms which returns
distance between two words based on which the words are clustered. Probabilistic topic models do not require this as a separate step to be
performed as it identify and aggregate aspects simultaneously. Some aspect synonyms may be common while others may be specific to a
particular domain which cannot be resolved accurately by the dictionary. When aspects are clustered each cluster represents a single aspect
while the words in the cluster are just different forms of referring to it. The aspect clusters are named in a way that is more suitable for it.
The cluster name which will become the aspect name may be picked from within the cluster by the unsupervised technique. The cluster name
can also be given different from the cluster words by an expert which may more closely represent the aspect. The aspect based sentiment
analysis is performed by doing sentiment analysis as aggregating the users’ sentiments towards each aspect or clustered aspect terms.
DISCUSSION AND RECOMMENDATIONS
Sentiment analysis is a diverse field that needs improvements from multiple fronts. Human psychology comes into play to consider what patterns
of words people produce and over-use when they lie or are over excited etc. The NLP challenges discussed are not addressed in the sentiment
analysis studies which mostly focus on the use of ML techniques to improve the results, however, no considerable progress can be achieved
without considering them. Ignoring the complex types of sentences is not a part of the solution either, as many such sentences are expected in a
real dataset and removing them will affect the analysis by missing on many important details. These challenges affect the discussed analysis
techniques with different levels of sensitivity. The purpose of health sentiment analysis is to identify the key health facilities and identify what
people like or dislike about them. It’s a very powerful system to convey patients’ appreciations directly to those concerned and motivate them to
do good in the future too. It also identify the problems which they are having and in a way are helping the authorities to focus on them for future.
Such a model will convey the most discussed health concerns in the form of buzz words, categories or summary to pinpoint the problem.
Adopting such a system is the key to the progress of an institution or government. Politicians perform sentiment analysis on their speeches to
identify the policies that are more appreciated by the public so that they can focus on them more closely and re-iterate them in the coming
campaigns.
Since the purpose of health care sentiment analysis is to empower the patients to make safe and preferred health related decisions, therefore, the
presentation of the results is also an important concern. No matter how good the technique performs, if the results are not presented to the
patients in a way that could help them make a decision; the whole process has lost its purpose. The graphical presentation tools should be
accompanied with the generated summaries and figures to help understand the outcome. Similarly if the suggested recommendations are also
mined and mentioned alongside the low scoring aspects it would be handier towards improvement. Aspect based sentiment analysis helps
patients to find a health related solution that closely suits their needs. As suggesting the best health facility to a patient who cannot afford it or
practically visit to, is useless for him/her. They would rather find a suitable solution among many that is within their reach. Similarly the
presentation of summaries may also be refined with more quantitative information as compared to qualitative details. The quantitative
information are more supportive in helping a patient making up his mind.
A big challenge for such systems is handling big data. The problem of big data is knocking at the doors of various disciplines and so is the health
analysis. There is a very large volume of health related content available online and is being poured in continuously at a high rate. This
information is available on a range of platforms that very in the way they store information. These sources have different levels of authenticity for
the content that they possess. This raises many questions on the results of the health recommendations provided by these techniques. The system
must be able to handle huge volume of data as considering a sample dataset out of it will not fulfill the needs of the user. Similarly the technique
has to be efficient enough to handle such huge volume of data and the hardware has to keep up with it as well. For up to date analysis the
streaming data has to be considered so that the most recent opinions should be considered. This will require real time analysis that has many
issues of its own as the data extraction, pre-processing, analysis and results presentation has to be performed in real time or near real time. The
data extracted from various online sources are found in different formats and the system has to be tuned to deal with all formats of data. For
example in Facebook people can like and comment while ontwitter they can tweet and Retweet. Therefore, the data extracted from these two
sources will have attribute values that will not be common to other sources. The last but not the least is the validity of data. It deals with the
processes that would identify the spam or false review documents, and discard them. The validity of the content has psychological factors to it as
well, as people use strong words when the lie. Structural information like timestamp, IP address, location longitude and latitude, user etc. can
help tune analysis to a finer level. They can also be used to track spamming and spammers.
FUTURE RESEARCH DIRECTIONS
Patients in search of a specific solution may want to limit the analysis specific to their region only. Therefore, a region based analysis will provide
more realistic analysis about that rather than averaging it with all the other regions. Similarly a time-based analysis will show the time-wise
performance of the services and policies in practice. This type of analysis can give an alarm to the authorities to re-visit the structure of a
department or service whose performance or popularity is on the decline and save it before it may be completely abandoned. Sentiment search
systems would provide aggregated subjective results on the health related queries. A recommended system can be built on top of such query
analysis that would recommend medicines, treatments, specialists in the locality, relevant health care centers etc. based on the subjective
information provided by the other patients or users. More sophisticated techniques can be employed in such systems to discourage spammers.
The use of psychology, structural information and ML techniques have to be combined to reduce their impact on results. More such parameters
are to be discovered through which the validity of content can be verified. Authors may also be ranked based on the validity of the content they
produce where authors with doubtful background may be ignored or watched more carefully for future content.
CONCLUSION
Sentiment analysis is the requirement of any organization to help people, have access to the information and empowering them to have their say.
The trend has already started though social media, where an issue highlighted may only reach the higher authorities if they notice it directly. It is
impractical to traverse social media and other user content sources for finding issues. Sentiment analysis is automating this process. Aspect
based sentiment analysis and sentiment reason analysis are more informative to help users make sensible decisions. The sentiment analysis
techniques employed from machine learning and data mining are tuned for this problem. Supervised techniques with high accuracy can be used
for more sensitive purposes for finding close bound problems. Unsupervised techniques are the less expensive can be used to explore large
domains for possible problems. The presentation of the sentiment analysis process is categorized with percentage values for the possible classes.
It can be aided with graphical tools to be more conclusive for the end users. Summaries may also be used to lay down the findings. Buzz words are
also used with summaries to highlight the most frequently used sentiment words for the given studies. Sentiment analysis is introduced in early
2000 while aspect based sentiment analysis introduced in 2004. This area is still far from being mature with new sub-streams identified as
emotion analysis, behavior analysis, bias analysis, danger analysis etc. The popularity and need of these findings can be observed from the fact
that it has already been used for commercial products while it is still fresh as a research problem.
This work was previously published in the International Journal of Privacy and Health Information Management (IJPHIM), 3(2); edited by
Muaz A. Niazi and Ernesto JimenezRuiz, pages 7891, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3, 993-1022.
Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: A study of online hate groups. International Journal of Human-
Computer Studies , 65(1), 57–70. doi:10.1016/j.ijhcs.2006.08.009
Chen Z. Liu B. (2014, August). Mining topics in documents: standing on the shoulders of big data.Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining (pp. 1116-1125). ACM.10.1145/2623330.2623622
Chen Z. Mukherjee A. Liu B. (2014). Aspect extraction with automated prior knowledge learning.Proceedings of ACL (pp. 347-358).
10.3115/v1/P14-1033
Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering ,19(3), 370–
383. doi:10.1109/TKDE.2007.48
Ding X. Liu B. (2007). The utility of linguistic rules in opinion mining.Proceedings of the 30th Annual International {ACM} {SIGIR} Conference
on Research and Development in Information Retrieval. 10.1145/1277741.1277921
Mohammad S. (2011). Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales.Proceedings of ACL Workshop on
LaTeCH.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval , 1(2), 1–135.
doi:10.1561/1500000011
Pavlopoulos J. Androutsopoulos I. (2014). Aspect term extraction for sentiment analysis: New datasets, new evaluation measures and an
improved unsupervised method.Proceedings of LASMEACL (pp. 44-52). 10.3115/v1/W14-1306
Qiang, Y., Ziqiong, Z., & Rob, L. (2009). Sentiment classification of online reviews to travel destinations by supervised machine learning
approaches. Expert Systems with Applications , 36(1), 6527–6535.
Qingliang, M., Qiudan, L., & Ruwei, D. (2009). AMAZING: A sentiment mining and retrieval system. Expert Systems with Applications , 36(3),
7192–7198. doi:10.1016/j.eswa.2008.09.035
Wang, T., Cai, Y., Leung, H. F., Lau, R. Y., Li, Q., & Min, H. (2014). Product aspect extraction supervised with online domain
knowledge. Knowledge-Based Systems , 71, 86–100. doi:10.1016/j.knosys.2014.05.018
KEY TERMS AND DEFINITIONS
Aspect: Aspects are features of the target entity. The sentiments are mostly expressed for the aspects of the target entity and therefore they are
also called sentiment targets.
Lexicon: Lexicon is the sense that a word gives when used in at a specific position with a group of accompanying words. It refers to the
semantics of the word.
Negation Handling: It refers to the handling of the word not in the review document. It has been an open challenge in NLP. The word not has
many forms where it may not necessarily invert the sentiments.
Objective Content: Objective content consists of facts and figures. Unlike subjective content, they can be either true or false which can be
verified from other sources.
PartsOfSpeech Tagging: Parts of speech (POS) tagging refers to the process of assigning POS tag to each word in the sentence as verbs,
adjectives, nouns etc. POS taggers are used for this purpose.
Review Document: A document that has subjective sentences expressed by an author about a target entity.
Sentiment Aggregation: Sentiment aggregation is averaging all the sentiments expressed for the sentiment target.
Sentiment Dictionary: The external source used for identifying the orientation and strength of the sentiment words. WordNet such a lexicon
dictionary produced by Princeton University.
Sentiment Orientation: Sentiment or Opinion orientation refers to the polarity of the sentiment word that is used for classification.
Orientation strength along with polarity is required for multi-class classification.
Subjective Content: Subjective content are consists of sentences that has factual information expressed in authors own sentiments or
opinions. They cannot be marked as wrong or correct due to their subjective nature.
Target Entity: Target entity refers to the place, product, service, procedure etc. that is under observation for which the subjective data is
collected.
CHAPTER 32
Restful Web Service and WebBased Data Visualization for Environmental Monitoring
Sungchul Lee
University of Nevada Las Vegas, USA
JuYeon Jo
University of Nevada Las Vegas, USA
Yoohwan Kim
University of Nevada Las Vegas, USA
ABSTRACT
The Nevada Solar Energy-Water-Environment Nexus project collects a large amount of environmental data from a variety of sensors such as soil,
atmosphere, biology, and ecology. Mostly, the environmental data is related to a development of renewable energy resources in the Nexus project.
The environmental data can have an impact on other research fields if it can easily be shared with other researchers, students, teachers, and
general users. Therefore, Nevada Climate Change Portal (NCCP) site was created for Nexus project with a purpose of sharing such data. However,
there are some challenges to address in utilizing such data, collecting the data, and sharing the data among the users. In this research, the
authors propose Extended Web Service Architecture for solving these challenges. The authors implement Arduino instead of CR1000 as a
collector due to its cost effectiveness. The authors also use REST API to overcome the limitations of Arduino. Moreover, the authors experiment
with popular Web-based data visualization tools such as Google Chart, Flex, OFC, and D3 to visualize NCCP data.
1. INTRODUCTION
Developing renewable energy resources is a national priority (U.S. Office of Management and Budget, 2012). In order to reach the national goal
of extending renewable energy resources, University of Nevada, Las Vegas (UNLV), University of Nevada, Reno (UNR), and other Nevada
institutions are collaboratively conducting Energy-Water-Environment Nexus project. Nexus project researchers have been collecting lots of data
from environment for decades. The environmental data is valuable for people in various areas such as engineers, hydrologists, biologists,
ecologists, soil scientists, atmospheric scientists, economists, and so on. Hence, numerous organizations focused on creating their own data
centers to share such data. Accordingly, NCCP site was built to accommodate Energy-Water-Environment Nexus project. It has been collecting
data and constructing data publications since 2011. NCCP site established the Nevada climate-eco-hydrology assessment network (NevCAN) for
collecting environmental sensor data such as, precipitation, pant canopy interception of snow, subsurface soil water flow, soil water content,
snow depth, soil temperature, thermal flux, solar radiation, and so on (Nevada Climate Change Portal, 2014). NCCP site can store environmental
sensor data over four hundred millions data points per year by NevCAN. Visualizing the data in real-time requires lots of memories and resources
on the web server. Pre-processing them is difficult because there are so many sensor data combinations, chart types, and the lengths of period
which could be requested by the users.
The proposed architecture is composed of three main parts, i.e., sensor network, Web Service and Visualization. In this research, we suggest
Arduino-based sensors for a reduced cost compared to CR1000. Also, we propose advance Restful Web Service for environmental monitoring
system. REST performs better in a sensor network compared to Simple Object Access Protocol (SOAP) which is currently utilized by NCCP. It
also has a flexibility of being fitted to numerous types of scale. Moreover, REST is suitable for web services based on Arduino. We test popular
web-based data visualization tools with a huge amount of sensor data that are achieved from Nevada Nexus project.
2. RELATED WORK
Environmental data are growing bigger and becoming more important due to a significant development of environmental monitoring. Therefore,
data portals, such as Climate Data Portal (Soreide, Sun, Kilonsky, & Denbo, 2001), NCCP, and GPS Explorer data portal, are becoming more
important to share such data. Sensor Web Services-based observation/analysis/modeling is focused on collecting and sharing the environmental
sensor data at the portal (Xianfeng, Chaoliang Kagawa & Raghavan, 2010. Bock, Crowell, Prawirodirdjo & Jamason, 2008).
However, majority of portals need to improve their data visualization for real-time visualization data Web Service. Most of visualization research
is practiced with off-line tools. For example, Mathematical toolssuch as Matlab (Azemi, & Stook, 1996), Mathematica (Savory, 1995), GODIVA
(Xiaosong, Winslett, Norris, & Xiangmin, 2004) and so on, are typically included in visualization routines and so are off-line visualization tools
such as Origin (Yingqi, 2011), Mayavi (Ramachandran, & Varoquaux, 2011), and R-software (Voulgaropoulou, Spanos & Angelis 2012). These
tools are not suitable for data portal as they are not based on on-line visualization.
In this research, we advanced the Web Service using second generation web service, REST Web Service. Additionally, we connected Arduino as a
sensor to reduce cost for collecting data. Also, we examined the data processing frameworks of popular web-based data visualization tools and
compared their performances to suggest suitable visualization tools for each data portal.
3. CURRENT EMPLOYED SENSOR SYSTEM AND NCCP DATA
In Nevada State, various environmental data have been collected by Nexus project, NCCP and its predecessor projects. Two main regions in
Nevada, i.e., Snake Range and Sheep Range, which has eight and five sites respectively, have been collecting sensor data by NevCAN. Also, some
of the sensors are installed at university campuses and other organizations for measuring precipitation, pant canopy interception of snow,
subsurface soil water flow, soil water content, snow depth, soil temperature, thermal flux, solar radiation, and so on. Figure 1 is an actual picture
of CR1000, currently installed as a sensor data collector at UNLV campus. CR1000 has local storage to store the sensor data.
Figure 2 displays a sample data collected by NevCAN and structured using Extensible Markup Language (XML) format. Basic information such
as name of location, machine, kinds of collecting data and time format, is stored at the head tag. Also, measured data with measuring time are
contained at data tag. NevCAN send the data to NCCP to be saved at Microsoft SQL Server. The data can be accessed via NCCP website. NCCP
website made a search engine for users which contains data types, locations, and the length of periods. Users can download the data either in CVS
or text format.
However, there are some delays for professional users in using NCCP website. Such delay can occur in instances like when users try to search for
a data compiled for a long period of time or to select various data types. In other words, users can make delays depends on its data size.
Consequently, we examined various tools of web visualization for visualizing large size of data in order to find the most efficient tool on the NCCP
web site.
In our research, we use Arduino as a data collector due to its subsequently low purchase cost compared to CR1000. We utilize Web Service to
store the data at database instead of using local storage. Also, REST Web Service bridges the data with database to overcome Arduino shortage.
We store the data based on the NCCP data structure to test various visualization tools.
4. ARDUINO WITH RESTFUL WEBSERVICE
Arduino can collect data instead of CR1000. And, it can work as a bridge between web server and collector (Campbell Scientific, 2014 Arduino is
an open-source electronics prototyping platform that focuses on flexible, convenient hardware and software. Arduino can create interactive
objects and environments for beginners, such as researchers, artists, designers, hobbyists, and so on (Arduino, 2014).
Table 1 shows the characteristics of CR1000 and Arduino. CR1000 has a higher performance microcontroller and larger memory. Accordingly,
CR1000 performs better than Arduino. In an outside like an uncontrolled environment, collector is always exposed to theft or sabotage. Arduino
is much cheaper than CR1000, thus the impact of a theft is much smaller. Arduino-based sensors can be remotely monitored and controlled by
Arduino Ethernet shield. Arduino with Ethernet shield can serve web service using HTTP communication API. As stated in Table 1, Arduino can
handle similar amount of sensor data to CR1000 using I/O pin (20 vs. 24 Analog and Digital). Also, Arduino can work as a data logger by
connecting with a variety of sensors (Vikatos, Theodoridis, Mylonas & Tsakalidis, 2011). Arduino can control the sensors, such as measurement
time resolution, and control switches or actuators. Arduino’s microcontroller can be programmed using programming language C (based on
Wiring). Arduino programs can run either in stand-alone mode or be controlled by the software running on a host computer (e.g. Flash,
Processing, MaxMSP). Figure 3 shows an Arduino Uno 3.0 board with an Ethernet Shield V2.0 and a temperature sensor along with their
connections. Arduino measures the voltage from the sensor using a polling method. So, Arduino sends the 5 voltage to the temperature sensor for
checking the resistance of temperature. The measuring timing is established by timer library in the program. Arduino translates the analog data
to the digital data for reducing workload of the web server. The data is sent to the server via RJ45 port which is connected with Arduino Ethernet
Shield V2.0 (Arduino, 2014). To reduce the overhead cost, we use REST API to send the data from Arduino to the server.
Table 1. CR1000 vs. Arduino specifications
Input voltage -5 V ~ 5V 7~12 V
Analog pin 16 6
Digital I/O pin 8 14
The programming has two main parts, initialization and main operation made by Arduino’s sketch. Figure 3 and Figure 4 show a sample Arduino
sketch code. Figure 4 shows a sample Arduino program segment. It first sets up the Ethernet connection and configures the timer using setTime
variable. The next part is loop(). The main operation and function of programming is workingworks in a loop form. Figure 5 shows the function of
that analog signal being changed into digital data, and the data being then sent as the data to the server. The getData() is called every second to
get data from by timer variable ‘t’ within the loop. The getVoltage() function reads an analog signal from sensors using polling method. And, the
signal data gets changed to digital data by getVoltage(). The data is converted to temperature in Fahrenheit and Celsius in getData(). The
restConnection() connects between server and collector using server IP address. Later, restConnection() sends the collector ID, the sensor ID and
the measured value to a destination URL (Uniform Resource Locator) over REST.
Figure 4. Setup, loop and getVoltage
Figure 5. RestConnect and getData
REST stands for Representational State Transfer. Roy Fielding proposed a software architecture design for web distributing computing platform
for web in his Ph.D. dissertation published in June of 2000 (Hongjun, Li, 2011). The architectural style of REST is an abstract model of the web
architecture to guide HTTP method, Uniform Resource Identifier (URI) and stateless (Franco, Norbert, Roberto, & Sandro, 2010).Recently,
REST has been chosen for web services and remotely accessing equipment instead of SOAP. Google and Amazon already changed SOAP to REST
as REST is more flexible, scalable and simpler than SOAP (Aihkisalo, 2012). Also, there are three reasons that REST is more suitable than SOAP
in a sensor network. First of all, REST is well organized via URI. Although we use different sensors and collectors for sensor network, REST web
service can classify the service for each kind of equipment. Secondly, SOAP uses SOAP envelope to send data. It always creates overhead even for
sending the small sized data. However, REST does not involve overhead. Therefore, the payload of REST can be subsequently reduced. Normally,
collector connects more than one sensor. Each sensor collects more than one data in every polling interval depending on the researcher’s setup.
Collectors, such as Arduino or CR1000, consequently receive the data from the sensor. Data logger needs sufficient large memory size to process
the sensor data. The message size also plays an important role. REST-based web service is again ideally positioned for remotely accessing
collectors. Lastly, REST requires less Round Trip Time (RTT) than SOAP Does. Figure 6 shows that Web Services based on REST can handle
service or operation requests, such as Register a client (REG), Send a message (SEND), Receive all messages (RCV), or Unregister a client
(UREG) faster than SOAP can. (Xuelei, Coll, & Bilan, 2009).
5. WEB SERVICE FOR SENSOR NETWORK
Web Service is frequently used for communicating to sensors and collectors. In this section, we will show our Extend Web Service architecture to
handle sensor data. Typically, Web Service is consisted of Client, Middle, and Server (Botts & Robin, 2007). User interface is handled by Client
tier. Users get data service via the web page. Middle tier works as a mediator. It handles the requests of the users. For example, the requested
data by the users come from the Server tier, then gets processed by the Middle tier, and is sent to the users (Xuelei, Coll, & Bilan, 2009).
In our project, we extended the middle tier to control remote sensors and collectors. Figure 7 illustrates the Extended Web Architecture with
Sensor Network. In general environmental monitor system, collector stores the measured data collected by sensors. The measured data is
temporarily stored in Collector prior to being moved to Middle tier. While it is possible to move the sensor data directly to the database in Server
tier, we avoid it for security reasons. Remote sensors are generally exposed to risks, such as being stolen, damaged, or modified. So, we can
protect newest data by moving the storage place. All the data is sent to database in the Middle tier for data integrity, data verification and data
classification. To overcome collector’s limitations, we delegate some functions of collector such as configuration parameters and processing in,
the Middle tier. By delegating functions to server, collector (Arduino) can focus on gathering sensor data with full resource and essential
functions. Additionally, the sensors and collectors can be upgraded transparently to the rest of the system.
Figure 7 Extended web architecture with sensor network
Also, we extend Middle tier for scalability and flexibility. Most of other Web Service for environmental data portal researches focus on
constructing Web Service suitable for one specific type of collector and sensor (Xianfeng, Chaoliang Kagawa & Raghavan, 2010. Bock, Crowell,
Prawirodirdjo & Jamason, 2008). However, Web Service for environmental sensor network in the real world uses various types of collectors and
user’s device. For example, NCCP collects over fifty different types by NevCAN. Therefore, it is crucial to extend the Middle tier in order to control
various devices. Consequently, we made two more parts in the Middle tier. One is Profile Manager (PM) and the other is Asset Manager (AM).
Users request services via various devices such as PC, cell phone, and smart devices. Such devices are handled by the PM in the Middle tier. The
roles of PM include processing, handling and describing user’s request. The AM is created for collectors and sensor data. The sensor data from
collectors such as CR1000, Arduino, and so on is processed by the AM. The sensor data is transformed into Sensor Model Language (SensorML)
by the AM. SensorML provides a framework to XML for standardizing data, capabilities, and systems of sensor (Botts & Robin, 2007). Also, the
AM controls the collectors using command orders such as, time, unit, update, and cancel (Quint, 2003). So, users with permissions to control the
collector can control through client tier. The order command is received by PM and passed onto AM to handle collector.
We follow Model-View-Control (MVC) design pattern to build Web Service (Xuelei, Coll, & Bilan, 2009). Figure 8 shows the interactions among
the modules. They work in the following sequence:
Figure 8. Flow web service with sensor network
3. Whenever the status is changed, view and Controller get the updated status of Model programmed in Java bean;
4. View produces the updated output. And controller changes the available set of commands by Model’s notification;
5. View sends the data from Model and associated formatting parameters to a 3rd party service (Google chart server);
6. 3rd party service (Google chart server) makes a graph image in Scalable Vector Graphics (SVG) format using the data and sends to View for
visualization data.
Figure 9 is an example of Google chart image in SVG format created with the actual NCCP sensor data. SVG is an XML-based vector image format
for two-dimensional graphics, interactivity, and animation (Ying Zhu, 2012). So, the SVG image can be recreated depending on the data period or
filter selection by users without additional connection any server. Google supports various types of chart such as line, pie, tree map and 3D.
Therefore, developer doesn’t need an additional effort to make different types of data with same data. Also, Google provides data table for data
manipulation functions. The data sorting, data modifying and data filtering are easy to be implemented by converting sensor data to Google data
table. This flexibility helps the users to understand the nature of data. However, Google chart has a delay for generating SVG with a big sized data
set.
Figure 9. Example of Google chart image in SVG format
Next section explains the difference between web-based visualization and standalone application-based visualization. We show the popular web-
based data visualization tools. We also test these tools with a large volume of NCCP data to find the most suitable visualization tool for Web
Service in environmental monitoring.
7. DATA VISUALIZATION
Scientific data visualization is useful in understanding the nature of the data and its underlying systems (Jianghui, Gracanin & Chang-Tien,
2004). Data visualization is the best way to describe the large volume of data. However, it is quite difficult to make real-time data visualization
with a large-scale of data due to its communication delay and processing overhead. (Kwan-Liu Ma, 2000). The Solar Energy-Water-Environment
Nexus project in Nevada generates a large amount of data from sensors (Sungchul, Juyeon, Yoohwan & Haroon, 2014).
Nexus installed NevCAN at nine different sites in Nevada. NevCAN collects 85 different types of data from sensors. The sensor measures the
environmental data every minute and over four hundred million data points are collected at NCCP datbase server every year as a result. The data
can be downloaded in textual format at the NCCP (Sungchul, Juyeon & Yoohwan, 2014). NCCP also needs the graphical representation of data on
the web to illustrate the data. However, NCCP has a challenge of processing a large-scale of data for visualization. Transforming this data into a
graphical format in real-time on the web server is quite demanding. We have searched other papers to solve the problem. However, majority of
the data visulization research focused on off-line visulaization. Our purpose is investigating a large-scale of visualization data on the web. Off-line
and visualization tools have a quite different mechanism.
Figure 10 illustrates the different styles of visualization for web-based and standalone application-based processing. Figure 10(a) shows the
process of web-based visualization. Web-based server takes data out from data storage when it receives client’s request via web browser. Server
processes the data to a graphical format and returns it to a client’s web browser. Unlike Figure 10(b), the standalone doesn’t send request to
server and receive graphical format data from server. The standalone can process and visualize all the data on its own without making additional
communication. The delay may be negligible in case of small data. However, it is crucial to note the delay of web-based visualization in case of
processing a large volume of data (Rong, Wang, & Ding, 2009).
We broke down the data visualization processing time into multiple stages, such as display time, data transfer time, system execute time, and
rendering time. Also, we analyzed popular web-based data visualization tools with these stages and delay time in each step has been measured.
We used various volumes of data collected by NevCAN to test the web-based data visualization tools.
8. WEB VISUALIZATION TOOLS
We selected four different types of web-based visualization tools that are widely-used in the popular web-based data visualizations. Selected
visualization tools are Google Chart, Open Flash Chart (OFC), Adobe Flex, and D3.js. The visualization is fundamentally a series of data
transformation (Figure 16). Each tools uses specific input data format type such as XML, JSON or DataTable. So, the raw data is transformed into
a set of parameters specifying the graphics for rendering into a graphics file such as SVG or JPG. These tools have a different rendering
technology. OFC and Flex use Flash library for their rendering engines, while Google and D3 use the built-in rendering functions in HTML5.
Hyper Text Markup Language 5 (HTML5) is made for data structuring and presenting in the World Wide Web (WWW). It is improved to support
the latest multimedia, such as animation or music, being used natively without any additional plug-ins. It is the latest version of HTML. Google
Charts and D3 are implemented in Google Chrome and MS Internet Explorer 9 or later.
All four tools create SVG format, an XML-based vector image for interactivity, animation, and so on (Xixi, Yuehui, Yidong, & Tan, 2012). SVG can
reduce additional connection between client and server. For example, SVG image can change by itself when users want to request some period of
time and filter the data from the SVG image. SVG have various user interactive functions such as zoom in, zoom out, visual entity in the chart,
category pickers, range sliders, and etc.
Jfreechart with NCCP data is used for comparing between web-based visualization tool performance and standalone off-line tool performance.
Jfreechart is a free java chart library. It is popular chart tool for data visualization on the off-line (Quint, 2003). Jfreechart is very useful to
visualize raw data from on the clients’ desktops with the raw data. Also, it can generate a static image for the servers. Figure 11 shows is the chart
that shows as the visualization results of NCCP data being visualized by from the web-based visualization tools with the NCCP data. The simplest
graph format (line graph) was chosen for a fair comparison:
Figure 11. Visualization outputs: (a) Google Chart; (b) OFC;
(c)Flex; (d) D3
1. Google Charts: Google charts is a free web-based data visualization tool. Figure 11(a) shows a sample Google Charts graph with
dashboard. Google Chart is optimized tool for chart. So, it supports various chart types such as line, calendar, maps, tree map, etc. (Young,
Jeong & Chang, 2007). Also, it has dashboards for a high level interaction with users. The dashboard has a wide selection of rendering
features for interacting user action such as selecting data period. Additionally, developers who use Google Chart don’t need to consider end-
user’s devices. It is because of cross-browser compatibility, e.g., for IE, or Firefox, and cross-platform portability, e.g., for iPad, iPhone or
Android (Ying, 2012). Google Chart can directly use raw data and also utilize DataTable for making a chart. DataTable class sends the
transformed data to visualization engine (Sakamoto, Matsumoto & Nakamura, 2012). The DataTable can be used for switching different
types of charts and dashboards. Also, developer can easily implement sorting, modifying, finding (max, min) and filtering functions using
DataTable. Additionally, Google Chart can use Google app such as Google Spreadsheets and Fusion Tables via DataTable:
2. Open Flash Chart: OFC is also free web charts tool (Open Flash Chart, 2014). Figure 11(b) shows a sample OFC graph with the same
data as Google Chart. It has 12 basic chart types such as bar, line, 3D, Glass, fade, sketch, area, pie, scatter, high-low-close, candle, and mixed
scatter. Although OFC does not have advanced chart features compared with other chart tools, it is easier to learn than other tools to draw
chart. The chart uses the SVG image format, so the user can interact with the chart. For example, the user can display information of entities
when the mouse pointer is over a point on the line. Like Google Charts, it is also cross-browser and cross-platform. Even though OFC has not
been updated since 2007, December, many web sites still use it because of the high availability of the source code:
3. Adobe Flex: Figure 11(c) shows the sample Adobe Flex graph with the same data. It is software development kit (SDK) for the
deployment of cross-platform Rich Internet Applications (RIA) based on the Adobe Flash platform (Juszkiewicz, Sakowicz, Mazure &
Napieralski, 2011). The online application is executed on the client side like a desktop application using RIA (Heidenbluth, & Schweiggert,
2009). This reduces the workload on the server (Peintner, Kosch & Heuer, 2009). Adobe Flex works cross-platform and uses the SVG
format. So, the chart created by Adobe Flex can be interacted with the users locally. Since NCCP’s data format is XML (Dittrich, Dascalu &
Gunes M, 2013). The data can be used natively by Adobe Flex which uses MXML (Mining eXtensible Markup Language), an XML-based user
interface markup language (Nammakhunt, Romsaiyud, Porouhan & Premchaiswadi, 2012). Also, data can be visualized in various types, e.g.,
picture, animation, or charts. Adobe Flex requires Adobe Flash running on the client machine. Adobe Flex application (Adobe Flex Builder
3.0) that is used for creating an MXML file is not free, but the reader (SDK) is available freely;
4. D3.js: Figure 11(d) shows D3 chart using same data as Google chart. It is freely available under Berkeley Software Distribution (BSD)
license. D3.js supports various types of visualization in W3C-compliant computing. It uses a JavaScript library for data-driven document
(Xiaosong, Winslett, Norris & Xiangmin, 2004). For the graphical representation, it uses SVG, HTML5, and Cascading Style Sheets (CSS)
(Bostock, Ogievetsky & Heer, 2011). D3 is quite difficult to use when it comes to creating charts compared toGoogle Charts, OFC or Adobe
Flex. But it is more powerful than other tools in scientific data analysis. D3 can easily illustrate complicated data using D3’s array operations.
It has various functions to assist visualization and data analysis such as math, ordering, shuffling, permuting, merging, bisecting, nesting,
manipulating, mapping, and setting collections.
Table 2 shows the specifications of four web visualization tools and one off-line application chart tool. These visualization tools are free of charge
for users. Among the web visualization tools, Google Charts and OFC specialize in creating charts. Compared to D3 and Flex, Google Charts and
OFC have limited number of data visualization methods. . Yet, due to their easy implication, Google Charts and OFC are often used for creating
charts. Data visualization methods of Flex and D3 are quite diverse and flexible. Therefore, Flex and D3 are suitable for complex data
visualization. They can implement vector graphics, charts, and animations for data visualization.
Table 2. Comparison of five chart tools
SVG o o o o
Web base o o o o
However, the web visualization tool’s performance can be diminished depending on the size of data. For instance, Flex can stop working or
dramatically reduce its performing speed in situation involving over 50,000 data points. On the other hand, Google Chart and D3 can visualize
charts involving over 100,000 data points without stopping.
We observed a trade-off between capability and reliability. As a Figure 10(a), there are two parts in the web-based visualization process. One is on
the client side and, the other one is on the server side. Each web visualization tools have different technology on the client side and server side.
Figure 12(a) shows the interaction sequence in Google Charts. Google server generates SVG image using client parameter and data. However,
Flex and OFC don’t need other servers to make a SVG image. They use Flash software in the client side to visualize data. Figure 12(b) shows the
interaction diagrams of Flex and OFC. In case of D3, the D3 visualization library is embedded in the web browser of client side for representing
data visualization. As shown in Figure 12(c), server responds with D3 visualization library.
9. PERFORMANCE TESTINF OF VISUALIZATION
We measure the visualization tools from both sides. From server side, we check the data handling time such as sending the data and formatting
the charts. From client side, we measure the rendering time such as drawing the background and charts. Even though each test was executed with
same tools using same data, data visualization time was inconsistent due to existing background task. In order to reduce external environmental
variables, the browser cache on the client side was wiped out. We averaged out the results of testing. Figure 13 shows our test results of
performance comparison. As shown on Figure 13, there are huge delays when charts are drawn with more than 100,000 data points using web
visualization tools. We may reduce the delay by pre-processing data points with averaging or binding method. However, it loses the data fidelity.
Figure 13. Compare performance of visualization tools
In order to test the original performances of various web visualization tools, we used the raw data without any pre-processing and additional data
manipulation. We draw charts using Google Chart, OFC, Flex and D3 with over 150,000 temperature data. We tested these tools using NCCP data
in Google Chrome. We used the Chrome Developer tools and Speed Tracer to measure time at each section. Figure 14 shows Chrome Developer
for memory tracking of visualization tools. The visualization tools can be tracked the memory usage, CPU and heap area via Chrome Developer.
Figure 15 shows Speed Tracer to measure event times in web visualization tools. Speed Tracer can pursue the event times such as
XMLHttpRequest time, data request time, and duration to the server (Google, 2014). Event duration information in Speed Tracer can be divided
into several parts. For more accurate analysis, we divided the processing time into three parts, i.e., Layout time, Data Transformation time, and
Rendering time:
Figure 14. Chrome developer with memory timeline
1. Layout Time: Drawing the chart background time and chart label time is the Layout time. The layout is the first to be drawn as soon as
clients request a chart. Figure 13(a) is the layout time of the web visualization tools. X-axis indicates the number of data point and Y-axis
indicates the time. According to our test results, Google Charts has slower layout time compared to other tools. As displayed on a Figure
12(a), the client browser requests Google Chart library to be downloaded from Google server. Google Charts’ terms of service requires that
the visualization libraries from the Google server should be dynamically loaded before each use. It makes an additional delay at Layout Time.
However, the visualization library of Flex, OFC, and D3 are either already installed or existed at client side. The client machine installed
Flash software to use OFC and Flex. Also, D3 library is embedded within the webpage for visualization. In summary, these web visualization
tools work using client side resource instead of server side resource. However, google.load() method is necessary for loading Google Chart’s
library from Google server;
2. Data Transformation Time: Data Transformation Time is a process time mostly from a server side for data manipulation and
visualization. It includes the data fetch time from the database, data transfer time to client web browser, the data transformation time from
the raw data (XML) to input data type (e.g., DataTable), generation time the SVG parameters. OFC, Flex, or D3 does not need data
transformation time for the raw data to input data type. For the test, we use NCCP data consists of XML. These tools use the raw data in
XML. However, Google Chart uses DataTable instead of XML. Google server receives the conversion data and the data is transformed into
SVG parameters, then, is delivered to client web browser. This processing time increases Google Chart’s Data Transformation Time. Figure
13(b) shows the Data Transformation Time of web visualization tools. Google Charts takes the longest Data Transformation Time. Figure 16
shows the data transformation flow. The visualization tools convert the raw data to SVG parameter. In D3 and Google Charts, the SVG
parameters are passed to the HTML5-enabled web browser. However, OFC and Flex pass the parameters to Flash for data visualization. In
the test, we assumed that Flash is already installed by users. So, OFC and Flex are faster than D3 or Google Charts in the data transformation
time:
3. Rendering Time: Rendering Time is the drawing time for charts using SVG file at client web browser. Normally, Rendering Time is the
longest process time in data visualization. Figure 13(c) shows the rendering times of web tools and also JfreeChart. JfreeChart is used to
compare between off-line standalone and on-line web programs. As shown in the Figure 13(c), JfreeChart performs slower than other web
visualization tools, such as D3 and Google, at the Randering Time test. Heap Profile function in Developer Tools can check dynamic memory
usages of the visualization tools. We use the function to understand the internal behavior of the web visualization tools (Google, 2014). Table
3 shows the memory usage of shallow sizes in the heap area in visualization tools. We can track over three hundreds constructor such as raw
data, date, and number using Chrome Developer. However, we chose three constructors to understand rendering time of web visualization
tools. They are Object (SVG), Array (contains SVG parameter data) and Total memory usage. These memory usages use most of the
memories in rendering time on the web.
Table 3. Memory usage of Google charts, OFC, D3, and flex (in bytes)
As shown in Table 3, Flex and OFC use smaller memory than other tools. Google Chart and D3 increase the memory usage depending on the size
of data points. However, Flex and OFC increase only by small amount upon increasing the data size. Flex uses 1.3MB memory for 10000 data
points and but only 1.7 MB for 50000 data points. However, D3 uses 4.2 MB for 10000 data points and 10.1MB for 50000 data points. We don’t
know how the memory is used in the visualization since we don’t have the source code. We found that OFC and Flex may convert the data into
numbers and Flash may handle it for visualization. However, Google Chart and D3 load the XML file that is retained as a whole in the memory at
client web browser. We leave this analysis as a future work.
We also test the web visualization tools at IE and Firefox. We observed total processing time for data visualization. However, it is difficult to
measure the processing time by the stage. The total processing time on the IE and Firefox are still the same as total processing time on the
Chrome. So, we only presented the data with Chrome.
By conducting this test, we found that every tool has its own well-suited environment. D3 is best to handle complicated data or large amount of
data with reasonable memory usage for data visualization. Flex is suitable for devices that require less memory. It is also good for complicated
data and small amount of data for data visualization. OFC easily implements various charts using open source. It also uses less memory.
Therefore, OFC is the best visualization tool for those who concern the memory usage and is the most chart friendly tool. Communication
overhead with the server of Google Charts may pose as a disadvantage. However, it also has benefit in the web environment by the DataTable.
Google Charts can be easily transformed to other web tools and app such as Google sheep, map, and so on by DataTable.
10. CONCLUSION AND FUTURE WORKS
The Extended Web Architecture with Sensor Network is designed to accommodate environmental data collection and data sharing needs in
Nevada Nexus project. We make a PM and AM for the interface with a range of client devices and collectors. We use MVC design pattern to make
the Extended Web Service. In our study, Arduino is proved to be as capable collector as CR1000 for collecting environmental sensor data.
Arduino can replace a traditional data logging device in some cases. It can work as a more cost-effective solution in the sensor network. We used
Restful Web Service to overcome limitations of Arduino. REST is used for sending data from Arduino and such data is employed at Google Charts
service for data visualization. In this research, environmental data can be successfully shared among all researchers by the environmental
monitoring systems.
Also, we tested popular web-based data visualization tools to find the best visualization method for a large volume of data in the NCCP. Using
NCCP data, we tested four different types of web visualization tools, such as Google Charts, Flex, OFC and D3. We also test JfreeChart to compare
standalone visualization tools and web visualization tools. In order to compare these visualization tools, we divide data visualization processing
time. Layout Time is measured for background and label. Transformation Time is measured for data manipulation and data transfer. Rendering
Time is measured for actual visualization on the client browser. The test is executed using actual data collected at the NCCP. NCCP has a large
volume of data and various types of data. Due to the volume and complexity of data, D3 was proven to be the best for NCCP data visualization.
Also, D3 has high processing speed with large amount of data. It can represent the data with various expressions.
This work was previously published in the International Journal of Software Innovation (IJSI), 3(1); edited by Roger Y. Lee and Lawrence
Chung, pages 7594, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This material is based upon work supported by the National Science Foundation under grant number IIA-1301726. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation.
REFERENCES
Aihkisalo, T. (2012). Latencies of service invocation and processing of the REST and SOAP web service interfaces. InProceedings of the Services
(SERVICES), 2012 IEEE Eighth World Congress, (pp. 100-107). IEEE.
Arduino. (2014). Arduino ethernet shield. Retrieved August, 21, 2014, from http://arduino.cc/en/Main/ArduinoEthernetShield
Azemi, A., & Stook, C. (1996). Utilizing MATLAB in undergraduate electric circuits courses. In Proceedings of the Frontiers in Education
Conference, (pp. 599-602). Academic Press.
Bock, Y., Crowell, B., Prawirodirdjo, L., & Jamason, P. (2008). Modeling and on-the-fly solutions for solid earth sciences: Web services and data
portal for earthquake early warning system. InProceedings ofGeoscience and Remote Sensing Symposium, IGARSS: IEEE International, (vol. 4,
pp. 124-126). IEEE. 10.1109/IGARSS.2008.4779672
Bostock, M., Ogievetsky, V., & Heer, J. (2011). D3: Data-driven documents. IEEE Transactions on Visualization and Computer Graphics, 2301-
2309.
Dittrich, A., Dascalu, S., & Gunes, M. (2013). ATMOS: A data collection and presentation toolkit for the Nevada climate change portal.
In Proceedings of the Int’l Conf. on Software Eng. and Applications (ICSOFTEA 2013), (pp. 206-213). Academic Press.
Franco, D., Meyer, N., Pugliese, R.., & Zappatore, S. (2010).Remote instrumentation services on the einfrastructure. Springer Science &
Business Media.
Heidenbluth, N., & Schweiggert, F. (2009). Status sensitive components: Adapting rich internet applications to their runtime context.
In Proceedings of Digital Society, (pp. 133-138). Academic Press.
Jianghui, Y., Gracanin, D., & Lu, C-T. (2004). Web visualization of geo-spatial data using SVG and VRML/X3D. In Proceedings of the Multi
Agent Security and Survivability, (pp. 497-500). IEEE.
Juszkiewicz, P., Sakowicz, B., Mazur, P., & Napieralski, A. (2011). The use of Adobe Flex in combination with Java EE technology on the example
of ticket booking system. In Proceedings of CAD Systems in Microelectronics (CADSM), (pp. 317-320). Academic Press.
Ma, X., Winslett, M., Norris, J., & Jiao, X. (2004). GODIVA: Lightweight data management for scientific visualization applications.
In Proceedings of Data Engineering, (pp. 732-743). Academic Press.
Ma, K-L. (2000). Visualizing visualizations: User interfaces for managing and exploring scientific visualization data. IEEE Computer Graphics
and Applications, 16-19.
Peintner, D., Kosch, H., & Heuer, J. (2009). Efficient XML interchange for rich internet applications. In Proceedings of Multimedia and
Expo, (pp. 149-152). IEEE.
Ramachandran, P., & Varoquaux, G. (2011). Mayavi: 3D visualization of scientific data. Computing in Science & Engineering , 13(2), 40–51.
doi:10.1109/MCSE.2011.35
Rong, Y., Wang, J., & Ding, J. (2009). RIA-based visualization platform of flight delay intelligent prediction. In Computing, communication,
control and management, (pp. 94-97). Academic Press.
Sakamoto, Y., Matsumoto, S., & Nakamura, M. (2012). A integrating service oriented MSR framework and Google chart tools for visualizing
software evolution. In Proceedings of Empirical Software Engineering in Practice (IWESEP), (pp. 35-39). Academic Press.
Savory, P. A. (1995). Using mathematica to aid simulation analysis. In Proceedings ofSimulation Conference, (pp. 1324-1328). Academic Press.
Soreide N. N. Sun C. L. Kilonsky B. J. Denbo D. W. (2001). A climate data portal. In Proceedings of OCEANS, 2001:MTS/IEEE Conference and
Exhibition, (pp. 2315-2317). IEEE.
Sungchul, L., Juyeon, J., & Yoohwan, K. (2014). Performance testing of web-based data visualization. In Proceedings ofIEEE International
Conference on Systems, Man, and Cybernetics.IEEE.
Sungchul, L., Juyeon, J., Yoohwan, K., & Haroon, S. (2014). A framework for environmental monitoring with Arduino-based sensors using restful
web service. In Proceedings of the Services Computing (SCC). IEEE.
Vikatos, P., Theodoridis, E., Mylonas, G., & Tsakalidis, A. (2011). PatrasSense: Participatory monitoring of environmental conditions in urban
areas using sensor networks and smartphones. In Proceedings of the Informatics (PCI), 2011 15th Panhellenic Conference, (pp. 392-396).
Academic Press.
Voulgaropoulou, S., Spanos, G., & Angelis, L. (2012). Analyzing measurements of the r statistical open source software. InProceedings
ofSoftware Engineering Workshop (SEW), 2012 35th Annual IEEE, (pp. 1-10). IEEE. 10.1109/SEW.2012.7
Xixi, G., Yuehui, J., Yidong, C., & Tan, Y. (2012). Web visualization of distributed network measurement system based on html5. InProceedings
of Cloud Computing and Intelligent Systems (CCIS),(pp. 519-523). IEEE.
Xuelei, W. C., Chen, J., & Rong, B. (2009). Web service architecture and application research. In Proceedings of the EBusiness and Information
System Security. Academic Press.
Yingqi, H. (2011). Application research on deformation monitoring of SMW method H steel pile using BOTDA. In Proceedings of Multimedia
Technology, (pp. 3862 – 3865). Academic Press.
Young, L., Jeong, Y., & Chang, K.H. (2007). Metrics and evolution in open source software. In Proceedings of Quality Software, (pp. 191-197).
Academic Press.
Zhu, Y. (2012). Introducing Google chart tools and Google maps API in data visualization courses. IEEE Computer Graphics and Applications, 6-
9.
CHAPTER 33
Web Intelligence:
A Fuzzy Knowledge-Based Framework for the Enhancement of Querying and Accessing Web Data
Jafreezal Jaafar
Universiti Teknologi PETRONAS, Malaysia
Kamaluddeen Usman Danyaro
Universiti Teknologi PETRONAS, Malaysia
M. S. Liew
Universiti Teknologi PETRONAS, Malaysia
ABSTRACT
This chapter discusses about the veracity of data. The veracity issue is the challenge of imprecision in big data due to influx of data from diverse
sources. To overcome this problem, this chapter proposes a fuzzy knowledge-based framework that will enhance the accessibility of Web data and
solve the inconsistency in data model. D2RQ, protégé, and fuzzy Web Ontology Language applications were used for configuration and
performance. The chapter also provides the completeness fuzzy knowledge-based algorithm, which was used to determine the robustness and
adaptability of the knowledge base. The result shows that the D2RQ is more scalable with respect to performance comparison. Finally, the
conclusion and future lines of the research were provided.
INTRODUCTION
There is a growing number of hypes in today’s world of data. The data that change the human interaction through leveraging power of
accessibility. The pervasive applications and tools such as phones, computers and cars have built the knowledge base which needs large data
stores and management. MapReduce, resource description framework (RDF) as well as simple protocol and RDF query language (SPARQL) are
the current technologies in data science that enable Web users to access and query information in a suitable way.
Perhaps, due to plethora of data, the database structures need to be enhanced in a way that will ease the query for proper processing and big data
exploration. This is a challenge in big data management in which the redundancy or/and unorganized information is of concerned. As schema is
the backbone for every database but it is not sufficient when the data is unstructured. Similarly, the quality of data re-usability reduces as it
increases everyday through Web by making it to be unstructured. This means that the presence of unstructured data reduces the integrity of data
and making it to be difficult for reuse. Many distributed databases and database schemas are now connected with large information. More
specifically, large amounts of information on the Web cause the problem of uncertainty and imprecision due to access and querying bunch of
data. Uncertainties, imprecisions and inconsistencies in data model is one of the tactical challenges of big data (Jewell et al., 2014; Savas,
Sagduyu, Deng, & Li, 2014). The uncertain or imprecise issue of data is called veracity which among the dimensions of big data. Significantly, big
data tools such as MapReduce and Hadoop deal with both structured and unstructured data to simplify accessibility. Therefore, the main
contribution of this chapter is to provide a fuzzy knowledge-based framework that will be used as channel of accessing and querying big data. The
framework guides in providing the precise information to the user through querying multiple data sources. This process allows machine to reason
intelligently. The reasoning approach of this work is specifically using fuzzy logic-based systems. The proposed work suggests the scalability of
the fuzzy knowledge base (KB).
This chapter proceeds with the second section that discusses about the retrospective background knowledge on Web intelligence, big data and the
uncertainty. Third section illustrates the overview about the proposed framework which includes: the general architecture of the system and fuzzy
ontology. In section four, the implementation process has been presented. Subsequently, section five discusses the result of the proposed
framework. Finally, section six provides the conclusion and future works.
WEB INTELLIGENCE
Web information has great impact in human life especially in the domain of world knowledge such as uncertainty and imprecision. The Web
intelligence constitutes the usage of WWW as a phenomenon of retrieving information from the storage efficiently. It acts as an agent in which
the machine can reason and conveys the message using Web tools. The agent goes round and integrates the resource which finally presents the
information to user through a Web page. The resources or things depend on RDF that links the entire concept together through Uniform
Resource Identifier (URI). Web intelligence is a well-known research area which converges subjects such as artificial intelligence, databases, Web
science, Semantic Web, and information retrieval (Berners-Lee et al., 2006; Camacho et al., 2013; Shadbolt & Berners-Lee, 2008; Shroff, 2013;
Williams et al., 2014). Therefore, reasoning in Web data is the first step in finding the solution of a problem in the knowledge-based system.
To work with intelligence, it is necessary to define the complex knowledge acquisition, knowledge inference, deduction and knowledge
representation in order to make the conceptualization of a knowledge model suitable for reasoning (Camacho et al., 2013; G’abor, 2007; Russell
et al., 2010; Zadeh, 2004). This may lead the ontology to be machine-processable and allows the precise interpretation of knowledge
representation. The knowledge representation has major three flavors: concepts, relations and instances (C, R and I). Generally, the
representations are being understood by humans. In this sense, if the knowledge representation is done through the concepts and relates it to
humans using symbols on a particular group, then the process is referred to as conceptualization. In accordance with Grimm et al. (2007)
axioms, a set of ontological statements that can be expressed in form of vocabularies. For example; “the weather is hot” or “only those who are
registered with MetOcean can know the wind direction at South China Sea”. Both of these consist of instances and concepts on a particular
domain. Expectedly, encoding such axiomatic concepts using semantic logic notations reduce the complexity of information. While, Web
Ontology Language (OWL) is the compromising tool of reasoning between human and machine that has been widely accepted by Semantic Web
community (Thomas & Sheth, 2006).
The concept of information processing in big data environment with structured and unstructured data is an essential aspect in today’s world of
data. It is a data to human relationship through accessing, sharing, and visualizing the raw data. Nevertheless, this creates a decentralization of
structural information and knowledge representation. In doing so, the implementation of fuzzy would be the challenging aspect but is the
solution to data-related industries. The fuzzy knowledge representation deals with the rules that can be interpreted as:
IF a is X AND b is Y THEN c is Z
This statement describes how the logic can infer the rule base in an inference system that would finally produce the result (antecedent to
consequent). Therefore, to achieve and validate the veracity of big data, such statement must be true in a knowledge base.
BIG DATA AND THE UNCERTAINTY
Uncertainties and inconsistencies in data model is one of the tactical big data challenges (Savas et al., 2014). It is because as Web is extending,
the demand for precision (precise information) is also increasing which is clearly the problem of ambiguity. “The more uncertainty in a problem,
the less precise we can be in our understanding of that problem” (Ross, 2010). Indeed, uncertainty is also the veracity issue of big data (Jewell et
al., 2014). There are currently five dimensions of big data which are called 5-Vs. These include Volume, Velocity, Value, Veracity and Variety. This
chapter attempts to contribute on one of these Vs that is the Veracity. Veracity is described as the quality of trustworthiness of the data (Hurwitz,
Nugent, Halper, & Kaufman, 2013; Zikopoulos, Parasuraman, Deutsch, Giles, & Corrigan, 2012). In addendum, Demchenko et al. (2013)
described Veracity as the certainty or consistency in the data due to its statistical reliability.
The aspect of establishing the trust or consistency in database widens the gap between the Web reasoning approach and human thinking. People
retrieve information via Web browsers which leads to the incessant increase of data in the database. Therefore, the collection of large amount of
datasets from different repositories and manage it in suitable way can be referred as big data. It is aimed to reduce the complexity due to the
increase of massive data from different applications. However, Mohanty et al. (2013) defined the big data as the combination of these
interactivities and transactions of data.
The tools that are currently been used in the big data ecosystems are the emerging technologies such as MapReduce, Hadoop Distributed File
system (HDFS) and non-relational (NoSQL) database. The database engine or agent in the knowledge base transforms the data value.
MapReduce is a product of an algorithm developed by Google in order to manage its large datasets. Then the Apache used this algorithm and
produced an open source project called Hadoop. It becomes an Apache-managed system derived from Google Big Table and MapReduce.
Therefore, Hadoop is basically contains two parts: MapReduce and Hadoop Distributed File System (HDFS) that is the programming part and
the file system respectively. Hadoop also has a number of prominent infrastructures or projects that manage the big data. These include: HBase,
Cassandra, Hive, Pig, Chukwa, Zookeeper, Amazon SimpleDB, Cloudata, CouchDB, AllegroGraph, MangoDB, and others. Some of these
databases run on top of HDFS which is also called NoSQL database. Since this chapter deals with fuzzy concepts, it’s therefore selected the Hive
and HBase for better scalability in big data management (see Figure 1). Similarly, it is discovered that HBase and HDFS are fault-tolerant
applications when integrating with Hadoop (Zaharia et al., 2012).
Hive is a data warehouse system layer built on Hadoop. It allows the developer to define a structure of the unstructured big data. The method
simplifies analysis and queries with the SQL-like scripting language called HiveQL. Hive latency queries are high in which it process the data that
keeps in HDFS. This makes it easier for the user to use the system without knowing the MapReduce. Apache Hive is one of the most effective
standards for SQL queries that works over petabytes of the data in Hadoop (Connolly, 2012).
HBase
In big data management, HBase is a columnar database engine developed on Ruby query language. It uses MapReduce engine and Hadoop
framework for its data processing. It is efficient to work with non-relational data (Hurwitz et al., 2013). Therefore, NoSQL would be a good
database to integrate with relational data. Connolly (2012) considers Apache HBase as a NoSQL database that runs on top of HDFS. Due to this
proficiency, HBase supports Web-based implementation of big data.
NoSQL
NoSQL, or not only SQL is an approach in big data management that operates on resources, columnar and graph database systems. It deals with
relational and non-relational databases and map with the architecture for better scalability.
FRAMEWORK OVERVIEW
This part discusses an overview of the process and functionality of the processed fuzzy knowledge-based framework for the enhancement of
querying and accessing Web data. Figure 2 describes the overall architecture as well as the process taken for integration of fuzzy ontology in big
data.
Figure 2. The framework architecture
Fuzzy Ontology
The Web ontology language (OWL) is a compromising tool of reasoning between human and machine that has widely been accepted by Semantic
Web community. Ontology is an explicit specification and representation of a particular domain of interest which allows understanding a domain
based on specific concepts (Gruber, 1993). The current standard of OWL is OWL 21 that standardized by World Wide Web Consortium (W3C) in
2009. This mechanize the interpretability of information on Web. The standard is perfect for use by applications that require and process the
content of Web data rather than presenting information to humans. Unarguably, several literatures have revealed that OWL is more related to
the expressiveness and computational nature of Web knowledge base (Bobillo, Delgado, Gómez-Romero, & Straccia, 2009; Lukasiewicz &
Straccia, 2009; Sugumaran & Gulla, 2012; Thomas & Sheth, 2006; Yu, 2011). Therefore, this notion has become the philosophical basis of
knowledge representation.
To design the knowledge base, the concepts TBox (terminological part) and ABox (assertional part) must be defined with respect to vocabularies
of the datasets. Fuzzy TBox provides the general rules for data retrieval in database. The triples (S, P, O) are instances in the database. The space
in the database contains the terminological files where the instances are the input of the system. On the other hand, fuzzy ABox starts from the
basic facts or axioms that are used for syntactic decomposition of the rules for the instances in the database. The fuzzy here is used to describe
each triple and unify with single or multiple atoms rule body (see Figure 3).
Available Resource
The resources available in the knowledge base such as the time series data is envisioned to have carried their task through integrating with the
relevant concepts. The data will allocate the distributed files with the aid of APIs. For instance, the resources generated from Malaysia time series
data. This may include all the connections between people, companies as well as the variables (wind speed, wind direction, etc.).
Intelligent Tasks
The available resources are processed by human and then the applications generate the tasks. Tasks are configured by Hadoop systems. The
optimality is based on the human and machine from the knowledge base which allows reliability due to the automated processing power of
Hadoop.
Map
The automation process taking care by the Hadoop from the intelligent tasks, mapping follows the automated tasks and specified and map each
concept to the available reduce.
Reduce
In this part, the Hadoop performs major contribution. It processed the nodes from the available distributed files.
Web User Interface
The Web user interface for this framework is a Simple Protocol and RDF Query Language (SPARQL) endpoint. The purpose for using SPARQL is
to ease the data accessibility through pulling the data from available resources in the HBase level. This means that SPARQL will enable the data
expressions or queries across different sources. It is also standardized by W3C with the current version as SPARQL 1.12. The data can be viewed
as raw RDF or as optional graph pattern.
Querying non-RDF data of MetOcean using D2R server with SPARQL endpoint allows the database to be accessed as graph. The data is dumped
and accessed using the general API. Considering the case study of meteorological and oceanographic data (MetOcean), the database with
corresponding entities and properties connect the entities together, which finally enable it to have complete interlinked data of the MetOcean
knowledge base.
Fuzzy Knowledge Base
The concepts are divided in to two segments: (i) TBox which consists of intentional information by means of terms that identifies common
properties of the concepts; and (ii) ABox of which includes extensional knowledge particularly to the individuals associated with the domain. To
decide the knowledge base is entailed, it is necessary to identify which concept of information complexity can be used with respect to the
knowledge domain. Therefore, this work checks the logical inference for every concept in the knowledge base and enumerates its truth value for a
particular sentence. It is noted that in finding the robustness of fuzzy KB the system must be satisfiable (Russell et al., 2010).
Resource description framework (RDF) represents the information inform of triples. A triple statement is a statement that
consists:subject, predicate and object (S, P, O). For example, “I love PETRONAS” means that “I” is the subject, “love” is predicate
and“PETRONAS” is object. The fuzzy RDF is when the triple statement has a truth value or is true to some degree say 0.8. Therefore, in fuzzy
logic a statement is mathematically between [0, 1] that is either the statement is false or true which can be changed as a matter of degree of a
statement.
For the fuzzy knowledge base, it is supposed that RBox (the role concept) knowledge base is defined. Therefore, three parts of KB work with
subsumption on the datasets. These will be the fuzzy TBox, fuzzy ABox and fuzzy RBox. Consider the degree of trust for is [0, 1]. Then, the
subsumption and expressivity of the KB with respect to fuzzy rules will be: ABox: C(x) for individual assertion, R(x, y) for property assertion and
for individual equivalence. For example:
This means that the PETRONAS concept has subsumed by MetOcean participant (Participate MetOcean) and produce time series data with at
least 0.7 as truth degree. Therefore, with respect to the general concept inclusion (GCI) axiom as defined by (Horrocks et al., 2006). It is when
two concepts C and D are of the form which implies:
Similarly, with respect to the definition of fuzzy interpretation, , since then but using first order logic the
⊓ .
This means that for all the fuzzy interpretation I of the KB will be:
since .
Now, taking the semantics of equivalent classes C ≡ D and the fuzzy interpretation as yield and in which and
are the fuzzy ABox and RBox respectively. Therefore, fuzzy individuals. Hence, the fuzzy interpretation concept is
satisfied. Since, satisfied iff and satisfied iff .
The Query Concept
With respect to the above knowledge base, for example; to find the MetOcean participants that produce time series data.
Query (?x, ?y): Participant (?y) ⋀ has time series data (?y,1) ⋀ PETRONAS (?x) ⋀ produce (?x, ?y).
Therefore, Figure 4 represents the query for MetOcean participant that follows the algorithm using the SPARQL query.
The above query result has achieved with the aid of fuzzy reasoning algorithm. When the data satisfies the line 2 (ABox) and line 5 it will compute
and determine whether the knowledge is entailed or not (line 7). Thus, query shows that the MetOcean participant has a time series data which is
true (since it satisfies the “1” condition). This gives the user precise information of the time series data with the given dates in RDF formats.
Fuzzy Reasoning Algorithm
Implementation
An interactive installation of SAP BusinessObjects Business Intelligence Platform was ran on Windows 7 Home Premium Service Park 1 with a
32-bit operating system. A 4.00 GB RAM, i5-intel® and CPU of 3.20GHz are the hardware systems that have been set. The software includes:
3 4 5 6
XAMPP3 package, D2RQ, Jena4and AllegroGraph 4.105 were installed. VMware6 (VMware player) was also installed in the same machine.
The D2RQ and XAMPP were installed to create the localhost and for accessing the relational data to RDF, mapping processes as well as some
supporting URIs. Similarly, for ontologies and reasoning, protégé 4.1 and other extensions were installed. Earlier, JAVA 7.2, JDK, eclipse 4.3 and
JDBC were also installed. Finally, the meteorological datasets were loaded on the same machine.
EVALUATION
This part discusses the result of querying the knowledge base using SPARQL endpoint. The goal of this work was to present a framework for
accessing and querying the data from the KB. Therefore, the evaluation of this chapter focuses on the performance of the KB framework and
compares the system using D2RQ, Jena ARQ and AllegroGraph. At the end of this section, an evaluation of the fuzzy knowledge base has been
discussed in order to provide a comprehensive discussion about the reasoning results.
Performance
As discussed about the experimental settings and applications used for this work, the testing was conducted on the same machine via SPARQL
endpoint. There are many methods for evaluating the SPARQL query. Since the target of this SPARQL is retrieving data from database based on
the user’s decision, F-measure is suitable for this. Accordingly, it was found that the precision and recall method is the most suited method in
evaluating user’s decision (Auer, et al., 2013; Makhoul et al., 1999).
Performance Metrics
Triplestore and the date concepts are the two major concepts considered for these metrics. The loading time was measured as in Table 1. F-
measure was used for evaluating the accuracy of the concepts (see Table 2). The F-measure (F-score) is a well-known method used in information
retrieval for weighing the accuracy of the variables (Ehrig & Staab, 2004; Makhoul et al., 1999; Sasaki, 2007; Truong, Duong, & Nguyen, 2013). It
consists of precision and recall as the two metrics, with non-negative value, α. Consider Ncorrect to be the number of correct or relevant ratio and
Nincorrectto be the number of incorrect or irrelevant ratio. Then F-measure, Fα will be:
Table 1. Loading time in millisecond of triples and date concepts
10 825 350
Table 2. Concept retrieval optimization
with
and
Results
When a user queries from SPARQL interface, the result appears based on his selection. For instance, if a MetOcean user wants to query with the
non-information resource available to MetOcean information resource. The result of query 1 (Appendix) presents the MetOcean URIs distinct
representation (see Figure 5).
Similarly, suppose the user wants to find the date concept of Malaysian State time series. The domain user may say: “show me the dates used in
Malaysian time series (and limited to only 10 triples)”. The result of this statement appears in Figure 6, which shows the approach is sufficient
for querying the data in form of triples. Furthermore, in respect to performance metrics of these concepts, they attained the general optimization
method of precision and recall, (Figure 7). Consider 1 at x-axis (metrics), the precision is 0.74(Fα) while recall is 0.60(Fα). Also, at precision
0.87(Fα) the recall is 0.55(Fα). Similarly, when precision is 0.89(Fα) the recall is 0.64(Fα). Apparently, this is the required optimization in both
the points of precisions and recalls. They show promising result, which are the optimized retrieved and relevant for the users query.
Figure 6. Triple result limited with only 10 queries
Furthermore, consider the loading time of the MetOceanSemWeb concepts on date, the triples appear in Table 1 based on the execution limit.
This revealed that the MetOcean data can be mapped and create rich pathways by means of querying the resources. Thus, the concepts allow
smooth connections between datasets. Nonetheless, Table 3 provided the evaluation of the system without limit using codes in the Appendix.
Table 3. Performance comparisons in milliseconds
Table 4. Evaluation query result
In contrast, the system is very scalable when querying directly from SPARQL database. This is very essential as found in Hassanzadeh et al.
(2009) and Samwald et al. (2011) where the D2R server was automatically queried 158 countries Clinical data trials with 7,011,000 triples. On the
other hand, the D2RQ annotation facets of the relational data found to be more scalable than Vavliakis et al. (2011) as they used Jena and Sesame
for their benchmarks. Correspondingly, Vandervalk et al. (2009, 2013) tested their Bio2RDF SPARQL query from relational data but with no
seamless visualization which this framework provides the more interesting visualization. Thus, the method employed in this work can apparently
be called scalable with respect to the existing approaches.
Fuzzy Knowledge Base Evaluation
Since SPARQL does not perform reasoning by itself, the query language needs inference. This part evaluates the results by using fuzzy KB
reasoning and the proof of algorithm. Notably, managing the database, instance creation and knowledge annotation are the right instruments for
knowledge assessment (Stuckenschmidt & Van Harmelen, 2004). Therefore, the benchmark here follows Gödel’s method for evaluation of the
algorithm (Bobillo et al., 2009; Kleene, 1976; Van Heijenoort, 1977). Three cretaria were evaluated. These are: expressivity, consistancy and
satisfiability.
Completeness of the Reasoning Procedure for Fuzzy ALC
This section uses completeness method to evaluate the performance of the algorithm. There are four methods for solving the algorithm’s
performance. These include: completeness, optimality, time complexity, and space complexity (Ertel, 2011; Russell et al., 2010). However, the
chapter considers completeness theorem in order to support the processes given above. Some parts of this has been discussed in (Danyaro et al.,
2012). The target is to show the consistency and satisfiability level of the given knowledge base. The reader should be noted that the fuzzy ALC
(Attributive Language with Complements) expressed here depend on fuzzy subsumption of fuzzy concepts knowledge base that generalizes
modeling of uncertain knowledge.
Proof:
Suppose the knowledge base, KB is satisfiable, then it is clear that fuzzy interpretation . Let C be the constraint in the
. Suppose A = ABox and I defined in ABox, then A:
2. consist of all atomic classes in A, with be the truth degree of variables in
3. , where Ro consist of all atomic roles, x, yindividual elements in A and represent the degree of variable of .
4. .
Now, using inductive technique, it shows that all concepts and roles that can be interpreted by I is a model (Klir & Yuan, 1995).
Given C as an atomic concept, then by definition it implies . Then which indicates that A contains . But since A is complete
and C has and . So contains the interpretation . Therefore, applying concept negation, it becomes
which implies the concept is obvious.
Next, it is to show I is satisfiable. Assume I is not satisfiable, then without loss of generality
is unsatisfiable . But,
This implies the supremum of y. Assume the supremum or infimum does not exist then the value is undefined. Therefore,
Since the given KB is unsatisfiable by contradiction it is therefore, I is satisfiable. This proof indicates that logic is decidable under Gödel’s
semantics. The approach supported the interpretation of different types of imprecise knowledge that has relation to the real world applications.
However, it differs from (Józefowska et. al., 2008, 2010), where their proof is valid in query trie for skolem substitution only. Also, there
completeness was proved by induction on a length of a query, which assumed that the query of length ≥ 1 with an equivalent query exists in trie.
Similarly, Zhao and Boley (2010) proved the completeness but it was restricted to constraints sets. Hence, the chapter approach shows the
desirability of the completeness property that can drive any entailed statement. This could minimize the imprecision in the database and improve
the trustworthiness.
CONCLUSION
This chapter analyses the benchmark of the database query approach using precision and recall. The result shows that the smooth connections
enable the transformation of data to be a good path for utilizing the MetOcean data (as a case study) and benefit its users. It also shows that
D2RQ is the most scalable API among the compared ones. Therefore, the approach can contribute in solving the problem of determining relevant
information in big data. Moreover, the chapter shows the application of fuzzy as a knowledge instrument for managing the database. The fuzzy
KB was constructed using rule-based approach and finally achieved the satisfiability, consistence and decidability of the knowledge base.
In summary, execution this framework as a decision making will reduce the cost and improve the big data performance – effective intelligence.
On the other hand, the framework provides the essential steps toward reducing the problem of imprecision. The basics for big data and
architectures of fuzzy knowledge base have been elucidated with the system built on TBox and ABox. It is found that the algorithm is sufficient
for a kind knowledge base that will help in querying the Web data. The method satisfied the operations in dealing with the fuzzy knowledge base.
Thus, this serves as a way of enhancing the query of big data. In future, the analysis of big data evaluation will extensively be discussed. High
calculations will be conducted such as statistical analysis as well as Extract, Transform and Load (ETL) processing.
This work was previously published in a Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 83104, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Auer, S., Lehmann, J., Ngonga Ngomo, A. C., & Zaveri, A. (2013). Introduction to Linked Data and Its Lifecycle on the Web . In Rudolph, S.,
Gottlob, G., Horrocks, I., & Harmelen, F. (Eds.),Reasoning Web. Semantic Technologies for Intelligent Data Access(Vol. 8067, pp. 1–90). New
York, NY: Springer Berlin Heidelberg. doi:10.1007/978-3-642-39784-4_1
Berners-Lee, T., Hall, W., & Hendler, J. A. (2006). A framework for web science. Foundations and Trends in Web Science , 1(1), 1–130.
doi:10.1561/1800000001
Bobillo, F., Delgado, M., Gómez-Romero, J., & Straccia, U. (2009). Fuzzy description logics under Gödel semantics. International Journal of
Approximate Reasoning , 50(3), 494–514. doi:10.1016/j.ijar.2008.10.003
Bobillo, F., & Straccia, U. (2011). Fuzzy ontology representation using OWL 2. International Journal of Approximate Reasoning ,52(7), 1073–
1094. doi:10.1016/j.ijar.2011.05.003
Bobillo, F., & Straccia, U. (2013). Aggregation operators for fuzzy ontologies. Applied Soft Computing , 13(9), 3816–3830.
doi:10.1016/j.asoc.2013.05.008
Camacho, D., Moreno, M. D., & Akerkar, R. (2013). Challenges and issues of web intelligence research. Paper presented at the 3rd International
Conference on Web Intelligence, Mining and Semantics, Madrid, Spain. 10.1145/2479787.2479868
Danyaro, K., Jaafar, J., & Liew, M. (2012). Completeness Knowledge Representation in Fuzzy Description Logics . In Lukose, D., Ahmad, A., &
Suliman, A. (Eds.), Knowledge Technology (Vol. 295, pp. 164–173). New York: Springer Berlin Heidelberg. doi:10.1007/978-3-642-32826-8_17
Demchenko, Y., Grosso, P., de Laat, C., & Membrey, P. (2013).Addressing big data issues in scientific data infrastructure. Paper presented at the
2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA. 10.1109/CTS.2013.6567203
Ehrig, M., & Staab, S. (2004). QOM – Quick Ontology Mapping . In McIlraith, S., Plexousakis, D., & Harmelen, F. (Eds.), The Semantic Web –
ISWC 2004 (Vol. 3298, pp. 683–697). New York: Springer Berlin Heidelberg. doi:10.1007/978-3-540-30475-3_47
G’abor, N. a. (2007). Ontology Development . In Studer, R., Grimm, S., & Abecker, A. (Eds.), Semantic Web Services (pp. 107–134). New York:
Springer Berlin Heidelberg. doi:10.1007/3-540-70894-4_4
Grimm, S., Hitzler, P., & Abecker, A. (2007). Knowledge Representation and Ontologies Logic, Ontologies and Semantic Web Languages . In
Studer, R., Grimm, S., & Abecker, A. (Eds.),Semantic Web Services (pp. 51–105). New York: Springer.
Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R. J., & Wang, M. (2009). Linkedct: A linked data space for clinical trials.arXiv preprint
arXiv:0908.0567.
Hassanzadeh, O., Xin, R., Miller, E. J., Kementsietsidis, A., & Wang, M. (2009). Linkage Query Writer. Proc. VLDB Endow.,2(2), 1590-1593. doi:
10.14778/1687553.1687599
Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data For Dummies . Hoboken, NJ: John Wiley & Sons.
Jewell, D., Barros, R. D., Diederichs, S., Duijvestijn, L. M., Hammersley, M., Hazra, A., & Plach, A. (2014). Performance and Capacity
Implications for Big Data . IBM Redbooks.
Józefowska, J., Ławrynowicz, A., & Łukaszewski, T. (2008). On Reducing Redundancy in Mining Relational Association Rules from the Semantic
Web . In Calvanese, D., & Lausen, G. (Eds.),Web Reasoning and Rule Systems (Vol. 5341, pp. 205–213). New York: Springer Berlin Heidelberg.
doi:10.1007/978-3-540-88737-9_16
Józefowska, J., Ławrynowicz, A., & Łukaszewski, T. (2010). The role of semantics in mining frequent patterns from knowledge bases in
description logics with rules. Theory and Practice of Logic Programming , 10(3), 251–289. doi:10.1017/S1471068410000098
Klir, G. J., & Yuan, B. (Eds.). (1995). Fuzzy sets and fuzzy logic: theory and applications . New Jersey: Prentice Hall.
Lukasiewicz, T., & Straccia, U. (2009). Description logic programs under probabilistic uncertainty and fuzzy vagueness. International Journal of
Approximate Reasoning , 50(6), 837–853. doi:10.1016/j.ijar.2009.03.004
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999).Performance measures for information extraction. Paper presented at the
DARPA Broadcast News Workshop.
Mohanty, S., Jagadeesh, M., & Srivatsa, H. (2013). Big Data Imperatives: Enterprise ‘Big Data’Warehouse,‘BI’Implementations and Analytics .
Apress. doi:10.1007/978-1-4302-4873-6
Ross, T. J. (2010). Fuzzy Logic with Engineering Applications . John Wiley & Sons, Ltd.doi:10.1002/9781119994374.index
Russell, S. J., Norvig, P., & Davis, E. (2010). Artificial intelligence: a modern approach (Vol. 2). Prentice Hall.
Samwald, M., Jentzsch, A., Bouton, C., Kallesøe, C. S., Willighagen, E., Hajagos, J., . . . Pichler, E. (2011). Linked open drug data for
pharmaceutical research and development. Journal of Cheminformatics, 3(1), 19.
Shadbolt, N., & Berners-Lee, T. (2008). Web Science Emerges.Scientific American , 299(4), 76–81. doi:10.1038/scientificamerican1008-76
Shroff, G. (2013). The intelligent web: search, smart algorithms, and big data . Oxford, UK: Oxford University Press.
Stuckenschmidt, H., & Van Harmelen, F. (2004). Information sharing on the semantic web . Springer.
Sugumaran, V., & Gulla, J. A. (2012). Applied semantic web technologies . CRC Press.
Thomas, C., & Sheth, A. (2006). On the expressiveness of the languages for the semantic web — Making a case for ‘a little more’. In S. Elie
(Ed.), Capturing Intelligence (Vol. 1, pp. 3-20). Netherlands: Elsevier.
Truong, H. B., Duong, T. H., & Nguyen, N. T. (2013). A Hybrid Method For Fuzzy Ontology Integration. Cybernetics and Systems, 44(2-3), 133–
154. doi:10.1080/01969722.2013.762237
Van Heijenoort, J. (1977). From Frege to Gödel: a source book in mathematical logic, 1879-1931 (Vol. 9). Harvard University Press.
Vandervalk, B., McCarthy, E. L., & Wilkinson, M. (2009). SHARE: A Semantic Web Query Engine for Bioinformatics . In Gómez-Pérez, A., Yu, Y.,
& Ding, Y. (Eds.), The Semantic Web (Vol. 5926, pp. 367–369). New York, NY: Springer Berlin Heidelberg. doi:10.1007/978-3-642-10871-6_27
Vavliakis, K. N., Symeonidis, A. L., Karagiannis, G. T., & Mitkas, P. A. (2011). An integrated framework for enhancing the semantic
transformation, editing and querying of relational databases.Expert Systems with Applications , 38(4), 3844–3856.
doi:10.1016/j.eswa.2010.09.045
Williams, K., Li, L., Khabsa, M., Wu, J., Shih, P., & Giles, C. L. (2014). A web service for scholarly big data information extraction.
In Proceeding of the 2014 IEEE International Conference on Web Services (ICWS), (pp. 105 – 112). Anchorage, AK: IEEE.
10.1109/ICWS.2014.27
Yu, L. (2011). A Developer’s Guide to the Semantic Web . New York, NY: Springer. doi:10.1007/978-3-642-15970-1
Zadeh, L. A. (2004). A Note on Web Intelligence, World Knowledge and Fuzzy Logic. Data & Knowledge Engineering ,50(3), 291–304.
doi:10.1016/j.datak.2004.04.001
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., . . . Stoica, I. (2012). Resilient distributed datasets: a faulttolerant
abstraction for inmemory cluster computing. Paper presented at the 9th USENIX conference on Networked Systems Design and
Implementation, San Jose, CA.
Zhao, J., & Boley, H. (2010). Knowledge Representation and Reasoning in Norm-Parameterized Fuzzy Description Logics . InCanadian Semantic
Web (pp. 27–53). Springer. doi:10.1007/978-1-4419-7335-1_2
Zikopoulos, P., Parasuraman, K., Deutsch, T., Giles, J., & Corrigan, D. (2012). Harness the Power of Big Data The IBM Big Data Platform .
McGraw Hill Professional.
KEY TERMS AND DEFINITIONS
Big Data Management: This is the process of managing, administering or organizing large volumes of both structured and unstructured data.
Completeness: In Artificial Intelligence (AI), completeness theorem is among the methods used for checking the validity of axioms and logical
inference in the knowlegde base. However, a knowledge base is said to be complete if no formular can be added in the knowledge base.
Fuzzy Knowledge Base: In fuzzy logic systems, the fuzzy knowledge base represents the facts of the rules and linguistic variables based on the
fuzzy set theory so that the knowledge base sytems will allow approximate reasoning.
Gödel’s Semantics: In mathematical logic, Gödel’s completeness theorem is a fundamental theorem that relies on the semantic completeness
of first order logic. In the proof of algorithm, when I is satisfiable, it shows the logic is decidable under Gödel’s semantics.
Satisfiablility: In this chapter, satisfiablity is the ability of the Web data concepts that are be able to test the value of the semantic query of
atomic concepts. Similarly, satisfiability determines the variables of the given boolean.
Veracity: Veracity can be described as the quality of trustworthiness of the data. In other wards, veracity is the consistency in data due to its
statistical reliability. It is also among the five dimentions of big data which are volume, velocity, value, variety and veracity.
Web Data: In this chapter, the Web data is the data that comes from large or diverse number of sources. Web data are developed with the help
of Semantic Web tools such as RDF, OWL, and SPARQL. Also, the web data allows sharing of information through HTTP protocol or SPARQL
endpoint.
Web Intelligence: The Web intelligence constitutes the usage of world wide web (WWW) as a phenomenon of retrieving information from data
storage in an intelligent efficient manner. Web intelligence is a well-known research area which converges subjects such as artificial intelligence,
databases, Web science, Semantic Web, and information retrieval.
ENDNOTES
1 http://www.w3.org/TR/owl2-overview/
2 http://www.w3.org/TR/sparql11-query/
3 XAMPP is an open source package that contains an Apache HTTP Server, a MySQL database, PHP interpreters and a Perl programming
languages. It can be found at: http://www.apachefriends.org/index.html
4 https://jena.apache.org/
5 http://franz.com/agraph/allegrograph/
6 VMware is Virtual Machine image that allow user to run Linux in a virtual environment. Available at: http://www.vmware.com/
APPENDIX: EVALUATION QUERIES
Table 5.
Cem Özdoğan
Çankaya University, Turkey
Dan Watson
Utah State University, USA
ABSTRACT
Data reduction is perhaps the most critical component in retrieving information from big data (i.e., petascale-sized data) in many data-mining
processes. The central issue of these data reduction techniques is to save time and bandwidth in enabling the user to deal with larger datasets
even in minimal resource environments, such as in desktop or small cluster systems. In this chapter, the authors examine the motivations behind
why these reduction techniques are important in the analysis of big datasets. Then they present several basic reduction techniques in detail,
stressing the advantages and disadvantages of each. The authors also consider signal processing techniques for mining big data by the use of
discrete wavelet transformation and server-side data reduction techniques. Lastly, they include a general discussion on parallel algorithms for
data reduction, with special emphasis given to parallel wavelet-based multi-resolution data reduction techniques on distributed memory systems
using MPI and shared memory architectures on GPUs along with a demonstration of the improvement of performance and scalability for one
case study.
INTRODUCTION
With the advent of information technologies, we live in the age of data – data that needs to be processed and analyzed efficiently to extract useful
information for innovation and decision-making in corporate and scientific research labs. While the term of ‘big data’ is relative and subjective
and varies over time, a good working definition is the following:
• Big data: Data that takes an excessive amount of time/space to store, transmit, and process using available resources.
One remedy in dealing with big data might be to adopt a distributed computing model to utilize its aggregate memory and scalable computational
power. Unfortunately, distributed computing approaches such as grid computing and cloud computing are not without their disadvantages (e.g.,
network latency, communication overhead, and high-energy consumption). An “in-box” solution would alleviate many of these problems, and
GPUs (Graphical Processing Units) offer perhaps the most attractive alternative. However, as a cooperative processor, GPUs are often limited in
terms of the diversity of operations that can be performed simultaneously and often suffer as a result of their limited global memory as well as
memory bus congestion between the motherboard and the graphics card. Parallel applications as an emerging computing paradigm in dealing
with big datasets have the potential to substantially increase performance with these hybrid models, because hybrid models exploit both
advantages of distributed memory models and shared memory models.
A major benefit of data reduction techniques is to save time and bandwidth by enabling the user to deal with larger datasets within minimal
resources available at hand. The key point of this process is to reduce the data without making it statistically indistinguishable from the original
data, or at least to preserve the characteristics of the original dataset in the reduced representation at a desired level of accuracy. Because of the
huge amounts of data involved, data reduction processes become the critical element of the data mining process on the quest to retrieve
meaningful information from those datasets. Reducing big data also remains a challenging task that the straightforward approach working well
for small data, but might end up with impractical computation times for big data. Hence, the phase of software and architecture design together
is crucial in the process of developing data reduction algorithm for processing big data.
In this chapter, we will examine the motivations behind why these reduction techniques are important in the analysis of big datasets by focusing
on a variety of parallel computing models ranging from shared-memory parallelism to message-passing parallelism. We will show the benefit of
distributed memory system in terms of memory space to process big data because of the system’s aggregate memory. However, although many of
today’s computing systems have many processing elements, we still lack data reduction applications that benefit from multi-core technology.
Special emphasis in this chapter will be given to parallel clustering algorithms on distributed memory systems using the MPI library as well as
shared memory systems on graphics processing units (GPUs) using CUDA (Compute Unified Device Architecture developed by NVIDIA).
GENERAL REDUCTION TECHNIQUES
Significant CPU time is often wasted because of the unnecessary processing of redundant and non-representative data in big datasets. Substantial
speedup can often be achieved through the elimination of these types of data. Furthermore, once non-representative data is removed from large
datasets, the storage and transmission of these datasets becomes less problematic.
There are a variety of data reduction techniques in current literature; each technique is applicable in different problem domains. In this section,
we provide a brief overview of sampling – by far the simplest method in implementation but not without intrinsic problems – and feature
selection as a data reduction technique, where the goal is to find the best representational data among all possible feature combinations. Then we
examinefeature extraction methods, where the aim is to reduce numerical data using common signal processing techniques, including discrete
Fourier transforms (DFTs) and discrete wavelet transforms (DWTs).
Sampling and Feature Selection
The goal of sampling is to select a subset of elements that adequately represents the complete population. At first glance, sampling may appear to
be overly simplistic, but these methods are nonetheless a very usable class of techniques in data reduction, especially with the advent of big data.
With the help of sampling, algorithms performed on the sampled population retain a reasonable degree of accuracy while yielding a significantly
faster execution.
Perhaps the most straightforward sampling method is simplerandom sampling (Figure 1), in which sample tuples are chosen randomly using a
pseudorandom number generator. In simple random sampling, the original big data population is randomly sampled, and those sampled units
are added to the (initially empty) sampled population set with (or without) replacement. With replacement, as sampled units are placed into the
sampled population set, if an identical unit already exists in that set, it is replaced by the sampled unit. Thus, one unit is randomly selected first
with probability of 1/ N where N is the size of population, and there are no duplicate elements in the sampled population. Without replacement,
sampled units are added to the sampled population set without prejudice, and thus identical elements may exist in the resulting set. In simple
random sample without replacement, because there are C(N, n) possible samples, where nis sample size, the probability of selecting a sampled
unit is 1 / C(N, n) (Lohr, 1999).
Figure 1. Simple random sampling algorithm
It is worth noting that random number generation on parallel computers is not straightforward, and care must be taken to avoid exact
duplication of the pseudorandom sequence when multiple processing elements are employed. The reader is directed for further reader on parallel
pseudorandom number generation to Aluru et al., 1992, Bradley et al., 2011; and Matteis & Pagnutti, 1990).
Another sampling method is systematic sampling, illustrated in Figure 2. The advantage of this method over simple random method is that
systematic sampling does not need a parallel random number generator. Let k = N / n and R be a random number between 1 and k, then
every ith tuple, 0 ≤ i < n, in database is selected according to the equation R + i × k. Because systematic sampling performs selection of the
sampled units in a well-patterned manner, the tuples must be stored in a random fashion in the database; otherwise, systematic sampling may
introduce unwanted bias into the sampled set. For example, as a web server records the information of the visitors, the order of information
stored in the database might produce bias if the employment status of the visitors affects this ordering because of heterogeneous characteristics
of the population. See Figure 3 for sampling illustrations using both simple random sampling and systematic sampling.
Figure 2. Systematic sampling algorithm
In stratified sampling, shown in Figure 4, population units are divided into disjoint, homogeneous groups, called strata in which a sampling
technique such as simple random sampling or systematic sampling is applied on those strata independently. One of the advantages of stratified
sampling is to minimize estimation error. For example, suppose the manager of a web site wishes to obtain sampled users represented fairly in a
sampled population, and that the information of gender proportions is known in advance. Therefore, the group (strata) consisting of all males are
sampled together, and then the same sampling operation is applied on the remaining portion of database tuples (i.e., females), independently.
Figure 4. Stratified sampling algorithm
While sampling techniques are straightforward and efficient in reducing data, they may introduce a biased result, because the sampled set does
not represent the population fairly. Thus, feature subset selection (FSS) techniques come up as a promising way to deal with this problem.
A Feature is defined as an attribute of the data specifying one of its distinctive characteristics. The objective of an FSS is to find the feature subset
with size d representing the data “best” among all possible feature set combinations, C(n, d), over n features. To measure how well the selected
feature subset represents the data, an objective function is used to evaluate the “goodness” of the candidate subset. Let set of features of the data
be and the feature subset be where . Then feature subset Y maximizes the objective function J(Y).
Exhaustive search, shown in Figure 5, is one straightforward way to find the optimum feature subset of size d. This algorithm is able to obtain the
optimal feature set of size d over all possible feature subsets. The algorithm is considered a “top-down” approach where starts from an empty
feature subset and incrementally finds the solution. The most optimal feature set is the set having maximum goodness value. While exhaustive
search guarantees the best solution, it is obvious that it exhibits poor performance due to its exponential algorithm complexity of O(J(Yi)) ×
O(C(n, d)) = O(J(Yi) × C(n,d)). Therefore, exhaustive search can be used only in practice if the data has a small number of features. The search
space of exhaustive search is illustrated in Figure 6 where every leaf node is traversed to find the optimum solution.
Figure 5. Exhaustive feature selection algorithm
In contrast to the exhaustive search approach, the branch and bound algorithm for future subset selection (Narendra & Fukunaga, 1977) finds the
optimal feature set in a “bottom-up” manner using a tree structure. Hence, the algorithm discards mfeatures to find the feature subset
with d dimensions where m = n – d. The monotonicity property of feature subsets with respect to objective function, J, is satisfied at each
iteration in which only one feature is discarded. Thus, the objective function is maximized at each iteration as shown in Equation 1:
(1)
The search space of the branch and bound algorithm is illustrated as a tree structure form in Figure 7. The tree has m + 1 levels where 0 ≤ i ≤ m.
Each leaf node represents a candidate feature subset. The label near each node indicates the feature to be discarded. Traversing begins from the
root node and switches to the next level. When the goodness-of-fit value of the feature set is less than the current maximum value or there are no
nodes left to be visited at level i, the algorithm backtracks up to the previous level. The maximum goodness-of-fit value is calculated when the leaf
node is reached for the corresponding candidate feature set. The algorithm finishes when it backtracks to level zero and the candidate feature set
with highest goodness-of-fit value is considered the optimal feature set.
Although the branch and bound algorithm is superior to exhaustive search algorithm in terms of algorithmic efficiency, the objective function is
invoked not only for the leaf nodes, but also for other nodes in the tree. In addition, there is no guarantee that every node is visited. In order to
alleviate this problem, Somol et al. (2000) introduced fast branch and bound algorithm (FBB) for optimal subset selection. The FBB algorithm
uses a prediction mechanism to save the computation time of excessive invocation of objective function by monitoring goodness-of-fit changes.
Another data reduction approach is to use heuristic feature selection algorithms such as sequential forward selection (SFS, shown in Figure 8)
and sequential backward selection (SBS, shown in Figure 9), which employ a greedy approach by respectively adding or removing the best
features in an incremental fashion. The SFS algorithm starts with an empty set and then adds features one by one, whereas the SBS algorithm
starts with all feature sets and then yield a suboptimum feature set by incrementally discarding features. Because heuristic algorithms
substantially reduce the search space (for example by three orders of magnitude when n = 20, d = 10 (Whitney, 1971)), they are considered faster
than exhaustive search at the expense of optimality. Furthermore, because forward selection starts with small subsets and then enlarges them,
while backward selection starts with large subsets and then shrinks them, experiments show that forward selection is typically faster than
backward selection because of the computational cost in repetitively processing large subsets (Jain & Zongker, 1997). As a forward selection
approach, FOCUS by Almuallim & Dietterich (1991) is a quasi-polynomial algorithm that starts searching from bottom when the size of candidate
feature subsets is one, then increments the size until the minimum goodness is satisfied.
The drawback of those heuristic algorithms is a problem callednesting, in which the operation of adding (SFS) or removing (SBS) cannot be
undone once the operation is finished. This situation is a primary cause of a suboptimal solution. To overcome the nesting problem, some
algorithms have been proposed, such as the PlusLMinusR introduced by Stearns (1976) in which the algorithm performs l adding operations
and r removing operations. If l > r, it functions like the forward selection algorithm, otherwise it functions like the backward selection algorithm.
One drawback of the Plus-L-Minus-R algorithm is that the parameters of l and rmust be determined to obtain the best feature set. To compensate
for this problem, a floating search method is proposed in which the algorithm changes l and r values at run time to maximize the feature set
optimality (Pudil et al., 1994).
As stated by John et al. (1994), the approach of any data reduction algorithm falls into two models for selecting relevant feature subsets. The first
is the filter model (see Figure 10) in which the phase of feature selection algorithm is applied independently of the induction algorithm as a
preprocessing step. This is contrasted with the wrapper model (see Figure 11), which “wraps” the induction algorithm into the feature selection
step. In the wrapper model, the evaluation phase of the goodness function for the feature subset is performed according to the induction
algorithm. Each model has its own strengths and weaknesses; the filter model may introduce irrelevant features that may impose performance
problems on the induction algorithm because of its separate existence but still performs relatively faster than the wrapper model. Even so, the
wrapper model has the ability to find more relevant subsets at a reasonable cost to performance.
The wrapper model utilizes an n-fold cross validation technique where the input feature set is divided into n partitions of equal sizes. Using those
partitions, the algorithm uses n – 1 partitions as training data and performs a validation operation on one partition. This operation is
repeated n times for each of the other partitions and the feature subset with the maximum averaged goodness-of-fit value is regarded as the best
solution. Cross validation can also avoid the phenomena of over-fitting; the problem of fitting noise into the model. While the wrapper model
may not be ideal for big data due to its performance inefficiencies, some feature pruning techniques or sampling techniques applied to training
data may yield faster execution times.
Data Reduction using Signal Processing Techniques
Another way to reduce big data to a manageable size is to use contemporary signal processing techniques. Although those methods fall into the
class of feature extraction methods in the context of dimensionality reduction (Pudil et al., 1998), they are considered here as an alternative
approach to feature selection techniques. Feature extraction methods reduce the dimensionality by mapping a set of elements in one space to
another; however, feature selection methods perform this reduction task by finding the “best” feature subset. Additionally, because feature
selections methods require an objective function to evaluate the goodness of the subset, the output of the reduction operation therefore depends
on the objective function. By contrast, feature extraction methods solve the reduction problem on another space on transformed data without
using objective function. Because of the fact that signal processing techniques do not work on non-numerical data, they require a quantization
step to process non-numerical data as a preprocessing step before applying the signal processing technique.
The Discrete Fourier Transform (DFT) method is used in many signal-processing studies that are seeking a reduction in data, for example in
reducing time series data (that naturally tends toward large amounts of data) without sacrificing trend behavior (Faloutsos et al., 1994). The main
aim of applying a DFT is to convert a series of data from the time domain into the frequency domain by decomposing the signal into a number of
sinusoidal components. The DFT extracts f features from sequences in which each feature in the transformed feature space refers to one DFT
coefficient.
While DFTs are useful tool for analyzing signal data, it often fails to properly analyze non-stationary data that might be frequently encountered in
time series data. To overcome this problem, another signal processing technique, the Discrete Wavelet Transform (DWT) can be employed to
analyze the components of non-stationary signals (Sifuzzaman et al., 2009). Moreover, while DFTs transform the signal from the time/space
domain into frequency domain, DWTs transform the signal into time/frequency domain. Thus, DWTs are beneficial to preserve spatial property
of the data on the transformed feature space unlike DFTs. Wu et al. (2000) show that the DWT is a better tool than the DFT for the application of
time series analysis because DWTs reduce the error of distance estimates on the transformed data. Additionally, the DWT (like the DFT),
produces no false dismissal (Chan & Fu, 1999). This important property allows us to use wavelet transform for data reduction.
DWTs decompose signals into high-frequency and low-frequency components using filtering techniques. The low-frequency component
represents a lower resolution approximation of the original feature space on which the data-mining algorithm is applied to obtain (upon re-
application of the DWT on the low-frequency component) solutions at different scales from fine (as in the case of the original dataset) to coarse as
in the case of multiple applications of the DWT). IN addition, a lookup table can be used to map the units from the transformed feature space
back to a unit in the original feature space, providing a useful indexing capability of the resulting units in the transformed space.
As implied above, wavelet transforms can be applied to the input data multiple times on the low-frequency component in a recursive manner,
creating a multi-resolution sampling of the original feature set. This recursive operation of decomposing the signal is called a filter bank. Each
operation of wavelet transform constitutes half-sized objects in the transformed feature space for each dimension as compared to previous
dimension of the feature space.
Figure 12 illustrates the low-frequency component extracted from the image of A. Lincoln. As shown in the Figures, as the scale level of the
wavelet transform increases, we obtain coarser representation of the original data in which typically a data-mining algorithm is applied to extract
the information from this reduced data. For example, in Figure 12(b), DWT is applied three times on the low-frequency component (scale level is
3), and in Figure(c), DWT is applied one more time using Figure 12(b), so the scale level becomes four. Thus, the reduced data in Figure 12(c) has
half-sized cells as compared to the Figure 12(b) and Figure 12(c) has coarser representation of the input data as compared to the previous low-
frequency component (Figure 12(b)). For comparison, Figure 12(d) is actually a rather famous illustration by Leon Harmon (1973) created to
illustrate his article on the recognition of human faces.
There are many popular wavelet algorithms including Haar wavelets, Daubechies wavelets, Morlet wavelets and Mexican Hat wavelets. As stated
by Huffmire & Sherwood (2006), Haar wavelets (Stollnitz et al., 1995) are used in many studies because of their simplicity and their fast and
memory-efficient characteristics. It should also be noted that it is necessary to scale source datasets linearly to power of two for each dimension,
because the signal rate decreases by a factor of two at each level. If we let X = {x1, x2,…,xn} be an input data of size n, then the low frequency
component of the data can be extracted via simple averaging operation as shown below:
(2)
Where each object in the low-frequency component of the feature space , is also contained in the low-frequency component of the previous
transformed feature space . Hence, the feature spaces are nested (Stollnitz et al., 1995). That is:
(3)
Popivanov & Miller (2002) list the main advantages of using DWTs over DFTs:
• DWTs are able to capture time-dependent local properties of data, whereas DFTs can only capture global properties;
• The algorithmic complexity of DWTs is linear with the length of data, which is superior to even Fast Fourier Transforms
with O(nlogn) complexity;
• While DFTs provide the set of frequency components in the signal, DWTs allows us to analyze the data at different scales. The low-
frequency component of the transformed space represents the approximation of the data in another scale;
• Unlike DFTs, DWTs have many wavelet functions. Thus, they provide access to information that can be obscured by other methods.
PARALLEL DATA REDUCTION ALGORITHMS
Despite substantial an continuing improvements in processor technology in terms of speed, data reduction algorithms may still not complete the
required task in a reasonable amount of time for big data problems. Besides, there may not be enough available memory resources to hold all the
data on a single computer. One of the most promising approaches for overcoming those issues is to make use of parallel computing systems and
algorithms (Hedberg, 1995). Parallel processing techniques are extensively used in divide-and-conquer approaches in which the task is divided
into smaller subtasks, all of which are then executed simultaneously, to cope with memory limits and to decrease the overall execution time of a
given task.
In this section, we briefly overview parallel processing as applied to big data. Then, we present studies in the literature that examine the task of
data reduction on data centers in parallel. As a case study, we investigate the parallel WaveCluster algorithm on distributed memory systems
using the Message-Passing Interface (MPI) model and on GPUs as an instance of the shared memory system using CUDA. WaveCluster
algorithms employ DWTs to reduce the dimension of the data before finding clusters on dataset, and so are well-suited to this discussion.
Parallel Processing
Parallel computer architectures can be classified according to the memory sharing models. The first of these is the Distributed Memory
System (DMS) model in which independent processors or nodes, each with their own memory and address space, are connected to each other by
means of high-speed network infrastructure. In this model, data is shared among nodes via explicit message-passing, either between individual
nodes, or via a broadcast mechanism. Typically, program written for DMS systems using library function calls such as those provided by the
Message Passing Interface (MPI) libraries (Gropp et al., 1994) to construct the communication patterns among the processors by
sending/receiving messages.
An alternative parallel architecture model is the Shared Memory System (SMS) model. Under the SMS model, there is a single address space –
and thus the appearance of a shared memory bank – into which any of the nodes may read or write at any time. Data sharing among processors
occurs without any explicit message passing; each node simply reads from or write to the shared space, which is immediately available to all other
processors. In practice, SMS systems are usually built out of individual processing elements that each contain their own private memory, and the
appearance of a shared address is created with a set of protocols that are often written for DMS-type systems. One notable exception to this (as
discussed below) is in the realm of GPUs, which often contain some mix of shared and distributed memory hierarchies.
Each memory model has its own advantages over another model. One advantage of SMS over DMS is that there is no need to store duplicate
items in separate memory. Additionally, SMS provides much faster communication among the processors. On the other hand, distributed
memory architectures fully utilize the larger aggregate memory that allows us to solve larger problems, which typically imply large memory
needs. Hybrid memory models represent a combination of the shared and distributed memory models. With an increasing number of clusters
with hybrid architectures, the MPI library is used to achieve data distribution between GPU nodes whereas CUDA functions as “main computing
engine” (Karunadasa & Ranasinghe, 2009). Hence, they have been considered complementary models in order to maximally utilize system
resources.
As a distributed memory system, MapReduce (Dean & Ghemawat, 2008) has the potential to provide a novel way of dealing with big data. In this
model, the programmer is not required to specify the parallelism in code. At its core, there are only two
functions: mapand reduce. Map processes key/value pairs and divides the data into a group of sub-data where each sub-data is represented by a
key. The sub-data, which are collectively known as intermediate values, are stored in a distributed file system. Reduce then merges all the
intermediate values across the distributed file system associated with the same key. In the starting phase of a job, the library splits the input files
into pieces of typically from 16 megabytes to 64 megabytes per piece. It then starts up the many copies of the program on a cluster. The
implementation has been developed based on master/worker model. Master node makes assignments of map and reduces tasks to available
worker jobs. Each worker writes the intermediate results of map and reduce operations to local disks which can be accessed by all nodes in the
cluster.
One of the characteristics of the map-reduce model is that processors are not allowed to communicate with each other directly. Thus, while this
strategy has been criticized as being too low-level, the model provides us with scalability to support ever-growing datasets. It also ensures fault-
tolerance by storing intermediate files on distributed file system and by taking advantage of its dynamic job scheduling system. One popular
implementation of map-reduce framework is Apache Hadoop which provides for the distributed processing of large data sets across clusters of
computers. Hodoop implements a distributed file system to provide high-throughput data access. Although Hadoop’s byte-stream data model
leads to inefficiencies in data access for highly structured data, Sci-Hadoop, a Hadoop plugin, attenuates this effect by allowing users to specify
logical queries over array-based data models (Buck et al, 2011). See Figure 13 for an illustration of the map-reduce model.
Figure 13. Illustration of map-reduce model where the number
of pieces, M = 6
There are many programming model choices for shared memory systems. Threading is one way to run concurrent processes on multi-core
processors where each thread may run on different cores within a single process. Threads share the process address space and can access other
process resources, as well. Another choice is OpenMP (Chapman et al., 2007), as an application-programming interface (API), OpenMP provides
the programmer flexibility, ease of use and transparency in the creation of CPU threads, which speed up the loops by using pragma directives.
Recent work in big data processing coincides with the advent of many-core graphical processing units (GPUs), which are used for general-
purpose computation as a co-processor of the CPU (in addition to its traditional usage of accelerating the graphic rendering). In 2008, NVIDIA
introduced CUDA (Compute Unified Device Architecture) (Nickolls et al., 2008) to enable general-purpose computations on NVIDIA GPUs in an
efficient way. In CUDA programming, the programmer must partition the problem into sub problems that can be solved independently in parallel
on GPU cores. In practice, the programmer writes a function named akernel and accesses the data using thread IDs in a multidimensional
manner. The advantage of this model is that once the parallel program is developed, it can scale transparently to hundreds of processor cores
independently on any CUDA-capable NVIDIA graphic processor.
There are three main stages in CUDA program flow. First, input data is copied from local memory (RAM) to the global memory of the graphic
processor so that processing cores can access the input data in a fast manner. Second, the CUDA machine executes the parallel program given.
Third, the data that reside on the global memory of graphic processor is transferred from global memory back to local memory (RAM). Thus, the
main disadvantage of this model is that this mechanism may incur performance bottlenecks due to the extra memory copy operations, eventually
decreasing the speedup ratio of the parallel program when compared to the sequential one. The other disadvantage is that the CUDA kernel does
not allow accessing the disk and calling the functions of the operating system or libraries compiled for CPU execution.
Because GPU hardware implements the SIMD (Single Instruction stream, Multiple Data stream) parallel execution model, each control unit in
the GPU broadcasts one instruction among all the processor units inside the same processor group in which the instruction is executed. This
architectural difference from the traditional CPU architecture requires non-trivial parallel programming models. It also comes with its own
intrinsic restrictions in developing a parallel program such as the synchronization issue among the GPUs and the complications of having
different levels of memory models. Furthermore, most of the classical programming techniques such as recursion might not be applied in GPU
computing. Thus, GPU programs have to be developed in accordance with the underlying architectures.
ServerSide Parallel Data Reduction Methods for Data Centers
Due to massive bandwidth requirements, the task of moving of data between hosting datacenters and workstations requires a substantial amount
of time. One remedy is to run data reduction application on the data center to take advantage of data locality. Wang et al. (2007) introduced the
Script Workflow Analysis for MultiProcessing (SWAMP) system. In SWAMP system, the user is required to write a shell script and then the script
is submitted to the OpenDAP data server using Data Access Protocol (DAP). The SWAMP parser has ability to recognize netCDF options and
parameters. The execution engine of SWAMP manages the script execution and builds dependency trees. The SWAMP engine can then
potentially exploit parallel execution of the script when possible.
Today, map-reduce frameworks are beginning to see increased use to reduce big data for further processing. Singh et al. (2009) introduced
parallel feature selection algorithm to quickly evaluate billions of potential features for very large data sets. In the mapping phase, each map
worker iterates over the training records of the block and produces intermediate data records for each new feature. Then, each reduce worker
computes an estimated coefficient for each feature. Finally, post-processing is performed to aggregate the coefficients for all features in the same
feature class.
Another map-reduce study is presented by Malik et al. (2011) to accelerate the performance of subset queries on raster images in parallel.
Inherently, geographic information data consists of a collection of multidimensional petabyte-sized data that are becoming more common. To
fetch subsets of data from geographic information database efficiently, each of the spatial, time and weather variables of data are stored in a
column-oriented storage format, instead of in a row-oriented storage format where multidimensional data is stored as a single record. Column-
oriented storage format leads to performance benefit by allowing each subset to be loaded into memory without loading unnecessary data when
subset query is issued. The algorithm indexes data in Hilbert order (Song et al., 2000) to take advantage of data locality.
Parallel WaveCluster Algorithm
WaveCluster Algorithm
Clustering is a common data mining technique that is used for data retrieval by grouping similar objects into disjoint classes or clusters (Fayyad
et al., 1996). There has been an explosive growth of very large or big datasets in scientific and commercial domains with the recent progress in
data storage technology. Because cluster analysis has become a crucial task for the mining of the data, a considerable amount of research is
focused on developing sequential clustering analysis methods. Clustering algorithms have been highly utilized in various fields such as satellite
image segmentation (Mukhopadhyay & Maulik, 2009), unsupervised document clustering (Surdeanu et al., 2005) and clustering of
bioinformatics data (Madeira & Oliveira, 2004). The WaveCluster algorithm is a multi-resolution clustering algorithm introduced by
Sheikholeslami et al. (1998). The algorithm is designed to perform clustering on very large spatial datasets by taking advantage of discrete
wavelet transforms. Thus, the algorithm has the ability to detect arbitrary shape clusters at different scales and can handle dataset noise in an
appropriate way.
The WaveCluster algorithm contains three phases. In the first phase, the algorithm quantizes the feature space and then assigns objects to the
units. This phase also affects the performance of clustering for different values of interval size. In the second phase, the discrete wavelet
transform is applied on the feature space multiple times. Discrete wavelet transforms are a powerful tool for time-frequency analysis that
decomposes the signal into average subband and detail subbands using filtering techniques. The WaveCluster algorithm gains the ability to
remove outliers with wavelet transforms and detects the clusters at different levels of accuracy (multi-resolution property) from fine to coarse by
applying the wavelet transform multiple times. Following the transformation, dense regions (clusters) are detected by finding connected
components and labels are assigned to the units in the transformed feature space.
In the second phase, a lookup table is constructed to map the units in the transformed feature space to original feature space.
In the third and last phase, WaveCluster algorithm assigns the cluster number of each object in the original feature space. In Figure 14, the effect
of the wavelet transformation on source dataset is demonstrated.
Figure 14. WaveCluster algorithm multi-resolution property:
(a) Original source dataset, (b) ρ and 6 clusters are detected,
(c) ρ = 3 and 3 clusters are detected (where ρ is the scale level)
The results of the WaveCluster algorithm for different values of ρ are shown in Figure 14; where ρ (scale level) represents how many times
wavelet transform is applied on the feature space. There are 6 clusters detected with ρ = 2 (Figure 14(b)) and 3 clusters are detected with ρ = 3
(Figure 14(c)). In the performed experiments, connected components are found on average sub bands (feature space) using a classical two-pass
connected component labeling algorithm (Shapiro & Stockman, 2001) at different scales.
Parallel WaveCluster Algorithm on the Distributed Memory System
We reported our developed parallel WaveCluster algorithm based on the message-passing model using MPI library (Yıldırım & Özdoğan, 2011).
The algorithm scales linearly with the increasing number of objects in the dataset. One may conclude that it makes mining big datasets possible
via the parallel algorithm on distributed memory systems, without having restrictions such as dataset size and other relevant criteria.
In the message-passing approach, each copy of the single program runs on processors independently, and communication is provided by the
manner of sending and receiving messages among nodes. We followed master-slave model in implementation. Each processor works on a specific
partition of the dataset and executes discrete wavelet transform and connected component labeling algorithm. The obtained results from each
processor are locally correct, but might not be globally correct. Therefore, processors exchange their local results and then check for correctness.
This operation is achieved simply by sending the border data of the local results to the parent node, called master node. This border data consists
of transformed values and their corresponding local cluster numbers. After processors send the border data to the parent node, the parent node
creates a merge table for all processors with respect to the global adjacency property of all cells. This requirement for sustaining consistency in
effect creates a barrier primitive. Accordingly, all processors wait for the parent node to receive merged table. Then, all processors update the
cluster numbers of the units on local feature space. Lastly, processors map the objects of the original feature space to the clusters using lookup
table. In the lookup table, each entry specifies the relationships of one unit in the transformed feature space to the corresponding units of the
original feature space. The flow of the algorithm is depicted in Figure 15.
Figure 15. Parallel WaveCluster algorithm for Distributed
Memory Systems
Experiments for this technique were conducted and obtained performance behaviors were presented for 1, 2, 4, 8, 16 and 32 core cases on a
cluster system having 32 cores with a 2.8 GHz clock speed and fast Ethernet (100 Mbit/sec) as underlying communication. Our experiments have
shown that the cluster shape complexity of the dataset has minimal impact on the execution time of the algorithm. Datasets below are named as
DS32 (1073741824 objects) and DS65 (4294967296 objects) according to the dataset size. The DS 65 case is beyond the limit of dataset size that
fits into the memory of a single processor within the available hardware, and is studied by running that dataset with 8, 16 and 32 processors.
Table 1 shows execution times and speed-up ratios with respect to the number of objects in the dataset for varied numbers of processors. Analysis
of the dataset DS65 is only possible at or above the size of eight processors in our computer cluster system. The benefit of the distributed memory
system is easily observed above in processing big data. Experimental results have shown that this parallel clustering algorithm provides superior
speed-up and linear scaling behavior (time complexity) and is useful in overcoming space complexity constraints via aggregate memory of the
cluster system.
Table 1. Execution times and speedup ratios for varying dataset sizes and number of processors (np)
Dataset np = 1 np = 2 np = 4 np = 8 np = 16 np = 32
Speed-up Ratios
Parallel WaveCluster Algorithm on the Shared Memory System
The parallel implementation of WaveCluster algorithm on GPUs is presented in (Yıldırım & Özdoğan, 2011). In this study, we used a Haar
wavelet, because of its initial small window width and easy implementation. Then, a lookup table is built which maps more than one unit in the
original dataset to one unit in the transformed dataset. After this phase, the connected component labeling algorithm is performed on the low-
frequency component to find the connected units and the resulting units are transformed back to the original space using look-up table. In the
WaveCluster algorithm, a cluster is defined as a group of connected units in the transformed feature space.
Figure 16 shows the pseudo-code of the extraction of the low-frequency component on the CUDA machine. The kernel is invoked times and each
invocation results in the extraction of coarser component. is the low-frequency component of scale jand the previous component is the
input of the following CUDA kernel. The kernel also removes outliers on the transformed feature space in which the threshold value TH is chosen
as arbitrary. Each CUDA thread is responsible for finding one approximation value extracted from the feature space of 2x2 size. Before kernel
invocation, the input feature space is transferred from host memory to the CUDA global memory and the output buffer is allocated in the device
memory to store the transformed feature space. Because the input data is 2-dimensional, the wavelet transform is applied twice for each
dimension. Each CUDA thread applies a one-dimensional wavelet transform to each column of the local feature space and stores the intermediate
values in the shared memory buffer H. The final approximation value is eventually calculated by applying a second one-dimensional discrete
wavelet transform for each row on H. Hence, buffer H is used to store temporal results. To facilitate the data access operation, the algorithm
takes advantage of shared memory located on the CUDA machine.
We implemented the CUDA-based algorithms of the wavelet transform, connected component labeling and look-up phases and then evaluated
the performance of each CUDA kernel on very large synthetic datasets. CUDA experiments were conducted on a Linux workstation with 2 GB
RAM, Intel Core2Duo processor (2 Cores, 2.4 GHz, 4MB L2 Cache) and an NVIDIA GTX 465 GPU(1 GB memory, 352 computing cores, each core
runs at 1.215 GHz) with compute capability 2.0 and runtime version 3.10. In this section, we present the extraction phase of the low frequency
component as a data reduction algorithm on GPUs. In the experiments, two-dimensional fog datasets (DataSet1; DS1 and DataSet2; DS2) have
been used. These datasets were obtained from the web site of NASA Weather Satellite Imagery Viewers (2011). As a distinctive property, DS1 has
more clusters than DS2, but the size of clusters is bigger in DS2. The larger datasets (DS(1 or 2) with the size of 4096, 2048, 1024 and 512) have
been obtained by scaling the datasets linearly.
Execution times (in microseconds) and corresponding kernel speed-up values for the low-frequency extraction algorithm are presented in Table
2.
Table 2. Performance results of CUDA and CPU versions of LowFrequency Component Extraction for scale level one on datasets (DS1& DS2);
times in microseconds
The obtained results indicate achievement of up to a 165.01x kernel speed-up ratio in dataset DS1 with the size 4096 in the kernel of low-
frequency extraction. We obtain high speed-up ratios as the number of points increase in the dataset. This result indicates that the operation of
signal component extraction in the wavelet transform is suitable to be executed on CUDA devices.
Detected numbers of clusters (K) for varying scale level ρ are depicted in Figure 17. As shown in the figures, there is a negative correlation
between scale level, ρ, and number of clusters K. Because resizing the dataset results in a coarser representation of the original dataset, the
number of clusters is decreasing with the increasing scale level ρ. This phenomenon is the natural consequence of applying wavelet transforms on
the low-frequency component of the original signal.
CONCLUSION
In this chapter, we have investigated the study of data reduction methodologies for big datasets both from the point of theory as well as
application with special emphasis on parallel data reduction algorithms. Big datasets need to be processed and analyzed efficiently to extract
useful information for innovation and decision-making in corporate and scientific research. However, because the process of data retrieval on big
data is computationally expensive, a data reduction operation can result in substantial speed up in the execution of the data retrieval process. The
ultimate goal of the data reduction techniques is to save time and bandwidth by enabling the user to deal with larger datasets within minimal
resources without sacrificing the features of the original dataset in the reduced representation of the dataset at the desired level of accuracy.
Despite substantial improvements on processor technology in terms of speed, data reduction algorithms may still not complete the required task
in a reasonable amount of time at big data. Additionally, there may not be enough available memory resources to hold all the data on a single
computer. One of the possible solutions to overcome those issues and efficiently mine big data is to make utilization of parallel algorithms
(Hedberg, 1995). Parallel processing techniques are extensively used by dividing the task into smaller subtasks and then executing them
simultaneously to cope with memory limits and to decrease the execution time of given task.
Graphical processing units (GPUs) are regarded as co-processors of the CPU and have tremendous computational power. NVIDIA introduced
CUDA, providing a programming model for parallel computation on GPUs. In this chapter, we have presented a CUDA algorithm of wavelet
transform that is used in the WaveCluster algorithm for reducing the original dataset. The WaveCluster approach first extracts the low-frequency
component from the signal using wavelet transform and then performs connected component labeling on the low-frequency component to find
the clusters present on the dataset. The algorithm has a multi-resolution feature. Thus, the algorithm has the ability of detecting arbitrarily-
shaped clusters at different scales and can handle noise in an appropriate way.
Due to limited memory capacity of the video cards, CUDA device may not be a remedy to all problems in processing big data. Much larger
datasets could be mined on distributed memory architectures with the aggregate memory of the system using message passing APIs such as MPI.
We have also implemented the WaveCluster algorithm for distributed memory models using MPI (Yıldırım & Özdoğan, 2011) in which a master-
slave model and replicated approach are followed. A dataset that does not fit into the available memory is processed by taking advantage of the
aggregate memory on the distributed memory system.
Each parallel memory architecture has intrinsic advantages over the other model. With the increasing hybrid architectures, MPI model (or map-
reduce model) can be used to achieve data distribution between GPU nodes whereas CUDA functions as “main computing engine” (Karunadasa &
Ranasinghe, 2009). Hence, they have been considered complementary models in order to utilize the system resources at maximum.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 7293, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Almuallim, H., & Dietterich, T. G. (1991). Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial
Intelligence (AAAI91), (vol. 2, pp. 547–552). Anaheim, CA: AAAI Press.
Aluru, S., Prabhu, G. M., & Gustafson, J. (1992). A random number generator for parallel computers. Parallel Computing ,18(8), 839–847.
doi:10.1016/0167-8191(92)90030-B
Bradley, T., Toit, J. D., Tong, R., Giles, M., & Woodhams, P. (2011). Parallelization techniques for random number generators. In W. Hwu
(Ed.), GPU Computing Gems Emerald Ed., (pp. 231–246). Boston: Morgan Kaufmann.
Buck, J. B., Watkins, N., LeFevre, J., Ioannidou, K., Maltzahn, C., Polyzotis, N., & Brandt, S. (2011). SciHadoop: Array-based query processing in
Hadoop. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM.
Chan, K.-P., & Fu, A. W.-C. (1999). Efficient time series matching by wavelets. [IEEE.]. Proceedings of Data Engineering , 1999, 126–133.
Chapman, B., Jost, G., & Ruud, R. V. D. P. (2007). Using OpenMP: Portable shared memory parallel programming . Cambridge, MA: The MIT
Press.
de Matteis, A., & Pagnutti, S. (1990). A class of parallel random number generators. Parallel Computing , 13(2), 193–198. doi:10.1016/0167-
8191(90)90146-Z
Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM , 51(1), 107–113.
doi:10.1145/1327452.1327492
Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. SIGMOD Record ,23(2), 419–
429. doi:10.1145/191843.191925
Fayyad, F., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine , 17, 37–54.
Gropp, W., Lusk, E., & Skjellum, A. (1999). Using MPI: Portable parallel programming with the message-passing interface . Cambridge, MA: MIT
Press.
Hedberg, S. R. (1995). Parallelism speeds data mining. IEEE Parallel & Distributed Technology Systems & Applications , 3(4), 3–6.
doi:10.1109/88.473600
Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis
and Machine Intelligence , 19(2), 153–158. doi:10.1109/34.574797
John, G. J., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the International Conference
on Machine Learning (pp. 121–129). IEEE.
Karunadasa, N. P., & Ranasinghe, D. N. (2009). Accelerating high performance applications with cuda and mpi. In Proceedings of the Industrial
and Information Systems (ICIIS), (pp. 331–336). ICIIS.
Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational
Biology and Bioinformatics , 1, 24–45. doi:10.1109/TCBB.2004.2
Malik, T., Best, N., Elliott, J., Madduri, R., & Foster, I. (2011). Improving the efficiency of subset queries on raster images. InProceedings of the
ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS '11).
ACM.
Mukhopadhyay, A., & Maulik, U. (2009). Unsupervised satellite image segmentation by combining sa based fuzzy clustering with support vector
machine. In Proceedings of the Advances in Pattern Recognition, (pp. 381–384). IEEE.
Narendra, P. M., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers , 26(9),
917–922. doi:10.1109/TC.1977.1674939
Nickolls, J., Buck, B., Garland, M., & Skadron, K. (2008). Scalable parallel programming with CUDA. Queue , 6(2), 40–53.
doi:10.1145/1365490.1365500
Popivanov, I., & Miller, R. J. (2002). Similarity search over time-series data using wavelets. [IEEE.]. Proceedings of Data Engineering , 2002,
212–221.
Pudil, P., & Novovičová, J. (1998). Novel methods for feature subset selection with respect to problem knowledge . In Feature Extraction,
Construction and Selection (pp. 101–116). New York: Springer. doi:10.1007/978-1-4615-5725-8_7
Pudil, P., Novovičová, J., & Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters , 15(11), 1119–1125.
doi:10.1016/0167-8655(94)90127-9
Shapiro, L. G., & Stockman, G. C. (2001). Computer vision . Englewood Cliffs, NJ: Prentice Hall.
Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). Wavecluster: A multi-resolution clustering approach for very large spatial databases.
In Proceedings of the International Conference on Very Large Data Bases (pp. 428-439). IEEE.
Sifuzzaman, M., Islam, M. R., & Ali, M. Z. (2009). Application of wavelet transform and its advantages compared to Fourier transform. The
Journal of Physiological Sciences; JPS , 13, 121–134.
Singh, S., Kubica, J., Larsen, S., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In Proceedings of the SIAM
International Conference on Data Mining (SDM). SDM.
Somol, P., Pudil, P., Ferri, F. J., & Kittler, J. (2000). Fast branch & bound algorithm in feature selection. In B. Sanchez, M. J. Pineda, & J.
Wolfmann, (Eds.), Proceedings of SCI 2000: The 4th World Multiconference on Systemics, Cybernetics and Informatics (pp. 646–651).
Orlando, FL: IIIS.
Song, Z., & Roussopoulos, N. (2000). Using Hilbert curve in image storing and retrieving. In Proceedings of the 2000 ACM Workshops on
Multimedia (MULTIMEDIA '00). ACM.
Stollnitz, E. J., DeRose, T. D., & Salesin, D. H. (1995). Wavelets for computer graphics: A primer, part 1. IEEE Computer Graphics and
Applications , 15(3), 76–84. doi:10.1109/38.376616
Surdeanu, M., Turmo, J., & Ageno, A. (2005). A hybrid unsupervised approach for document clustering. In Proceedings of the Eleventh ACM
SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 685–690). New York, NY: ACM.
Wang, D. L., Zender, C. S., & Jenks, S. F. (2007). Server-side parallel data reduction and analysis. Advances in Grid and Pervasive
Computing , 4459, 744–750. doi:10.1007/978-3-540-72360-8_67
Whitney, A. W. (1971). A direct method of nonparametric measurement selection. IEEE Transactions on Computers , 20(9), 1100–1103.
doi:10.1109/T-C.1971.223410
Wu, Y., Agrawal, D., & Abbadi, A. E. (2000). A comparison of DFT and DWT based similarity search in time-series databases. InProceedings of
the Ninth International Conference on Information and Knowledge Management, CIKM ’00 (pp. 488–495). New York, NY: ACM.
Yıldırım, A. A., & Özdoğan, C. (2011). Parallel WaveCluster: A linear scaling parallel clustering algorithm implementation with application to very
large datasets. Journal of Parallel and Distributed Computing , 71(7), 955–962. doi:10.1016/j.jpdc.2011.03.007
Yıldırım, A. A., & Özdoğan, C. (2011). Parallel wavelet-based clustering algorithm on GPUs using CUDA. Procedia Computer Science , 3, 396–
400. doi:10.1016/j.procs.2010.12.066
KEY TERMS AND DEFINITIONS
Big Data: Data that takes an excessive amount of time/space to store, transmit, and process using available resources.
Clustering: A common data mining technique that is used for data retrieval by grouping similar objects into disjoint classes or clusters.
CUDA (Compute Unified Device Architecture): A parallel programming model that is used for general-purpose computation on NVIDIA
GPUs (Graphical Processing Units).
Distributed Memory System (DMS) model: A model in which independent processors or nodes, each with their own memory and address
space, are connected to each other by means of high-speed network infrastructure.
Feature Selection Technique: As a data reduction technique, where the goal is to find the best representational data among all possible
feature combinations.
Feature Extraction Technique: A technique that solves the reduction problem by transforming input data into another space in which the
task is performed using reduced transformed data.
Parallel Processing Techniques: Techniques that are extensively used in divide-and-conquer approaches in which the task is divided into
smaller subtasks, all of which are then executed simultaneously to decrease the overall execution time of a given task.
Shared Memory System (SMS) Model: A model that there is a single address space – and thus the appearance of a shared memory bank –
into which any of the nodes/processors may read or write at any time.
CHAPTER 35
The IT Readiness for the Digital Universe
Pethuru Raj
IBM India Pvt Ltd, India
ABSTRACT
The implications of the digitization process among a bevy of trends are definitely many and memorable. One is the abnormal growth in data
generation, gathering, and storage due to a steady increase in the number of data sources, structures, scopes, sizes, and speeds. In this chapter,
the authors show some of the impactful developments brewing in the IT space, how the tremendous amount of data getting produced and
processed all over the world impacts the IT and business domains, how next-generation IT infrastructures are accordingly being refactored,
remedied, and readied for the impending big data-induced challenges, how likely the move of the big data analytics discipline towards fulfilling
the digital universe requirements of extracting and extrapolating actionable insights for the knowledge-parched is, and finally, the establishment
and sustenance of the smarter planet.
INTRODUCTION
One of the most visible and value-adding trends in IT is nonetheless the digitization aspect. All kinds of common, casual, and cheap items in our
personal, professional and social environments are being digitized systematically to be computational, communicative, sensitive and responsive.
That is, all kinds of ordinary entities in our midst are instrumented differently to be extraordinary in their operations, outputs and offerings.
These days, due to unprecedented maturity and stability of a host of path-breaking technologies such as miniaturization, integration,
communication, computing, sensing, perception, middleware, analysis, actuation and articulation, everything has grasped the inherent power of
interconnecting with one another in its vicinity as well as with remote objects via networks purposefully and on need basis to uninhibitedly share
their distinct capabilities towards the goal of business automation, acceleration and augmentation. Ultimately, everything will become smart,
electronics goods will become smarter and human beings will become the smartest.
The Trickling and TrendSetting Technologies in the IT Space
As widely reported, there are several delectable transitions in the IT landscape. The consequences are vast and varied: incorporation of nimbler
and next-generation features and functionalities into existing IT solutions; grand opening of fresh possibilities and opportunities; eruption of
altogether new IT products and solutions for the humanity. The Gartner report on the top-ten trends for the year 2014 reports several
scintillating concepts (Forbes, 2014). These have the inherent capabilities to bring forth numerous subtle and succinct transformations in
business as well as people. In this section, the most prevalent and pioneering trends in the IT landscape will be discussed.
IT Consumerization and Commoditization
The much-discoursed and deliberated Gartner report details the diversity of mobile devices (smartphones, tablets, wearables, etc.) and their
management to be relevant and rewarding for people(Vodafone, 2010). That is, it is all about the IT consumer trend that has been evolving for
some time now and peaking these days. That is, IT is steadily becoming an inescapable part of consumers directly and indirectly. And the need
for robust and resilient mobile device management software solutions with the powerful emergence of Bring Your Own Device (BYOD) is being
felt and is being insisted across. Another aspect is the emergence of next-generation mobile applications and services across a variety of business
verticals. There is a myriad of mobile applications, maps and services development platforms, programming and mark-up languages,
architectures and frameworks, tools, containers, and operating systems in the fast-moving mobile space. Commoditization is another cool trend
penetrating in the IT industry. With the huge acceptance and adoption of cloud computing and big data analytics, the value of commodities IT is
decidedly on the rise.
IT Digitization and Distribution
As explained in the beginning, digitization has been an on-going and overwhelming process and it has quickly generated and garnered a lot of
market and mind shares. Digitally enabling everything around us induces a dazzling array of cascading and captivating effects in the form of
cognitive and comprehensive transformations for businesses as well as people. With the growing maturity and affordability of edge technologies,
every common thing in our personal, social, and professional environment is becoming digitized. Devices are being tactically empowered to be
computational, communicative, sensitive, and responsive. Ordinary articles are becoming smart artifacts in order to significantly enhance the
convenience, choice, and comfort levels of humans in their everyday lives and works.
Therefore it is no exaggeration to state that lately there have been a number of tactical as well as strategic advancements in the edge-technologies
space. Infinitesimal and invisible tags, sensors, actuators, controllers, stickers, chips, codes, motes, specks, smart dust, and the like are being
produced in plenty. Every single tangible item in our midst is being systematically digitized by internally as well as externally attaching these
miniscule products onto them. This is for empowering them to be smart in their actions and reactions. Similarly, the distribution aspect too gains
more ground. Due to its significant advantages in crafting and sustaining a variety of business applications ensuring the hard-to-realize Quality of
Service (QoS) attributes, there are a bevy of distribution-centric software architectures, frameworks, patterns, practices, and platforms for Web,
enterprise, embedded, analytical and cloud applications and services.
Ultimately all kinds of perceptible objects in our everyday environments will be empowered to be self-, surroundings-, and situation-aware,
remotely identifiable, readable, recognizable, addressable, and controllable. Such a profound empowerment will bring forth transformations for
the total human society, especially in establishing and sustaining smarter environments, such as smarter homes, buildings, hospitals, classrooms,
offices, and cities. Suppose, for instance, a disaster occurs. If everything in the disaster area is digitized, then it becomes possible to rapidly
determine what exactly has happened, the intensity of the disaster, and the hidden risks inside the affected environment. Any information
extracted provides a way to plan and proceed insightfully, reveals the extent of the destruction, and conveys the correct situation of the people
therein. The knowledge gained would enable the rescue and emergency team leaders to cognitively contemplate appropriate decisions and plunge
into actions straightaway to rescue as much as possible thereby minimizing damage and losses.
In short, digitization will enhance our decision-making capability in our personal as well as professional lives. Digitization also means that the
ways we learn and teach are to change profoundly, energy usage will become knowledge-driven so that green goals can be met more smoothly and
quickly, and the security and safety of people and properties will go up considerably. As digitization becomes pervasive, our living, relaxing,
working, and other vital places will be filled up with a variety of electronics including environment monitoring sensors, actuators, monitors,
controllers, processors, projectors, displays, cameras, computers, communicators, appliances, gateways, high-definition IP TVs, and the like. In
addition, items such as furniture and packages will become empowered by attaching specially made electronics onto them. Whenever we walk
into such kinds of empowered environments, the devices we carry and even our e-clothes will enter into collaboration mode and form wireless ad
hoc networks with the objects in that environment. For example, if someone wants to print a document from their Smartphone or tablet and they
enter into a room where a printer is situated, the Smartphone will automatically begin a conversation with the printer, check its competencies,
and send the documents to be printed. The Smartphone will then alert the owner.
Digitization will also provide enhanced care, comfort, choice and convenience. Next-generation healthcare services will demand deeply connected
solutions. For example, Ambient Assisted Living (AAL) is a new prospective application domain where lonely, aged, diseased, bedridden and
debilitated people living at home will receive remote diagnosis, care, and management as medical doctors, nurses and other care givers remotely
monitor patients’ health parameters.
People can track the progress of their fitness routines. Taking decisions become an easy and timely affair with the prevalence of connected
solutions that benefit knowledge workers immensely. All the secondary and peripheral needs will be accomplished in an unobtrusive manner
people to nonchalantly focus on their primary activities. However, there are some areas of digitization that need attention, one being energy
efficient. Green solutions and practices are being insisted upon everywhere these days, and IT is one of the principal culprits in wasting a lot of
energy due to the pervasiveness of IT servers and connected devices. Data centers consume a lot of electricity, so green IT is a hot subject for
study and research across the globe. Another area of interest is remote monitoring, management, and enhancement of the empowered devices.
With the number of devices in our everyday environments growing at an unprecedented scale, their real-time administration, configuration,
activation, monitoring, management, and repair (if any problem arises) can be eased considerably with effective remote correction competency.
At a fundamental level, there are three distinct, but deeply interrelated, domains(Devlin, 2012).
• HumanSourced Information: People are the ultimate source of all information. This is our highly subjective record of our personal
experiences. Previously recorded in books and works of art, and later photographs, audio and video recordings, human-sourced information
has now been largely digitized and electronically stored everywhere, from tweets to movies. This information is loosely structured,
ungoverned and may not even be a reliable representation of “reality,” especially for business. Structuring and standardization, ie.,
modelling, are required to define a common version of the truth. We convert human-sourced information to process-mediated data in a
variety of ways, the most basic being data entry in systems of record.
• ProcessMediated Data: Every business and organization is run according to processes, which, among other things, record and monitor
business events of interest, such as registering a customer, manufacturing a product, or taking an order. This data includes transactions,
reference tables and relationships, as well as the metadata that sets its context, all in a highly structured form. Traditionally, process-
mediated data formed the vast majority of what IT managed and processed, including both operational and BI data. Its highly structured and
regulated form makes it ideal for performing information management, maintaining data quality and so on.
• MachineGenerated Data: We have become increasingly dependent on machines to measure and record events and situations that we
experience physically. Machine-generated data is the well-structured output of machines from simple sensor records to complex computer
logs considered to be a highly reliable representation of reality. It is an increasingly important component of the information stored and
processed by many businesses. Its volumes are growing as sensors proliferate and, although its structured nature is well-suited to computer
processing, its size and speed is often beyond traditional approaches such as the enterprise data warehouse (EDW) to handle the process-
mediated data.
Extreme Connectivity
The connectivity capability has risen dramatically and become deeper and extreme. The kinds of network topologies are consistently expanding
and empowering their participants and constituents to be highly productive. There are unified, ambient and autonomic communication
technologies from research organizations and labs drawing the attention of executives and decision-makers. All kinds of systems, sensors,
actuators and other devices are empowered to form ad hoc networks for accomplishing specialized tasks in a simpler manner. There are a variety
of network and connectivity solutions in the form of load balancers, switches, routers, gateways, proxies, firewalls, etc. for providing higher
performance, network solutions are being embedded in appliances (software as well as hardware) mode.
Device middleware or Device Service Bus (DSB) is the latest buzzword enabling a seamless and spontaneous connectivity and integration
between disparate and distributed devices. That is, device-to-device (in other words, Machine-to-Machine [M2M]) communication is the talk of
the town. The interconnectivity-facilitated interactions among diverse categories of devices precisely portend a litany of supple, smart and
sophisticated applications for people. Software-Defined Networking (SDN) is the latest technological trend captivating professionals to have a
renewed focus on this emerging yet compelling concept. With clouds being strengthened as the core, converged and central IT infrastructure,
device-to-cloud connections are fast-materializing. This local as well as remote connectivity empowers ordinary articles to become extraordinary
objects by distinctively communicative, collaborative, and cognitive.
Service Enablement
Every technology pushes for its adoption invariably. The Internet computing has forced for Web-enablement, which is the essence behind the
proliferation of Web-based applications. Now with the pervasiveness of sleek, handy, and multifaceted mobiles, now every enterprise and Web
applications are being mobile-enabled. That is, any kind of local and remote applications are being accessed through mobiles on the move, thus
fulfilling real-time interactions and decision-making economically. With the overwhelming approval of the service idea, every application is
service-enabled. That is, we often read, hear and feel service-oriented systems. The majority of next-generation enterprise-scale, mission-critical,
process-centric and multi-purpose applications are being assembled out of multiple discrete and complex services.
Not only applications, physical devices at the ground level are being seriously service-enabled in order to uninhibitedly join in the mainstream
computing tasks and contribute for the intended success. That is, devices, individually and collectively, could become service providers or
publishers, brokers and boosters, and consumers. The prevailing and pulsating idea is that any service-enabled device in a physical environment
could interoperate with others in the vicinity as well as with remote devices and applications. Services could abstract and expose only specific
capabilities of devices through service interfaces while service implementations are hidden from user agents. Such kinds of smart separations
enable any requesting device to see only the capabilities of target devices, and then connect, access, and leverage those capabilities to achieve
business or people services. The service enablement completely eliminates all dependencies and deficiencies so that devices could interact with
one another flawlessly and flexibly.
The Internet of Things (IoT)/Internet of Everything (IoE)
Originally, the Internet was the network of networked computers. Then, with the heightened ubiquity and utility of wireless and wired devices,
the scope, size, and structure of the Internet has changed to what it is now, making the Internet of Devices (IoD) concept a mainstream reality.
With the service paradigm being positioned as the most optimal, rational and practical way of building enterprise-class applications, a gamut of
services (business and IT) are being built by many, deployed in worldwide Web and application servers and delivered to everyone via an
increasing array of input / output devices over networks. The increased accessibility and auditability of services have propelled interested
software architects, engineers and application developers to realize modular, scalable and secure software applications by choosing and
composing appropriate services from those service repositories quickly. Thus, the Internet of Services (IoS) idea is fast-growing. Another
interesting phenomenon getting the attention of press these days is the Internet of Energy. That is, our personal as well as professional devices
get their energy through their interconnectivity. Figure 1 clearly illustrates how different things are linked with one another in order to conceive,
concretize and deliver futuristic services for the mankind (Intel, 2012).
As digitization gains more accolades and success, all sorts of everyday objects are being connected with one another as well as with scores of
remote applications in cloud environments. That is, everything is becoming a data-supplier for the next-generation applications thereby
becoming an indispensable ingredient individually as well as collectively in consciously conceptualizing and concretizing smarter applications.
There are several promising implementation technologies, standards, platforms, and tools enabling the realization of the IoT vision. The probable
outputs of the IoT field is a cornucopia of smarter environments such as smarter offices, homes, hospitals, retail, energy, government, cities, etc.
Cyber-Physical Systems (CPS), Ambient Intelligence (AmI), and Ubiquitous Computing (UC) are some of the related concepts encompassing the
ideals of IoT.
In the upcoming era, unobtrusive computers, communicators, and sensors will be facilitating decision making in a smart way. Computers in
different sizes, look, capabilities, and interfaces will be fitted, glued, implanted, and inserted everywhere to be coordinative, calculative, and
coherent. The interpretation and involvement of humans in operationalizing these smarter and sentient objects are almost nil. With autonomic
IT infrastructures, more intensive automation is bound to happen. The devices will also be handling all kinds of everyday needs, with humanized
robots extensively used in order to fulfil our daily physical chores. With the emergence of specific devices for different environments, there will
similarly be hordes of services and applications coming available for making the devices smarter that will in turn make our lives more productive.
On summarizing, the Internet is expanding into enterprise assets and consumer items such as cars and televisions. Gartner identifies four basic
usage models that are emerging:
• Manage,
• Monetize,
• Operate, and
• Extend.
These can be applied to people, things, information, and places. That is, the Internet of everything is all set to flourish unflinchingly.
Infrastructure Optimization
The entire IT stack has been going for the makeover periodically. Especially on the infrastructure front due to the closed, inflexible, and
monolithic nature of conventional infrastructure, there are concerted efforts being undertaken by many in order to untangle them into modular,
open, extensible, converged, and programmable infrastructures. Another worrying factor is the underutilization of expensive IT infrastructures
(servers, storages and networking solutions). With IT becoming ubiquitous for automating most of the manual tasks in different verticals, the
problem of IT sprawl is to go up and they are mostly underutilized and sometimes even unutilised for a long time. Having understood these
prickling issues pertaining to IT infrastructures, the concerned have plunged into unveiling versatile and venerable measures for enhanced
utilization and for infrastructure optimization. Infrastructure rationalization and simplification are related activities. That is, next-generation IT
infrastructures are being realized through consolidation, centralization, federation, convergence, virtualization, automation, and sharing. To
bring in more flexibility, software-defined infrastructures are being prescribed these days.
With the faster spread of big data analytics platforms and applications, commodity hardware is being insisted to accomplish data and process-
intensive big data analytics quickly and cheaply. That is, we need low-priced infrastructures with supercomputing capability and infinite storage.
The answer is that all kinds of underutilized servers are collected and clustered together to form a dynamic and huge pool of server machines to
efficiently tackle the increasing and intermittent needs of computation. Precisely speaking, clouds are the new-generation infrastructures that
fully comply to these expectations elegantly and economically. The cloud technology, though not a new one, represents a cool and compact
convergence of several proven technologies to create a spellbound impact on both business and IT in realizing the dream of virtual IT that in turn
blurs the distinction between the cyber and the physical worlds. This is the reason for the exponential growth being attained by the cloud
paradigm. That is, the tried and tested technique of “divide and conquer” in software engineering is steadily percolating to hardware engineering.
Decomposition of physical machines into a collection of sizable and manageable virtual machines and composition of these virtual machines
based on the computing requirement is the essence of cloud computing.
Finally software-defined cloud centers will see the light soon with the faster maturity and stability of competent technologies towards that goal.
There is still some critical inflexibility, incompatibility and tighter dependency issues among various components in cloud-enabled data centers,
thus full-fledged optimization and automation are not yet possible within the current setup. To attain the originally envisaged goals, researchers
are proposing to incorporate software wherever needed in order to bring in the desired separations so that a significantly higher utilization is
possible. When the utilization goes up, the cost is bound to come down. In short, the target of infrastructure programmability can be met with the
embedding of resilient software so that the infrastructure manageability, serviceability, and sustainability tasks become easier, economical and
quicker.
RealTime, Predictive, and Prescriptive Analytics
As we all know, the big data paradigm is opening up a fresh set of opportunities for businesses. As data explosion would occur according to the
forecasts of leading market research and analyst reports, the key challenge in front of businesses is how efficiently and rapidly to capture,
process, analyse and extract tactical, operation as well as strategic insights in time to act upon swiftly with all the confidence and clarity. In the
recent past, there were experiments on in-memory computing. For a faster generation of insights out of a large amount of multi-structured data,
the new entrants such as in-memory and in-database analytics are highly reviewed and recommended. The new mechanism insists on putting all
incoming data in memory instead of storing it in local or remote databases so that the major barrier of data latency gets eliminated. There are a
variety of big data analytics applications in the market and they implement this new technique in order to facilitate real-time data analytics.
Timeliness is an important factor for information to be beneficially leveraged. The appliances are in general high-performing, thus guaranteeing
higher throughput in all they do. Here too, considering the need for real-time emission of insights, several product vendors have taken the route
of software as well as hardware appliances for substantially accelerating the speed with which the next-generation big data analytics get
accomplished.
In the Business Intelligence (BI) industry, apart from realizing real-time insights, analytical processes and platforms are being tuned to bring
forth insights that invariably predict something to happen for businesses in the near future. Therefore executives and other serious stakeholders
proactively and pre-emptively can formulate well-defined schemes and action plans, fresh policies, new product offerings, premium services,
viable and value-added solutions based on the inputs. Prescriptive analytics, on the other hand, is to assist business executives for prescribing
and formulating competent and comprehensive schemes and solutions based on the predicted trends and transitions.
IBM has introduced a new computing paradigm “stream computing” in order to capture streaming and event data on the fly and to come out with
usable and reusable patterns, hidden associations, tips, alerts and notifications, impending opportunities as well as threats, etc. in time for
executives and decision-makers to contemplate appropriate countermeasures(Kobielus, 2013).
The Recent Happenings in the IT Space
• Extended Device Ecosystem: Trendy and handy, slim and sleek mobile, wearable, implantable and portable, and energy-aware devices
(instrumented, interconnected and intelligent devices).
• Sentient and Smart Materials: Attaching scores of edge technologies (invisible, calm, infinitesimal and disposable sensors and
actuators, stickers, tags, labels, motes, dots, specks, etc.) on ordinary objects to exhibit extraordinary capabilities.
• Extreme and Deeper Connectivity Standards, Technologies, Platforms and Appliances for device-to-device, device-to-cloud, cloud-to-cloud,
and on-premise to off-premise interactions.
• Infrastructure Optimization: Programmable, Consolidated, Converged, Adaptive, Automated, Shared, QoS-enabling, Green, and Lean
Infrastructures.
• NewGeneration Databases: SQL, NoSQL, NewSQL, and Hybrid Databases for the Big Data World.
• RealTime, Predictive, and Prescriptive Analytics:Big Data Analytics, In-Memory Computing, etc.
• NextGeneration Applications: Social, Mobile, Cloud, Enterprise, Web, Analytical, and Embedded Application Categories.
The Big Picture
With the cloud space growing fast as the next-generation environment for application design, development, deployment, integration,
management, and delivery as a service, the integration requirement too has grown deeper and broader as pictorially illustrated in the Figure 2.
All kinds of physical entities at the ground level will have a purpose-specific interactions with services and data hosted on the enterprise as well as
cloud servers and storages to enable scores of real-time and real-world applications for the society. This extended and enhanced integration
would lead to data deluges that have to be accurately and appropriately subjected to a variety of checks to promptly derive actionable insights
that in turn enable institutions, innovators and individuals to be smarter and speedier in their obligations and offerings.
Newer environments such as smarter cities, governments, retail, healthcare, energy, manufacturing, supply chain, offices, and homes will
flourish. Cloud, being the smartest IT technology is inherently capable of meeting up with all kinds of infrastructural requirements fully and
firmly.
ENVISIONING THE DIGITAL UNIVERSE
The digitization process has gripped the whole world today as never before and its impacts and initiatives are being widely talked about. With an
increasing variety of input and output devices and newer data sources, the realm of data generation has gone up remarkably. It is forecasted that
there will be billions of everyday devices getting connected, capable of generating an enormous amount of data assets which need to be processed.
It is clear that the envisaged digital world is to result in a huge amount of bankable data. This growing data richness, diversity, value and reach
decisively gripped the business organizations and governments first. Thus, there is a fast-spreading of newer terminologies such as digital
enterprises and economies. Now it is gripping the whole world and this new world order has tellingly woken up worldwide professionals and
professors to formulate and firm up flexible, futuristic strategies, policies, practices, technologies, tools, platforms, and infrastructures to tackle
this colossal yet cognitive challenge head on. Also, IT product vendors are releasing refined and resilient storage appliances, new types of
databases, distributed file systems, data warehouses, etc. to stock up for the growing volume of business, personal, machine, people and online
data and for enabling specific types of data processing, mining, slicing, and analysing on the data getting collected, and processed. This pivotal
phenomenon has become a clear reason for envisioning the digital universe.
IDC(Gantz & Reinsel, 2012) defines “digital universe” as a measure of all the digital data created, replicated and consumed in a single year. It is
also a projection of the size of that universe to the end of the decade (2020). The digital universe is made up of all kinds of data such as images
(still and running) in cameras, audio albums in digital players, video games in game consoles, and digital movies in high-definition television
sets, environmental data extracted and transmitted by scores of smart sensors, banking data through card swiping, security footage at major
establishments such as airports and in major events such as the Olympic Games, subatomic collisions recorded by the Large Hadron Collider at
CERN, transponders recording highway tolls, voice calls, emails and texting for communications, etc. In this context, at the midpoint of a
longitudinal study starting with data collected in 2005 and extending to 2020, the IDC analysis shows a continuously expanding, increasingly
complex, and ever more interesting digital universe. Here is some statistical information.
From 2005 to 2020, the digital universe will grow by a factor of 300 from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than
5,200 gigabytes for every man, woman, and child in 2020). From now until 2020, the digital universe will almost double every two years.
Investing on IT hardware, software, services, telecommunications and staff that is considered to be the “infrastructure” of the digital universe and
telecommunications is set to grow by 40% between 2012 and 2020. As a result, the investment per Gigabyte (GB) during that same period will
drop from $2.00 to $0.20. Of course, investment in targeted areas like storage management, security, big data, and cloud computing will grow
considerably faster.
Impact of the Digital Universe on IT – Information managed by enterprises will grow by 14X by 2020. Average number of servers will increase by
10X by 2020. The number of IT professionals is expected to grow by only 1.5X by 2020.
The world’s ‘digital universe’ will grow to 8 ZB by 2015. While this figure is very daunting, the bigger challenge is the different data forms and
formats within that 8 ZB data. IDC predicts that by 2015, over 90 percent of that data will be unstructured. Just think, every 60 seconds, the
world generates massive amounts of unstructured data:
• 98,000+ tweets.
There will be hitherto unforeseen applications in the digital universe in which all kinds of data producers, middleware, consumers, storages,
analytical systems, virtualization and visualization tools and software applications will be seamlessly and spontaneously connected with one
another. Especially there is a series of renowned and radical transformations in the sensor space. Nanotechnology and other miniaturization
technologies have brought in legendary changes in sensor design. The nano-sensors can be used to detect vibrations, motion, sound, colour, light,
humidity, chemical composition and many other characteristics of their deployed environments. These sensors can revolutionize the search for
new oil reservoirs, structural integrity for buildings and bridges, merchandise tracking and authentication, food and water safety, energy use and
optimization, healthcare monitoring and cost savings, and climate and environmental monitoring. The point to be noted here is the volume of
real-time data being emitted by the army of sensors and actuators.
The steady growth of sensor networks increases the need for one million times more storage and processing power by 2020. It is projected that
there will be one trillion sensors by 2030 and every single person will be assisted by approximately 150 sensors in this planet. Cisco has predicted
that there will be 50 billion connected devices in 2020 and hence the days of the Internet of Everything (IoE) are not too far off. All these scary
statistics convey one thing, which is that IT applications, services, platforms, and infrastructures need to be substantially and smartly invigorated
to meet up all sorts of business and peoples’ needs in the ensuing era of deepened digitization. Here is a pictorial representation (Figure 3) of
some of the data-driven revolutions that are to happen soon in the forthcoming digital universe (HP, 2012).
Precisely speaking, the data volume is to be humongous as the digitization is growing deep and wide. The resulting digitization-induced digital
universe will therefore be in war with the amount of data being collected and analysed. The data complexity through the data heterogeneity and
multiplicity will be a real challenge for enterprise IT teams. Therefore big data is being positioned and projected as the right computing model to
effectively tackle the data revolution challenges of the ensuing digital universe.
BIG DATA IN 2020
The big data paradigm has become a big topic across nearly every business domain. IDC defines big data computing as a set of new-generation
technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity
capture, discovery, and/or analysis. There are three core components in big data: the data itself, the analytics of the data captured and
consolidated, and the articulation of insights oozing out of data analytics. There are robust products and services that can be wrapped around one
or all of these big data elements. Thus there is a direct connectivity and correlation between the digital universe and the big data idea sweeping
the entire business scene. The vast majority of new data being generated as a result of digitization is unstructured or semi-structured. This means
there is a need arising to somehow characterize or tag such kinds of multi-structured big data to be useful and usable. This empowerment
through additional characterization or tagging results in metadata, which is one of the fastest-growing sub-segments of the digital universe
though metadata itself is a minuscule part of the digital universe. IDC believes that by 2020, a third of the data in the digital universe (more than
13,000 exabytes) will have big data value, only if it is tagged and analysed. There will be routine, repetitive, redundant data and hence not all data
is necessarily useful for big data analytics. However, there are some specific data types that are princely ripe for big analysis such as:
• Surveillance Footage: Generic metadata (date, time, location, etc.) is automatically attached to video files. However as IP cameras
continue to proliferate, there is a greater opportunity to embed more intelligence into the camera on the edges so that footage can be
captured, analysed, and tagged in real time. This type of tagging can expedite crime investigations for security insights, enhance retail
analytics for consumer traffic patterns and of course improve military intelligence as videos from drones across multiple geographies are
compared for pattern correlations, crowd emergence and response or measuring the effectiveness of counterinsurgency.
• Embedded and Medical Devices: In future, sensors of all types including those that may be implanted into the body will capture vital
and non-vital biometrics, track medicine effectiveness, correlate bodily activity with health, monitor potential outbreaks of viruses, etc. all in
real time thereby realising automated healthcare with prediction and precaution.
• Entertainment and Social Media: Trends based on crowds or massive groups of individuals can be a great source of big data to help
bring to market the “next big thing,” help pick winners and losers in the stock market, and even predict the outcome of elections all based on
information users freely publish through social outlets.
• Consumer Images: We say a lot about ourselves when we post pictures of ourselves or our families or friends. A picture used to be worth
a thousand words but the advent of big data has introduced a significant multiplier. The key will be the introduction of sophisticated tagging
algorithms that can analyse images either in real time when pictures are taken or uploaded or en masse after they are aggregated from
various Websites.
Explaining the Big Data Era (Intuit 2020, 2013)
The growth of the Internet, wireless networks, smartphones, social media, sensors and other digital technology is collectively fuelling a data
revolution. Over the next decade, analysts expect the global volume of digital data to increase more than 40-fold. Previously the exclusive domain
of statisticians, large corporations and Information Technology (IT) departments, the emerging availability of data and analytics (call it a new
democratization) gives small and medium businesses and individual consumers a greater access to cost-effective, sophisticated, data-powered
tools and analytical systems. This new data democracy will deliver meaningful insights on markets, competition and bottom-line business results
for businesses as well as shape up many of the decisions we take as individuals and families in our daily life journey. In a nutshell, the arrival of
digital data in big numbers is to empower individuals, innovators and institutions.
• Data Empowers Consumers: Besides organizations, digital data helps individuals to navigate the maze of modern life. As life becomes
increasingly complex and intertwined, digital data will simplify the tasks of decision-making and actuation. The growing uncertainty in the
world economy over the last few years has shifted many risk management responsibilities from institutions to individuals. In addition to this
increase in personal responsibility, other pertinent issues such as life insurance, health care, retirement, etc. are growing evermore intricate
increasing the number of difficult decisions we all make very frequently. The data-driven insights come handy in difficult situations for
consumers to wriggle out. Digital data hence is the foundation and fountain for the knowledge society.
• Power Shifts to the DataDriven Consumers: Data is an asset for all. Organizations are sagacious and successful in promptly
bringing out premium and people-centric offerings by extracting operational and strategically sound intelligence out of accumulated
business, market, social, and people data. There is a gamut of advancements in data analytics in the form of unified platforms and optimized
algorithms for efficient data analysis, etc. There are plenty of data virtualization and visualization technologies. Data-aggregation, analysis
and articulation platforms are making online business reviews a commonplace and powering smartphone applications that evaluate and
compare products and service prices in double quick time. These give customers enough confidence and ever-greater access to pricing
information, service records and specifics on business behaviour and performance. With the new-generation data analytics being performed
easily and economically in cloud platforms and transmitted to smart phones, the success of any enterprise or endeavour solely rests with
knowledge-empowered consumers.
• Consumers Delegate Tasks to Digital Concierges: We have been using myriad of digital assistants (tablets, smartphones, wearables,
etc.) for a variety of purposes in our daily life. These electronics are of great help and crafting applications and services for these specific as
well as generic devices empower them to be more right and relevant for us. Data-driven smart applications will enable these new-generation
digital concierges to be expertly tuned to help us in many things in our daily life.
Big data is driving a revolution in machine learning and automation. This will create a wealth of new smart applications and devices that can
anticipate our needs precisely and perfectly. In addition to responding to requests, these smart applications will proactively offer information and
advice based on detailed knowledge of our situation, interests and opinions.
This convergence of data and automation will simultaneously drive a rise of user-friendly analytic tools that help make sense of the information
and create new levels of ease and empowerment for everything from data entry to decision making. Our tools will become our data interpreters,
business advisors and life coaches, making us smarter and more fluent in all subjects of life.
• Data Fosters Community: Due to the growing array of extra facilities, opportunities, and luxuries being made available and accessible
in modernized cities, there is a consistent migration to urban areas and metros from villages. This trend has displaced people from their
roots and there is a huge disconnect between people in new locations also. Now with the development and deployment of services (Online
location-based services, local search, community-specific services, and new data-driven discovery applications) based on the growing size of
social, professional and people data, people can quickly form digital communities virtually in order to explore, find, share, link and
collaborate with others. The popular social networking sites enable people to meet and interact with one another purposefully. Government
uses data and analytics to establish citizen-centric services, improve public safety and reduce crime. Medical practitioners use it to diagnose
better and treat diseases effectively. Individuals are tapping on online data and tools for help with everything from planning their career to
retirement, to choose everyday service providers, to pick up places to live, to find the quickest way to get to work and so on. Data, services
and connectivity are the three prime ingredients in establishing and sustaining rewarding relationships among diverse and distributed
people groups.
• Data Empowers Businesses to be Smart: Big data is changing the way companies conduct businesses. Starting with streamlining
operations, increasing efficiencies to boost the productivity, improving decision making, and bringing forth premium services to market are
some of the serious turnarounds due to big data concepts. It is all “more with less.” A lot of cost savings are being achieved by leveraging big
data technologies smartly and this in turn enables businesses to incorporate more competencies and capabilities.Big data is also being used
to better target customers, personalize goods and services and build stronger relationships with customers, suppliers and employees.
Business will see intelligent devices, machines and robots taking over many repetitive, mundane, difficult, and dangerous activities.
Monitoring and providing real-time information about assets, operations and employees and customers, these smart machines will extend
and augment human capabilities. Computing power will increase as costs decrease. Sensors will monitor, forecast and report on
environments; smart machines will develop, share and refine new data into knowledge based on their repetitive tasks. Real-time, dynamic,
analytics-based insights will help businesses provide unique services to their customers on the fly. Both sources will transmit these rich
streams of data to cloud environments so that all kinds of implantable, wearable, portable, fixed, nomadic, and any input / output devices
can provide timely information and insights to their users unobtrusively. There is a gamut of improvisations such as the machine learning
discipline solidifying an ingenious foundation for smart devices. Scores of data interpretation engines, expert systems, and analytical
applications go a long way in substantially augmenting and assisting humans in their decision-making tasks.
• Big Data Brings in Big Opportunities: The big data and cloud paradigms have collectively sparked a stream of opportunities as both
start-ups and existing small businesses find innovative ways to harness the power of the growing streams of digital data. As the digital
economy and enterprise mature, there can be more powerful and pioneering products, solutions and services.
BIG DATA ANALYTICS: THE IT INFRASTRUCTURE CHARACTERISTICS
• A Brief on Stream Computing: In the beginning, we had written about three kinds of data being produced. The processing is mainly
two types: batch and online (real-time) processing. As far as the speed with which data needs to be captured and processed is concerned,
there are both low-latency as well as high-latency data. Therefore the core role of stream computing (introduced by IBM) is to power
extremely low-latency data but it should not rely on high-volume storage to do its job. By contrast, the conventional big data platforms
involve a massively parallel processing architecture comprising Enterprise Data Warehouses (EDW), Hadoop framework, and other analytics
databases. This setup usually requires high-volume storage that can have a considerable physical footprint within the data center. On the
other hand, a stream computing architecture uses smaller servers distributed across many data centers. Therefore there is a need for
blending and balancing of stream computing with the traditional one. It is all about choosing a big data fabric that elegantly fits for the
purpose on hand. The big data analytics platform has to have specialised “data persistence” architectures for both short-latency persistence
(caching) of in-motion data (stream computing) and long-latency persistence (storage) of at-rest data. Stream computing is for extracting
actionable insights in time out of streaming data. This computing model prescribes an optimal architecture for real-time analysis for those
data in flight.
• Big Data Analytics Infrastructures: And as IT moves to the strategic center of business, CXOs at organizations of all sizes turn to
product vendors and service providers to help them extract more real value from their data assets, business processes and other key
investments. IT is being primed for eliminating all kinds of existing business and IT inefficiencies, slippages and wastages etc. Nearly 70
percent of the total IT budget is being spent on IT operations and maintenance alone. Two-thirds of companies go over schedule on their
project deployments. Hence this is the prime time to move into smarter computing through the systematic elimination of IT complexities
and all the inflicted barriers to innovation. Thus there is a business need for a new category of systems. Many prescribe different
characteristics for next-generation IT infrastructures. The future IT infrastructures need to be open, modular, dynamic, converged, instant-
on, expertly integrated, shared, software-defined, virtualised, etc.
However, they are often only analysing narrow slivers of the full data sets available. Without a centralized point of aggregation and integration,
data is collected in a fragmented way, resulting in limited or partial insights. Considering the data and process-intensive nature of big data
storage and analysis, cloud compute, storage and network infrastructures are the best course of action. Private, public and hybrid clouds are the
smartest way of proceeding with big data analytics. Also social data are being transmitted over the public and open Internet; public clouds seem
to be a good bet for some specific big data analytical workloads. There are WAN optimization technologies strengthening the case for public
clouds for effective and efficient big data analysis. Succinctly speaking, cloud environments with all the latest advancements in the form of
software-defined networking, storage and compute infrastructures, cloud federation, etc. are the future of fit-for-purpose big data analysis. State-
of-the-art cloud centers are right for a cornucopia of next-generation big data applications and services.
The IBM Expert Integrated Systems(IBM, 2012)
IBM has come out with expert integrated systems that ingeniously eliminate all kinds of inflexibilities and inefficiencies. Cloud services and
applications need to be scalable and the underlying IT infrastructures need to be elastic. The business success squarely and solely depends on IT
agility, affordability, and adaptability. Hence the attention has turned towards the new smarter computing model, in which IT infrastructure is
more simple, efficient and flexible.
There are three aspects as far as smarter computing is concerned. The first one is to tune IT systems using the flexibility of general-purpose
systems to optimize the systems for specific business environment. The second one is to take advantage of the simplicity of appliances and the
final one is to leverage the elasticity of cloud infrastructures. The question is how can organizations get the best of all these options in one
system? Expert integrated systems are therefore much more than a static stack of pre-integrated components: a server here, some database
software there, serving a fixed application at the top, etc. Instead, these expert integrated systems are based on “patterns of expertise,” which can
dramatically improve the responsiveness of the business. Patterns of expertise automatically balance, manage and optimize the elements
necessary from the underlying hardware resources up through the middleware and software to deliver and manage today’s modern business
processes, services, and applications. Thus as far as infrastructures are concerned, expertly integrated systems are the most sought-after in the
evolving field of big data analytics. In order to deliver fully on this economic promise, systems with integrated expertise must possess the
following core capabilities:
• BuiltIn Expertise: When embedded expertise and best practices are captured and automated in various deployment forms, it is possible
to dramatically improve the time-to-value.
• Integration by Design: When one deeply tunes hardware and software in a ready-to-go, workload optimized system, it becomes easier to
“tune to the task.”
• Simplified Experience: When every part of the IT lifecycle becomes easier with integrated management of the entire system, including a
broad, open ecosystem of optimized solutions, business innovation can thrive.
The HP Converged Infrastructure (HP, 2011)
There are other players in the market producing adaptive infrastructures for making big data analytics easy. For example, HP talks about
converged cloud infrastructures. At the heart of HP converged infrastructure (Figure 4) is the ultimate end state of having any workload,
anywhere, anytime. This is achieved through a systematic approach that brings all the server, storage, and networking resources together into a
common pool. This approach also brings together management tools, policies, and processes so that resources and applications are managed in a
holistic, integrated manner. And it brings together security, and power and cooling management capabilities so systems and facilities work
together to extend the life of the data center.
This all starts by freeing assets trapped in operations, or by deploying a new converged infrastructure, that establishes a services-oriented IT
organization that better aligns IT with the wide variety of fluctuating business demands. This is exactly what the converged infrastructure does. It
integrates and optimizes technologies into pools of interoperable resources so they can deliver operational flexibility. And as a business grows, a
converged infrastructure provides the foundation for an instant-on enterprise. This type of organization shortens the time needed to provision
infrastructure for new and existing enterprise services to drive competitive and service advantages – allowing the business to interact with
customers, employees, and partners more quickly, and with increased personalization.
A converged infrastructure has five overarching requirements. It is virtualized, resilient, open, orchestrated, and modular.
• Virtualized: A converged infrastructure requires the virtualization of all heterogeneous resources: compute, storage, networking, and I/O.
Virtualization separates the applications, data, and network connections from the underlying hardware, thereby making it easier and faster
to reallocate resources to match the changing performance, throughput, and capacity needs of individual applications. This end-to-end
virtualization improves IT flexibility and response to business requests, ultimately improving business speed and agility.
• Resilient: A converged infrastructure integrates non-stop technologies and high availability policies. Because diverse applications share
virtualized resource pools, a converged infrastructure must have an operating environment that automates high-availability policies to meet
SLAs. A resilient, converged infrastructure provides the right level of availability for each business application.
• Open: Products are being built using open standards. This avoids vendor lock-in. Also interoperability and portability are easily
accomplished.
• Orchestrated: A converged infrastructure orchestrates the business request with the applications, data, and infrastructure. It defines the
policies and service levels through automated workflows, provisioning, and change management design by IT and the business.
Orchestration provides an application-aligned infrastructure that can be scaled up or down based on the needs of each application.
Orchestration also provides centralized management of the resource pool, including billing, metering, and chargeback for consumption.
• Modular: A converged infrastructure is built on modular design principles based on open and interoperable standards. A modular
approach allows IT to integrate new technologies with existing investments without having to start over. This approach also gives IT the
ability to extend new capabilities and to scale capacity over time.
Dell talks about shared infrastructure, which is a straightforward innovation(Acosta, 2013). That is, all kinds of infrastructure resources are being
pooled and shared across thereby increasing their utilization levels significantly saving a lot of IT budget For high-end applications such as Web
2.0 social sites, big data analytics, Machine-to-Machine (M2M) services, and high-performance applications such as genome research, climate
modelling, drug discovery, energy exploration, weather forecasting, financial services, new materials design, etc., shared infrastructure is being
recommended.
Integrated Platform for Big Data Analytics (Devlin, 2012)
Previously we have talked about versatile infrastructures for big data analytics. In this section, we are insisting on the need for integrated
platforms for compact big data analysis. An integrated platform (Figure 5) has to have all kinds of compatible and optimized technologies,
platforms, and other ingredients to adaptively support varying business requirements.
The first is central core business data, the consistent, quality-assured data found in EDW and MDM systems. Traditional relational databases,
such as IBM DB2, are the base technology. Application-specific reporting and decision support data often stored in EDWs today are excluded.
Core reporting and analytic data covers the latter data types. In terms of technology, this type is ideally a relational database. Data warehouse
platforms such as IBM InfoSphere Warehouse, IBM Smart Analytics System and the new IBM PureData System for Operational Analytics, play a
strong role here. Business needs requiring higher query performance may demand an analytical database system built on Massively Parallel
Processing (MPP) columnar databases or other specialized technologies, such as the new IBM PureData System for Analytics (powered by
Netezza Technology).
Deep analytic information requires highly flexible, large scale processing such as the statistical analysis and text mining often performed in the
Hadoop environment.
Fast analytic data requires such high-speed analytic processing that it must be done on data in-flight, such as with IBM InfoSphere Streams, for
example. This data is often generated from multiple sources that need to be continuously analyzed and aggregated with near-zero latency for real-
time alerting and decision-making.
At the intersection of speed and flexibility, we have specialty analytic data, using specialized processing such as NoSQL, XML, graph and other
databases and data stores.
Metadata, shown conceptually as a backdrop to all types of information, is central to this new architecture to define information context and to
enable proper governance. In the process-mediated and machine-generated domains, metadata is explicitly stored separately; in the human-
sourced domain it is more likely to be implicit in the information itself. This demands new approaches to modelling, discovering and visualizing
both internal and external sources of data and their inter-relationships within the platform.
Transitioning from Traditional BI to Big Data BI (Persistent, 2013)
Business Intelligence (BI) has been a key requirement for every aspiring enterprise across the globe. There are consultants, empowered service
organizations and product vendors collaboratively strategizing and actuating to establish the BI competency for worldwide businesses (small,
medium, and large) as this empowerment would innately help to plan and execute appropriate actions for business optimization and
transformations. The BI horizon has now increased sharply with distributed and diverse data sources. The changes have propelled industry
professionals, research labs and academicians to bring in required technologies, tools, techniques and tips. The traditional BI architecture is as
shown in Figure 6.
CONCLUSION
The brewing trends clearly vouch for a digital universe in 2020. The distinct characteristic of the digitized universe is nonetheless the huge data
collection (big data) from a variety of sources. This voluminous data production and the clarion call for squeezing out workable knowledge out of
the data for adequately empowering the total human society is activating IT experts, engineers, evangelists, and exponents to incorporate more
subtle and successful innovations in the IT field. The slogan ‘more with less’ is becoming louder. The inherent expectations from IT for resolving
various social, business, and personal problems are on the climb. In this chapter, we have discussed how the next-generation Business
Intelligence (BI) needs are to be accomplished in the context of growing data size, speed, and scope. The big data analytics architecture was
discussed in detail to invigorate the interest of readers on this fast-evolving topic. There are briefs on the readiness of IT assets, resources and
applications to be a right and strategic partner in the projected digital universe. In the ensuing chapters, you can find deeper thoughts on sub-
topics from acclaimed and accomplished authors.
This work was previously published in the Handbook of Research on Cloud Infrastructures for Big Data Analytics edited by Pethuru Raj and
Ganesh Chandra Deka, pages 121, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
I work for IBM India, and my chapters in this book do not represent or reflect the views of IBM directly or indirectly.
REFERENCES
Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east . IDC View.
Yen Pei Tay
Quest International University Perak, Malaysia
Vasaki Ponnusamy
Quest International University Perak, Malaysia
Lam Hong Lee
Quest International University Perak, Malaysia
ABSTRACT
The meteoric rise of smart devices in dominating worldwide consumer electronics market complemented with data-hungry mobile applications
and widely accessible heterogeneous networks e.g. 3G, 4G LTE and Wi-Fi, have elevated Mobile Internet from a ‘nicetohave’ to a mandatory
feature on every mobile computing device. This has spurred serious data traffic congestion on mobile networks as a consequence. The nature of
mobile network traffic today is more like little Data Tsunami, unpredictable in terms of time and location while pounding the access networks
with waves of data streams. This chapter explains how Big Data analytics can be applied to understand the Device-Network-Application (DNA)
dimensions in annotating mobile connectivity routine and howSimplify, a seamless network discovery solution developed at Nextwave
Technology, can be extended to leverage crowd intelligence in predicting and collaboratively shaping mobile data traffic towards achieving real-
time network congestion control. The chapter also presents the Big Data architecture hosted on Google Cloud Platform powering the backbone
behind Simplify in realizing its intelligent traffic steering solution.
INTRODUCTION
On a cold Spring day in March 2011, Japan was struck by a deadly tsunami following a strong earthquake. The aftermath was devastating, not
only destruction was brought into its physical terrain, the Internet has also suffered huge ‘data tsunami’ where millions of online readers roamed
into major news websites, following every emergency update. With smart communication devices dominating the global consumer electronics
market, posting high-resolution pictures and streaming high-definition videos over wireless networks are becoming the norms for smartphone
users (Tay, 2012). Complemented with more affordable mobile broadband packages and widely accessible high-speed data networks, wireless
networks today are constantly plagued by daily ‘data tsunamis’.
In suppressing mobile network congestion, mobile operators often impose data limits and fair usage policies at a costly trade-off. Not only did
data capping and network throttling significantly impede user experience, implementing such mechanisms incur heavy monitoring costs on the
mobile operators. While understanding network traffic trend is far beyond any straightforward mathematical equation, the sporadic nature of
mobile connectivity makes network congestion complex, unpredictable and difficult to eradicate. At such, traditional radio network planning and
progressive network capacity upgrades may no longer be sufficient to serve the fluctuating connectivity demand. This poses serious threats to
mobile operators in need for a more effective solution for real-time congestion control.
In this chapter, we will examine Simplify, a mobile data solution developed at Nextwave Technology aiming to solve network congestion woes by
applying Big Data technologies in forecasting, shaping and routing mobile network traffic based on analysis of real-time data collected from
mobile devices. By analysing human mobility patterns and understanding their connectivity routines,Simplify is able to predict network
behaviours and prescribe personalized network policies to each mobile device in realizing dynamic network traffic steering while improving
Mobile Internet experience.
BACKGROUND
In the effort to curb mobile network congestion, the challenges facing mobile operators are far more complex than just scaling up their network
infrastructure. As fluctuating mobile data demand varies from area to area, on-demand network capacity allocation is almost a mandatory
requirement. Despite the advancements in software-defined radio network technologies, which allow mobile operators to flexibly configure
network capacity on the fly, such deployment requires costly upgrade to existing radio base stations. Instead, the immediate priority should focus
on optimizing existing mobile network traffic by reducing the cost per megabyte while maintaining good user experience.
One immediate remedy to ease mobile congestion is to employ Wi-Fi offloading solution, diverting mobile data traffic towards Wi-Fi networks.
Cisco (2014) has reported that approximately 45 percent of global mobile data traffic (1.2 Exabytes per month) was offloaded onto the fixed
network through Wi-Fi and Femtocell in 2013. This figure is expected to reach 51 percent (17.3 Exabytes per month) by 2018. In this section, as a
prelude to our work onSimplify, we will first focus our evaluation on contemporary Wi-Fi offloading solutions and other related work in relieving
mobile network congestion.
Access Network Discovery and Selection Function
Acknowledging the tedious task in discovering and switching in between mobile data and Wi-Fi on mobile devices, the 3rdGeneration Partnership
Project (3GPP), a global telecommunications standardization body that defines cellular network specifications, has proposed Access Network
Discovery and Selection Function (ANDSF) as a solution to optimize mobile broadband experience in heterogeneous network environment. By
introducing an ANDSF server in the Evolved Packet Core (EPC) complemented by ANDSF clients installed on mobile devices, the solution
enables mobile operators to send network policies containing a list of preferred networks available for connection within the immediate vicinity
of the mobile devices. Using over-the-air provisioning, Wi-Fi hotspot locations and security credentials can be sent directly to mobile devices,
eliminating the need to manually search and connect to Wi-Fi networks (3GPP, 2014b). In addition, mobile users traversing different locations
may enjoy seamless network experience while switching in between a variety of wireless networks based on pre-determined network priority e.g.
4G LTE has a higher connection priority over Wi-Fi and 3G. To mobile operators who are in the midst of extending 4G LTE coverage, having an
ANDSF system to interactively indicate possible 4G networks in an area may completely eliminate the need to constantly advertise or publish 4G
network coverage map.
However, standards-based ANDSF exposes some critical limitations. Firstly, ANDSF is fundamentally designed to cater only for SIM-based
mobile devices, which means device with Wi-Fi only capability cannot be supported by the system. Moreover, all ANDSF policies require pre-
configuration on the server and periodic provisioning to the targeted devices, making it impractical to suppress ad-hoc network congestion
scenarios.
Furthermore, the propagation of ANDSF policy is always one-way: originating from the network to mobile devices via a standard ANDSF S14
interface. This omits valuable data gathered on mobile devices to be used in network quality analysis. In compliance to 3GPP architecture
deployment model, ANDSF server is deeply entrenched within the Evolved Packet Core (EPC) in which the signalling traffic traversing in
between ANDSF server and mobile devices may likely to congest the operator’s network even more. The deployment of ANDSF server is operator-
specific, where roaming between multiple operators’ networks would be difficult to achieve. In addition, it is almost impossible for an operator to
provision all public and consumer Wi-Fi hotspots into the ANDSF server, limiting its capability only to include operator-deployed or partner-
owned Wi-Fi.
Nevertheless, to many mobile operators, deploying Wi-Fi networks to cover mobile network blind spots is inevitable especially with the
prominent evolution of Wi-Fi standards namely the IEEE 802.11u and IEEE 802.21, which are fast-transforming Wi-Fi to become a ‘mission
critical’ technology interoperable with cellular networks.
Network Crowdsourcing
Over the years, Wi-Fi technology has gained solid popularity through mass-market adoption and its cost-effectiveness in deployment. According
to Cisco, there are 800 million new Wi-Fi devices shipped every year with more than 700 million consumers already having access to Wi-Fi out
there. The public hotspots will be approaching 5.8 million globally by 2015 (Wireless Broadband Alliance, 2011).
While companies like iPass and Boingo have long being able to aggregate Wi-Fi for enterprise customers, network crowdsourcing has becoming
widely popular in recent years to federate public hotspots for consumer use. WeFi, a market leader in consumer Wi-Fi aggregation, has
crowdsourced over 200 million hotspots from its mobile application installed on user smartphones worldwide. In another context, a company
called Fon, customizes its Wi-Fi router to broadcast both private SSID and a public FonSpot access point, granting Fon subscribers the
complimentary access to Fon hotspots globally. Unlike operator-owned Wi-Fi networks, most consumer Wi-Fi offers short-range coverage and
lacks carrier-grade quality. Intermittent network performance causes variation in quality-of-service. On the other hand, the heterogeneity of Wi-
Fi security authentication methods further complicates seamless access experience. To make matter worse, the existence of rogue Wi-Fi that gives
free Internet access to the public in disguise to hijack user traffic, pose serious security concerns.
As for public hotspots, like those found at Starbucks and McDonald’s, have literally no integration to mobile networks at all. Access to these
complimentary networks usually requires a separate user identity. With such limitation, it is tedious for mobile operators and Wi-Fi service
providers to correlate the multiple identities used to access their networks in realizing seamless Wi-Fi offloading experience. The effort to
choreograph network load-balancing and real-time network traffic routing between cellular and Wi-Fi prove extremely challenging. With more
and more Wi-Fi service providers and aggregators trying to safeguard their market shares, the choice of proprietary technologies and business
models make the entire ecosystem complex and highly fragmented. As a consequence, consumers will have to install different mobile applications
to access different Wi-Fi services.
Human Mobility Patterns
In a human mobility study, Gonzales et al. (2008) found that cell-tower location information can be used to characterize human mobility and
that humans follow simple reproducible mobility patterns. In another study led by Massachusetts Institute of Technology, Schneider et al. (2013)
concluded that 90 percent of the 40,000 sample population used only 17 trip configurations out of a million possible commuting chains in their
daily travels. Such premise holds true across the board for commuters resided in two megacities namely Chicago and Paris. These findings
reiterate that user mobility behaviour is largely influenced by their daily routine. Further to this, there is a positive correlation between user
routines to the intra-city travel patterns and network usage behaviour by analysing the mobile phone data. As such, it is absolutely possible to
construct simple predictive models to describe user connectivity behaviour.
It is also worth to note that there is a distinguished mobility patterns for native and non-native inhabitants in a city with high accuracy of location
prediction, even in the absence of demographic information (Yang et al., 2014). Not only that, by employing clustering techniques e.g. k-means
and Markov models (Liu, 2009), user health risk profile or pandemic exposure could be predicted by correlating network connectivity patterns
with their social network ties. For instance, people who frequently use free Wi-Fi at fast food chains (it is also possible to identify which fast food
outlet by looking at the Wi-Fi SSID) are more prone to obesity and diabetes as compared to people who only connect to the Internet at home.
NETWORK TRAFFIC STEERING WITH CROWD INTELLIGENCE
Issues, Controversies and Problems
Battling mobile network congestion definitely falls beyond the responsibility of a network department. Without a concerted strategy and
ecosystem support, it is very difficult to cover all key challenges spreading across the networks, devices, services and ultimately the subscribers, in
need to render the best Mobile Internet experience. While solving mobile congestion puzzle requires complex data science in crunching high-
volume network data collected at real-time, the measurement on quality-of-experience from end-user perspective cannot be ignored.
Understanding the correlation between users’ mobile connectivity routines to network quality shall provide absolute precursors and a more
holistic approach in dynamic network traffic management.
Solutions and Recommendations
Visualizing Mobile Connectivity Routine with Big Data
Leveraging on crowd intelligence to bring social good is not entirely a new concept. Waze, a community-based road traffic and navigation mobile
application, relies on real-time traffic information posted by drivers to bypass congestion and optimize travel routes. Similarly, this concept can
be applied in mobile network context.
Simplify, the world’s first community ANDSF solution designed to leverage mobile consumer intelligence in collaboratively discover, access and
steer network traffic for online community benefits. The solution comprises a centralized Simplify on Cloud server andSimplify mobile
applications installed on mobile devices in automating network discovery, selection and switching in the heterogeneous network environment.
Built-in with network crowdsourcing capability, good networks e.g. Wi-Fi, 3G and 4G LTE, are automatically discovered, tagged and uploaded
toSimplify on Cloud. In reciprocal, the cloud-based server sends ANDSF policy containing a list of best available networks nearby with the
encrypted security credentials to facilitate seamless access to these networks. In other words, Simplify users may no longer require to manually
scan for good networks or keying-in creepy Wi-Fi passwords, such tedious tasks are handled automatically in the background enabling a total
handsfree connectivity experience.
Eco Surf – the fundamental principle guiding Simplify’s solution design, is envisioned to reduce our carbon footprints while surfing on the
Mobile Internet. The key idea is simply to leverage collective intelligence to report and avoid poor quality networks, reducing unnecessary
network scans, minimizing wastage of device battery and network resources in the process. Inspired by such eco-friendly principle, product
teams at Nextwave Technology made several enhancements to ANDSF standards (3GPP, 2014a) in facilitating real-time data collection on mobile
connectivity. This includes the introduction of (i) locationtaggingfeature on Simplify application, (ii) a proprietary ANDSF S14 Postinterface
to Simplify on Cloud, (iii) a Network Analytics Engine, and (iv) a Network Traffic Controller powering Simplify’s Big Data architecture
backbone.
Location Tagging
With location-tagging enabled, Simplify tags every connected network to its nearby cell tower identifiers and geo-locations to form network
location triplet i.e. SSID, Cell-ID, GPS. These location datasets enable Simplify to remember the frequented network locations including home,
workplace, café and campus, auto-connecting to previously tagged networks when the device comes into proximity. The mobile application
intelligently manages device wireless radios based on user location and connectivity preference. Unlike most conventional smartphones that
leave Wi-Fi radio on low-powered passive scanning mode,Simplify automatically turns Wi-Fi off completely when user leaves the frequented
location, optimizing network connection while minimizing energy use. On the other hand, by associating the location information with the types
of network connected in a chronological order, we are able to capture user mobility patterns and chart their mobile connectivity routines in a time
series map.
ANDSF S14 Post Interface
Having Simplify hosted on the cloud provides two key advantages: Firstly, it prevents ANDSF data and signalling traffic from overloading mobile
operator’s core network. This means the data exchange between Simplify applications and the server can bypass mobile packet core with local IP
breakout mechanism e.g. Local IP Access (LIPA), for all Internet-bound traffic to Simplify on Cloud. Secondly, by putting the platform on the
cloud, the ANDSF solution is no longer operator-specific. This opens up the entire infrastructure for global Mobile Internet traffic steering.
Essentially, smaller operators are now able to provide international data and Wi-Fi roaming with other service providers leveraging on a common
platform.
Unlike 3GPP-based ANDSF server that only allows mobile devices to pull network policies over the standard ANDSF S14 Pull
interface, Simplify mobile application captures and sends device information, data usage patterns and network quality data to the server using
Nextwave-proprietary ANDSF S14 Post interface (Tay, 2013). With this enhancement, a complete feedback loop is formed enabling two-way
information exchange between the server and the mobile applications sparsely deployed on the field in providing a dynamic view of real-time
network conditions. With Simplify on Cloud server subscribing to continuous streams of data sent from thousands of mobile devices, these
invaluable datasets can help network service providers to gain greater insight into every mobile consumer’s connectivity pattern.
Network Analytics Engine
On Simplify on Cloud server, the Network Analytics Engineimplements the descriptive analytics feature that is capable of gathering, filtering,
storing and analysing the collected information. As such, every single user contributes data to formulate Device-Network-Application (DNA)
dimensions (see Figure 1) in illustrating their own connectivity routine. Subsequently, a collection of these user routines using the same network
can be extrapolated to derive network congestion patterns and predict network behaviours in the densely populated areas.
Figure 1. The Device-Network-Application (DNA) dimensions
of Internet Connectivity Behaviour
Network Traffic Controller
However, deciphering network and user behaviours would be meaningless without any pre-emptive measure to optimize the rapid growing
network traffic. Like deploying a traffic police at the middle of crossroads diverting road traffic at peak hours, theNetwork Traffic
Controller provides prescriptive analytics to proactively shape, optimize and divert traffic flows toward alternative and less congested networks.
With thousands of devices sending data simultaneously, Simplify on Cloud is gaining greater insight not only on mobile connectivity patterns and
network behaviours, but the effectiveness of its prescriptions on steering Mobile Internet traffic.
Inside Simplify: The Big Data Architecture
With the aim to realize network traffic steering to ease network congestion, the massive volume and high-velocity of probe data received
from Simplify applications require almost zero latency in processing. Powered by Google App Engine, both Network Analytics
Engine and Network Traffic Controller constitute the core of Simplify on Cloud’s Big Data architecture (see Figure 2). The system exposes
ANDSF S14 Post web service interface using Google Cloud Endpoints in receiving real-time data streams sent from mobile devices. The data
collected can be broadly classified into: (i) Device Location (ii) Network Discovery Information (iii)Network QualityofService (QoS), and
(iv) User Data Usage. The corroboration of these data further describes when, where and how users connect to the networks, what devices they
carry, their connectivity patterns and the present network conditions.
Unlike traditional business intelligence that follows a store, process then analyse approach, with massive raw data flooding in at high-
speed, Network Analytics Engine uses Google Cloud Datastore, an auto-scalable NoSQL database, to provide high-performance and robust
database storage for device locations and network quality-of-service. This enables Network Analytics Engine to store continuous flow of data as
they come in, making new datasets readily available for instantaneous processing.
Coupling with Google BigQuery, a tool that can query terabytes of datasets within seconds, this Google-fast advantage serves as a critical edge
for Network Traffic Controller to handle time sensitive use cases such as instant network quality analysis and near real-time network traffic
steering. For example, upon detecting poor quality networks, alerts are triggered to ANDSF Policy Manager to blacklist those affected networks in
its policy database. As a result, only good quality networks are propagated via ANDSF policy sent to mobile devices, delighting mobile consumers
with excellent online experience. Besides, the Network Traffic Controller can focus on a subset of network data based on different contexts such
as location and time e.g. analysing past hour data or querying network conditions at specific locations. Such flexibility allows Simplify on
Cloud to create timely network snapshots and identify the key drivers behind certain network behaviour during peak and off-peak hours (see
Figure 3).
While it is not difficult to find a linear correlation between consumer data demand and network behaviours, what is interesting to note from our
analysis is that there is also positive correlation found between the type of devices used and the network data consumption. Generally, iPhone
users consume 34 percent more data than any other smartphone users. What is more insightful to us is, even for a same user, the usage
behaviour on Wi-Fi and mobile data network varies considerably. Many users tend to be more abusive on Wi-Fi, streaming heavy-duty
multimedia content, while staying conservative on 3G data network, perhaps due to data capping policy imposed by mobile operators. When we
found the type of network reacts predictably to such mobile consumer behaviour, statistical model can be built to predict network behaviours.
Using Google Prediction API, Network Traffic Controller learns and predicts network behaviours based on device types and locations as well as
the generalized user routines. Not only does the occurrence of frequent congested networks can be known in advance, by factoring in these
variables, effective ANDSF policies can be crafted to shape, optimize and divert mobile network traffic towards alternative networks. In addition,
personalized connection profiles can be prescribed to enhance Mobile Internet experience, tailoring specifically to individual routine.
Like a live traffic control centre, the Network Analytics Dashboard is built with Google Charts API to display real-time network conditions.
Highly congested networks are plotted on a heat map, indicating the number of devices attached to them, and the list of nearby networks
available for diversion. The power of combining descriptive and predictive analytics gives mobile operators a holistic view and more control over
mobile network congestion while improving user experience at real time. This was never before a possibility without Big Data technologies.
Today,Simplify is setting another technology frontier, opening up its platform to steer mobile network traffic worldwide.
Emulating the honeybee colony, Simplify applications are mimicking the role of worker bees in discovering good access networks on the field,
sending precious information back toSimplify on Cloud server (“the hive”) to share with the colony. The aim of employing such architecture is
more than just allowing massive data to flow into the system rapidly, but it is also used to form the basis of collaborative network discovery and
collective decision-making. These mechanisms are critical in realizing what Pentland (2014) coined as the ‘Idea Machines’, enabling diverse
exploration of new networks and facilitating high-quality dense engagement within each cluster of networks to increase social intelligence.
Findings, Issues, and Limitations
While real-time network quality data critically captures instant snapshots of actual network conditions, the deployment ofSimplify at such large
scale has contributed significant network signalling traffic along ANDSF S14 interface to the Internet. To prevent this, optimizations
on Simplify application reporting frequency, interval and size of the data sent from mobile devices are needed and should be regulated according
to peak and off-peak hours in ensuring optimum network performance.
Furthermore, the specifications of mobile devices and network access points are not homogenously equal. For instance, we found that top-notch
devices tend to record better network quality-of-service than those average ones over the same network due to the use of higher-end radio
chipset. Also, only the network quality data reported by mobile devices with Simplify installed are taken into our prediction model, rendering a
partial view of a network. Drawing on such data incompleteness may distort the views on the actual network conditions and resulted in
inconsistent network quality patterns observed.
The device information gathered reveals that the average data consumption ratio for Wi-Fi over mobile data stands at 3:1. Such finding is
consistent with our Mobile Connectivity Survey(Nextwave, 2013b) where 76 percent of the respondents perceive that Wi-Fi gives faster Internet
speed than mobile data network with 68 percent preferred to use Wi-Fi as their choice of connection. However, as we zoom into specific user
profiles with similar data consumption ratio, we further found that the data consumption volume varies quite significantly. For instance, User A
who consumed 3Gb on Wi-Fi per month and another 1Gb on 3G network definitely require higher throughput needs as compared to User B who
only consumed 300Mb and 100Mb on Wi-Fi and 3G respectively despite both users fall into the same ratio of 3:1. Within this same context, the
monthly data consumption volume for a same user fluctuates invariably. Hence, it is impossible to use a one-size-fits-all model to predict
consumer data usage across the board.
It is also important to note that the day-to-day data pattern behaves quite sporadically, especially in the urban districts where dispersion of
population movements spread across some complex overlapping layers of wireless networks covering a vast area. The tediousness in tracking
every single network access point lying along the customer journey map requires massive corroborative effort. Imposing street-level tracking on
user locations may draw intrusive privacy concerns. Even for mobile users who strictly follow a nine-to-five routine, there are a mixture of mobile
connectivity patterns observed. Within this subset of users, although we can technically decompose their routines into time-of-day, day-of-week,
day-of-month and even comparing weekday versus weekend cycles, there are still many user routines that cannot be explained using the
conventional calendar cycles. The data inveracity instigates instability to our prediction model. We believe such ‘irregular routine’ is not an
oxymoron. There could be many other contributing factors, such as user demographics and subscription profiles, which are not immediately
available toSimplify, obscuring the entire process of deciphering individual mobile connectivity behaviour at a deeper level.
Assessing from the graph theory perspective, when multiple mobile devices contend for a same network access point which can only cater for a
limited number of devices, a constricted set phenomenon occurs. Although it is possible to solve this using market clearing property technique,
the matching algorithms behind pairing each device with its preferred network to maximize quality-of-experience proves extremely complex. At
one hand, the optimum network load factor for each and every network has to be computed at real-time, to avoid overloading of a network with
significant number of devices. On the other hand, the relationship between a network and Simplify on Cloud is asymmetric. Meaning to say that
while Simplify can instruct mobile devices to connect to a good network, but when a network gets congested, there is no direct mechanism to
report congestion and query for alternative offloading route via Simplify on Cloud server.
In order to address these challenges, data fusion technique could be used to create multiple contexts around the data. For example, we could
collaborate with mobile operators to obtain subscription profiles and billing cycle to explain certain irregular data points such as the sudden
surge in mobile data consumption towards the end of monthly billing cycle. Another way to manage uncertainty is through advanced
mathematics that embraces it, such as mathematical modelling and optimization algorithms that can be applied to strengthen our prediction
models.
FUTURE RESEARCH DIRECTIONS
The Internet of Things
One key area to advance our research is on Internet of Things where more than 50 billion devices will be connected by 2020 (Ericsson, 2011).
These are beyond smartphones and tablets, but comprising household appliances, wearable devices, implantable sensors, connected vehicles,
intelligent bots and machines etc. What will be interesting to highlight is that every single device that we use and wear shall produce richer
parameters to annotate our daily routines and behaviours than what we have collected and analysed today.
With this new range of devices, Simplify would be required to handle much complex use cases such as the ability to associate multiple types of
devices belonging to a same user and to optimally allocate network bandwidth to heterogeneous devices attached to the networks. Having a
federated identity is also critical in learning and predicting user behaviours as to what devices they wear, when and where they use them, so that
personalized network profiles could be prescribed to these personal devices based on specific context. One important aspect to this is to employ
device-profiling technique, where the data consumption ratio for each device is individually measured, computed and predicted over time to
better project device data demand. For example, it is highly unlikely for a cyclist to stream high-definition video but only to keep his wearable
sensors active during the workout period. Understanding these behaviours would require a lot of sophistications built into our initial prediction
model in reducing the overall network resource consumption, with the vision to create a sustainable model in preserving network neutrality
principles.
Heuristic Device
In catering for more aggressive data demand, the 4G LTE standards namely IP Flow Mobility (IFOM) and Multi-access Packet Data Network
Connectivity (MAPCON) are evolving toward enabling future mobile devices the concurrent access to multiple types of networks at any one time.
Unlike any conventional smartphone today, which can only alternate between mobile data or Wi-Fi, the next generation multi-radio devices are
capable to connect multiple networks simultaneously. For instance, a mobile consumer may opt to keep an on-going WhatsApp session on mobile
data while at the same time streaming high-definition YouTube video over the Wi-Fi. This provides greater bandwidth and faster Internet speed
by maximizing the full capacity of both networks.
As realizing such scenario would have significant impact to the networks, understanding each consumer application behaviour is critical in
achieving real-time network load-balancing and traffic shaping. Revisiting our use case above, under normal circumstances, it would be logical to
split WhatsApp messaging and YouTube traffics toward two different networks for ultimate bandwidth performance. But what if one of these
networks suffers from quality degradation? Should the device divert the YouTube traffic to mobile data instead? Can the mobile network cater
additional load incurred by the high-definition video streaming without compromising user experience? This shall pose a great dilemma to the
logic embedded within the multi-radio device. Undeniably, without application profiling data, it is very difficult to achieve the desired result.
Every mobile application possesses its own unique data behaviour and by adding consumer usage behaviour into the equation,Simplify would
require to profile the frequently used mobile applications installed on consumer devices, in determining the data consumption ratio at
application level, device level and user level. With the heuristic feature built-in, Simplify can prescribe connectivity policies tailoring to specific
application usage requirements on different devices and networks. Ultimately, at such, we aim to decentralize the decision making point
fromSimplify on Cloud server to individual devices, making them more intelligent with self-learning and network forecasting capabilities.
Nevertheless, the optimum network assignment and effectiveness of such deployment shall require further investigation.
Social and Behavioural Sensing
Among other things, we also see a huge potential to open upSimplify as a social and behavioural sensing platform for academia and industry. In
this highly-connected society, the digital footprints of networked consumers prove so invaluable not only to advertisers and marketers but also to
researchers in the study of computational social science, urban planning, behavioural economics, social psychology, network optimization
algorithms etc. Unlike traditional research methodology, empowered with Big Data technologies, the data collection and analysis could happen in
parallel. In what was coined as Reality Mining by MIT Professor, Alex Pentland, as the platform continues to capture and analyse individual or
group patterns of human connectivity behaviour, researchers may test out their hypotheses based on huge sets of live data. Despite this
remarkable vision, the concern on personal data privacy has to be dealt with absolute anonymity, sensitivity and transparency (Davenport, 2014).
Perhaps, the platform should evolve into a trust network, safeguarding user personal data by obtaining informed consent from users who
voluntarily participate in such living lab with the power to possess, control, dispose or re-distribute what information and the granularity level of
information to be shared (Pentland, 2014).
Furthermore, Simplify should advance into becoming a new type of geo-social network, aiding us to derive a more precise model in predicting the
likelihood of friends to co-occur at the same place and at the same time (Dong, Lepri & Pentland, 2011). With such possibility, Simplify may
adopt social network incentives to engineer co-operative behaviour and encourage more social interactions among its users to constructively
address social problems. This can be applied not only at the community level but city level, with active engagements from the municipals. For
instance, city councils and authorities may direct their concerted effort towards solving social issues by providing multi-level incentives to
citizens who willingly participate in reporting public infrastructure glitches. With this crowdsourcing capability, instead of lodging complaints via
the call centre, urbanites can tag precise locations of traffic woes, potholes, fires, bursting water pipes etc. using Simplify with immediate
response team deployed to rectify the issues on-site. Likewise, city councils can make announcements on local information via Simplify, which is
seemingly more personalized, non-intrusive, fast and effective.
In addition, neighbourhood watch could be another potential application area. Too often we only found out crimes in our own neighbourhood as
late as days after it happened, either from word of mouth or local newspapers. With Simplify, neighbours can post ‘shoutout’ on thefts, robberies
and other crimes nearby your area. Taking the fact that there are also many unreported cases out there, police may leverage on Simplify’s data to
beef up patrol in high crime areas. This is a whole new way of social interactions because when you join a social network like Facebook, typically
you would expect to connect to the people you know, reconnect to the people you have known and discover the people you want to know.
However, Simplify reverse this process entirely. In order to view a ‘shoutout’, a user may not be required to be a friend of the person who posted
it. At a glance, user may see a list of hot topics physically close to him, connecting and grouping people into multiple clusters of geographical
networks, collaborating together to bring social good.
CONCLUSION
Mobile network congestion poses a series of great challenges lying along mobile operator’s value chain spanning across radio coverage, network
capacity, security, quality-of-service, network optimization, standards compliance, applications variety, device specifications, subscription
packages, consumer connectivity behaviours, user experience and customer loyalty. The value of employing Big Data in understanding,
reconciling and addressing each of these aspects is imminent for mobile operators to achieve operational excellence.
In our vision to liberate social connectivity, it is the interactive engagements among the crowd with Simplify platform that makes Big Data
mutually reinforcing. While Simplify users contribute split-second network information as they move, the data scientists at the lab could collate,
aggregate and propagate synthesized network information back to the community for optimum network experience. Not only does this method
open up a whole new way to collectively shape Mobile Internet traffic, the potential of leveraging crowd intelligence can go further to solve more
complex social issues in the future. The realization of Simplify’s ecosystem presented in this chapter is consistent with what Stibel (2009)
represented today’s Internet as a complex network of brain neurons, wired together to store memories, associate patterns and manifest
intelligence.
As pieces of Big Data originate from the crowd, mimicking the natural water cycle where evaporating steam turns into rains that shower back on
the valley, we strongly believe by iteratively recycling the data back to the society can nurture new perspectives in shaping collaborative and
sensible behaviours that embrace theEco Surf core values, bringing technology for good in cultivating a sustainable networked society.
This work was previously published in the Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 6781, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Davenport, T. H. (2014). Big data at work: dispelling the myths, uncovering the opportunities . Boston: Harvard Business Review Press.
doi:10.15358/9783800648153
Dong W. Lepri B. Pentland A. (2011). Modeling the Co-evolution of Behaviors and Social Relationships Using Mobile Phone Data. In Proceedings
of the 10th International Conference on Mobile and Ubiquitous Multimedia, (pp. 134-143). 10.1145/2107596.2107613
Easley, D., & Kleinberg, J. (2010). Networks, Crowds and Markets: Reasoning About Highly Connected World . New York: Cambridge University
Press. doi:10.1017/CBO9780511761942
Ericsson. (2011). More Than 50 Billion Connected Devices.
Gonzalez, M. C., Hidalgo, A., & Barabsi, A.-L. (2008).Understanding individual human mobility patterns . Nature Scientific Reports.
Liu, B. (2009). Web Data Mining: Exploring Hyperlinks, Contents and Usage Data . Springer.
Pentland, A. (2014). Social Physics: How Good Ideas Spread – The Lessons from A New Science . New York: The Penguin Press.
Schneider, C. M., Belik, V., Couronne, T., Smoreda, Z., & Gonzalez, M. C. (2013). Unraveling daily human mobility motifs. Journal of the Royal
Society , 10(84), 1742–5662.
Stibel, J. M. (2009). Wired for Thought: How the Brain is Shaping the Future of Internet (pp. 105–115). Boston: Harvard Business Press.
Tay, Y. P. (2012). Getting Ready for Data Tsunami . Kuala Lumpur: Nextwave Technology.
Tay, Y. P. (2013). Nextwave Simplify Solution Description: Redefine the way we connect . Kuala Lumpur: Nextwave Technology.
Yang, Z., Yuan, N. J., Xie, X., Lian, D., Rui, Y., & Zhou, T. (2014). Indigenization of Urban Mobility. arXiv preprint arXiv:1405.7769.
KEY TERMS AND DEFINITIONS
3G: A set of third generation mobile network systems standardized by Third Generation Partnership Project (3GPP) comprising of Wideband
Code Division Multiple Access (W-CDMA), High Speed Packet Access (HSPA), Evolved High Speed Packet Access (HSPA+) and 4G Long Term
Evolution (LTE) technologies.
4G LTE: A fourth generation wireless communication standard defined by 3GPP for high-speed mobile data communication network evolved
from Global System for Mobile (GSM) communication.
ANDSF: Access Network Discovery and Selection Function (ANDSF) is an network element within Evolved Packet Core (EPC) responsible for
propagating network policies containing access network discovery information and selection based on time, location, network preference and
device type.
EPC: Evolved Packet Core, the core network of 4G LTE systems introduced by 3GPP in its Release 8 standard.
Femtocell: A micro cellular base station designed for use at homes and small businesses with its backbone connected to fixed broadband
network.
IEEE: The Institute of Electrical and Electronics Engineers is an organization that develops global standards in a broad range of industries
including power and energy, biomedical and health care, information technology, telecommunication, nanotechnology, information on
assurance, etc.
IFOM: IP Flow Mobility is a mechanism for a mobile device to simultaneously connect to 3GPP cellular access and WLAN enabling exchange of
different IP flows belonging to the same or different applications being moved seamlessly between a 3GPP access and WLAN belonging to the
same network provider.
M. Anil Kumar
M. S. Ramaiah Institute of Technology, India
K. G. Srinivasa
M. S. Ramaiah Institute of Technology, India
G. M. Siddesh
M. S. Ramaiah Institute of Technology, India
ABSTRACT
In order to store these huge bulks of data, organizations have to buy servers and scale it according to their need. One of the solution for storage
would be Cloud Environment as there is no need to scale up storage by aggregating more physical servers, installing, updating, or to run backups,
which also cuts down system hardware and makes application integration easier. This chapter discusses key concepts of cloud storage as storage
as a solution for Big data, different big data domains, and its application with case studies of how organizations are leveraging Big data for better
business opportunities.
1. INTRODUCTION
1.1 Big Data: A Revolutionary Technology
Big Data and Cloud Computing has brought about huge transition to computing infrastructure, traditional computing techniques cannot be used
for storage and processing of very large quantities of digital information. Cisco cited an estimate that, by end of 2015, global internet traffic will
reach 4.8 zettabytes a year that is 4.8 billion terabytes which indicates both the Big Data challenge as well as Big Data opportunities on the
horizon.
1.2 Exploding Data Volumes
The amount of digital data being generated is growing exponentially for a number of related reasons. First, every organization is becoming
increasingly aware of value locked in massive amounts of data. E-commerce, retailers are starting to build up vast databases of recorded
customer activity. Organizations working in many sectors including healthcare, retail, social media are generating additional value by capturing
more and more data. New data collectors such as sensors, geo-data, internet click tracking have created a world where everything around us
where right from automobiles to mobile phones are collecting massive amount of data which may potentially be mined to generate valuable
insights.
1.3 How to Avoid Data Exhaust?
The more an organization recognizes the lucrative role of Big Data more data they seek to capture and utilize. However, because of big data’s
volume, velocity and variety few organizations have ignored huge quantities of very valuable potential information. In effect most of the data that
surrounds organizations today is ignored. Massive amount of data that they gather is lost unprocessed, with a significant quantity of useful
information passing straight through them as “data exhaust”.
Traditional large-scale computing solutions rely on expensive server hardware which are highly fault tolerant towards hardware failures or other
system problems at the application level because of which there is a high level of service continuity to be delivered from clusters server
computers, each of which may be prone to failure. The most feasible premise would be to process vast quantities of data across large low-cost
distributed computing infrastructures.
Big Data requires new technological solutions. The need for more dependable, scalable, distributed computing systems capable of handling the
Big Data deluge led to use of more flexible technologies like Hadoop, database virtualization, storage virtualization, network virtualization, and
more in order to avoid single device barrier, since they impede scaling. To spread analytical processing across bunch of commodity servers,
Hadoop uses MapReduce. Elasticity and agility were needed to scale and address Big data
Big Data is characterized by its volume and velocity of data and for convenient processing requires decentralization. Due to lot of limiting factors
such as resources, expertise in the domain etc, organizations are highly unlikely to implement their own solutions. Nonetheless, big pioneers in
the market like Amazon, NetApp and Google, allow organizations of all sizes to start benefiting from Big Data processing capabilities. Big Data
sets need to be utilized, since data and processing need will change and change may be rapid and disruptive a more flexible platform such as
Cloud computing services which makes it more manageable and agile(James Taylor, 2011). As new types of computer processing power become
available, Big Data will progress in leaps and bounds as technology advances.
2. CLOUD COMPUTING FOR BIG DATA
Cloud computing transforms the way big data has been dealt with for all these years – it has emerged as a cost-effective and elastic computing
paradigm that facilitates large scale data storage and analysis. The cloud facilitates Big Data processing for wide spectrum of diverse enterprises
of various sizes. Getting Business insights from Big Data is prevailing and Cloud is becoming an ideal choice for that. Cloud complements big data
since cloud computing provides boundless capabilities on demand. Having these capabilities made available now, any organization can work with
unstructured data at a huge scale (Gartner Report, 2013).
Relational Database management systems or a Traditional data management can handle only limited data which are either structured or semi-
structured. But, they do not have the capabilities to handle the “Big Data” which consists of massive amount of data sets of varied data
types(John, 2007)A new variety of software environment coupled with Cloud infrastructure are bracing up for Big Data challenges such as
acquiring, managing and analyzing. Cloud computing has advantages over technological advancement like –- computing and networking– and
are focusing on advancing and innovate in storage and management to tackle Big Data.
As shown in figure below, Data analytics is growing along with accelerated data resources, which generate different kinds of data such as
structured or unstructured data for different purposes like for public or private access of data which can be a potential cloud application. Easy to
access web services, wireless devices, and with the growth of mobile computing more consumers, professionals, and Organization are creating
and accessing data with cloud-based service (Janusz Wielki, 2013).
Figure 1. Big data cloud
2.1 What Cloud Has to Offer Big Data
• Previously, organizations were hesitant to experiment even with few terabytes of data to see if they could get some insight that would be
valuable to their organization because it involved costs and with low or no probability of getting a valuable result. Now, cloud fosters the
ability to innovate and experiment with the existing data. Now with cloud trying to crunch data has become lot more easier than before
which has brought about a fundamental change for productivity and innovation.
• In traditional system it takes far too long for people to get their application running or compartmentalize to get things working. In cloud,
people don’t have to worry about installing, updating etc., cloud host vendor has to worry about keeping it in place. All that organizations
have to know is: How to make architectural decisions that they need for their application to scale, and that allows people to provision things
extremely quickly?
• Lot of enterprises is setting up private clouds along with which comes lot of complexities like security and privacy of data. With this
foreknowledge about Cloud, Big data and Analytics which all play a prominent role of specific roles that organizations need to do in order to
better leverage to harness the benefits of these two technological solutions in particular.
• Companies are now more and more heading up for a step change, modernizing technology suite for better infrastructure.
3. ESTABLISHED STRATEGIES TO PROCESS BIG DATA
In order to develop big data capabilities, companies try to measure or experiment on these data. By measuring, they look to see what values they
can get and they know exactly what they are looking for. When the objective is to experiment, they use scientific methods to verify the validity of
the experiment. In order to carry out these tasks organizations compartmentalize and segregate the data that is being accumulated in order to
make processing these data easier. On a broader note, data can be analyzed by splitting in terms of one measure for example, based on the data
type it can be further categorized into Structured or Transactional data and Unstructured or Non-transactional data. Keeping data type as a
fundamental component, data can be strategically analyzed which allows organizations to extract value from large volumes of data. Like any
capability, it requires investments in technologies, processes and governance.
3.1 Transactional or Structured Data
In organizations the primary source for transactional data that has structure and schema are from their operations which can be easily stored in a
relational database systems. In order to transform these raw data to meaningful data different techniques like Business Intelligence (BI), OLAP,
SQL, predictive modelling, cluster analysis, data mining etc. are deployed to interactively analyse multidimensional data. Data that resides in a
fixed field within a record or file is called structured data. Structured data has the advantage of being easily entered, stored, queried and analyzed
only in traditional relational database systems (James Bean, 2011).
Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored,
processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type numeric, currency,
alphabetic, name, date, address and any restrictions on the data input such as number of characters etc.
Data Exploration and Performance Management are the fundamental strategies to process transactional or structured data which are as
discussed as follows:
3.1.1 Big Data Exploration
Data exploration is an intended focus of discovering new facts from existing data, which is now ascending to a new level of importance because it
plays a crucial role in analytics. Exploring data is a prerequisite to analyse data where the objective is to experiment. Data exploration helps
organization to look into their past logs and present data consumption pattern in order to address organization specific business issues, in order
to understand full potential of existing and current data by amplifying and formulating.
During the course of Data Exploration beginning with Visualization, where organizations look for available data within and outside enterprise be
it raw data, enterprise data or external data and understand them for further processes, use functionalities to establish the ability to access and
use big data to support decision making and day-to-day operations, mine for valuable information to get the most value from data by using big
data tools to sieve the most useful content from the rest, check for veracity of the data and discover hidden insight in new and unstructured data
while adding important context to raw data. Some of the key benefits of data exploration are that it reduces costs and improves operational
performance for the business, improved decision making, greater efficiencies in business processes, gains new insights by combining and
analyzing data types in different ways, develop new business models with increased market presence and revenue (TDWI, 2013).
3.1.2 Big Data Performance Management
A good performance management requires a transparent application transaction, end user interaction with web server, transaction with database
server of all different tier and how each of those individual components are tied to play together as application to the business and end user.
Application performance management is a Big Data challenge, organization need to collect enough data to understand the transaction and put
together different components. The key is to trace and track till the end. Performance management for Big Data restores visibility and puts you
back in control by providing deep insight into your Bigdata workloads and transactions, finding out the primary cause of inefficiencies or
problems in minutes instead of hours or days. Improve collaboration between development and operations and bring new applications to market
faster.
Here, the goal is to maintain high level quality data and accessibility for business intelligence and big data analytics applications. Effective big
data management helps corporations, government agencies and other organizations to locate valuable information in large sets of unstructured
data and semi-structured data from a variety of sources, including call detail records, system logs and social media sites (Wen-Chen&Naima,
2013)
3.2 NonTransactional or Structured Data
Non-Transactional or Structured data poses more complex role in Big data environment. While web presents a number of very common
examples for unstructured data such as images, audio, video, even texts they are only a subset of unstructured data because they are being
captured from various sources beyond well defined horizon. In typical enterprise also we have other forms of unstructured data like emails,
documents charts, graphs etc. Unstructured data projects various challenges like the data size, bandwidth, varied formats, encoding, and
classification for discovery, tagging, and lack of structure. Since data is collected from large community and distributed group of people (social
media) capturing and analysis gets intricate.
There are lot of key challenges and benefits that involve data storage and processing unstructured data. Various technologies like Textual
analysis, algorithms that analyze natural language and value that can be extracted from text along with their linkages, Textual analysis to
understand the customer behavior often used in marketing and brand campaign, Network analysis on social media platform it can be used to
create social graph of followers and connections among users. Among various strategies to process Non-transactional or Unstructured data,
Social Media Analytics and Decision Science have been prime strategies as it requires coordination among various other components.
3.3 Social Media Analytics
Social Media Analytics combines traditional business data with social media data in meaningful ways that leads to analytics and innovation.
People talking online, pictures, and video. Now, it is not just social networks, most companies realize that they need some kind of engagement
strategy which is to know that are the returns in this investment of time and resources and energy. In order to deploy social analytics prerequisite
is to have a goal of what to get from it, what we can do by knowing what people want. All of which can be done by tracking and looking into pages,
filling out information, request information, buy from e-commerce etc.
What metrics indicate success? Social media is propagated by social interaction. Web is no longer a passive medium, it’s a place where people
consume and create content to interact with one another. A place where people interactively create data be it forums, blogs, social networks, slide
sharing site, product review site and many more. Lot of data being created and transmitted through social media because any user can express
opinion, share links, share data, which means we can mine data for opinions of millions or billions of users to gain insights into human behavior,
marketing analysis, product sentiment etc (Vishal Bhatnagar,2013). The value proportion that we get from these data is actionable intelligence
such as gaining insight, reputation or brand management, marketing communications, product features that people like.
With this development -as the world continues to become more and more social -competitive advantage will come to those who understand
what's happening better than their peers and can directly connect it to their business outcomes and other useful pursuits. Social networks and
enterprise social software has long been driven by two things: The connections between the people that use them and the information they share.
Further, in order to understand Big Data analytics in different domain following are some of Big Data applications, issues and challenges,
solutions, implementation and technology consideration in various domains like healthcare, retail and social media.
4. BIG DATA APPLICATION IN DIFFERENT DOMAIN
4.1 Big Data in Healthcare
The Healthcare industry is sophisticate and unique. This industry has a particular mission that deals with the most important asset of an
individual which is Health. Big-data revolution has a key role in health care. Healthcare industries are aggregating the data to realize its value for
quality care, how to use it to improve patient experience, how to share it, and how to secure it. Health care professionals have improved the
standard of care, to make more personalized decisions for individual patients, and to identify risks at earlier stages in these cases Big Data has
eased the transition for emerging data-driven health care systems. In order to leverage big data effectively it is very important for health care
organizations to get the tools, infrastructure, and techniques in place.
Healthcare data repository consists of life sciences data such as clinical trials, patients history, clinical data like EMR, medical images, scan etc.,
claims data utilization of care, cost estimates and patient sentiment patient behavior, social networking all these data when integrated leads to
major opportunities for clinical decisions, intelligent service availability and personalized medicine. The core components are to connect data,
connect analysis and connect people. Data connection can be made by keeping efficient storage in place having distributed platform for storage
optimization, security and privacy. Implementing Analytics on these stored data compartmentalizing Data Processing and Management of
medical records and medical images further Analytics and visualization using SQL like query or Machine learning. After gaining insight into these
data they provide patient with more reliable Health information services, Personal Health Management and Personalized medicine.
Traditionally physicians have used judgment while making treatment decisions and diagnosis, but gradually they have moved to proof based
medicines, where treatment and other important decisions are made by systematically reviewing clinical data and using best available
information.
4.1.1 Key Benefits of Big Data in Healthcare
By effectively combining analytics and using Big Data, Healthcare organizations of discrete sizes networks and groups will gain significant
benefits that can include:
• Information can be mined for effective outcomes across large population using Electronic health records (EHRs), coupled with new
analytics tools.
• Researchers are actively studying trends by using statistical methods and provide assessments for quality care.
• Electronic sensors are increasing real-time analysis that can alert individuals and their providers in conflicting situations like the early
development of infection, and allergic reactions.
4.1.2 Challenges of Big Data in Healthcare
Even though Big data has a crucial role in Healthcare it faces certain implementation and structural issues.
• Along with this dynamic field, traditional medical management techniques should evolve.
• Data-driven revolutions a fundamental transformation where Healthcare organizations will need to adjust to.
• Another major concern that will continue and persists will be privacy issue, which should be dealt with.
These are some of the common open challenges of Big Data in the Health care domain. One such example is Charité Universitätsmedizin, a
hospital which has successfully implemented Big Data processing strategy using the latest technology SAP HANA, which is further discussed as a
case study.
CASE STUDY 1
Charité Universitätsmedizin Berlin
Charité Universitätsmedizin Berlin is one of the largest and ancient hospitals in Europe which is trying to balance between improving operational
as well as budget efficiencies and it is also trying to seek next-generation business intelligence as one way of optimizing clinical outcomes. In its
role of enabling quick real-time analysis of the organization’s very large data storage.
Challenge
To improve methodologies, Infrastructure, and technologies that transform raw data into meaningful and useful information to lead both clinical
and financial efficiencies to improve overall support for patient welfare, research pursuits, and cost concerns. Further, to handle huge amount of
unstructured data to help identify and develop new opportunities. Making use of new convenience and implementing an effective strategy for
easy everyday operation of data centers including clinical, research, administrative and academic personnel. Primarily, they need ad hoc
reporting from data repository with multiple data sets including structured and unstructured data and real time analysis in order to generate
better patient outcomes and greater operational efficiency which is not possible with their current system.
Implementation Goal
To have Extensive Parallelism, Large Memory Capacity, Quick response and find relevant data in real-time each of which are done at different
tier in the infrastructure including, data source, data storage, data organization, analysis and actionable business intelligence(Intel case study,
2011).
Solution Strategy
Charité Universitätsmedizin collects data from online data entry of patient information, Medicare enrollment, admission and outpatient log,
imaging efficiency measures, Patient’s survey. The Annual Survey of Hospitals provides the comprehensive source of information on issues such
as accommodation for patient care, resource utilization, revenue, Medicare utilization and types of hospital services. All of these data need to be
stored and hence requires large storage capacity.
Technology Consideration
In order to address the challenges of ad hoc reporting and to have a concrete real-time decision support system Charité Universitätsmedizin has
replaced Traditional Business Intelligence with Business Intelligence system which is based on SAP HANA to support needed capabilities.
Traditional Business Intelligence at Charité Universitätsmedizin data were being read from the disk is much slower when compared to reading
the same data from RAM. Performance was severely damaged especially when analyzing large volumes of data. Traditional disk based
technologies means Relational Database Management systems such as SQL Server, MySQL, Oracle etc. which are designed especially for
transactional processing. Also the structured query language is designed to efficiently fetch rows of data while BI queries usually involve fetching
of partial rows of data involving heavy calculations. SQL is a very powerful tool running complex queries takes very long time to execute and often
resulting in bringing down transactional processing
To avoid performance issues and provide faster query processing when dealing with large volumes of data, Charité needed advanced database
techniques like for easy parsing index creation, use specialized data structures to support different data types and aggregate and joining tables. As
and when the complexity of the need started to increase traditional BI tools could not be in par with the ever growing business intelligence
requirements and were unable to deliver real time data for end users.
Reading data from disk before calculations and writing it back tends to hinder performance since there is lot of discontinuity associated with the
operation. Charité overcomes these limitations by integrating SAP HANA into its diverse system which has the capability to provide fast
calculation on massive data sets.
As shown in Figure below SAP HANA has its key advantage with its ability to integrate smoothly into diverse system architecture, In-memory has
in-built capability of storing the primary data copy in memory instead of a disk. As the facility standardizes and consolidates its systems, it
becomes the basis for centralized reporting that draws from diverse data sources such as medical imaging, laboratories, surgery, intensive care,
and emergency services by improving its capabilities with regard to analysis and reporting on this information, Charité hopes to improve both
efficiency and quality of care(SAP HANA, 2012).
Business Benefits
Charité Universitätsmedizin has implemented SAP HANA, an in-memory data platform for Real Time Analytics & Applications, available via
Appliance and Cloud Deployment. At Charité Universitätsmedizin two functions are carried out first, Real-time analytics by HANA which
specializes in Operational Reporting and provides the ability to perform predictive and text analysis on large volumes of data in real-time through
the power of its in-database predictive algorithms and its R integration capability. Second, Real time application in which HANA Specializes – to
leverage in memory technology by leveraging core process accelerators, planning, optimization and Real time insight on Big Data. Due to its
hybrid structure for processing transactional workloads and analytical workloads fully in-memory it is capable of building traditional adaptation
of structure to enable reporting. Because of which real-time, live transactions can be processed (John, 2007).
Case Study Conclusion
Speed, Affordability, and Reduced Dependence
At Charité Universitätsmedizin the need for any network access or disk I/O as queries and related data reside in the server’s memory has been
eliminated. Which has significantly enhanced the performance and reliability of the data warehouses and back-end databases in which the
required report data exists? With In-memory processing the source database is queried only once instead of accessing the database every time a
query is run thereby eliminating repetitive processing and reducing the load on database servers. In order to avoid traffic and congestion of
access to database servers for operational purposes during peak hours, it is possible to schedule to populate In-memory database overnight. In-
memory analytics is a less expensive and more feasible way to operate an enterprise business intelligence environment because in the past,
storage and memory were expensive, and 32-bit architectures offered limited processing power and storage. But today, the costs associated with
memory continue to decline, while 64-bit computing delivers much greater memory stores. Because it reduces the burden of managing the data
for reporting and analysis purposes (SAP HANA, 2011). For instance, it can potentially eliminate the need to build, deploy, and maintain business
intelligence solutions. And, it can drastically cut down on data warehouse maintenance.
4.2 Big Data in Retail Industry
Retail has always been a dynamic, high pressure environment. Retailers who best anticipate their customer’s wants and needs— offering the
legitimate product, in a good place, at the right time, and for the suitable price—will prevail. All these have been made possible by accumulating
customer data and extensively analyzing it. Today, there is a vast and challenging new dimension to understanding the consumer: Big Data.
Never before have had customers generated such a deluge of information about their habits, behaviors, and desires. With every click, tap, or
touch; every swipe, search, or share, consumer information is created.
Retail industries need to tap all the information that has been accumulated to understand every step throughout shopping experience right from
random web search to social-network share to proactively get an insight of what a potential customer first think, or even what they might need or
desire. After which retailer should gather new torrents of data and patterns that needs to be identified, and combined with more traditional
sources of data and establish a strategic optimization. Once these information streams is coupled with fundamental meaning to it, seemingly
overwhelming data deluge will have to be evaluated, managed, and refined through Big Data analytics. But, Big Data is not just a technology; it
has the capability to drive sustainable competitive advantages which brings about transformational strategic capabilities (Big Data deluge, 2012).
So, instead of being overwhelmed and inundated by the torrent of data, retailers must learn to surf the deluge rather than just let data to wipe
out. Some of the industries in retail to drive the value for Big data are mobile, video, and social networking data promise enormous insight for
retailers.
4.2.1 Key Benefits of Big Data in Retail
• Using Big Data, Retailers are able to predict what customers want.
• With Big Data, Retailers are able to provide customers the flexibility to research, decide, and buy.
• Retailers are working to enhance individual unique customer experiences over time.
• Big Data will enhance and solidify pre- and post-purchase relationships.
• Retailers have gained the capability to acquire and store data, analyze and interpret it, decide what is important, implement on the basis of
the insights gained, and measure and refine the actions taken from the new insights.
4.2.2 Driving Factors
Retailers need to consider adopting a holistic architecture for Big Data. This requires not only high-performance computing, but also the abilities
to ingest large volumes of high-velocity information from myriad sources; to analyze new data streams from video and rich media; and to take
action once that intelligence is filtered. This action will be greatly facilitated by new visualization and collaboration capabilities, and via
integration with workflow and business-process management. Cloud services is one of the leading service providers of Big Data in order to gain
the core capabilities of storing and processing high-volume and high-velocity data, retailers can simplify the process by employing a cloud
service.
4.2.3 Advantages
Even though there certain challenges that retailers need to overcome, there are exciting opportunities for retailers to begin reaping the benefits of
Big Data. What keeps retailers to endlessly work and overlook the sheer volume of the data deluge that may be intimidating, they believe that
there is good news for retailers to look for actual insights, once isolated task may not be overwhelming. Once captured, those insights will
intersect that have always defined retailing where the key is speed and efficiency in finding them. It is impossible to ignore the opportunities that
come with Big Data.
4.2.4 Disadvantages
Challenges thrive through in any sector. Even in retail for instance, massive amount of data threatening because retailers are like to get bogged
down with overwhelming capacity for understanding. One huge challenge to compile and store terabytes of information on consumers, much of it
unstructured data from sources like social media and video feeds. It is quite another to extract those crucial nuggets of golden insight that will
enable retailers to understand—indeed, predict beforehand—the behavior of consumers in real time.
CASE STUDY 2
Walmart
Wal-Mart Stores, Inc., branded as Walmart, an American retail corporation that runs chains of large department and warehouse stores. The
company is the world’s second largest public corporation, it is also over two million employees being the biggest private employer, and is the
largest retailer in the world.
Challenges
The company has struggled for more than a decade in e-commerce, with a revolving door of e-commerce leadership and a string of questionable
acquisitions. The company is said to have exhaustive Petabytes of consumer data of more than 145 million Americans – more than 60 percent U.S
adults They have not been good at the technology part of this, and are deploying search engine and how they can market on the Internet.
Implementation Goal
There should be an efficient way to gather information on consumers and implement analytics on these huge data sets. They have to analyze
consumer behavior online by tracking every page viewed by customers, identifiers for users, their devices, location information and they should
be able to connect data from consumers’ social media activity to transactions. Given the amount of information that these organizations have,
tasks such as acquire, aggregate and analysis gets more complex (Walmart case study, 2012).
Solution Strategy
Data Acquisition and Data Storage
Walmart collects data from its consumers by tracking their transaction, every page viewed online, tracking unique identifiers for users and their
devices, system information such as device type and location information (Srikanth, 2011). To acquire and store data, Walmart has infused a data
ecosystem into their overall, companywide culture. Cross-functional approaches to data management have become part of their processes.
Previously, Walmart used to collect all incoming data and store them all in Hadoop Cluster.
Hadoop for Storage
Walmart uses HDFS(Hadoop File system) for its data storage; Hadoop is a highly scalable analytics platform for processing large volumes of
unstructured and structured data. Large scale data that is been collected at Walmart, which is multiple Petabytes of data spread across hundreds
or thousands of physical storage servers or nodes and is implemented on racks of commodity servers as a Hadoop cluster. Hadoop framework is
designed to scale up within a single server by using faster processors, more memory and fast shared storage.
As shown in Figure 3 Hadoop has two primary components: Hadoop File Systems (HDFS) and MapReduce.
• Hadoop Distributed File System: Popularly known as HDFS, it is cost effective, highly dependable, scalable, with steep bandwidth
data storage system that aids easy management of data files across machines.
• MapReduce Engine: A high-performance parallel/distributed data processing implementation of the MapReduce algorithm. MapReduce
was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. The “map”
component distributes and handles the programming tasks across a large number of systems. It is highly fault tolerant and also balances the
load. “Reduce” component of MapReduce engine aggregates all the components back together to provide a result.
Technological Innovation and Solution
As the regular Hadoop framework was not able to cope with the amount and speed the data was coming in, Walmart has developed their own tool
called MUPPET. This, tool processes the data in real-time over all clusters and can perform several analysis at the same time. Muppet processes
massive amounts of swift data over large clusters of machines in a quick and efficient way. Muppet was custom designed keeping the focus on the
capability to manage and track data streams with billions of updates per day which deals with both Big Data and Fast Data. Using core
functionalities of Hadoop and other open source tools to analyze social media data from Twitter, Facebook, Foursquare and other sources they
have retained them that which laid the groundwork. Muppet has adopted Hadoop’s primitives “Map” and “Reduce” and has extended to Muppet
uses as “Map” and “Update. Muppet has also been built to analyze Fast Data such as full data feed of all public tweets on Twitter (Walmart case
study, 2012)
Data Analysis and Data Interpretation
Walmart implemented an enhanced “semantic” search engine, named POLARIS for its e-commerce channels. This includes advanced analytical
technologies, information retrieval, machine learning and text mining. Polaris is also used for mobile search and will expand to power the
company’s international e-commerce sites. The objective of Polaris is to deliver more meaningful results when shoppers enter keywords in the
Walmart website's search field, and it does so by gathering shopper's relevant interests and thereby intuit his or her intent in doing the search by
using a series of methods.
Searching for a shopping result is very different from conducting a general search. In 2012 Walmart announced its new search engine which was
named as Polaris built right from groundwork by @Walmartlabs and stated that “Polaris is a platform that connects people to places, events and
products giving Walmart a richer level of understanding about customers and products. The new search engine uses advanced algorithms
including query understanding and synonym mining to glean user intent in delivering results. Polaris focuses on engagement understanding,
which takes into account how a user is behaving with the site to surface the best results for them. It delivers a new and intuitive results page when
browsing for topics instead of giving a standard list of search results allowing shoppers to discover new items they may not have considered.”
Case Study Conclusion
The major issue concerning data analysis of huge amounts of that that is being collected at Walmart is analyzed using technologies like Machine
Learning and Data mining which makes predictions from data. Machine learning is in between statistics and data mining. At Walmart, Machine
Learning is used to build a compressed representation version of existing data. From terabytes of data, this needs to express in a compressed way
in terms of groups or subgroups to make future predictions in order to make sense of existing data which is done using three basic methods:
• Classification;
• Clustering; and
• Regression.
At Walmart data exploration or Knowledge discovers is implemented using Data mining, an efficient algorithm for mining relevant patterns and
association rules for discovering new frequent patterns and aberrations. Previously data mining was coupled with traditional database systems
and was linked to creating better way of fetching data using SQL. Data mining has its strong focus on solving complex real world industrial
problems and finding a way to get practical solutions. And hence it mainly deals with two aspects of Big Data: Data size which is usually large
data and Data processing speed which clearly solves certain obstacles that were hindering full potential operations at analysis level.
4.3 Big Data in Social Media
With over five billion people using mobile phones, two billion people using the internet, and over one billion Facebook users world-wide the
social media has created huge wave of data deluge. Social media has become so popular these days that it cannot be ignored. Every single action
that is done on the internet produces added value, every user activity produces data either by searching the engine, comments, forums, social
profiles, likes, sharing tweets, follows, uploading or downloading images, videos or any other content all these activity generate potential data
that can produce great value when aggregated and capitalised in market analysis and commercial advertising.
The dynamic nature of social networking has taken over traditional methods, such as polls, interviews, panels and questionnaires which also take
people's opinion for information or facts. Now, the multi-xnetwork nature of social networking generates data that indicates clear idea of what
users actually do and what they're talking about.
Social media analytics like any Big data technological strategy includes process of gathering data from myriad of social media sources such as
blogs, Twitter, wikis and other public and private forums. Text Analytics enables organization to parse through every phrase or word their
frequency and usage trends which are further used for commercial and marketing purposes (Hinchcliffe, 2010).By keeping the central focus as
gleaning the sentiment behind every information that is generated by users or customers, which is further used to enhance prospective customer
experience and about a product or an issue all of these are done by organizations by gathering and analyzing massive amounts of customer data.
Social media has become a centre stage for organizations that are looking to improve their businesses.
In Analytics, data collection, segregation and analysis is a continuous process which every organization must understand. Real time, quick and
smart analytic strategy should be continuously enhanced and optimized after every process of data. Specially on social media it is very important
to check for the veracity of the data as these data are coming from known or unknown sources and streaming at lightning fast pace. Well refined,
simple, clear and accurate data add a lot of value and ease the rest operations in organizations. Making effective use of customer data gives an
edge to organization across the business as social data and analytics offer greater benefits (Dyche, 2010). However it is not easy to achieve but the
value it brings to an organization is imperative.
CASE STUDY 3
Netflix
Netflix is one of those pioneers who have leveraged Internet for their business. It is one of world's leading companies which has provided movie
and TV subscription services which were not possible previously. They have implemented video on demand, a video streaming technology where
user has to pay per view. Netflix has maintained an archive of viewing history of over 20 million users. Not only video streaming and viewing
history it also maintains a log of position at which they had stopped while watching it so that users have the ease to playing from where they
stopped. Over the past couple of years Netflix has offered more than 10,000 streaming videos.
Challenge
Netflix members have exponentially increased over time; the company realized the need for data and the demand for storage that the streaming
media would require large DVD warehouses and streaming movies. With single data centre they had single point of failure, which would means
loss of entire data in the database which cannot be recovered. Amount of data was growing with growing customer (Netflix, 2011).
Implementation Goal
To maintain high availability of member information, streaming quality video data in more robust fashion and create flexibility and agility of
streaming the video data from multiple devices like phones, tablet, iPad, Wii devices etc. The main challenge is to maintain quality streaming
experience which is daunting as there are more customers every quarter.
Solution Strategy
More reliable and fault tolerant data storage, open source and scalable Apache Cassandra data platform, which lets Netflix quickly create and
manage data clusters and minimize the chances of failure. With Traditional Databases Netflix incurred added cost every time when servers were
scaled up and scaled down. There was single point of failure since all data that were collected were stored in one data repository. Central SQL
Databases used to inhibit exchange of data globally.
Technology Consideration
Amazon Web Services as Cloud Storage Solution
AWS provides complete set of cloud computing services accessed through the internet to help and build applications. Netflix implemented AWS
which also provides IT resources like computing, database services, business intelligence solution and data storage. High durable storage
services, low latency database systems, provides the elasticity to add and remove capacity to scale up to meet customer demands, and scale back
down again and hence providing efficient performance management.
Netflix also developed their own tools to test and verify to survive any kind of failure. Scalability for streaming services like Netflix is critical it is
important to maintain thousands of servers and terabytes of storage. Reducing the complexity of running data on datacenter and cloud are
having higher efficiency and higher availability. Netflix needed database that could match flexibility of cloud services. Eventually Netflix
implemented Apache Cassandra– A solution which offers globally distributed data model which have the capabilities that are flexible to create
and manage data clusters quickly (Netflix, 2011). The process involves Netflix to migrate its data from Oracle to Amazon’s SimpleDB distributed
database eventually transitioning into Cassandra
Cassandra is an open source distributed high performance, extremely scalable, fault tolerant, non-typical relational database management
system. Cassandra can serve both real time data store for online/ transactional applications as well as database for business intelligence
implemented at organization. It was designed keeping focus on peer to peer, distributed systems. It automatically partitions data in the database.
Data duplication and replication can be customized by the user.
• With many different systems instead of one system, which tends to break at different times results in losing small pieces of system rather
than losing whole system at once.
• Cassandra’s open-source model has provided Netflix the flexibility of modifying by implementing their own backup, recovery system and
their own replication strategy (Netflix, 2011).
Case Study Conclusion
For Netflix Cassandra provides the flexibility to add capacity online and accommodate more customers and more data whenever needed. Product
or service based issue are addressed with more window of opportunity for improvement by continuously monitoring social media. Leveraging
social media analytics, Netflix is trying to capture some major and common perspective such as monitoring customer or prospective customer
attitudes in general or in social sphere at large, brand related activities, the kind of impact that they have on their organization be it online or
offline businesses. Big data after mining information, consolidating and prompted in a meaningful context leads to better business decision
which is important for the success of most business today.
5. BIG DATA BIOMETRICS
Data is being generated in various forms from diverse sources. Big Data finds its application in one such domain Biometrics, where data being
gathered are biological data which are measured and analyzed for human body or behavioural characteristics. Biological and behavioral
characteristics such as face recognition, fingerprint scanning, retina, voice recognition, DNA etc, have become extremely important in public
security systems, consumer electronics and even corporate. Since these characteristics are firmly associated with identity, biometric based person
recognition provides a gateway for verifying biometric based person recognition.
Sensors of all types capturing vital biometrics data that represents certain identification traits that are necessary for verification will have
multiple dimensions of measuring each of them. Typically Biometric systems are checking for two distinctive cases.
In latter case, the identity is verified with already existing information in the database in order to determine if the identification is accepted or
rejected.
With ever growing use of new sensor devices, the amount of data collected has increased exponentially. Innovative approaches to fuse sensors
and data collected using emerging technologies and capabilities of large scale data inputs from multiple sources are providing solution for many
identification and verification based person recognition. Big data tools are being deployed for data management, biometric intelligence enabled
data analysis for predictive and risk analysis.
6. BIG DATA ANALYTICS INSIGHT TO VALUE
Big data is all about being proactive and predictive. Proactive decisions require proactive big analytics like statistical analysis, predictive
modeling, forecasting, text mining and optimization. They allow determining trends, spot flaw or determining conditions for making decisions
about the future, extract only the relevant information from terabytes, Petabytes and exabytes of data and analyze it to transform business
decisions for the subsequent decisions. How to start solving problems in advance even before it happens, when it might fail? What reason it
might fail for? What is the predicted impact after failure? These are the kind of questions the organizations have started to ask.
However, organizations are embedding analytics to transform data into insight then into operation. Every industries and new technologies are
collecting more and more data than ever before, but still these organizations are still looking for better way to deploy an effective way to obtain
value from their data and compete in market. They are constantly looking to obtain sharper and timely insights while keeping focus on expenses.
Organizations have believed that their top priority is to acquire information and deploy advanced analytic approach. (Edd Dumbill, 2012)
Leading organizations in this domain are inventing their own strategy and approaches to get greater benefits and insight from the information
keeping these as their centralized focus and having other implementation techniques around such as tools to handle massive sets of data for
specific goal, timeframe and architecture, assessing the value at each step using the result to improve process, applying sophisticated technique to
extract value from data and deliver business value at real time.
Big Data has the potential to provide service on a whole new level in various segments whether it is to build new products or improve analytical
insight on existing information which was not possible previously. Some of the leading organizations such as Google, Amazon and Facebook by
delivering highly personalized advertising, search results and product recommendations have already presented the proof of how Big Data can
eventually help in diverse fields (Ping Li, 2009).
7. CONCLUSION
This chapter discusses key big data concepts, cloud and its implementation for big data, with few parameters accessing various established
strategies to process big data, big data application for different domain with different background and varying business requirement, analyzing
each of which considering open problem in each area with case study as examples. With evolving technologies, newer forms of data access to
support faster response or decision times. The need for performing well with speed and accuracy and an infrastructure that can support high
velocity data accurately.. Increasingly, the cloud is becoming an important deployment model to integrate existing systems with cloud
deployments either public or private hybrid model, hence, to be able to integrate with cloud.
This work was previously published in the Handbook of Research on Securing CloudBased Databases with Biometric Applications edited by
Ganesh Chandra Deka and Sambit Bakshi, pages 7290, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bhatnagar, V. (2013). Is your cdata mining in dynamics social networks and fuzzy systems, social media analytics, data mining in social media
and networkingompany primed for social media analytics [Web log message]. Retrieved from
http://searchbusinessanalytics.techtarget.com/answer/Is-your-company-primed-for-social-media-analytics
Gill, J. J. (2007). Shifting the BI paradigm with in-memory database technologies. Business Intelligence Journal, 12 (2), 58–62.
Hu, W-C., & Kaabouch, N. (2013). Big data management, technologies, and applications. In Essential big data management, technologies, and
applications. Hershey, PA: IGI Global.
Sathi, A. (2012). Big data analytics: disruptive technologies for changing the Game. McPress.
Scale, D. B. (n.d.). White Paper Big data and Transactional Databases: Exploding data volumes is creating new sress on traditional
transactional databases. Retrieved from http://www.scaledb.com/pdfs/BigData.pdf
KEY TERMS AND DEFINITIONS
Analytics: Big Data analytics is the process of finding meaningful patterns or to get insight on data with extensive use of statistical and
mathematical techniques.
Big Data Biometrics: Biometric based person data is collected such as DNA, retina, voice recognition etc., for Biometric based person
Identification and Verification.
Big Data: A term used to describe large volumes of data of both structured and unstructured data type. Gartner defines it as data with three
dimensions of Volume, Velocity and Variety.
Cloud Computing: For Big Data is a technological storage solution, where bulk data is stored which provides scalability and flexibility.
Data Exploration: Data exploration is process where data that has been gathered are explored and analysed in order to find relevant data for
statistical reporting, trend spotting and pattern spotting.
Performance Management: Big Data performance Management is to maintain quality data and enable businesses to meet their goal with
efficient data management.
Social Media Analytics: Customer Data is captured to understand consumer sentiment or attitude in order to predict consumer behavior and
provide recommendations for next best action.
CHAPTER 38
Big Data and National Cyber Security Intelligence
A. G. Rekha
Indian Institute of Management Kozhikode, India
ABSTRACT
With the availability of large volumes of data and with the introduction of new tools and techniques for analysis, the security analytics landscape
has changed drastically. To face the challenges posed by cyber-terrorism, espionage, cyber frauds etc. Government and law enforcing agencies
need to enhance the security and intelligence analysis systems with big data technologies. Intelligence and security insight can be improved
considerably by analyzing the under-leveraged data like the data from social media, emails, web logs etc. This Chapter provides an overview of
the opportunities presented by Big Data to provide timely and reliable intelligence in properly addressing terrorism, crime and other threats to
public security. This chapter also discusses the threats posed by Big Data to public safety and the challenges faced in implementing Big Data
security solutions. Finally some of the existing initiatives by national governments using Big Data technologies to address major national
challenges has been discussed.
1. INTRODUCTION
With the availability of large volumes of structured and unstructured data and with the introduction of new tools and techniques for analysis, the
security analytics landscape has changed drastically. To face the challenges posed by cyber-terrorism, espionage, cyber frauds etc. Government
and law enforcing agencies need to enhance the security and intelligence analysis systems with big data technologies. There are two different
approaches regarding security in the big data context. One is leveraging Big Data for enhancing the national security systems and second is to
secure the national data itself. Intelligence and security insight can be improved considerably by analyzing the under-leveraged data like the data
from social media, emails, telecommunication systems, web logs etc. By providing timely and reliable intelligence, big data analytics can help in
properly addressing terrorism, crime and other threats to public security.
Nowadays more volume of significant data is available with the security officials as well as with the criminals and hence big data provides both
opportunities and threats for national security and critical infrastructures. Security systems need to be able to make use of information from a
wide variety of sources including human and software applications in order to exploit the big data and to get value out of it. At the same time data
ownership and privacy concerns are also to be taken into account. In light of the growing number of incidents of cyber terrorism and cases of
threats to public security, in this chapter we will explore the national security implications of Big Data including opportunities, threats and
challenges. The rest of the chapter is organized as follows: Section 1 will discuss Big Data opportunities for national cyber intelligence, section 2
will discuss threats of Big Data to National Security environment. Section3 will cover challenges in exploiting Big Data for national security. In
section 4 we will discuss the role of Big Data Analytics for Critical Infrastructure Protection (CIP). Section 5 will discuss some of the tools and
techniques available in the big data context and section 6 will discuss about some of the major initiatives taken by governments for national
security. Section 7 concludes this chapter.
2. BIG DATA OPPORTUNITIES FOR NATIONAL CYBER INTELLIGENCE
This section will give an overview of how Big Data technologies can complement cyber security solutions. Law enforcement agencies can utilizes
Big Data to ensure public safety by capturing and mining huge amounts of data from multiple sources. For example, there are systems which
collect data related to travel, immigration, suspicious financial transactions etc. Linking previously unconnected datasets can remove anonymity
of individuals and analyzing this data can reveal patterns of connections among persons, places or events. These patterns could then be used for
proactive policy making to ensuring public safety. Big Data tools can provide actionable security intelligence by reducing the time for correlating,
consolidating, and contextualizing information, and also correlate long-term historical data for forensic purposes. For instance, the WINE
platform and Bot-Cloud allow the use of MapReduce to efficiently process data for security analysis. (Ardenas, 2013). Now we will discuss some
of the opportunities of Big Data in the security landscape.
2.1. Efficient Resource Management to Set a Holistic Strategy
By integrating and analyzing huge amounts of structured and unstructured data from various sources we can have efficient security assessment
and thereby support national security agencies to set a holistic strategy for public safety. Leveraging all sources of available data can present great
opportunities for a more efficient resource management and thereby provide new insights and intelligence. Using big data logs from multiple
sources could be consolidated and analysed. This provides a better security intelligence compared to analyzing in isolation.
2.2. Crime Prediction and Mitigation
Discovering hidden relationships and detecting patterns from data gathered from sources such as internet, mobile devices, transactions, email,
social media etc. can reveal evidence of criminal activity. For example by correlating real time and historical user activity we can uncover
abnormal user behavior and fraudulent transactions.
2.3. Enhanced Security Intelligence
Analyzing massive amount of data in real time can enhance security intelligence by providing insights into new associations. For example by
analyzing network traffic in real time, attacks could be detected or by intercepting text and voice communications carried over the internet could
provide intelligence about terrorist groups.
2.4. Intelligence Sharing
Big Data technologies can help in enhancing sharing of intelligence more frequently and intensively among different nations. The ability to
analyze this data more effectively and quickly can produce high quality intelligence. Sharing of data between various industry sectors and with
the national security agencies is also needed.
The following case study presented by Zions Bancorporation shows how Hadoop and BI analytics powered for better security intelligence.
2.4.1. Case Study: Zions Bancorporation (Source: Zions, 2015)
According to Preston Wood, CSO at Zions and the moderator of a panel of his Zion team members, the institution has been trying to move to a
more data-driven approach to its security practice during the past several years. But it was finding that it was continually running into the
limitations of its traditional SIEM tools.
In order to drive deeper forensics and to train statistical machine-learning models, Zions found it needed months or even years of data before it
became functionally useful. This quantity of data and the frequency analysis of events was too much for SIEM to handle alone.
“We [knew] we’d be bumping our heads against the ceiling with SIEM fairly early on,” Wood said. “The underlying data technology just couldn’t
handle it.”
What’s more, the analysis itself was watery. The team was swimming in data but had a hard time turning that into action.
“The SIEM is good for telling the data what to do,” Wood said. “But who is telling us what to do?”
The pivotal point came with Hadoop, which allowed the company to use data in a new, more effective way. Open-source Hadoop, when coupled
with Google’s MapReduce, has made life much different for Zions.
“The crux of the system is the distributed file system,” said Mike Fowkes, director of fraud prevention and analytics for Zions. The file system
makes it easy for administrators to run Java-based queries that will then run against data spread across multiple systems. This allows more
timely analysis of a greater sum of data than was before possible.
Zions’ results have been dramatic. In an environment where its security systems generate 3 terabytes of data a week, just loading the previous
day’s logs into the system can be a challenge. It used to take a full day, Foust said.
“With MapReduce, HIVE, and Hadoop, we’re doing it in near-real-time fashion,” he said. “We’re pulling in data every five minutes, hourly, every
two minutes -- it just depends on the frequency of how fresh our data needs to be.”
And actual searches can be even more dramatically fast. Searching among a month’s load of logs could take anywhere between 20 minutes to an
hour depending on how busy the server was, he said.
“In our environment within HIVE, it has been more like a minute to get the same deal,” Fowkes said.
Aside from a boost in data-mining firepower, Hadoop’s HDFS file system brings a robust level of availability to the data warehouse environment,
too.
“If you’re running a job and something fails on a system, it will dynamically readjust,” said Fowkes, explaining that a failure of a node or a hard
drive isn’t the show-stopper it used to be. Instead, the system is able to reapportion the data based on the number of remaining nodes.
With a fast and effective infrastructure set up and running, Zions uses the data for dozens of purposes. Database logs, firewall, antivirus, IDS logs,
plus industry-specific logs like wire ACS deposit applications and credit data are all pulled together into a centralized syslog server.
While queries are written in Java, it takes more than an off-the-shelf Java programmer to put together meaningful queries and make sense of
what they return. That’s where Aaron Caldiero comes in. As senior data scientist at Zions, he plays the part of “part computer scientist, part
statistician, and part graphic designer,” he explains.
Caldiero's job is to collect and centralize the data, design methods of synthesizing it (ranging from basic logic to machine-learning algorithms),
and then present it in a coherent way.
His approach has achieved incredible results for his organizations, but it may be foreign for security professionals.
“It’s a bottom-up process where you’re putting the data first,” Caldiero said.
Compiling huge amounts of data allows analysts to draw trends, patterns, or correlations that they might never have found had they put the
questions first and sorted through terabytes of data for the answers.
It’s an approach that has worked well for Zion and Wood, and his team believes it could be well-applied elsewhere. Wood stressed that the power
of big data analytics isn’t just for big companies, either.
“You can start with a single box in your environment,” he said, stressing that it is a technology well-suited for security, but the expectation needs
to be set that “big data strategy is a journey, not a destination. It’s not a product you’re going to buy; it’s not something you’re going to stand up
there and be done with.”
3. NATIONAL SECURITY ENVIRONMENT: THREATS OF BIG DATA
With the increased availability of data with the criminals the number and scope of attacks have also increased. Attacks are typically performed for
getting access to sensitive data or to steal funds, or to damage reputation. Identity theft for instance becomes simpler with the availability of
information from multiple resource. After getting hold of sufficient information, criminals use social engineering to gather the rest of the data
even like logon credentials. Availability of Big Data makes detection of identity theft a harder task to accomplish.
In the next section we will discuss the challenges involved in exploiting Big Data for National security.
4. EXPLOITING BIG DATA FOR NATIONAL SECURITY CHALLENGES
The Cloud Security Alliance in their report has highlighted top ten big data specific security and privacy challenges (CSA, 2012) as follows:
9. Granular audits;
In order to realize the true potential of Big Data for national security there exist numerous other challenges which are specific to the national
security context. These include the challenges involved in gathering and aggregating data from multiple sources, storing managing the data thus
collected for easy access and analysis, making sense of the raw data thus obtained and utilizing it for avoiding or mitigating the impact of a
potential crime. Now we will briefly discuss some of the major challenges involved in the present system.
4.1. Siloed Data
Traditional Government IT systems are stand alone and Proprietary and there is lack of integrated information store. Collecting and analyzing
this massive data from the application siloes in the public sector in real time is beyond the existing information handling techniques.
4.2. Rigidity of Conventional Systems
The conventional systems are rigid and are not designed to deal with the unstructured nature of Big Data. These systems normally are
constrained to predefined schemes and changing them to make suitable for Big Data applications will be difficult.
4.3. Noisy and Incomplete Data Sets
Existing databases could contains lot of missing data points and noisy features making them infeasible for complex analytic applications. Proper
pre-processing will be required prior to using these datasets in applications and that itself is a rigorous task. Security related applications demand
real time or near real time processing and hence cannot afford longer time requirements for pre-processing.
4.4. Authenticity and Integrity of Data
Since the quality of results will depend on the data used, it is important that the sources from which data sets are obtained are reliable. The
authenticity and integrity of data should be verified before using it for security related applications.
4.5. Privacy Issues
Advances in Big data analytics make privacy violations easier and issue is more complex when dealing with sensitive government data and data of
individuals. The collected data could be of interest to many. These could include people from industries who can use it for marketing purposes or
criminals who can use it for some fraudulent activities. Hence creating safeguards like data masking should be done to prevent abuse of big data.
By this way we can make sure that the data is used only for the purpose it was collected and privacy can be ensured.
4.6. Security of Big Data
We will need new systems and tools to deal with the unique security problems presented by big data environment. Most of the traditional security
solutions are not designed to take care of the cluster environment and hence new approaches keeping in mind the distributed architecture should
be developed to meet the security requirements. The massive amounts of unstructured data that is being created daily such as customer
transaction details including credit card numbers, data on purchasing habits, mobile communication data from cell towers etc. opens new doors
for cyber criminals and hence Big Data requires a different approach to security.
4.7. Budgetary Restrictions
There is a need for robust IT infrastructure for efficient capturing and analyzing of large volumes of unstructured data. There is always a tradeoff
between processing speed and accuracy. For national security applications, we need fast yet accurate analytics deployments and this demands
more provisioning in the budget.
4.8. Workforce
Availability of talented workforce who can develop and use Big Data technologies is another challenge faced in the national security context.
5. BIG DATA ANALYTICS FOR CRITICAL INFRASTRUCTURE PROTECTION (CIP)
Critical Infrastructures are vital assets like public health systems, financial networks, air-traffic control systems etc. These are essential for the
functioning of the society and economy. The volume of information related to critical infrastructures is increasing day by day and most of the
time it crosses the peta byte threshold. Big Data analytics can play an instrumental role in protecting the critical infrastructure of a nation.
Classifying critical assets and streamlined deployment of security solutions capable of taking care of distributed nature of Big Data is needed for
ensuring the protection of critical infrastructures. By leveraging machine learning in the Big Data platform can help in the early detection and
prevention of attacks to the critical infrastructure.
Now we will discuss two use cases of security analytics in the big data context:
Use Cases 3 Law Enforcement Analytics : Law enforcement agencies deal with tremendous amount of data on a day to day basis.
These data include call records, video surveillance data and thousands of police reports. The Internet of Things (IoT), sensors and devices
produce large volume of data. By analyzing such data, long term trends and hidden patterns could be revealed which could help in predictive
policing. A white paper by CTOlabs.com describes that big data analysis has been highly effective in law enforcement and can make police
departments more effective, accountable, efficient, and proactive (CTOlabs.com, 2012). The author explains this using the following
example: “Take, for example, dividing a city into policing districts and beats, a process every department has to conduct regularly due to
crime trends and changing demographics. Patrol is the backbone of policing, and these divisions determine where patrol officers are
allocated, with officers typically patrolling, answering calls for, and staying within their beat throughout their shift. Departments that do not
use data analysis at all divide their city into equal portions, which is problematic as it assigns the same manpower to high crime and quiet
areas. Others use crime statistics to draw up boundaries, but given the many factors involved, patrol areas are rarely a great fit. For example,
distribution of officers can have an effect on crime, so a new beat map may change the very data a department is analyzing. Also, if size is
predicated only on crime, quiet beats may grow too large to effectively patrol. An officer’s ability to receive backup should also be considered,
as well as contingencies for reshuffling beats when officers have days off and get sick or injured. In a relatively large department, analysts
may look at millions of calls for service over several years to best plan patrols, but type of crime is also a factor. Fairly distributing resources
also means screening for biases and confounding variables, such as wealthier or more politically connected neighborhoods having a louder
voice and getting more attention than those really in need.
Hadoop provides a platform to solve all of those problems. It makes storing historical records, even phone calls and videos, cheap, as they can be
kept on commodity hardware. It lets you analyze them after the fact in any way you want, as Hadoop works with raw data and implements a
schema-on-read, adding whatever structure you need whenever you need it. It can also run similar analysis for other resources to allow an agency
to run more smoothly at a lower price, for example tracking gas consumption by cruisers, rates of ammunition used at the firing range, serial
numbers on stolen goods, and paper and form usage at stations.”
6. TOOLS AND TECHNOLOGIES
Although traditional systems have been developed for analyzing event logs and network flows and for detecting intrusions and malicious
activities, they are not always adequate to handle large scale heterogeneous data. New techniques such as databases related to the Hadoop
ecosystem are emerging to analyze security data and improve security defenses. The various tools which can help in analyzing big data include
Hive (a query language), Pig (a platform for analyzing large datasets) and Mahout (for building data mining algorithms). Hive facilitates querying
and managing large datasets residing in distributed storage (Apache, Apache Hive TM, 2015). Pig consists of a high-level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs. Its structure is amenable to substantial parallelization, which
in turns enables them to handle very large data sets (Apache, Welcome to Apache Pig!, 2015). Mahout provides free implementations of
distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and
classification. It also provides Java libraries for common mathematical operations (Sean Owen, 2011). Relatively new frameworks like Spark are
being developed to improve the efficiency of machine learning algorithms. Many of the implementations use the Apache Hadoop platform.
There are also several databases designed specifically for efficient storage and query of Big Data. These include Cassandra, CouchDB, Greenplum
Database, HBase, MongoDB, and Vertica. Apache Cassandra is an open source distributed database management system designed to handle
large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra has support for
clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients
(planetcassandra, 2015). CouchDB is a database for the web which stores data as JSON documents and it supports master-master setups with
automatic conflict detection. (J. Chris Anderson, 2010). In this data access and query could be done with the web browser and index, combine,
and transform of documents could be done using JavaScript. Greenplum Database utilizes a shared-nothing, massively parallel processing (MPP)
architecture with the ability to utilize the full local disk I/O bandwidth of each system (EMC, 2010). HBase is a non-relational database that runs
on top of HDFS. Its tight integration with Hadoop makes scalability with HBase easier (George, 2011). MongoDB is a document database that
provides high performance, high availability, and easy scalability. It stores data in JSON documents with lot of features of a traditional RDBMS
such as secondary indexes, dynamic queries, sorting, rich updates, upserts and easy aggregation (Chodorow, 2013). HP Vertica provides
advanced SQL analytics as a standards-based relational database with full support for SQL, JDBC, and ODBC. It also has numerous built-in
analytical functions, including geospatial, time series and pattern matching (Agrawal, 2014). Apart from these there are scalable stream
processing tools such as IBM InfoSphere which are designed for designed for stream computing. InfoSphere Streams is an advanced analytic
platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time
sources with the capacity to handle very high data throughput rates, up to millions of events or messages per second (IBM, 2015). TIBCO
StreamBase is a high-performance system for rapidly building applications that analyze and act on real-time streaming data. The goal of
StreamBase is to offer a product that supports developers in rapidly building real-time systems and deploying them easily (TIBCO, 2015).
7. EXISTING BIG DATA APPROACHES FOR NATIONAL SECURITY
Now we will discuss about some of the Big Data initiatives made by leading countries for national security. [Source: (Gang-Hoon Kim, 2014)]
• U.S.: The U.S. government and IBM in 2002 collaborated to develop a massively scalable, clustered infrastructure resulting in IBM
InfoSphere Stream and IBM Big Data to achieve real-time analysis of high volume streaming data. They are platforms for discovery and
visualization of information from thousands of real-time sources, encompassing application development and systems management built on
Hadoop, stream computing, and data warehousing. Both are widely used by government agencies and business organizations. The U.S.
government has launched Data.gov as a step toward government transparency and accountability. It is a warehouse of datasets covering
transportation, economy, health care, education, and human services and a data source for multiple applications (U.S. Government.
Data.gov). The Internal Revenue Service has been integrating big data-analytic capabilities into its Return Review Program (RRP), which by
analyzing massive amounts of data allows it to detect, prevent, and resolve tax-evasion and fraud cases (National Information Society
Agency, 2012).
• U.K.: The U.K. government has implemented the U.K. Horizon Scanning Centre (HSC) in 2004 to improve the government's ability to deal
with cross-departmental and multi-disciplinary challenges.
• Asia: South Korea’s Big Data Initiative aims to establish pan-government big-data-network-and-analysis systems; promote data
convergence between the government and the private sectors; build a public data-diagnosis system; produce and train talented
professionals; guarantee privacy and security of personal information and improve relevant laws; develop big-data infrastructure
technologies; and develop big-data management and analytical technologies. To address national security, infectious diseases, and other
national concerns, the Singapore government launched the Risk Assessment and Horizon Scanning (RAHS) program. By Collecting and
analyzing large-scale datasets, it proactively manages national threats, including terrorist attacks, infectious diseases, and financial crises.
(Habegger, 2010).The Japanese government has initiated several programs to address the consequences of the Fukushima earthquake,
tsunami, and nuclear-power-plant disaster and the reconstruction and rehabilitation of affected areas, as well as relief of related social and
economic consequences with the help of Big Data.
7.1. Case Study: NSA PRISM and MUSCULAR
National Security Agency PRISM (Planning Tool for Resource Integration, Synchronization, and Management) is a top-secret data-mining
program aimed at terrorism detection and other pattern extraction authorized by federal judges working under the Foreign Intelligence
Surveillance Act (FISA). This project uses big data technologies and allows the U.S. intelligence community to gain access from nine Internet
companies to a wide range of digital information, including e-mails and stored data, on foreign targets operating outside the United States. As per
the Guardian, the National Security Agency has obtained direct access to the systems of Google, Facebook, Apple and other US internet giants.
The program facilitates extensive, in-depth surveillance on live communications and stored information. The law allows for the targeting of any
customers of participating firms who live outside the US, or those Americans whose communications include people outside the US. The breadth
of the data it is able to obtain include: email, video and voice chat, videos, photos, voice-over-IP (Skype, for example) chats, file transfers, social
networking details, and more. The court-approved program is focused on foreign communications traffic, which often flows through U.S. servers
even when sent from one overseas location to another. As per director of National Intelligence James R. Clapper: “information collected under
this program is among the most important and valuable foreign intelligence information we collect, and is used to protect our nation from a wide
variety of threats. The unauthorized disclosure of information about this important and entirely legal program is reprehensible and risks
important protections for the security of Americans.”
PRISM is an heir, in one sense, to a history of intelligence alliances with as many as 100 trusted U.S. companies since the 1970s. The NSA calls
these Special Source Operations, and PRISM falls under that rubric. Even when the system works just as advertised, with no American singled
out for targeting, the NSA routinely collects a great deal of American content. That is described as “incidental,” and it is inherent in contact
chaining, one of the basic tools of the trade. To collect on a suspected spy or foreign terrorist means, at minimum, that everyone in the suspect’s
inbox or outbox is swept in. Intelligence analysts are typically taught to chain through contacts two “hops” out from their target, which increases
“incidental collection” exponentially. (Sources: (theguardian, 2013), (washingtonpost, 2013), (baselinemag, 2013), (informationweek, 2013)).
As per the Washington Post revelation NSA has tapped into overseas links that Google and Yahoo use to communicate between their data centers.
This program, codenamed MUSCULAR, harvests vast amounts of data. A top-secret memo dated January 9, 2013 says that the NSA gathered
181,280,466 new records in the previous 30 days. Those records include both metadata and the actual content of communications: text, audio,
and video. Operating overseas gives the NSA more lax rules to follow than what governs its behavior stateside.
As per baseline PRISM is using something similar to Apache Hadoop: a massively distributed file system that can hold large volumes of
unstructured data and process it in a fast and parallel way. This platform must be self-healing, horizontally scalable and built with off-the-shelf
components. Like Hadoop, it most likely works by sending the program to the data rather than the more traditional approach of ingesting the
data into the program. According to InformationWeek, the centerpiece of the NSA's data-processing capability is Accumulo, a highly distributed,
massively parallel processing key/value store capable of analyzing structured and unstructured data. Accumolo is based on Google's BigTable
data model, but NSA came up with a cell-level security feature that makes it possible to set access controls on individual bits of data. Without that
capability, valuable information might remain out of reach to intelligence analysts who would otherwise have to wait for sanitized data sets
scrubbed of personally identifiable information. One of Accumulo's strengths is finding connections among seemingly unrelated information and
as per Dave Hurry, head of NSA's computer science research section: “By bringing data sets together, Accumulo allowed us to see things in the
data that we didn't necessarily see from looking at the data from one point or another,” Accumulo gives NSA the ability “to take data and to
stretch it in new ways so that you can find out how to associate it with another piece of data and find those threats.”The power of this capability is
finding patterns in seemingly innocuous public network data -- which is how one might describe the data accessed through the Prism program --
yet those patterns might somehow correlate with, say, a database of known terrorists or data on known cyber warfare initiatives. Where prior
intelligence techniques have largely been based on knowing patterns and then alerting authorities when those patterns are detected, security and
intelligence analysts now rely on big data to provide more powerful capabilities than analytics alone.
8. CONCLUSION AND SUMMARY
Big data technologies have a big role to play in the fight against terrorism and crime and in various other situations related to law enforcement
and national security. By analyzing the data like the data from social media, emails, telecommunication systems, web logs etc. security insight can
be improved considerably. In this chapter we have discussed the scope of Big Data in extending cyber intelligence and enhancing national
security. We have presented the case study of Zions Bancorporation in which they could increase processing efficiency using Hadoop
technologies. We have also discussed the threats posed by Big Data to public safety and the challenges faced in implementing Big Data security
solutions. Then we have discussed Big Data Analytics for Critical Infrastructure Protection. Three use cases, first one related to intrusion
detection, the second one related to social network crime analytics and the third one related to law enforcement analytics has also been discussed.
The tools and technologies for big data has been discussed along with the case studies BotCloud research project and Airavat. We have also
discussed some of the existing initiatives by national governments (US UK and Asis) using Big Data technologies to address major national
challenges involving terrorism, natural disasters, healthcare etc. and finally a case study of NSA PRISM aimed at terrorism detection has been
presented.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 231244, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Ardenas, A. A. (2013). Big data analytics for security. IEEE Security and Privacy , 11(6), 74–76. doi:10.1109/MSP.2013.138
Cheon, J. C. T.-Y. (2013). Distributed processing of snort alert log using hadoop. Int J Eng Technol, 2685-2690.
Gang-Hoon Kim, S. T.-H. (2014). Big-Data Applications in the Government Sector. Communications of the ACM , 78–85.
George, L. (2011). HBase: The Definitive Guide Random Access to Your Planet-Size Data . O'Reilly Media.
Habegger, B. (2010). Strategic foresight in public policy: Reviewing the experiences of the U.K., Singapore, and the Netherlands. Futures , 42(1),
49–58. doi:10.1016/j.futures.2009.08.002
Jerome Francois S. W. (2011). BotCloud: Detecting Botnets Using MapReduce. IEEEInternational Workshop on Information Forensics.
Lee, Y. L. Y. (2013). Toward scalable internet traffic measurement and analysis with hadoop. ACM SIGCOMM Comput Commun Rev, 5-13.
Richard Zuech, T. M. (2015). Intrusion detection and Big Heterogeneous Data: a Survey. Journal of Big Data.
Richard Zuech, T. M. (2015). Intrusion detection and Big Heterogeneous Data: A Survey. Journal of Big Dat , 2(1), 3. doi:10.1186/s40537-015-
0013-4
Suthaharan, S. (2014). Big data classification: Problems and challenges in network intrusion prediction with machine learning.Performance
Evaluation Review , 41(4), 70–73. doi:10.1145/2627534.2627557
Tene, O., & Polonetsky, J. (2012). Privacy in the age of big data: A time for big decisions . Stanford Law Review Online.
Tudor Dumitras, D. S. (2011). Toward a standard benchmark for computer security research: The Worldwide Intelligence Network Environment
(WINE) . Building Analysis Datasets and Gathering Experience Returns for Security. doi:10.1145/1978672.1978683
Dan Watson
Utah State University, USA
ABSTRACT
Major Internet services are required to process a tremendous amount of data at real time. As we put these services under the magnifying glass,
it’s seen that distributed object storage systems play an important role at back-end in achieving this success. In this chapter, overall information
of the current state-of –the-art storage systems are given which are used for reliable, high performance and scalable storage needs in data centers
and cloud. Then, an experimental distributed object storage system (CADOS) is introduced for retrieving large data, such as hundreds of
megabytes, efficiently through HTML5-enabled web browsers over big data – terabytes of data – in cloud infrastructure. The objective of the
system is to minimize latency and propose a scalable storage system on the cloud using a thin RESTful web service and modern HTML5
capabilities.
INTRODUCTION
With the advent of the Internet, we have faced with a need to manage, store, transmit and process big data in an efficient fashion to create value
for all concerned. There have been attempts to alleviate the problems emerged due to the characteristics of big data in high-performance storage
systems that have existed for years such as: Distributed file systems: e.g., NFS (Pawlowski et al., 2000), Ceph (Weil et al., 2006), XtreemFS
(Hupfeld et al., 2008) and Google File System (Ghemawat et al., 2003); Grid file systems: GridFTP (Allcock et al., 2005) and recently object-
oriented approach to the storage systems (Factor et al., 2005).
As an emerging computing paradigm, cloud computing refers to leasing of hardware resources as well as applications as services over the
Internet in an on-demand fashion. Cloud computing offers relatively low operating costs that the cloud user no longer needs to provision
hardwares according to the predicted peak load (Zhang et al., 2010) via on-demand resource provisioning that comes with pay-as-you-go
business model. In realization of this elasticity, virtualization is of significant importantance where hypervisors run virtual machines (VMs) and
share the hardware resources (e.g. CPU, storage, memory) between them on the host machine. This computing paradigm provides a secure,
isolated environment that operational errors or malicious activity occurred in one VM do not affect directly the execution of another VM on the
same host. Virtualization technology also enables the cloud providers to further cut the spendings through live migration of VMs to underutilized
physical machines without downtime in a short time (Clark et al., 2005), thus, maximize resource utilization.
The notion of an object in the context of storage is a new paradigm introduced in (Gibson et al., 1997). An object is a smallest storage unit that
contains data and attributes (user-level or system-level). Contrary to the block-oriented operations that perform on the block level, object storage
provides the user higher-level of abstraction layer to create, delete and manipulate objects (Factor et al., 2005). Backends of most object storage
systems maximize throughput by means of caching and distributing the load over multiple storage servers, and ensuring fault-tolerance by file
replication on data nodes. Thus, they share similar characteristics with most high-performance data management systems, such as fault-
tolerance and scalability.
Modern web browsers have started to come with contemporary APIs with the introduction of the fifth revision of the HTML standard (HTML5)
to enable complex web applications that provide a richer user experience. However, despite a need on client-side, web applications still are not
taking advantage of HTML5 to deal with big data. In regards to the server-side, object storage systems are complex to build and to manage its
infrastructure.
We introduce an experimental distributed object storage system for retrieving relatively bigger data, such as hundreds of megabytes, efficiently
through HTML5-enabled web browsers over big data – terabytes of data – using an existing online cloud object storage system, Amazon S3, to
transcend some of the limitations of online storage systems for storing big data and to address further enhancements.
Existing systems exhibit the capability of managing high volumes of data, retrieving larger size resources from a single storage server might cause
an inefficient I/O due to unparalleled data transfer at the client-side and underutilized network bandwidth. The main objective of the
implemented system is to minimize latency via data striping techniques and propose a scalable object storage system on top of an existing cloud-
based object storage system. For the client side, we implemented a Java Script library that spawns a set of web workers – which is introduced
with HTML5 to create separate execution streams on web browsers – to retrieve the data chunks from the storage system in parallel. We aim to
increase the data read rates on the web browser by utilizing full Internet bandwidth. Our approach is also capable of handling data loss by
automatically backing up the data on a geographically distinct data center. The proposed distributed object storage system handles a common
error gracefully, such as if a disaster takes place in the data center that might result in data inaccessibility, the implemented client detects this
issue and then starts retrieving the data from the secondary data center. We discuss advantages and disadvantages of using the proposed model
over existing paradigms in the chapter.
The remainder of this chapter is organized as follows. In Section 2, we discuss high performance storage systems; distributed file systems, grid
file systems, object storage systems and common storage models to handle big data. Then in Section 3, we introduce cloud-aware distributed
object system (CADOS) and present its architecture, CADOS library and performance measurements. Finally, we present our conclusions in
Section 4.
HIGH PERFORMANCE STORAGE SYSTEMS
Distributed File Systems
Distributed file systems provide access to geographically distributed files over POSIX-compliant interface or API. The advantage of using
distributed file systems comes from their fault-tolerance, high-performance, highly scalable data retrieval by following replication and load
balancing techniques. Variety of distributed file systems have been used for decades and especially on data centers, high-performance computing
centers and cloud computing facilities as backend storage providers.
Lustre file system (Wang et al., 2009) is an object-based file system that composed of three software components: Metadata servers (MDSs) to
provide metadata of the file system such as directory structure, file names, access permissions; object storage servers (OSSs) that stores file data
objects and functions as block devices; and clients that access the file system over POSIX-compliant file system interface, such as open(), read(),
write() and stat().
Figure 1. Lustre file system
In Lustre, the high performance is achieved by the employing striping technique where the segments of the file are spread across multiple OSSs
and then the client starts reading each data segment in parallel directly from OSSs. This goal of this approach is to minimize the disk load over
the disks and to achieve high utilization of the bandwidth of underlying interconnect. The number of stripes for a file highly affects the
performance of the data retrieval, but it also increases the possibility of data loss in case any object storage server is failed. The other important
factor is that MSD might be considered a single point of failure where all the data becomes inaccessible over the object storage servers when MSD
fails. To achieve high-availability of metadata, multiple OSS nodes can be configured as failover nodes to serve as metadata server. Moreover,
OSS nodes can also function as a failover pair for high-availability of block devices.
Ceph (Weil et al., 2006) is a distributed file system that provides high performance, reliability, and scalability. Ceph is following the same
paradigm like Lustre File System by separating data and metadata management. A special-purpose data distribution function called CRUSH is
used to assign objects to heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). In order to prevent overloading of the
metadata cluster, Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning (Roselli et al., 2000) that adaptively
and intelligently distributes responsibility for managing the file system directory hierarchy among multiple of MDSs where the client sees the
object storage cluster as a single logical object store and namespace. Ceph follows a stochastic approach by randomly distributing the data over
OSDs. In data distribution and data location, Ceph uses a hash function to map the objects of the file into placement groups (PGs), in turn,
placement groups are assigned to OSDs using CRUSH. Thus in data placement, it does not rely on any block or object list metadata. This
approach is depicted in Figure 2.
With the advent of Apache Hadoop library (White, 2009), a distributed file system tailored for the map-reduce computations is of necessity for
processing large data sets where Hadoop applications mostly need a write-once-read-many access model for files. To address this model, the
Hadoop Distributed File System (HDFS) (Shvachko et al., 2010) locates the data near the application that operates on. This approach minimizes
network congestion for extensive reading operations and increases the overall throughput of the system. An HDFS cluster consists of a single
NameNode that manages the file system namespace and file accesses, and multiple DataNodes where data are spread across the DataNodes in a
striping fashion. HDFS replicates file blocks to provide fault tolerance.
Other distributed file systems can be found in (Hupfeld et al., 2008), (GlusterFS, 2014), (MooseFS, 2014) and (Ghemawat et al., 2003).
Grid File Systems
Grid (Ferreira et al., 2003) is the federation of a number of heterogeneous, geographically disperse systems where shares various types of
resources such as CPU, storage and software resources. Grid file systems create an illusion of a single file system by utilizing many storage
systems where reliability, high-speed data access and a secure sharing mechanism are of the important factors in design.
GridFTP (Allcock et al., 2005) is the extension to the File Transfer Protocol (FTP) to be a general-purpose mechanism for secure, reliable, high-
performance data movement. Thus, apart from the features supported by FTP, it provides a framework with additional capabilities such as
multiple data channels for parallel data transfer, partial file transfer, failure detection, server side processing and striped data transfer for faster
data retrieval over the grid. For secure file operations, GridFTP supports GSS-API, for Grid Security Infrastructure (GSI) and Kerberos
authentication bindings.
The Gfarm Grid file system (Tatebe et al., 2010) is designed to share files across multiple geographically distributed storage centers. High I/O
performance through parallelism, security, scalability across a large number of storages in wide area and reducing the cost of metadata
management has taken into account in designing such a file system. Gfarm consists of gfmd (metadata server) for namespace, replica catalog,
host information and process information and multiple of gfsd (I/O servers) for file access. In design, gfmd caches all metadata in memory for
performance and monitoring concerns for all I/O servers. To prevent corrupted metadata in case of a client application crash, close() operations
are carried out by the file system nodes.
Object Storage Systems
An object has higher-level of data abstraction than block data storage that consists of both data and attributes including user and system-defined
attributes. As one of the distinctive properties of an object, metadata is stored and recoverable with the object’s data and it provides secure data
sharing mechanism, mostly on distributed and scalable storage systems (Factor et al., 2005).
Figure 4 shows the illustration of the object storage system. Manager nodes provide the metadata and key to client nodes that in turn use this key
to access the object storage devices (OSDs) directly for secure data access. The main advantage of using an object storage system in performance
when compared to block-based storage system clients can not suffer from queuing delays at the server (Mesnier et al., 2003).
Figure 4. Object storage system
Online cloud-based object storage systems are gaining popularity due to the need for handling elastic storage demands, ready-to-use and
relatively easy APIs, durability and initiating business with minimum start-up costs. Although it is not a standard practice, we see most of online
object storage systems leverage the use of Restful (Pautasso et al., 2008) web APIs because of its lightweight infrastructure and widespread
adoption of programming languages and APIs. Although at the backend of the cloud-based storage systems might utilize high-performance
distributed object storage systems, the major advantage of the cloud storages come from the elastic usage of object storage from the customer
perspective. Thus, cloud-based object storage systems give the illusion of infinite storage capacity where the cloud users have the opportunity to
adjust their allotted storage capacity as the demand varies.
Amazon Simple Storage Service (S3) (AmazonS3, 2014) enables the user to store objects in a container, named bucket, which has a unique name
to access later using HTTP URLs via Amazon API and physically mapped to a geographic region. The service complies with the pay-as-you-go
charging model of utility computing. Today, many companies start their business over cloud-based object storage systems with minimum
investment for storage needs, including a popular personal cloud storage service, Dropbox, which is known to use Amazon S3 to store users’
personal data (Drago et al., 2012). That creates a different charging model for companies where online object storage systems charge a company
per HTTP request to store, access, delete and manipulate data over the cloud and per GB per month. However, its integration with science
applications is criticized and recommendations are given to reduce the usage costs (Palankar et al., 2008). Other cloud-based storage systems can
be found in (GoogleCloudStorage, 2014), (Pepple, 2011), (Wilder, 2012), (OracleNimbula, 2014) and (Nurmi et al., 2009).
Figure 5. Cloud-based online object storage system
RADOS (Weil et al., 2007) aims a scalable and reliable object storage service, built as part of the Ceph (Weil et al., 2006) distributed file system.
In design, each object corresponds to a file to be stored on an Object Storage Device (OSD). I/O operations are managed by Ceph OSD Daemons
such as reading/writing of an object including globally unique object id, metadata with name/value pairs and binary data. To eliminate the single
point of failure and performance purposes, clients interact with Ceph OSD Daemons directly to retrieve metadata using CRUSH algorithm.
CRUSH provides location information of an object without need of central lookup table (Weil et al., 2006). Ceph OSD Daemons replicates the
objects on other Ceph nodes for data safety and high availability.
Storage Systems for Big Data
Internet age comes with the vast amount of data that requires efficient storage and processing capabilities. To alleviate this issue, we discuss data
storage systems which are tailored to store and process big data effectively. While general-purpose RDBMSs are still a viable option in handling
and analyzing structural data, they suffer from a variety of problems, including performance and scalability issues when it comes to big data. To
increase the performance of DBMS for big data storage needs, partitioning the data across several sites and paying big license fees for enterprise
SQL DBMS might be the two possible options (Stonebraker, 2010), however they are not even without disadvantages such as inflexible data
management, high system complexity, high operational and management costs and limited scalability.
NoSQL Databases
The complexity of analyzing data via SQL queries has severe performance degradations in cloud computing, especially demand on multi-table
association query (Han et al., 2011). Thus, NoSQL databases (e.g. (Anderson et al., 2010), (Chodorow, 2013) and (ApacheHBase, 2014)) change
the data access model with Key-value format where a value corresponds to a Key. That design leads to simpler model that enables faster query
speed than relational databases, mass storage support, high concurrency and high scalability (Jing et al., 2011). The key features of NoSQL
database are listed by (Cattell, 2011):
1. Capability to scale horizontally by adding database nodes when necessary, because of its “shared nothing” nature
2. Support for data replication and distribution over many database nodes
3. Easier API and protocol to store, delete and manipulate data when compared to SQL bindings
4. To provide high performance in reading and writing data, weaker concurrency model than the ACID transactions of RDBMSs
6. The ability to add user attributes dynamically; associated with data records
ColumnOriented Databases
Most row-oriented relational databases are not optimized for reading large amount of data and in performing operations such as aggregate
operations (e.g. max, min, sum, average) that are widely used in generating reports. These operations might potentially result in performance
problems on row-based databases in the big data analysis because of irrelevant data load that leads to high disk seek time and ineffective cache
locality. Contrary to row-based approach, column-oriented databases store the records in a column fashion where each column value is located
on the disk contiguously, as illustrated in Figure 6. Thus, column-based databases perform better in locating the group of columns with minimum
seek time, as stated in (Stonebraker et al., 2005), (Idreos et al., 2012), (Harizopoulos et al., 2006) and (Chang et al., 2008). Furthermore, because
the same type of data are stored together, high data compression can be achieved using variety of compression schemes (D. Abadi et al., 2006).
C-Store (Stonebraker et al., 2005) physically stores a collection of columns sorted on some attribute(s). It is referred as “projections” if the groups
of columns are sorted on the same attribute. Thus, when multiple attributes are used, redundant copies of the columns may exist on multiple
projections that might lead to the data explosion in space. However, this problem can be alleviated to some degree with proper using of strong
compression techniques. C-Store aims to provide read-optimized database system toward ad-hoc querying of large amounts of data, contrary to
most relational DBMSs that are write-optimized. In design, C-Store follows a hybrid paradigm by combining a read-optimized column store (RS)
and an update/insert-oriented writeable store (WS), connected by a tuple mover that performs batch movement of records from WS to RS. Thus,
queries access data in both storage systems.
Bigtable (Chang et al., 2008) is a distributed column-based database where the data key is generated using three parameters which are strings of
row and column, and timestamp of the input. Then Bigtable returns the corresponding data where it behaves similar to a distributed hash table.
Bigtable maintains the data in lexicographic order by row string and the data is partitioned according to the row range, called a tablet. By
incorporating the timestamp value, Bigtable gains the ability to store multiple versions of the same data. For faster retrieval of the tablets,
Bigtable takes advantage of a three-level hierarchy analogous to B+-tree.
• This model might not be convenient if multiple columns are read in parallel that might lead to increased disk seek time.
• At each record insertion, multiple distinct locations on disk have to be updated; this might lead to increased cost of I/O operation.
• CPU cost can be significant in the reconstruction of a tuple by grouping values from multiple columns into a row-store style tuple.
NewSQL Databases
In on-line transaction processing (OLTP), we mostly face with repetitive and short-lived transaction execution whose performance depends on
the I/O performance (Kallman et al., 2008). In recent years, researchers have begun to seek efficient ways to outperform legacy database
systems. As the RAM capacities increase, the technique of storing partitions of data on the RAM of shared-nothing machines is more applicable
than ever. NewSQL databases are designed by taking advantage of some modern techniques such as data sharding, data replication and
distributed memory database and offer scalable and high performance solution to disk-based legacy database systems. NewSQL databases
provide an object oriented database language that is considered easier to learn than the standard SQL language (Kumar et al., 2014).
H-Store (Kallman et al., 2008) divides database into partitions where each partition is replicated and resides in main memory. The H-Store
system relies on distributed machines that share no data to improve the overall performance of database operations.
VoltDB (Stonebraker et al., 2013) demonstrates a main-memory based DBMS that horizontally partition tables into shards and stores the shards
in the cluster of nodes. The database is optimized based on the frequency of transaction types. As the paper states, vast majority of transactions
can be efficiently performed using a single shard on a single node independent of other nodes and shards such as retrieval of a phone number of a
particular person. VoltDB achieves high availability via replication. A VoltDB cluster can continue to run until K nodes fail. However, when the
node becomes operational again, it joins to the cluster after loading the latest state of the system.
CASE STUDY: A CLOUDAWARE OBJECT STORAGE SYSTEM (CADOS)
Major Internet services are required to process a tremendous amount of data at real time. As we put these services under the magnifying glass, we
see that distributed object storage systems play an important role at back-end in achieving this success. Backends of most object storage systems
maximize throughput by means of caching and distributing the load over multiple storage servers, and ensuring fault-tolerance by file replication
at server-side. However, these systems are designed to retrieve small-sized data from large-scale data centers, such as photos on Facebook
(Beaver et al., 2010) and query suggestions in Twitter (Mishne et al., 2012), designed specifically to meet the high demand of Internet services.
We introduce an experimental distributed object storage system for retrieving relatively bigger data, such as hundreds of megabytes, efficiently
through HTML5-enabled web browsers over big data – terabytes of data – in cloud infrastructure. The objective of the system is to minimize
latency and propose a scalable object storage system. While the existing systems exhibit the capability of managing high volumes of data,
retrieving a resource with higher sizes from a single storage server might cause an inefficient I/O due to unparalleled data transfer over web
browsers that load data from one host only. Our approach is designed to alleviate this problem via the use of parallel data retrieval at client side
unlike the existing paradigm. Data are stored over the implemented distributed object storage system in a striped way in cloud storage servers
where common errors are handled gracefully.
With the advent of the fifth revision of the HTML standard, modern web browsers have started to come with contemporary APIs to enable
complex web applications that provide a richer user experience. Our approach takes advantage of modern HTML 5 APIs in order to retrieve large
resource by transferring its data chunks from multiple storage servers in parallel by circumventing same-origin policy of the web, and then
merging data chunks efficiently at client-side. By this way, the proposed approach might offload data preparation phase out of storage servers
and provide an efficient resource retrieval way by approaching a web application to the modern distributed client using scalable web services and
backend system.
The implemented cloud-aware distributed object storage system (CADOS) leverages online cloud storage system, Amazon S3; but transcends
data size limitations and retrieves data in parallel via HTML5-enabled web browser. Although the storage system is using Amazon S3 for storing
data, it’s not bound to the specific cloud environment in design. The benefits of the implemented storage system are listed below:
1. Data striping is a technique that advocates data partitioning across multiple storage nodes and then clients can retrieve stripes in parallel
directly communicating with storage nodes. We used this technique to go beyond the file size limitation of the underlying cloud storage and
gain high performance.
2. Disaster recovery is a phenomenon that in case of a disaster on the data center such as earthquake or nuclear disaster. If data become
inaccessible for some reason, the software system starts serving backup data to avoid interruption of system and data loss. CADOS
automatically backups the objects on a data center which is geographically distant from the source.
3. Although the storage system can be used by various types of clients, we designed a system, particularly for modern web applications that
take advantage of HTML5 features and desire to be capable to perform big data operations including data analysis, retrieval and uploading
over the web browser.
Design Overview
CADOS consists of two software components; a Restful web service (CADOS server) that provides URLs of data segments in JSON format, object
metadata and, upon an upload request, dynamic data portioning; and a Java Script library (CADOS client) running on the web browser to upload
the selected file, to retrieve the data segments in parallel and to merge the segments onto a file over RAM.
Figure 7 illustrates a typical data retrieval process from a web browser using high performance data storage system. (1) Web browser firstly
requests the web page from a web server and then web server returns the HTML content (2). To speed-up the loading speed of web pages, set of
files such as images are hosted on file servers that prevents the overloading of the web server. In the 3rd step, web application starts requesting
the files using URL addresses of the files from storage middleware. The storage architecture might be simply set up with one file server, but might
also take advantage of parallel file system for low latency. Finally, storage middleware efficiently returns data to the web browser. Additionally, to
further reduce disk operations, data caching can be employed between storage middleware and web browser that given file id as a key, cached
data is located via distributed hash table (Beaver et al., 2010).
Figure 7. Typical high performance data retrieval on Internet
While this approach performs well, because of same-origin policy that restricts a file or scripts to be loaded from different origins, for example
when the web application tries to load the resource whose URL differs from the web site URL (e.g. port, protocol or host). To work around this
policy, we enabled the cross-origin resource sharing (CORS) in Amazon S3 that allowed us to load file stripes from different domains.
Figure 8 illustrates the CADOS design. In the loading event of the web page, CADOS JavaScript library spawns a number of web workers; a set of
slave web workers that perform data retrieval, and a master web worker that orchestrates slave web workers. A web worker is a new feature
coming with HTML5 standard. Analogous to threads in operating systems, this feature allows a user to create concurrent execution streams by
running a JavaScript code in the background. After the web browser retrieves the content of the web page (1), instead of connecting directly to the
storage system, it connects to the CADOS server (3) where master web worker fetches the URL list of file segments (4), and then fairly distributes
the URLs among slave web workers. This list contains the temporary URLs of the file stripes located in the cloud object storage system. Then,
each slave web workers starts retrieving file segments asynchronously from the source data center. If the URL is inaccessible, slave web worker
connects to CADOS server to obtain the secondary URL of the file segment and then starts again to fetch the data segment from a backup data
center which is physically distant from the source data center for disaster recovery. As the retrieval of the file segment finishes, slave workers
send message to the master web worker, and then the master web worker stores the pointer of the data segment and merges them into a blob. To
avoid copying operation between the master web worker and slave web workers, web workers pass the pointer of the data, not the data itself.
Data Indexing
Object storage systems are complex systems that require high-speed data management system to handle the vast amount of object attributes. In
CADOS, we take advantage of PostgreSQL (Stonebraker and Rowe, 1986) database to store the object and stripe
information. Namespace technique is widely used to prevent the name conflict of objects with the same name. Each object in CADOS is accessed
via well-defined namespace paths. The object path column is represented via a ltree structure (ltree, 2015) in order to support hierarchical tree-
like structure in an efficient way. This structure allows us to use regular-expression-like patterns in accessing the object attributes.
Security
One of the distinctive propertes of the object storage systems is the security that enables the system to share resources over the Internet. CADOS
is applying 'Signed URL' approach using the Amazon S3 feature that the object has a fixed expiration time to be retrieved from the cloud object
storage system. Once the object URL is generated for temporary use, it's guaranteed to be never generated again. The benefit of using this
technique that the URL can be shared between many clients.
CADOS Client
CADOS library is a JavaScript library that separate execution streams (web workers) retrieve file segments in parallel through HTML5
capabilities. The working of the web workers is illustrated in Figure 9. The master web worker communicates with CADOS server using AJAX,
which is an asynchronous technique to send data to or receive from a server (Garrett, 2005), and obtains the list of URLs to file segments that
reside on the cloud object storage (1). All the communication between the master web worker and slave web workers are made in a message-
passing fashion. At (2), master web worker distributes the URL list of the data segments across the slave web workers that are created in the
loading event on the web page. onmessage is an event handler that is called when the message is posted to the corresponding web worker.
Inonmessage handler of the slave web workers, unique IDs of URLs are posted to the slave web worker, then each slave web worker starts
retrieving data segments from the cloud object storage, again, by means of AJAX communication technique (3). As the slave web worker finishes
the retrieval of the data segment, it posts the data pointer and corresponding index to the master web worker (4). Index of the data segment is
used to locate the data segment on the main cache created by the master web worker. Because the data is passed to the master web worker using
pointers, there is no data copy overhead. Once all the slave workers finish data retrieval operations, the master web worker writes out the cache
to the hard disk (5). The downside of this technique the total amount of retrieved data is limited by the RAM capacity of a user’s machine,
although we anticipate this feature to be introduced in the future as a part of the HTML standard with the introduction of the File API: Writer
draft (W3C, 2015). To work around this issue, another solution, which is a specific API to Google Chrome browsers, is chrome.fileSystem API
(GoogleChrome, 2015).
Data Loading Benchmarks
We conducted experiments to measure the read rates (MB/s) on a single machine with respect to varying number of slave workers and stripe size.
We purposely used wireless router that provides speeds of up to 60 Mbps that is common in today’s home and business networks. The datasets
used in the experiments were synthetically generated and the stripes of which were uploaded into Amazon S3 storage by CADOS upon file upload
request of the web browser.
We see that the stripe size highly affects throughput, thus this parameter needs to be calibrated based on the target computing environment and
network bandwidth. However, note that, there is also trade-off that as the stripe size decreases, the number of HTTP requests per object increases
that might lead to higher cost using a public cloud, which is true for Amazon S3 as of writing this chapter. Figure 10 (left) shows the read rate
performance where stripe size is 5 MB. Throughput is limited by the network bandwidth of the client (web browser). It shows that we obtain 41%
performance gain at most when 2 slave web workers are utilized with respect to serial data retrieval. However, the performance reached
saturation after 2 slave web workers with slightly less performance gain. Then we increased the stripe size to 20 MB that the result is shown in
Figure 10 (right). The results show that we obtained the maximum throughput when we utilize 3 slave web workers with 3.38 MB/s read rate. On
the other hand, we obtained worse data read rates as we go further in stripe size over 20 MB.
CONCLUSION
In this chapter, we overview the software storage systems, particularly to handle big data via distributed file system, grid file systems, object
storage systems and relatively new storage paradigms such as No-SQL, NewSQL and column-oriented databases. The objective of the storage
systems is to deliver the data with low latency and secure by following data sharding techniques such as data striping, which promises data
reliability, availability and system scalability.
With the advent of the Internet, realization of the high-available and high-performance storage systems is of necessity to meet the clients'
demands, for example, high demands of user images in Facebook, a tremendous amount of tweets per second in Twitter or even in a private cloud
for enterprise needs. Object storage systems differ from the block-based storage systems with the capabilities to provide secure data sharing
between clients, and ability in dynamically adding user and system type attributes. While an objects might be mapped to a file on a file system,
this is considered a paradigm shift in storage design and a higher-level of view over block-based storages.
Cloud computing promises elastic resource provision in all three layers of a cloud: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service
(PaaS) and Software-as-a-Service (SaaS). Cloud-based object storage systems with their own security model, handling of elastic storage demands,
ready-to-use and relatively easy APIs, and data durability gains attention on public and private cloud users. We introduce a cloud-aware object
storage system (CADOS) that leverages an existing on-line object storage system, but further delivers faster data retrieval for HTML5-enabled
web applications by loading data segments in parallel via web workers; works around the object size limitation of the underlying cloud storage
system; and recovers data automatically from the backup data center in case of disaster. Read rate experiments show that we obtain nearly 40%
higher read rates on one machine, that the segments of the file are retrieved in parallel, and merged on the client side over the web browser. In
future research, we will study to enhance the scalability in writing the file segments to the cloud object storage system. Moreover, because the
replicates of the file segments are stored in a geographically distinct data center, the system will direct the user to the nearest data center for
faster data retrieval. We believe that the implemented system conveys important I/O design ideas to further achieve high-throughput on data
retrieval/upload for modern web applications.
This work was previously published in Managing Big Data in Cloud Computing Environments edited by Zongmin Ma, pages 2545, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Abadi, D., Madden, S., & Ferreira, M. (2006). Integrating compression and execution in column-oriented database systems.Paper presented at
theProceedings of the 2006 ACM SIGMOD international conference on Management of data, Chicago, IL, USA. 10.1145/1142473.1142548
Allcock W. Bresnahan J. Kettimuthu R. Link M. Dumitrescu C. Raicu I. Foster I. (2005). The Globus Striped GridFTP Framework and
ServerProceedings of the 2005 ACM/IEEE Conference on Supercomputing (p. 54). Washington, DC, USA: IEEE Computer Society.
10.1109/SC.2005.72
Anderson, J. C., Lehnardt, J., & Slater, N. (2010). CouchDB: the definitive guide . O'Reilly Media, Inc.
Beaver D. Kumar S. Li H. C. Sobel J. Vajgel P. (2010). Finding a Needle in Haystack: Facebook's Photo StorageProceedings of the 9th USENIX
Conference on Operating Systems Design and Implementation (pp. 1-8). Vancouver, BC, Canada: USENIX Association.
Cattell, R. (2011). Scalable SQL and NoSQL data stores. SIGMOD Record , 39(4), 12–27. doi:10.1145/1978915.1978919
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., & Gruber, R. E. (2008). Bigtable: A Distributed Storage System for
Structured Data. ACM Transactions on Computer Systems , 26(2), 1–26. doi:10.1145/1365815.1365816
Clark, C., Fraser, K., Hand, S., Hansen, J. G., Jul, E., Limpach, C., . . . Warfield, A. (2005). Live Migration of Virtual MachinesProceedings of the
2Nd Conference on Symposium on Networked Systems Design & Implementation (Vol. 2, pp. 273-286). Berkeley, CA, USA: USENIX
Association.
Drago, I., Mellia, M., Munafo, M. M., Sperotto, A., Sadre, R., & Pras, A. (2012). Inside dropbox: understanding personal cloud storage
services. Paper presented at theProceedings of the 2012 ACM conference on Internet measurement conference, Boston, Massachusetts, USA.
10.1145/2398776.2398827
Factor, M., Meth, K., Naor, D., Rodeh, O., & Satran, J. (2005). Object storage: the future building block for storage systems Local to Global Data
Interoperability Challenges and Technologies, 2005 (pp. 119-123).
Ferreira, L., Berstis, V., Armstrong, J., Kendzierski, M., Neukoetter, A., Takagi, M., & Hernandez, O. (2003). Introduction to grid computing with
globus . IBM Corporation, International Technical Support Organization.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. SIGOPS Oper. Syst. Rev. , 37(5), 29–43. doi:10.1145/1165389.945450
Gibson G. A. Nagle D. F. Amiri K. Chang F. W. Feinberg E. M. Gobioff H. Zelenka J. (1997). File Server Scaling with Network-attached Secure
DisksProceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (pp. 272-284).
Seattle, Washington, USA: ACM. 10.1145/258612.258696
Han, J., Song, M., & Song, J. (2011, May 16-18). A Novel Solution of Distributed Memory NoSQL Database for Cloud Computing.Paper presented
at the 2011 IEEE/ACIS 10th International Conference on Computer and Information Science (ICIS).
Harizopoulos, S., Liang, V., Abadi, D. J., & Madden, S. (2006). Performance tradeoffs in read-optimized databases. Paper presented at
theProceedings of the 32nd international conference on Very large data bases, Seoul, Korea.
Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., & Cesario, E. (2008). The XtreemFS architecture—a case for object-based file
systems in Grids. Concurrency and Computation ,20(17), 2049–2060.
Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, S., & Kersten, M. (2012). MonetDB: Two decades of research in column-oriented
database architectures. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering , 35(1), 40–45.
Jing, H., Haihong, E., Guan, L., & Jian, D. (2011, October 26-28).Survey on NoSQL database. Paper presented at the 2011 6th International
Conference on Pervasive Computing and Applications (ICPCA).
Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S., & Abadi, D. J. (2008). H-store: A high-performance, distributed main
memory transaction processing system.Proceedings of the VLDB Endowment , 1(2), 1496–1499. doi:10.14778/1454159.1454211
Kumar, R., Gupta, N., Maharwal, H., Charu, S., & Yadav, K. (2014). Critical Analysis of Database Management Using NewSQL. International
Journal of Computer Science and Mobile Computing, May, 434-438.
Mesnier, M., Ganger, G. R., & Riedel, E. (2003). Object-based storage. Communications Magazine, IEEE , 41(8), 84–90.
doi:10.1109/MCOM.2003.1222722
Mishne, G., Dalton, J., Li, Z., Sharma, A., & Lin, J. (2012). Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion
Architecture. CoRR, abs/1210.7350.
Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L., & Zagorodnov, D. (2009). The eucalyptus open-source cloud-
computing system. Paper presented at the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid CCGRID'09.
10.1109/CCGRID.2009.93
OracleNimbula. (2014). Retrieved from http://www.oracle.com/us/corporate/acquisitions/nimbula/index.html
Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008). Amazon S3 for science grids: a viable solution? Paper presented at
theProceedings of the 2008 international workshop on Dataaware distributed computing, Boston, MA, USA. 10.1145/1383519.1383526
Pautasso, C., Zimmermann, O., & Leymann, F. (2008). Restful web services vs. “big”' web services: making the right architectural decision. Paper
presented at theProceedings of the 17th international conference on World Wide Web, Beijing, China. 10.1145/1367497.1367606
Pawlowski B. Shepler S. Beame C. Callaghan B. Eisler M. Noveck D. Thurlow R. (2000). The NFS version 4 protocol. Proceedings of the 2nd
International System Administration and Networking Conference (SANE 2000).
Roselli, D., Lorch, J. R., & Anderson, T. E. (2000). A comparison of file system workloads. Paper presented at theProceedings of the annual
conference on USENIX Annual Technical Conference, San Diego, California.
Shvachko, K., Hairong, K., Radia, S., & Chansler, R. (2010, May 3-7). The Hadoop Distributed File System. Paper presented at the 2010 IEEE
26th Symposium on Mass Storage Systems and Technologies (MSST).
Stonebraker, M. (2010). SQL databases v. NoSQL databases.Communications of the ACM , 53(4), 10–11. doi:10.1145/1721654.1721659
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., . . . Zdonik, S. (2005). C-store: a column-oriented DBMS. Paper
presented at theProceedings of the 31st international conference on Very large data bases, Trondheim, Norway.
Stonebraker, M., & Rowe, L. A. (1986). The design of Postgres:Vol. 15. No. 2 (pp. 340–355). ACM.
Stonebraker, M., & Weisberg, A. (2013). The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. , 36(2), 21–27.
Tatebe, O., Hiraga, K., & Soda, N. (2010). Gfarm Grid File System.New Generation Computing , 28(3), 257–275. doi:10.1007/s00354-009-0089-
5
Wang, F., Oral, S., Shipman, G., Drokin, O., Wang, T., & Huang, I. (2009). Understanding Lustre filesystem internals. Oak Ridge National Lab
technical report . ORNL. U. S. Atomic Energy Commission , TM-2009(117).
Weil S. A. Brandt S. A. Miller E. L. Long D. D. E. Maltzahn C. (2006). Ceph: A Scalable, High-performance Distributed File SystemProceedings of
the 7th Symposium on Operating Systems Design and Implementation (pp. 307-320). Seattle, Washington: USENIX Association.
Weil, S. A., Brandt, S. A., Miller, E. L., & Maltzahn, C. (2006). CRUSH: controlled, scalable, decentralized placement of replicated data. Paper
presented at theProceedings of the 2006 ACM/IEEE conference on Supercomputing, Tampa, Florida. 10.1109/SC.2006.19
Weil, S. A., Leung, A. W., Brandt, S. A., & Maltzahn, C. (2007). RADOS: a scalable, reliable storage service for petabyte-scale storage
clusters. Paper presented at the Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with
Supercomputing '07, Reno, Nevada. 10.1145/1374596.1374606
White, T. (2009). Hadoop: the definitive guide: the definitive guide . O'Reilly Media, Inc.
Wilder, B. (2012). Cloud Architecture Patterns: Using Microsoft Azure . O'Reilly Media, Inc.
Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud computing: State-of-the-art and research challenges. Journal of Internet Services and
Applications , 1(1), 7–18. doi:10.1007/s13174-010-0007-6
KEY TERMS AND DEFINITIONS
Big Data: Data that takes an excessive amount of time/space to store, transmit, and process using available resources.
Cloud Computing: A computing paradigm that refers to leasing of hardware resources as well as applications as services over the Internet in
an on-demand fashion.
Data Striping: A technique that applies data partitioning across multiple storage nodes that provides high throughput in data access.
Distributed File Systems: File systems that provide access to geographically distributed files over POSIX-compliant interface or API.
Grid: A federation of a number of heterogeneous, geographically dispersed systems that share various types of resources such as CPU, storage
and software resources.
Object: A storage unit that consists of both data and attributes including user and system-defined attributes.
CHAPTER 40
EnergySaving QoS Resource Management of Virtualized Networked Data Centers for
Big Data Stream Computing
Nicola Cordeschi
“Sapienza” University of Rome, Italy
Mohammad Shojafar
“Sapienza” University of Rome, Italy
Danilo Amendola
“Sapienza” University of Rome, Italy
Enzo Baccarelli
“Sapienza” University of Rome, Italy
ABSTRACT
In this chapter, the authors develop the scheduler which optimizes the energy-vs.-performance trade-off in Software-as-a-Service (SaaS)
Virtualized Networked Data Centers (VNetDCs) that support real-time Big Data Stream Computing (BDSC) services. The objective is to minimize
the communication-plus-computing energy which is wasted by processing streams of Big Data under hard real-time constrains on the per-job
computing-plus-communication delays. In order to deal with the inherently nonconvex nature of the resulting resource management
optimization problem, the authors develop a solving approach that leads to the lossless decomposition of the afforded problem into the cascade of
two simpler sub-problems. The resulting optimal scheduler is amenable of scalable and distributed adaptive implementation. The performance of
a Xen-based prototype of the scheduler is tested under several Big Data workload traces and compared with the corresponding ones of some
state-of-the-art static and sequential schedulers.
1. INTRODUCTION
Energy-saving computing through Virtualized Networked Data Centers (VNetDCs) is an emerging paradigm that aims at performing the adaptive
energy management of virtualized Software-as-a-Service (SaaS) computing platforms. The goal is to provide QoS Internet services to large
populations of clients, while minimizing the overall computing-plus-networking energy consumption (Cugola & Margara, 2012; Baliga, Ayre,
Hinton, & Tucker, 2011; Mishra, Jain, & Durresi, 2012). As recently pointed out in (Mishra et al. 2012; Azodomolky, Wieder, & Yahyapour, 2013;
Wang et al. 2014), the energy cost of communication gear for current data centers may represent a large fraction of the overall system cost and it
is induced primarily by switches, LAN infrastructures, routers and load balancers.
However, actual virtualized data centers subsume the (usual) Map/Reduce-like batch processing paradigm and they are not designed for
supporting networking-computing intensive real-time services, such as, for example, emerging Big Data Stream Computing (BDSC) services
(Cugola et al. 2012). In fact, BDSC services retain the following (somewhat novel and unique) characteristics (Cugola et al. 2012; Scheneider,
Hirzel, & Gedik, 2013; Qian, He, Su, Wu, Zhu, & Zhang, 2013; Kumbhare, 2014):
1. The incoming data (i.e., the offered workload) arrive continuously at volumes that far exceed the storage capabilities of individual
computing machines. Furthermore, all data must be timely proceed but, typically, a few of data require to be stored. This means that the
(usual) storing-then-computing batch paradigm is no longer feasible;
2. Since BDSC services acquire data from massive collections of distributed clients in a stream form, the size of each job is typically
unpredictable and also its statistics may be quickly time-varying; and,
3. The offered workload is a real-time data stream, which needs real-time computing with latencies firmly limited up to a few of seconds
(Qian et al. 2013; Kumbhare, 2014). Imposing hard limits on the overall per-job delay requires, in turn, that the overall VNetDC is capable to
quickly adapt its resource allocation to the current (a priori unpredictable) size of the incoming big data.
In order to attain energy saving in such kind harsh computing scenario, the joint balanced provisioning and adaptive scaling of the networking-
plus-computing resources is demanded. This is the focus of this work, whose main contributions may be so summarized. First, the contrasting
objectives of low consumptions of both networking and computing energies in delay and bandwidth-constrained VNetDCs are cast in the form of
a suitable constrained optimization problem, namely, the Computing and Communication Optimization Problem (CCOP). Second, due to the
nonlinear behavior of the rate-vs.-power-vs.-delay relationship, the CCOP is not a convex optimization problem and neither guaranteed-
convergence adaptive algorithms nor closed-form formulas are, to date, available for its solution. Hence, in order to solve the CCOP in exact and
closed-form, we prove that it admits a loss-free (e.g., optimality preserving) decomposition into two simpler loosely coupled sub-problems,
namely, the CoMmunication Optimization Problem (CMOP) and the ComPuting Optimization Problem (CPOP). Third, we develop a fully
adaptive version of the proposed resource scheduler that is capable to quickly adapt to the a priori unknown time-variations of the workload
offered by the supported Big Data Stream application and converges to the optimal resource allocation without the need to be restarted.
1.1 RELATED WORK
Updated surveys of the current technologies and open communication challenges about energy-efficient data centers have been recently
presented in (Mishra et al. 2012; Balter, 2013). Specifically, power management schemes that exploit Dynamic Voltage and Frequency Scaling
(DVFS) techniques for performing resource provisioning are the focus of (Chen & Kuo, 2005; Kim, Buyya, & Kim, 2007; Li, 2008). Although
these contributions consider hard deadline constraints, they do not consider, indeed, the performance penalty and the energy-vs.-delay tradeoff
stemming from the finite capacity of the utilized network infrastructures and do not deal with the issues arising from perfect/imperfect Virtual
Machines (VMs) isolation in VNetDCs.
Energy-saving dynamic provisioning of the computing resources in virtualized green data centers is the topic of (Liu, Zhao, Liu, & He, 2009;
Mathew, Sitaraman, & Rowstrom, 2012; Padala, You, Shin, Zhu, Uysal, Wang, Singhal, & Merchant, 2009; Kusic & Kandasamy, 2009; Govindan,
Choi, Urgaonkar, Sasubramanian, & Baldini, 2009; Zhou, Liu, Jin, Li, Li, & Jiang, 2013; Lin, Wierman, Andrew, & Thereska, 2011; Laszewski,
Wang, Young, & He, 2009). Specifically, (Padala et al. 2009) formulates the optimization problem as a feedback control problem that must
converge to an a priori known target performance level. While this approach is suitable for tracking problems, it cannot be employed for energy-
minimization problems, where the target values are a priori unknown. Roughly speaking, the common approach pursued by (Mathew et al. 2012;
Kusic et al. 2009; Govindan et al. 2009) and (Lin et al. 2011) is to formulate the afforded minimum-cost problems as sequential optimization
problems and, then, solve them by using limited look-ahead control. Hence, the effectiveness of this approach relies on the ability to accurately
predict the future workload and degrades when the workload exhibits almost unpredictable time fluctuations. In order to avoid the prediction of
future workload, (Zhou et al. 2013) resorts to a Lyapunov-based techniques, that dynamically optimizes the provisioning of the computing
resources by exploiting the available queue information. Although the pursued approach is of interest, it relies on an inherent delay-vs.-utility
tradeoff that does not allow us to account for hard deadline constraints.
A suitable exploitation of some peculiar features of the network topology of current NetDCs is at the basis of the capacity planning approach
recently proposed in (Wang et al. 2014). For this purpose, a novel traffic engineering-based approach is developed, which aims at reducing the
number of active switches, while simultaneously balancing the resulting communication flows. Although the attained reduction of the energy
consumed by the networking infrastructures is, indeed, noticeable, the capacity planning approach in (Wang et al. 2014) does not consider, by
design, the corresponding energy consumed by the computing servers and, which is the most, it subsumes delay-tolerant application scenarios.
The joint analysis of the computing-plus-communication energy consumption in virtualized NetDCs which perform static resource allocation is,
indeed, the focus of (Baliga et al. 2011; Tamm, Hersmeyer, & Rush, 2010), where delay-tolerant Internet-based applications are considered.
Interestingly, the main lesson stemming from these contributions is that the energy consumption due to data communication may represent a
large part of the overall energy demand, especially when the utilized network is bandwidth-limited. Overall, these works numerically analyze and
test the energy performance of some state-of-the-art static schedulers, but do not attempt to optimize it through the dynamic joint scaling of the
available communication-plus-computing resources. Providing computing support to the emerging BDSC services is the target of some quite
recent virtualized management frameworks such as, S4 (Neumeyer, Robbins, & Kesari, 2010), D-Streams (Zaharia, Das, Li, Shenker, & Stoica,
2012) and Storm (Loesing, Hentschel, & Kraska, 2012). While these management systems are specifically designed for the distributed runtime
support of large scale BDSC applications, they do not still provide automation and dynamic adaptation to the time-fluctuations of Big Data
streams. Dynamic adaptation of the available resources to the time-varying rates of the Big Data streams is, indeed, provided by the (more
recent) Time Stream (Qian et al. 2013) and PLAStiCC (Kumbhare, 2014) management frameworks. However, these frameworks manage only the
computing resources and do not consider the simultaneous management of the networking resources.
The rest of this chapter is organized as follows. After modeling (in Section 2) the considered virtualized NetDC platform, in Section 3 we formally
state the afforded CCOP and, then, we solve it and provide the analytical conditions for its feasibility. In Section 4, we present the main structural
properties of the resulting optimal scheduler and analytically characterize the (possible) occurrence of hibernation phenomena of the instantiated
VMs. Furthermore, we also develop an adaptive implementation of the optimal scheduler and prove its convergence to the optimum. In Section 5,
we test the sensitivity of the average performance of the proposed scheduler on the Peak-to-Mean Ratio (PMR) and time-correlation of the
(randomly time-variant) offered Big Data workload and, then, we compare the obtained performance against the corresponding ones of some
state-of-the-art static and sequential schedulers. After addressing some possible future developments and hints for incoming research in Section
6, the conclusive Section 7 recaps the main results.
About the adopted notation, indicates min{max{x; a}; b}, [x]+indicates max{x;0}, ≜ means equality by definition, while 1[A] is the (binary-
valued) indicator function of the A event (that is, 1[A] is unit when the A event happens, while it vanishes when the A event does not occur).
2. THE CONSIDERED VNetDC PLATFORM
A networked virtualized platform for real-time parallel computing is composed by multiple clustered virtualized processing units interconnected
by a single-hop virtual network and managed by a central controller (see, for example, Figure 1 of (Azodomolky et al. 2013)). Each processing unit
executes the currently assigned task by self-managing own local virtualized storage/computing resources. When a request for a new job is
submitted to the VNetDC, the central resource controller dynamically performs both admission control and allocation of the available virtual
resources (Almeida, Almeida, Ardagna, Cunha, & Francalanci, 2010). Hence, according to the emerging communication-computing system
architectures for the support of real-time BDSC services (see, for example, Figs. 1 of (Azodomolky et al. 2013) and (Ge, Feng, Feng, & Cameron,
2007)), a VNetDC is composed by multiple reconfigurable VMs, that operate at the Middleware layer of the underlying protocol stack and are
interconnected by a throughput-limited switched Virtual Local Area Network (VLAN). The topology of the VLAN is of star-type, and, in order to
guarantee inter-VM communication, the Virtual Switch of Figure 1 acts as a gather/scatter central node. The operations of both VMs and VLAN
are jointly managed by a Virtual Machine Manager (VMM), which performs task scheduling by dynamically allocating the available virtual
computing-plus-communication resources to the VMs and Virtual Links (VLs) of Figure 1. A new job is initiated by the arrival of a data of
size Ltot (bit). Due to the (an aforementioned) hard real-time nature of the supported BDSC services, full processing of each input job must be
completed within assigned and deterministic processing time which spans Ttseconds.
3. The job granularity, that is, the (integer-valued) maximum number of independent parallel tasks embedded into the submitted job.
Let be the maximum number of VMs which are available at the Middleware layer of Figure 1. In principle, each VM may be modeled as a
virtual server that is capable to process fc bits per second (Portnoy, 2012). Depending on the size L (bit) of the task to be currently processed by
the VM, the corresponding processing rate fc may be adaptively scaled at runtime, and it may assume values over the interval , where
1
(bit/s) is the per-VM maximum allowed processing rate . Furthermore, due to the real-time nature of the considered application scenario, the
time allowed the VM to fully process each submitted task is fixed in advance at , regardless from the actual size L of the task currently assigned
to the VM. In addition to the currently assigned task (of size L), the VM may also process a background workload of size Lb(bit), that accounts for
the programs of the guest Operating System (Portnoy, 2012).
Hence, by definition, the utilization factor of the VM equates (Portnoy, 2012):
Then, as in (Kim et al. 2009; Almeida et al. 2010; Koller, Verma, & Neogi, 2010), let (Joule) be the overall energy consumed by the VM to
process a single task of duration at the processing rate fc, and let (Joule) be the corresponding maximum energy when the VM
operates at the maximum processing rate . Hence, by definition, the (dimensionless) ratio is the so-called Normalized Energy Consumption of
the considered VM (Warneke & Kao, 2011). From an analytical point of view, is a function of the actual value of the utilization factor
of the VM. Its analytical behavior depends on the specific features of the resource provisioning policy actually implemented by the VMM of
Figure 1 (Kim, Beloglazov, & Buyya, 2009; Zhu, Melhem, & Childers, 2003).
(1)
Specifically, the following three main conclusions are widely agreed by several recent studies (Kansal, Zhao, Liu, Kothari, & Bhattacharya, 2010;
Ge et al. 2007; Cordeschi, Shojafar, & Baccarelli, 2013; Stoess, Lang, & Bellosa, 2007), (Laszewski et al. 2009). First, under the assumption of
isolation of the VMs, the energy consumed by the physical servers is the summation of the energies wasted by the hosted VMs. Second, in
practice, the per-VM normalized energy consumption follows the c-powered behavior:
(2)
In (2), the exponent parameter is application-depending and it is empirically evaluated at runtime (e.g., c=1 in (Kansal et al. 2010) and c=2 in
(Lin et al. 2011)), while is the fraction of the per-VM maximum energy consumption which is wasted by the VM in the idle state, that is,
. Third, from an application point of view, the aforementioned per-VM energy parameters , andc may be estimated at runtime. In
particular, the Joulemetersystem described in section 5 of (Kansal et al. 2010) is a practical tool that estimates at runtime the per-VM energy
parameters by performing suitable measurements atop the VMM of Figure 1. Finally, before proceeding, we anticipate that the validity of the
analytical developments of Sections 3 and 4 is not limited up to the energy model in (2) and it extends, indeed, to the more general case in which
is an increasing and convex function of .
In Figure 1, the Virtual LAN (VLAN) is atop the Virtualization Layer. Black boxes indicate Virtual Network Interface Cards (VNICs) ending point-
to-point TCP-based connections. Physical Ethernet adapters connect the VLAN to the underlying physical network (Azodomolky et al. 2013). The
reported architecture is instantiated at the Middleware layer.
Big Data Streaming Workload and PowerLimited Virtualized Communication Infrastructures
Let M= min {MV, MT} be the degree of concurrency of the submitted job, let Ltot be the overall size of the job currently submitted to the VNetDC,
and let , be the size of the task that the Scheduler of Figure 1 assigns to the i-th VM, e.g.,VM(i). Hence, the following constraint:
guarantees that the overall job Ltot is partitioned into (at the most) M parallel tasks. In order to hold at the minimum the transmission
delays from (to) the Scheduler to (from) the connected VMs of Figure 1, as in (Baliga et al. 2011), (Azodomolky et al. 2013), we assume
thatVM(i) communicates to the Scheduler through a dedicated (i.e., contention-free) reliable virtual link, that operates at the transmission rate
of Ri(bit/s), i=1,…, M and it is equipped with suitable Virtual Network Interface Cards (VNICs) (see Figure 1). The one-way transmission-plus-
switching operation over the i-th virtual link drains a (variable) power of (Watt), where is the summation: ≜ of the power
consumed by the transmit VNIC and Switch and the corresponding power wasted by the receive VNIC (see Figure 1).
About the actual value of , we noted that, in order to limit the implementation cost, current data centers utilize off-the-shelf rackmount
physical servers which are interconnected by commodity Fast/Giga Ethernet switches. Furthermore, they implement legacy TCP suites (mainly,
the TCPNewReno one) for attaining end-to-end (typically, multi-hop) reliable communication (Mishra et al. 2012). At this regard, we note that
the data center-oriented versions of the legacy TCPNewReno suite proposed in (Vasudevan, Phanishayee, & Shah, 2009; Alizadeh, Greenberg,
Maltz, & Padhye, 2010; Das & Sivalingam, 2013) allow the managed end-to-end transport connections to operate in the Congestion Avoidance
state during 99.9% of the working time, while assuring the same end-to-end reliable throughput of legacy TCPNewReno protocol. This means, in
turn, that the average throughput Ri (bit/s) of the i-th virtual link of Figure 1 (i.e., the i-th end-to-end transport connection) equates (Kurose &
Ross, 2013, section 3.7; (Jin, Guo, Matta, & Bestavros, 2003)
(3)
As it is well-known, MSS (bit) in (3) is the maximum segment size, is the number of per-ACK acknowledged segments, is the average
round-trip-time of the i-th end-to-end connection (e.g., less than 1 (ms) in typical data centers (Vasudevan et al. 2009)), and is the
average segment loss probability experienced by the i-th connection section 3.7 in (Kurose et al. 2013). At this regard, several studies point out
that scales down for increasing as in (Liu, Zhou, &Giannakis, 2004); Cordeschi et al 2012a; Baccarelli, Cordeschi, & Patriarca, 2012b;
Baccarelli & Biagi, 2003; Baccarelli, Biagi, Pelizzoni, &Cordeschi, 2007)
(4)
where is the coding gain-to-receive noise power ratio of the i-th end-to-end connection, while the positive exponent dmeasures the
diversity gain provided by the frequency-time interleavers implemented at the Physical layer. Explicit closed-form expressions for gi and d are
reported, for example, in (Liu et al. 2004; Cordeschi, Patriarca, & Baccarelli, 2012a) and (Baccarelli et al. 2012b) for various operating settings.
Hence, after introducing (4) into (3), we obtain
(5)
with
and
Hence, since the corresponding one-way transmission delay equates: , the corresponding one-way communication energy needed
Before proceeding, we point out that the -powered (convex) formula in (5) featuring the power-vs.-throughput relationship of the i-th virtual
link of Figure 1 holds regardless from the actual (possibly, multi-hop) topology of the adopted physical network (e.g., Fat-tree, BCube, DCell
(Wang et al. 2014)). Formally speaking, the validity of (5) relies on the (minimal) assumption that TCP-based transport connections working in
the Congestion Avoidance state are used for implementing the virtual links of Figure 1.
Reconfiguration Cost in VNetDCs
Under the per-job delay constraints imposed by Big Data stream services (Zhou et al. 2013), the VMM of Figure 1 must carry out two main
operations at runtime, namely, Virtual Machine management and load balancing. Specifically, goal of the Virtual Machine management is to
adaptively control the Virtualization Layer of Figure 1. In particular, the set of the (aforementioned) VM’s attributes:
(6)
are dictated by the Virtualization Layer and, then, they are passed to the VMM of Figure 1. It is in charge of the VMM to implement a suitable
frequency-scaling policy, in order to allow the VMs to scale up/down in real-time their processing rates fc’s at the minimum cost (Warneke et al.
2011).
At this regard, we note that switching from the processing frequency f1 to the processing frequency f2 entails an energy cost of
(Joule) (Laszewski et al. 2009; Kim et al. 2007). Although the actual behavior of the function may depend on the adopted VMM and the
underlying DVFS technique and physical CPUs (Portnoy, 2012), any practical function typically retains the following general properties
(Laszewski et al. 2009; Kim et al. 2007): i) it depends on the absolute frequency gap |f1f2|; ii) it vanishes at f1=f2 and is not decreasing in |f1f2|;
and, iii) it is jointly convex in |f1f2|. A quite common practical model which retains the aforementioned formal properties is the following one:
(7)
where (Joule/(Hz) 2) dictates the resulting per-VM reconfiguration cost measured at the Middleware layer (see Figure 1). For sake of
concreteness, in the analytical developments of the following section 3, we directly subsume the quadratic model in (6). The generalization to the
case of functions that meet the aforementioned (more general) analytical properties is, indeed, direct.
On the VirtualtoPhysical QoS Resource Mapping in VNetDCs
Due to the hard delay-sensitive feature of Big Data stream services, the Virtualization Layer of Figure 1 must guarantee that the demands for the
computing fi and communication Ri resources done by the VLAN are mapped onto adequate (i.e., large enough) computing (e.g., CPU cycles) and
communication (e.g., link bandwidths) physical supplies.
In our setting, efficient QoS mapping of the virtual demands fi for the computing resources may be actually implemented by equipping the
Virtualization Layer of Figure 1 with a per-VM queue system that implements the (recently proposed) mClockscheduling discipline section 3 in
(Gulati, Merchant, & Varman, 2010). Interestingly enough, Table 1 of (Gulati et al. 2010) points out that the mClock scheduler works on a per-
VM basis and provides:
Table 1. Main taxonomy of the paper
Symbol Meaning/Role
Job’s size
3. Hard (i.e., absolute) resource reservation, by adaptively managing the computing power of the underlying DVSF-enabled physical CPUs
(see the Algorithm 1 of (Gulati et al. 2010) for a code of the mClock scheduler).
About the TCP-based networking virtualization, several (quite recent) contributions (Ballami, Costa, Karagiannis, & Rowstron, 2011; Greenberg
et al. 2011; Guo et al. 2010; Xia, Cui, & Lange, 2012) point out that the most appealing property of emerging data centers for the support of delay-
sensitive services is the agility, i.e., the capability to assign arbitrary physical server to any service without experiencing performance degradation.
To this end, it is recognized that the virtual network atop the Virtualization Layer should provide a flat networking abstraction (see Figure 1 of
(Azodomolky et al. 2013)). The Middleware layer architecture of Figure 1 of the considered virtualized NetDC is aligned, indeed, with this
requirement and, then, it is general enough to allow the implementation of agile data centers.
Specifically, according to (Azodomolky et al. 2013), the VNetDC of Figure 1 may work in tandem with any Network Virtualization Layer that is
capable to map the rate-demands Ri onto bandwidth-guaranteed end-to-end (possibly, multi-hop) connections over the actually available
underlying physical network. Just as examples of practical state-of-the-art Networking Virtualization tools,Oktopous (Ballami et al. 2011, Figure
5) provides a contention-free switched LAN abstraction atop tree-shaped physical network topologies, while VL2 (Greenberg et al. 2011) works
atop physical Clos’ networks. Furthermore, SecondNete (Guo et al. 2010, section 3) and VNET/P (Xia et al. 2012, section 4.2) provide bandwidth-
guaranteed virtualized Ethernet-type contention-free LAN environments atop any TCP-based end-to-end connection. For this
purpose, SeconNet implements Port-Switching based Source Routing (PSSR) (Guo et al. 2010, section 4), while VNET/P relies on suitable Layer2
Tunneling Protocols (L2TPs) (Xia et al. 2012, section 4.3).
An updated survey and comparison of emerging contention-free virtual networking technologies is provided by Table 2 of (Azodomolky et al.
2013). Before proceeding, we anticipate that the solving approach developed in section 3.1 still holds verbatim when the summation: in
(7) is replaced by a single energy function which is jointly convex and not decreasing in the variables {Li}(see (15.1)). This generalization could
be employed for modeling flow coupling effects that may be (possibly) induced by the imperfect isolation provided by the Networking
Virtualization Layer (e.g., contentions among competing TCP flows) (Portnoy, 2012).
Table 2. Average energy reductions attained by the proposed and the sequential schedulers one over the static one at ke=0.005(Joule/(MHz)2),
and (Mbit/s)
1 0% 0% 0% 0%
Remark 1: Discrete DVFS.
Actual VMs are instantiated atop physical CPUs which offer, indeed, only a finite set:
of Q discrete computing rates. Hence, in order to deal with both continuous and discrete reconfigurable physical computing infrastructures
without} introducing loss of optimality, we borrow the approach formerly developed, for example, in (Neely, Modiano, & Rohs, 2003; Li, 2008).
Specifically, after indicating by
the discrete values of which correspond to the frequency set , we build up a per-VM Virtual Energy Consumption curve by resorting to a
piecewise linear interpolation of the allowed Qoperating points: . Obviously, such virtual curve retains the
(aforementioned) formal properties and, then, we may use it as the true energy consumption curve for virtual resource provisioning (Neely et al.
2003). Unfortunately, being the virtual curve of continuous type, it is no longer guaranteed that the resulting optimally scheduled computing
rates are still discrete valued. However, as also explicitly noted in (Neely et al. 2003; Li, 2008), any point , with , on
the virtual curve may be actually attained by time-averaging over secs (i.e., on a per-job basis) the corresponding surrounding vertex points:
and . Due to the piecewise linear behavior of the virtual curve, as in (Neely et al. 2003; Li, 2008), it is guaranteed that the
average energy cost of the discrete DVFS system equates that of the corresponding virtual one over eachtime interval of duration (e.g., on a per-
job basis).
3. OPTIMAL ALLOCATION OF THE VIRTUAL RESOURCES
In this section, we deal with the second functionality of the VMM of Figure 1; namely, the dynamic load balancing and provisioning of the
communication-plus-computing resources at the Middleware layer (see Figure 1). Specifically, this functionality aims at properly tuning the task
sizes , the communication rates and the computing rates of the networked VMs of Figure 1. The goal is to
minimize (on a per-slot basis) the overall resulting communication-plus-computing energy:
(8)
under the (aforementioned) hard constraint Tt on the allowed per-job execution time. This last depends, in turn, on the (one-way) delays {D(i),
i=1,…, M} introduced by the VLAN and the allowed per-task processing time . Specifically, since the M virtual connections of Figure 1 are
typically activated by the virtual switch of Figure 1 (e.g., the load dispatcher) in a parallel fashion (Scheneider et al. 2013), the overall two-way
communication-plus-computing delay induced by the i-th connection of Figure 1 equates: 2D(i)+ , so that the hard constraint on the overall per-
job execution time reads as in:
(9)
(10.1)
s.t:
(10.2)
(10.3)
(10.4)
(10.5)
(10.6)
(10.7)
(10.8)
About the stated problem, the first two terms in the summation in (10.1) account for the computing-plus-reconfiguration energy Ec(i)consumed
by the VM(i), while the third term in (10.1) is the communication energy Enet (i) requested by the corresponding point-to-point virtual link for
conveying Li bits at the transmission rate of Ri (bit/s). Furthermore, and in (10.1) represent the current (i.e., already computed and
consolidated) computing rate and the target one, respectively. Formally speaking, is the variable to be optimized, while describes
the current state of the VM(i), and, then, it plays the role of a known constant. Hence, in (10.1) accounts for the resulting
reconfiguration cost. The constraint in (10.2) guarantees that VM(i) executes the assigned task within secs, while the (global) constraint in
(10.3) assures that the overall job is partitioned into M parallel tasks. According to (9), the set of constraints in (10.6) forces the VNetDC of Figure
1 to process the overall job within the assigned hard deadline Tt. Finally, the global constraint in (10.7) limits up to Rt (bit/s) the aggregate
transmission rate which may be sustained by the underlying VLAN of Figure 1, so that Rt is directly dictated by the actually considered VLAN
standard (Azodomolky et al. 2013; Scheneider et al. 2013). Table 1 summarizes the main taxonomy used in this paper.
Due to the noticeable power drained by idle servers, turning the idle servers OFF is commonly considered an energy-effective policy. However,
nothing comes for free, so that, although the above conclusion holds under delay-tolerant application scenarios, it must be carefully re-
considered when hard limits on the allowed per-job service time Tt are present (Kim et al. 2009; Almeida et al. 2010; Balter, 2013, Chap. 27;
Koller et al. 2010; Gandhi, Balter, &Adam, 2010; Loesing et al. 2012) . In fact, the setup time I for transitioning a server from the OFF state to the
ON one is currently larger than 200 seconds and, during the overall setup time, the server typically wastes power (Balter, 2013, Chap. 27;
(Loesing et al. 2012).
Hence, under real-time constrains, there are (at least) two main reasons to refrain to turn the idle servers OFF. First, the analysis recently
reported in (Balter, 2013, Chap. 27; Loesing et al. 2012) point out that turning the idle servers OFF increases, indeed, the resulting average
energy consumptions when the corresponding setup time I is larger than 2Tt, and this conclusion holds also when the long-run (i.e., not
instantaneous) average utilization factor of the servers is low and of the order of about 0.3 (see Table 27.1 of (Balter, 2013)). Second, in order to
not induce outage-events, the setup time I must be limited up to a (negligible) fraction of the per-job service time Tt (see the hard constraint in
(10.6)). Hence, since the tolerated per-job service times of communication-oriented real-time applications are typically limited up to few seconds
(see, for example, Tables 1 and 2 of (Zaharia et al. 2012)), the results of the aforementioned performance analysis induce to refrain to turn the
idle servers OFF, at least when the setup times Iis two orders of magnitude larger than the corresponding service times. However, as a
counterbalancing aspect, the performance results reported in the (quite recent) contributions (Kim et al. 2009; Almeida et al. 2010, section 3.1;
Koller et al. 2010) unveil that, under real-time constraints, there is still large room for attaining energy savings by adaptively setting the set of the
utilization factors in (2) of the available VMs, so as to properly track the instantaneous size Ltot of the offered workload. This is, indeed, the
energy-management approach we pursue in the following sections, where we focus on the (still quasi unexplored) topic of the energy-saving
adaptive configuration of the VNetDC of Figure 1.
First, several reports point out that the energy consumption of other non-IT equipments (e.g., cooling equipments) is roughly proportional to that
in (8) through a constant factor PUE, which represents the (measured) ratio of the total energy wasted by the data center to the energy consumed
by the corresponding computing-plus-networking equipments (Greenberg, Hamilton, & Maltz, 2009).
Second, the size of the workload output by VM(i) at the end of the computing phase may be different from the corresponding
one Li received in input at the beginning of the computing phase. Hence, after introducing the i-th inflating/deflating constant coefficient
, the basic CCOP’s formulation may be generalized by simply replacing the term: 2Li in (10.1), (10.6) by the following one:
.
Third, the summation in (10.1) of the computing and reconfiguration energies may be replaced by any two energy functions: H(f1,…,fM) and V(f1,
…,fM) which are jointly convex in {fi, i=1,…,M}. Just as an application example, as in (Zhou et al. 2013), let us assume that all the VMs are
instantiated onto the same physical server and, due to imperfect isolation; they directly compete for acquiring CPU cycles. In this (limit) case, the
sum-form in (8) for the computing energy falls short, and the total energy ECPU wasted by the physical host server may be modeled as in (Zhou et
al. 2013):
where is the maximum energy consumed by the physical server, ≜ is the corresponding energy consumed by the physical server in the
idle state, while the inner summation is the aggregate utilization of the server by all hosted VMs. Since the above expression of is jointly
convex in {fi, i=1,…,M} for , the solving approach of the next section (section 3.1) still applies verbatim.
3.1 Solving Approach and Optimal Provisioning of the Virtual Resources
The CCOP in (10) is not a convex optimization problem. This due to the fact that, in general, each function: in (10.1) is not jointly convex
in Li, Ri, even in the simplest case when the power function reduces to an assigned constant that does not depend on Ri. Therefore,
neither guaranteed-convergence iterative algorithms nor closed-form expressions are, to date, available to compute the optimal solution
of the CCOP.
However, we observe that the computing energy in (10.1) depends on the variables {fi}, while the network energy depends on the variables {Ri}.
Furthermore, since the variables {Li} are simultaneously present in the sets of constraints in (10.2) and (10.6), they affect both the computing and
network energy consumptions. Formally speaking, this implies, that, for any assigned set of values of {Li}, the minimization of the computing and
network energies reduces to two uncoupled optimization sub-problems over the variables {fi} and {Ri}, respectively. These sub-problems are
loosely coupled (in the sense of (Chiang, Low, Calderbank, & Doyle, 2007)), and (10.2), (10.6) are the coupling constraints.
Therefore, in order to compute the optimal solution of the CCOP, in the following, we formally develop a solving approach that is based on the
lossless decomposition of the CCOP into the (aforementioned) CMOP and CPOP.
Formally speaking, for any assigned nonnegative vector of the task’s sizes, the CMOP is a (generally nonconvex) optimization problem in the
communication rate variables {Ri, i=1,…,M}, so formulated:
(11.1)
s.t:
(12)
be the region of the nonnegative M-dimensional Euclidean space which is constituted by all ’s vectors that meet the constraints in (10.6)-(10.8).
Thus, after introducing the dummy function:
s.t:
Let be the (possibly, empty) set of the solutions of the cascade of the CMOP and CPOP. The following Proposition 1 proves that the
cascade of these sub-problems is equivalent to the CCOP.
Proposition 1
The CCOP in (10.1)-(10.8) admits the same feasibility region and the same solution of the cascade of the CMOP and CPOP, that is,
Proof
By definition, the region S in (12) fully accounts for the set of the CCOP constraints in (10.6)-(10.8), so that the constraint is equivalent to the set
of constraints in (10.6)-(10.8). Since these constraints are also accounted for by the CMOP, this implies that Sfully characterizes the coupling
induced by the variables {Li,i=1,…,M} between the two sets of constraints in (10.2)-(10.5) and (10.6)-(10.8). Therefore, the claim of Proposition 1
directly arises by noting that, by definition, is the result of the constrained optimization of the objective function in (10.1) over the
variables {Ri,i=1,…,M}, and is part of the objective function of the resulting CPOP in (13.1).
About the feasibility of the CMOP, the following result holds (see (Cordeschi, Amendola, Shojafar, & Baccarelli, 2014) for the proof).
Proposition 2
Let each function in (11.1) be continuous and not decreasing for . Hence, for any assigned vector , the following two properties hold:
1.
2.
(15)
Being the condition in (14) necessary and sufficient for the feasibility of the CMOP, it fully characterizes the feasible regionSin (12). This last
property allows us to recast the CPOP in the following equivalent form:
(16.1)
s.t.:
(17)
Since the constraint in (14) involves only the offered workload, it may be managed as a feasibility condition and this observation leads to the
following formal feasibility result (proved in the final Appendix A of (Cordeschi et al. 2013)).
Proposition 3
The CCOP in (10) is feasible if and only if the CPOP in (16) is feasible. Furthermore, the following set of (M+1) conditions:
(18.1)
(18.2)
is necessary and sufficient for the feasibility of the CPOP.
About the solution of the CPOP, after denoting by the following dummy function:
≜ (19)
(20)
Hence, after indicating by the final inverse function of , the following Proposition 4 analytically characterizes the optimal
scheduler (see the Appendix A for the proof).
Proposition 4
Let the feasibility conditions in (18) be met. Let the function in (17) be strictly jointly convex and let each function: be
continuous and increasing for . Hence, the global optimal solution of the CPOP is unique, and it is analytically computable as in:
(21.1)
(21.2)
(22)
Finally, in (21.2) is the unique nonnegative root of the following algebraic equation:
(23)
where is given by the r.h.s. of (21.2}, with replaced by the dummy variable .
4. ADAPTIVE IMPLEMENTATION OF THE OPTIMAL SCHEDULER
About the main structural properties and implementation aspects of the optimal scheduler, the following considerations may be of interest.
4.1 Hibernation Effects
Formally speaking, the i-th VM is hibernated when (i.e., no exogenous workload is assigned to VM(i)) and the corresponding processing
rate is strictly larger than the minimum one: requested for processing the background workload (see (10.2)). In principle, we
expect that the hibernation of VM(i)may lead to energy savings when ke, and the ratios are large, while the offered workload Ltot is
small. As proved in theAppendix B, this is, indeed, the behavior exhibited by the optimal scheduler, that hibernates VM(i) at the processing
frequency in (21.1) when the following hibernation condition is met:
(24)
4.2 Adaptive Implementation of the Optimal Scheduler and Convergence Properties
From an application point of view, remarkable features of the optimal scheduler of Proposition 4 are that:
1. It leads to distributed and parallel computation (with respect to the i-index) of the 3M variables ; and,
2. Its implementation complexity is fully independent from the (possibly, very large) size Ltot of the offered workload.
Moreover, in Big Data Streaming environments characterized by (possibly, abrupt and unpredictable) time-fluctuations of the offered
workload Ltot (see section 5), the per-job evaluation and adaptive tracking of theLagrange multiplier in (23) may be performed by resorting to
suitable primal-dual iterates. Specifically, due to the (strict) convexity of the CPOP, Theorem 6.2.6 of (23) guarantees that the Karush-Khun-
Tucker (KKT) point of CPOP is unique and coincides with the saddle point of the corresponding Lagrangian function in (A.1) of Appendix B
(Bazaraa, Sherali, & Shetty, 2006, pp.272-273). Furthermore, the saddle point is the equilibrium point of the equations’ system which is obtained
by equating to zero the gradients of the Lagrangian function in (A.1) performed with respect to the primal and dual variables (Bazaraa et al. 2006,
Th.6.2.6).
Hence, as in (Srikant, 2004, sections 3.4, 3.8; Zaharia et al. 2012), we apply the primal-dual iterative algorithm for computing the saddle-point of
the Lagrangian function in (A.1)2. Since the gradient of the Lagrangian function in (A.1) with respect to the dual variable is:
and the closed-form KKT relationships in (21.1), (21.2), (22) hold, the primal-dual algorithm reduces to the following quasi-Newton iterates:
(25)
with . In (25), is an integer-valued iteration index, is a (suitable) positive step-size sequence, and the following dummy
iterates in the n-index also hold (see (21.1),(21.2) and (22)):
(26.1)
(26.2)
(26.3)
Regarding the asymptotic global convergence to the optimum of the primal-dual iterates in (25), (26), the following formal result holds (see the
final Appendix B for the proof).
Proposition 5
Let the feasibility condition in (18) be met and let in (25) be positive and vanishing for , i.e., . Hence, the primal-dual
iterates in (25), (26) converge to the global optimum for , regardless of the starting point .
Formally speaking, Proposition 5 points out that the adaptive version in (25), (26) of the proposed scheduler attains the global convergence to the
solving point in Proposition 4 of the considered optimization problem. The numerical plots of section 5.2 confirm the actual global convergence
of the iterates in (25), (26). In principle, the actual choice of in (25) also impact on the rate of convergence and the tracking capability of
the iterates. At this regard, we note that an effective choice for coping with the unpredictable time-variations of the workload offered by Big Data
streaming applications is provided by the gradient-descendant algorithm in (Kushner et al. 1995) for the adaptive updating of the step-size in
(25). In our framework, this updating reads as in (Kushner et al. 1995, Equation (2.4)):
(27)
where and are positive constants, while is updated as in (Kushner et al. 1995, Equation (2.5)):
(28)
with .
In practice, the iteration index n must run faster than the slot-time Tt. Although the actual duration of each n-indexed iteration may depend on
the considered application, it should be small enough to allow the iterates in (25), (26) to converge to the global optimum within a limited
fraction of the slot-time Tt. In fact, at the beginning of each slot, the iterates in (25), (26) are performed by starting from the optimal resource
allocation computed at the previous slot. Then, after attaining the convergence, the iterates stop and the corresponding resource reconfiguration
takes place. Hence, since a single reconfiguration action is performed during each slot-time Tt, the resulting reconfiguration cost of the adaptive
implementation of the scheduler does not depend, indeed, on the convergence times and/or trajectories of the iterates (25), (26) (see Figure 3).
Figure 3. Time evolution (in the n-index) of in (25) for the
application scenario of Section 5.2
5. PROTOTYPE AND PERFORMANCE EVALUATION
In order to evaluate the per-job average communication-plus-computing energy consumed by the proposed scheduler, we have implemented a
prototype of the adaptive scheduler of section 4.2, with paravirtualized Xen 3.3 as VMM and Linux 2.6.18 as guest OS kernel (see Figure 1). The
adaptive scheduler has been implemented at the driver domain (i.e., Dom0) of the legacy Xen 3.3. Interestingly enough, out of approximately
1100 lines of SW code needed for implementing the proposed scheduler, 45% is directly reused from existing Xen/Linux code. The reused code
includes part of the Linux’s TCPReno congestion control suite and Xen’s I/O buffer management.
The implemented experimental setup comprises four Dell PowerEdge servers, with 3.06 GHz Intel Xeon CPU and 4GB of RAM. All servers are
connected through commodity Fast Ethernet NICs. In all carried out tests, we configure the VMs with 512MB of memory and utilize the
TCPNewReno suite for implementing the needed VM-to-VM transport connections.
5.1 Test Workload Patterns
In order to stress the effects of the reconfiguration costs and time-fluctuations of the Big Data workload, as in (Zhou et al. 2013), we begin to
model the workload size as an independent identically distributed (i.i.d.) random sequence , whose samples are Tt-spaced apart
r.v.’s evenly distributed over the interval , , with = 8 (Mbit). By setting the spread
parameter a to 0 (Mbit), 2 (Mbit), 4 (Mbit), 6 (Mbit) and 8 (Mbit), we obtain Peak-to-Mean Ratios (PMRs) of 1 (e.g., the offered workload is of
constant size), 1.2, 1.5, 1.75 and 2.0, respectively. About the dynamic settings of in (10.1), at the first round of each batch of the carried out
tests, all the frequencies ‘s are reset. Afterwards, at the m-th round, each is set to the corresponding optimal value computed at the
previous (m1)-th round.
Since each round spans an overall slot-time Tt, all the reported test results properly account for the reconfiguration cost in (7). Furthermore, all
the reported test results have been obtained by implementing the adaptive version in (25)-(28) of the optimal scheduler, with the duration of
each n-indexed iterate set to(Tt/30) secs. Finally, the performed tests subsume the power-rate function in (5) at , together with the
computing energy function in (2) with c=2.0 and b=0.5.
5.2 Impact of the VLAN Setup and Tracking Capability of the Scheduler
Goal of a first set of tests is to evaluate the effects on the per-job average energy consumed by the optimal scheduler induced by the size M of
the VNetDC and the setting of the TCP-based VLAN. For this purpose, we pose:
Tt=5 (s), Rt=100(Mb/s),
PMR=1.25, ke=0.05 (J/(MHz)2),
(Mbit/s),
(Joule), (s),
, and
Lb(i)=0.
Afterwards, since the quality of the virtual links (e.g., end-to-end TCP connections) of Figure 1 is captured by the corresponding coefficients
(see (5)), we have evaluated the average consumed energy under the following settings:
1. ;
2. ;
3. ; and,
4. .
The obtained numerical plots are drawn in Figure 2. As it could be expected, larger ‘s penalized the overall energy performance of the emulated
VNetDC. Interestingly, since is, by definition, the minimum energy when up to M VMs may be turned on, at fixed ‘s, decreases for
increasing M and, then, it approaches a minimum value, that does not vary when M is further increased (see the flat segments of the two
uppermost plots of Figure 2).
Figure 2. Effects of the link quality on for the application
scenario of Section 5.2 at PMR= 1.25
Finally, in order to appreciate the sensitivity to the parameters of the adaptive version in (27), (28) of the implemented scheduler, Figure 3
reports the measured time-behavior of in (25) when the workload offered by the supported Big Data stream application abruptly passes
from Ltot=8 (Mbit/s) to Ltot=10 (Mbit/s) at n=30and, then, it falls out to Ltot=6 (Mbit/s) at n=60. The application scenario already described at
the beginning of this sub-section has been tested at M= 10 and (W). The solid piece-wise linear plot of Figure 3 marks the steady-state
optimal values of . These optimal values have been obtained by directly solving the equation (23) through offline centralized numerical methods.
Overall, an examination of the plots of Figure 3 supports two main conclusions. First, the implemented adaptive version of the scheduler quickly
reacts to abrupt time-variations of the offered workload, and it is capable to converge to the corresponding steady-state optimum within
about 1015 iterations. Second, virtually indistinguishable trajectories for are obtained for ranging over the interval [0.1, 0.6], so that in
Figure 3 we report the time-evolutions of at and .
As already noted in (Kushner et al. 1995), also in our framework, the sensitivity of the adaptive version of the optimal scheduler on is
-3 -1
negligible, at least for values of in (27) ranging over the intervals [10-3, 10-1] and [0.1, 0.6], respectively.
5.3 Computing vs. Communication Tradeoff
Intuitively, we expect that small ‘s values give rise to high per-VM computing frequencies (see (10.2)), while too large ‘s induce high end-to-end
communication rates (see (10.6)). However, we also expect that, due to the adaptive power-rate control provided by the implemented scheduler
(see (21.2), (22)), there exists a broad range of ‘s values that attains an optimized tradeoff. The plots of Figure 4 confirm, indeed, these
expectations. They refer to the application scenario of section 5.2 at M=2, 10, (Mbit), a=1,3 (Mbit) and i=1,…, M (W). An
examination of these plots supports two main conclusions. First, the energy consumption of the scheduler attains the minimum for values of the
ratio falling into the (quite broad) interval [0.1, 0.8]. Second, the effects of the ratio on the energy performance of the scheduler are negligible
when the considered VNetDC operates far from the boundary of the feasibility region dictated by (18.1)-(18.2) (see the two lowermost curves of
Figure 4). Interestingly enough, the increasing behavior of the curves of Figure 4 gives practical evidence that the computing energy dominates
the overall energy consumption at vanishing (see (10.2)), while the network energy becomes substantial at (see (10.6)).
Figure 4. Impact on of the computing vs. communication
delay tradeoff
5.4 Performance Impact of Discrete Computing Rates
1. The average energy reduction stemming from the exploitation of multi-frequency techniques; and,
2. The energy penalty induced by the frequency-switching over a finite number Q of allowed per-VM processing rates (see (7)).
For this purpose, the same operating scenario of the above section 5.2 has been considered at
(Joule/(MHz)2)
and
2. Q=6 (i.e., discrete computing rates with six allowed computing frequencies evenly spaced over ); and,
As a consequence, the plots of Figure 5 support the following three main conclusions. First, at Q=2, the activation of only two VMs (if feasible)
stems out as the most energy-saving solution. Second, the relative gap in Figure 5 between the uppermost curve (at M=2) and the lowermost one
(at M=9) is very large. Third, the relative gap between the two lowermost curves of Figure 5 is limited up to15%.
5.5 Performance Comparisons under Synthetic TimeUncorrelated Workload Traces
Testing the sensitivity of the performance of the implemented scheduler on the PMR of the offered workload is the goal of the tests of this sub-
section. Specifically, they aim at unveiling the impact of the PMR of the offered workload on the average energy consumption of the proposed
scheduler and comparing it against the corresponding ones of two state-of-the-art schedulers, namely, the STAtic Scheduler (STAS) and the
SEquential Scheduler (SES). Intuitively, we expect that the energy savings attained by dynamic schedulers increase when reconfigurable VMs are
used, especially at large PMR values. However, we also expect that not negligible reconfiguration costs may reduce the attained energy savings
and that the experienced reductions tend to increase for large PMRs. In order to validate these expectations, we have implemented the
communication-computing platform of section 5.2 at
and
ke=0.005(Joule/(MHz)2).
About the simulated STAS, we note that current virtualized data centers usually rely on static resource provisioning (Baliga et al. 2011; Tamm et
al. 2010), where, by design, a fixed number Ms of VMs constantly runs at the maximum processing rate . The goal is to constantly provide the
exact computing capacity needed for satisfying the peak workload (Mb). Although the resulting static scheduler does not experience
reconfiguration costs, it induces overbooking of the computing resources. Hence, the per-job average communication-plus-computing energy
consumption (Joule) of the STAS gives a benchmark for numerically evaluating the energy savings actually attained by dynamic schedulers.
About the simulated SES, it exploits (by design) perfect future workload information over a time-window of size (measured in multiple of the
slot period Tt), in order to perform off-line resource provisioning at the minimum reconfiguration cost. Formally speaking, the SES implements
the solution of the following sequential minimization problem shown in Box 1.
Box 1.
(29)
The equation shown in Box 1` is under the constraints in (10.2)-(10.8). We have evaluated the solution of the above sequential problem by
resorting to numerical computing tools.
Table 2 reports the average energy savings (in percent) provided by the proposed scheduler and the sequential scheduler over the static one.
Furthermore, in order to test the performance sensitivity of these schedulers on the allowed computing delay, the cases of =0.05 (s), 0.1 (s) have
been considered.
In order to guarantee that the static, sequential and proposed schedulers retain the same energy performance at PMR = 1 (i.e., under constant
offered workload), the numerical results of Table 2 have been evaluated by forcing the sequential and the proposed schedulers to utilize
the same number of VMs activated by the static scheduler.
Although this operating condition strongly penalizes the resulting performance of the sequential and proposed schedulers, nevertheless, an
examination of the numerical results reported in Table 2 leads to four main conclusions.
First, the average energy savings of the proposed scheduler over the static one approaches 65%, even when the VMs are equipped with a limited
number Q=4 of discrete processing rates and the reconfiguration energy overhead is accounted for. Second, the performance loss suffered by the
proposed (adaptive) scheduler with respect to the sequential one tends to increase for growing PMRs, but it remains limited up to 3%-7%. Third,
the performance sensitivity of the proposed and sequential schedulers on the allowed computing delay is generally not critical, at least for values
of corresponding to the flat segments of the curves of Figure 4. Finally, we have experienced that, when the proposed scheduler is also free to
optimize the number of utilized VMs, the resulting average energy saving over the static scheduler approaches 90%-95% (Cordeschi et al. 2014).
5.6 Performance Comparisons under TimeCorrelated RealWorld Workload Traces
These conclusions are confirmed by the numerical results of this subsection, that refer to the realworld (e.g., not synthetic) workload trace of
Figure 6. This is the same real-world workload trace considered in Figure 14.a of (Urgaonkar, Pacifici, Shenoy, Spreitzer, & Tantawi, 2007). The
numerical tests of this subsection refer to the communication-computing infrastructure of Section 5.5 at ke=0.5 (Joule/ (MHz) 2) and =1.2 (s).
Furthermore, in order to maintain the peak workload still fixed at 16 (Mbit/slot), we assume that each arrival of Figure 6 carries out a workload
of0.533(Mbit).
Since the (numerically evaluated) PMR of the workload trace of Figure 6 is limited up to 1.526, and the corresponding time-covariance coefficient
is large and approaches 0.966, the workload trace of Figure 6 is smoother (e.g., it exhibits less time-variations) than those previously
considered in Section 5.5. Hence, we expect that the corresponding performance gaps of the proposed and sequential schedulers over the static
one are somewhat less than those reported in Section 5.5 for the case of time-uncorrelated workload.
However, we have tested that, even under the (strongly) time-correlated workload trace of Figure 6, the average energy reduction of the proposed
scheduler over the static one still approach 40%, while the corresponding average energy saving of the sequential scheduler over the proposed
one remains limited up to 5.5%-5%.
6. ONGOING DEVELOPMENTS AND HINTS FOR FUTURE RESEARCH
The focus of this chapter is on the adaptive minimum-energy reconfiguration of data center resources under hard latency constraints. However,
when the focus shifts to delay-tolerant Internet-assisted mobile applications (e.g., mobile Big Data applications), the current work may be
extended along three main directions of potential interest.
First, being the VNetDC considered here tailored on hard real-time applications, it subsumes that a single job accedes the VNetDC during each
slot, so to avoid at all random queue effects. However, under soft latency constraints, the energy efficiency of the VNetDC could be, in principle,
improved by allowing multiple jobs to be temporarily queued at the Middleware layer. Hence, guaranteeing optimized energy-vs.-queue delay
trade-offs under soft latency constraints are, indeed, a first research topic.
Second, due to the considered hard real-time constrains, in our framework, the size Ltot of the incoming job is measured at the beginning of the
corresponding slot and, then, it remains constant over the slot duration Tt. As a consequence, the scheduling policy considered here is of
clairvoyant-type, and this implies, in turn, that migrations of VMs are not to be considered. However, under soft delay constraints, intra-slot job
arrivals may take place. Hence, the optimal resource provisioning policy could be no longer of clairvoyant-type, so that live migrations and VMs
replacement could become effective means to further reduce the energy costs. The development of adaptive mechanisms for planning at runtime
minimum-energy live migrations of VMs is a second research topic of potential interest.
Third, emerging mobile BDSC applications require that the information processed by data centers is timely delivered to the requiring clients
through TCP/IP mobile connections (Cugola et al. 2012). In this application scenario, the energy-efficient adaptive management of the delay-vs.-
throughput trade-off of TCP/IP mobile connections becomes an additional topic for further research.
Lastly, the final practical goal of these research lines is to implement in SW a revised VMM kernel, that:
7. CONCLUSION
In this chapter, we developed the optimal scheduler for the joint adaptive load balancing and provisioning of the computing rates,
communication rates and communication powers in energy-efficient virtualized NetDCs which support real-time Big Data streaming services.
Although the resulting optimization problem is inherently nonconvex, we unveil and exploit its loosely coupled structure for attaining the
analytical characterization of the optimal solution. The carried out performance comparisons and sensitivity tests highlight that the average
energy savings provided by our implemented scheduler over the state-of-the-art static one may be larger than 60%, even when the PMR of the
offered workload is limited up to 2 and the number Q of different processing rates equipping each VM is limited up to 4-5. Interestingly, the
corresponding average energy loss of our scheduler with respect to the corresponding sequential one is limited up to 4%-6%, especially when the
offered workload exhibits not negligible time-correlation.
This work was previously published in Emerging Research in Cloud Distributed Computing Systems edited by Susmit Bagchi, pages 122155,
copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Alizadeh M. Greenberg A. Maltz D. A. Padhye J. (2010). Data center TCP (DCTCP). In Proceedings of ACM SIGCOMM. ACM.
Almeida, J., Almeida, V., Ardagna, D., Cunha, I., Francalanci, C., & Trubian, M. (2010). Joint admission control and resource allocation in
virtualized servers. Journal of Parallel and Distributed Computing , 70(4), 344–362. doi:10.1016/j.jpdc.2009.08.009
Azodomolky, S., Wieder, P., & Yahyapour, R. (2013). Cloud computing networking: Challanges and Opportunities for Innovations. IEEE Comm.
Magazine, 54-62.
Baccarelli, E., & Biagi, M. (2003). Optimized power allocation and signal shaping for interference-limited multi-antenna ‘ad hoc’
networks. Personal Wireless Communications, 138-152.
Baccarelli, E., Biagi, M., Pelizzoni, C., & Cordeschi, N. (2007). Optimized power allocation for multiantenna systems impaired by multiple access
interference and imperfect channel estimation.IEEE Transactions on Vehicular Technology , 56(5), 3089–3105. doi:10.1109/TVT.2007.900514
Baccarelli, E., Cordeschi, N., & Patriarca, T. (2012). QoS stochastic traffic engineering for the wireless support of real-time streaming
applications. Computer Networks , 56(1), 287–302. doi:10.1016/j.comnet.2011.09.010
Baliga, J., Ayre, R. W. A., Hinton, K., & Tucker, R. S. (2011). Green Cloud Computing: Balancing Energy in Processing, Storage and
Transport. Proceedings of the IEEE , 99(1), 149–167. doi:10.1109/JPROC.2010.2060451
Ballami, H., Costa, P., Karagiannis, T., & Rowstron, A. (2011). Towards predicable datacenter networks. In Proceedings of SIGCOMM ‘11. ACM.
Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (2006). Nonlinear programming (3rd ed.). Wiley. doi:10.1002/0471787779
Chen, J. J., & Kuo, T. W. (2005). Multiprocessor energy-efficient Scheduling for real-time tasks with different power characteristics.ICCP , 05,
13–20.
Chiang, M., Low, S. H., Calderbank, A. R., & Doyle, J. C. (2007). Layering as optimization decomposition: A mathematical theory of network
architectures. Proceedings of the IEEE , 95(1), 255–312. doi:10.1109/JPROC.2006.887322
Cordeschi, N., Amendola, D., Shojafar, M., & Baccarelli, E. (2014). Performance evaluation of primary-secondary reliable resource-management
in vehicular networks. In Proceedings of IEEE PIMRC 2014. Washington, DC: IEEE.
Cordeschi, N., Patriarca, T., & Baccarelli, E. (2012). Stochastic traffic engineering for real-time applications over wireless networks. Journal of
Network and Computer Applications , 35(2), 681–694. doi:10.1016/j.jnca.2011.11.001
Cordeschi, N., Shojafar, M., & Baccarelli, E. (2013). Energy-saving self-configuring networked data centers . Computer Networks ,57(17), 3479–
3491. doi:10.1016/j.comnet.2013.08.002
Cugola, G., & Margara, A. (2012). Processing flows of information: From data stream to complex event processing. ACM Computing
Surveys , 44(3), 1–62. doi:10.1145/2187671.2187677
Das, T., & Sivalingam, K. M. (2013). TCP improvements for data center networks. Communication Systems and Networks (COMSNETS), 1-10.
Gandhi, A., Balter, M. H., & Adam, I. (2010). Server farms with setup costs. Performance Evaluation , 11(67), 1123–1138.
doi:10.1016/j.peva.2010.07.004
Ge, R., Feng, X., Feng, W., & Cameron, K. W. (2007). CPU miser: A performance-directed, run-time system for power-aware clusters.
In Proceedings of IEEE ICPP07. IEEE.
Govindan, S., Choi, J., Urgaonkar, B., Sasubramanian, A., & Baldini, A., (2009). Statistical profiling-based techniques for effective power
provisioning in data centers. In Proc of EuroSys. Academic Press.
Greenberg, A., Hamilton, J., Maltz, D. A., & Patel, P. (2009). The cost of a cloud: Research problems in data center networks. ACM
SIGCOMM , 39(1), 68–73. doi:10.1145/1496091.1496103
Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., & Sengupta, S. (2011). VL2: A scalable and flexible data center
Network. Communications of the ACM , 54(3), 95–104. doi:10.1145/1897852.1897877
Gulati, A., Merchant, A., & Varman, P.J. (2010). mClock: Handling throughput variability for hypervisor IO scheduling. InProceedings of
OSDI’10. Academic Press.
Guo, C. (2010). SecondNet: A data center network virtualization architecture with bandwidth guarantees. In Proceedings of ACM CoNEXT.
ACM; doi:10.1145/1921168.1921188
Jin, S., Guo, L., Matta, I., & Bestavros, A. (2003). A spectrum of TCP-friendly window-based congestion control algorithms.IEEE/ACM
Transactions on Networking , 11(3), 341–355. doi:10.1109/TNET.2003.813046
Kansal, A., Zhao, F., Liu, J., Kothari, N., & Bhattacharya, A. (2010). Virtual machine power metering and provisioning. SoCC ,10, 39–50.
Kim K. H. Beloglazov A. Buyya R. (2009). Power-aware provisioning of cloud resources for real-time services. In Proc. of ACM MGC’09. ACM.
Kim, K. H., Buyya, R., & Kim, J. (2007). Power aware scheduling of bag-of-tasks applications with deadline constraints on DVS-enabled clusters.
In Proceedings ofIEEE International Symposium of CCGRID (pp. 541-548). IEEE. 10.1109/CCGRID.2007.85
Koller, R., Verma, A., & Neogi, A. (2010). WattApp: An application aware power meter for shared data centers. In Proceedings of ICAC’10.
IEEE; doi:10.1145/1809049.1809055
Kumbhare, A. (2014). PLAstiCC: Predictive look-ahead scheduling for continuous data flaws on clouds . IEEE.
Kurose, J. F., & Ross, K. W. (2013). Computer networking - A top-down approach featuring the internet (6th ed.). Addison Wesley.
Kushner, H. J., & Yang, J. (1995). Analysis of adaptive step-size SA algorithms for parameter tracking. IEEE Transactions on Automatic
Control , 40(8), 1403–1410. doi:10.1109/9.402231
Kusic, D., & Kandasamy, N. (2009). Power and performance management of virtualized computing environments via look-ahead control.
In Proc. of ICAC. Academic Press.
Laszewski G. Wang L. Young A. J. He X. (2009). Power-aware scheduling of virtual machines in DVFS-enabled clusters. In Proceeding of
CLUSTER’09. IEEE.10.1109/CLUSTR.2009.5289182
Li, K. (2008). Performance analysis of power-aware task scheduling algorithms on multiprocessor computers with dynamic voltage and
speed. IEEE Tr. On Par. Distr. Systems , 19(11), 1484–1497. doi:10.1109/TPDS.2008.122
Lin M. Wierman A. Andrew L. Thereska E. (2011). Dynamic right-sizing for power-proportional data centers. In Proceedings of INFOCOM. IEEE.
Liu, J., Zhao, F., Liu, X., & He, W. (2009). Challenges towards elastic power management in internet data centers. In Proc. on IEEE Conf. Distr.
Comput. Syst. Workshops. Los Alamitos, CA: IEEE. 10.1109/ICDCSW.2009.44
Liu, Q., Zhou, S., & Giannakis, G. B. (2004). Cross-layer combining of adaptive modulation and coding with truncated ARQ over wireless
links. IEEE Transactions on Wireless Communications , 3(5), 1746–1755. doi:10.1109/TWC.2004.833474
Loesing, S., Hentschel, M., Kraska, T., & Kossmann, D. (2012). Storm: An elastic and highly available streaming service in the cloud. EDBT-
ICDT , 12, 55–60. doi:10.1145/2320765.2320789
Lu, T., Chen, M., & Andrew, L. L. H. (2012). Simple and effective dynamic provisioning for power-proportional data centers. IEEE Transactions
on Parallel and Distributed Systems , 24(6), 1161–1171. doi:10.1109/TPDS.2012.241
Mathew, V., Sitaraman, R., & Rowstrom, A. (2012). Energy-aware load balancing in content delivery networks. In Proceedings of INFOCOM (pp.
954-962). IEEE.
Mishra, A., Jain, R., & Durresi, A. (2012). Cloud computing: Networking and communication challenges. IEEE Comm. Magazine, 24-25.
Neely, M.J., Modiano, E., & Rohs, C.E. (2003). Power allocation and routing in multi beam satellites with time-varying channels.IEEE/ACM Tr.
on Networking, 19(1), 138-152.
Neumeyer, L., Robbins, B., & Kesari, A. (2010). S4: Distributed stream computing platform. In Proceedings of Intl. Workshop on Knowledge
Discovery Using Cloud and Distributed Computing Platforms. IEEE.
Padala, P., You, K.Y., Shin, K.G., Zhu, X., Uysal, M., Wang, Z., … Merchant, M. (2009). Automatic control of multiple virtualized resources.
In Proc. of EuroSys. Academic Press.
Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T. (2013). TimeStream: Reliable stream computation in the cloud. InProceedings of EuroSys (pp.
1-14). Academic Press.
Scheneider, S., Hirzel, M., & Gedik, B. (2013). Tutorial: Stream processing optimizations. In Proceedings of ACM DEBS (pp. 249-258). ACM.
Stoess, J., Lang, C., & Bellosa, F. (2007). Energy management for hypervisor-based virtual machines. USENIX Annual Technical, 1-14.
Tamm, O., Hersmeyer, C., & Rush, A. M. (2010). Eco-sustainable system and network architectures for future transport networks.Bell
Labs.Techn. J. , 14(4), 311–327. doi:10.1002/bltj.20418
Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., & Tantawi, A. (2007). Analytic modeling of multitier Internet applications. ACM Tr. on the
Web, 1(1).
Vasudevan, V., Phanishayee, A., & Shah, H. (2009). Safe and effective fine-grained TCP stream retransmissions for datacenter communication.
In Proceedings of ACM SIGCOMM (pp. 303-314). ACM.
Wang, L., Zhang, F., Aroca, J.A., Vasilakos, A.V., Zheng, K., Hou, C., …, Liu, Z. (2014). Green DCN: A general framework for achieving energy
efficiency in data denter networks. IEEE JSAC, 32(1).
Warneke, D., & Kao, O. (2011). Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE Tr. on Paral. and
Distr. Systems , 22(6), 985–997.
Xia, L., Cui, Z., & Lange, J. (2012). VNET/P: Bridging the cloud and high performance computing through fast overaly networking. IEEE.
doi:10.1145/2287076.2287116
Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012).Discretized streams: an efficient and fault-tolerant model for stream processing on
large clusters . Hotcloud.
Zhou Z. Liu F. Jin H. Li B. Jiang H. (2013). On arbitrating the power-performance tradeoff in SaaS clouds. In Proceedings of INFOCOM (pp.
872–880). IEEE.
Zhu, D., Melhem, R., & Childers, B. R. (2003). Scheduling with dynamic voltage/rate adjustment using slack reclamation in multiprocessor real-
time systems. IEEE Transactions on Parallel and Distributed Systems , 14(7), 686–700. doi:10.1109/TPDS.2003.1214320
KEY TERMS AND DEFINITIONS
Cloud Brokers: They accept risk associated with reserving dynamically priced resources in return for charging higher but stable prices.
Data Center: A data center is a facility used to house computer systems and associated components, such as telecommunications and storage
systems.
Dynamic Voltage and Frequency Scaling (DVFS): DVFS is applied in most of the modern computing units, such as cluster computing and
supercomputing, to reduce power consumption and achieve high reliability and availability.
Load Balancing: Algorithms seek to distribute workloads across a number of servers in such a manner that the average time taken to complete
execution of those workloads is minimized.
Network Virtualization: Network virtualization is categorized as either external virtualization, combining many networks or parts of
networks into a virtual unit, or internal virtualization, providing network-like functionality to software containers on a single network server.
Reconfiguration Cost: Reconfiguration cost measured at the Middleware layer of VNetDc and refer to the cost of changing the frequency of
each VM while it loaded for new workload compare to its frequency for the previous workload.
Stream Computing: A high-performance computer system that analyzes multiple data streams from many sources live. The word stream in
stream computing is used to mean pulling in streams of data; processing the data and streaming it back out as a single flow.
Virtual Machine Manager (VMM): Virtual Machine Manager allows users to create, edit, start and stop VMs, view and control of each VM’s
console and see performance and utilization statistics for each VM.
ENDNOTES
1 Since Ltot is expressed in (bit), we express (bit/s). However, all the presented developments and formal properties still hold verbatim
when Ltot is measured in Jobs and, then, fc is measured in (Jobs/cycle). Depending on the considered application scenario, a Job may be a bit,
frame, datagram, segment, or an overall file.
2 Formally speaking, the primal-dual algorithm is an iterative procedure for solving convex optimization problems, which applies quasi-Newton
methods for updating the primal-dual variables simultaneously and moving towards the saddle-point [23, pp.407-408]
3 Proposition 4 proves that, for any assigned , the relationship in (26.3) gives the corresponding optimal . This implies, in turn, that
in (B.1) vanishes if and only if the global optimum is attained, that is, at , for any i=1,…,M.
4 About this point, the formal assumption of section 2 guarantees that: 1. in (26.2) is continuous and strictly increasing in over the
feasible set ; 2. in (26.3) is continuous and strictly increasing in ; and, 3. in (26.1) is continuous for and strictly
increasing in for . Hence, the condition: guarantees that:∆, for any i=1,…,M.
APPENDIX A: DERIVATIONS OF EQUATIONS (21.1)(23)
Being the constraint in (10.7) already accounted for by the feasibility condition in (18.2), without loss of optimality, we may directly focus on the
resolution of optimization problem in (16) under the constraints in (10.2)-(10.5). Since this problem is strictly convex and all its constraints are
linear, the Slater’s qualification conditions hold (Bazaraa et al. 2006, Chap.5, 23), so that the KKT conditions (Bazaraa et al. 2006, Chap.5) are
both necessary and sufficient for analytically characterizing the corresponding unique optimal global solution. Before applying these conditions,
we observe that each power-rate function in (17) is increasing for , so that, without loss of optimality, we may replace the equality constraint in
(10.3) by the following equivalent one: . In doing so, the Lagrangian function of the afforded problem reads as in
(A.1)
where indicates the objective function in (16.1), ‘s and are nonnegative Lagrange multipliers and the box constraints in (10.4), (10.5) are
managed as implicit ones. The partial derivatives of with respect to are given by
(A.2)
(A.3)
while the complementary conditions (Bazaraa et al. 2006, Chap.4) associated to the constraints present in (A.1) read as in
(A.4)
Hence, by equating (A.2) to zero, we directly arrive at (21.1), that also account for the box constraint: through the corresponding
projector operator. Moreover, a direct exploitation of the last complementary condition in (A.4) allows us to compute the optimal by solving the
algebraic equation in (23). In order to obtain the analytical expressions for and , we proceed to consider the two cases of and .
Specifically, when is positive, the i-th constraint in (10.2) is bound (see (A.4)), so that we have:
(A.5)
Hence, after equating (A.3) to zero, we obtain the following expression for the corresponding optimal :
(A.6)
Since must fall into the closed interval for feasible CPOPs (see (10.2), (10.5)), at , we must have: or . Specifically,
we observe that, by definition, vanishing is optimal when . Therefore, by imposing that the derivative in (A.3) is nonnegative at ,
we obtain the following condition for the resulting optimal :
(A.7)
Passing to consider the case of and , we observe that the corresponding KKT condition is unique; it is necessary and sufficient
for the optimality and requires that (A.3) vanishes (Bazaraa et al. 2006, Chap.4). Hence, the application of this condition leads to the following
expression for the optimal (see (A.3)):
(A.8)
Equation (A.8) vanishes at (see (A.7)), and this proves that the function: vanishes and is continuous at . Therefore, since
(A.7) already assures that vanishing is optimal at and , we conclude that the expression in (A.8) for the optimal must hold when
and . This structural property of the optimal scheduler allows us to merge (A.7), (A.8) into the following equivalent expression:
(A.9)
so that Equation (21.2) directly arises from (A.5), (A.9). Finally, after observing that cannot be negative by definition, from (A.6) we obtain (22),
where the projector operator accounts for the nonnegative value of . This completes the proof of Proposition 4.
APPENDIX B: PROOF OF PROPOSITION 5
The reported proof exploits arguments based on the Lyapunov Theory which are similar, indeed, to those already used, for example, in sections
3.4, 8.2 in (Srikant, 2004). Specifically, after noting that the feasibility and strict convexity of the CPOP in (16) guarantees the existence and
uniqueness of the equilibrium point of the iterates in (25) and (26), we note that Proposition 4 assures that, for any assigned and ,
Equations (26.1)-(26.3) give the corresponding (unique) optimal values of the primal and dual variables . Hence, it suffices to prove
the global asymptotic convergence of the iteration in (25).
(B.1)
3
we observe that for and at the optimum, i.e., for . Hence, since in (B.1) is also radially unbounded
(that is, as ), we conclude that (B.1) is an admissible Lyapunov’s function for the iterate in (25). Hence, after posing
according to the Lyapunov’s Theorem (Srikant, 2004, section 3.10), we must prove that the following (sufficient) condition for the asymptotic
global stability of (25) is met:
(B.2)
(B.3)
Hence, since is positive, we have (see (25)): , that, in turn, leads to (see (26.3)): , for any i=1,…,M4.Therefore, in order to
prove (B.2), it suffices to prove that the following inequality holds for large n:
(B.4)
As a consequence, the difference: may be done vanishing (i.e., as small as desired) as . Hence, after noting that the
functions in (25), (26) are continuous by assumption (see the footnote 4), a direct application of the Sign Permanence Theorem guarantees that
(B.4) holds when the difference in (B.3) is positive.
By duality, it is direct to prove that (B.2) is also met when the difference in (B.3) is negative. This completes the proof of Proposition 5. From an
application point of view, we have numerically ascertained that, in our setting, the step-size sequence in (27) leads to the global convergence in
the steady-state and allows fast tracking in the transient-state.
CHAPTER 41
Essentiality of Machine Learning Algorithms for Big Data Computation
Manjunath Thimmasandra Narayanapppa
BMS Institute of Technology, India
T. P. Puneeth Kumar
Acharya Institute of Technology, India
Ravindra S. Hegadi
Solapur University, India
ABSTRACT
Recent technological advancements have led to generation of huge volume of data from distinctive domains (scientific sensors, health care, user-
generated data, finical companies and internet and supply chain systems) over the past decade. To capture the meaning of this emerging trend
the term big data was coined. In addition to its huge volume, big data also exhibits several unique characteristics as compared with traditional
data. For instance, big data is generally unstructured and require more real-time analysis. This development calls for new system platforms for
data acquisition, storage, transmission and large-scale data processing mechanisms. In recent years analytics industries interest expanding
towards the big data analytics to uncover potentials concealed in big data, such as hidden patterns or unknown correlations. The main goal of this
chapter is to explore the importance of machine learning algorithms and computational environment including hardware and software that is
required to perform analytics on big data.
INTRODUCTION
Every day, 2.5 quintillion bytes of data are created and 90 percent of the data in the world today were produced within the past two years (IBM,
2012). Our capability for data generation has never been so powerful and enormous ever since the invention of the information technology. As
another example, in 2012, the first presidential debate between President Barack Obama and Governor Mitt Romney generated more than 10
million tweets in 2 hours (Twitter, 2012). Among all these tweets, the specific moments that generated the most discussions revealed the public
interests, such as the discussions about vouchers and Medicare. Such online discussions provide a new means to sense the public interests and
generate feedback in real- time, and are mostly appealing compared to standard media, such as TV broadcasting, newspapers or radio. Another
example is Flickr, a picture sharing site, which receives on an average 1.83 million photos (Michel, 2015). Assuming the size of each photo is 2
megabytes (MB), this requires 3.6 terabytes (TB) of storage disk every single day. In fact, as an old saying states: “a picture is speaks a thousand
words,” the billions of pictures collected by Flicker are a treasure tank for us to explore the human society, public affairs, social events, disasters,
and so on, only if we have the powerful technology to harness the enormous amount of data. The above examples show the rise of Big Data
applications where data collection has grown tremendously and is beyond the ability of commonly used software tools to acquire, manage, and
process within an “acceptable elapsed time.” An essential challenge facing by applications of Big Data is to explore the large volumes of data and
extract useful information or knowledge for future actions (Rajaraman & Ullman, 2011).
Machine learning is a branch of artificial intelligence that allows us to make our application intelligent without being explicitly programmed.
Machine learning concepts are used to enable applications to take a decision from the available datasets. A combination of machine learning and
data mining can be used to develop various applications such as spam mail detectors, self-driven cars, face recognition, speech recognition, and
online transactional fraud-activity detection. There are many popular organizations that are using machine-learning algorithms to make their
service or product understand the need of their users and provide services as per their behavior. Google has its intelligent web search engine,
which provides a number one search, spam classification in Google Mail, news labeling in Google News, and Amazon for recommender systems.
There are many open source frameworks available for developing these types of applications/frameworks, such as R, Python, Apache Mahout,
and Weka (Han Hu, Wen, Chua, Xuelong Li, n.d).
In the basic computational model of CPU and memory, the algorithms runs on the CPU and access the data that is in the memory, its need to
bring in the data from disk into memory, but once the data is in memory, and the algorithm runs in the data that is on memory. This is the
familiar model considered to implement all kinds of algorithms such as machine learning, statistics etc. wherever the data is so big, that it cannot
fit into memory at the same time. That's where data mining comes in. In traditional data mining algorithms, since the data is a big, only portion
of the data bring into memory at a time and process the data in batches, finally writes the results back to disk. But sometimes even this is not
sufficient. Now if you take ten billion webpages, each of 20 KB, you have, total dataset size of 200 TB. Now, when you have 200 TB, let us assume
that by using the traditional computational model, traditional data mining model. And all this data is stored on a single disk, and we have read
tend to be processed inside a CPU. Now the fundamental limitation here is the data bandwidth between the disk and the CPU. The data has to be
read from the disk into the CPU, and read bandwidth for most modern SATA disk is around 50MB a second, so we can read data at 50MB a
second, it means its takes 4 million seconds that is 46 days. To do something useful with the data, it's going to take even longer time. Such a long
time is unacceptable, we need a better solution to read and process the Big Data. One of the solutions for this kind of problem is to split the data
into chunks on multiple disks and CPUs. Now read the chunks of the data from the multiple disks and process it in parallel in multiple CPUs.
That will cut down read and process time by a lot. For example, if you had a 1,000 disks and CPUs, 4 million seconds come down to 4,000
seconds. This is the fundamental idea behind the idea of cluster computing.
The architecture that has emerged for cluster computing is something is show in the Figure 1.in the figure the racks consisting of number of
commodity Linux nodes. Commodity Linux nodes will be used because they are very cheap. Each rack has 16 to 64 of these commodity Linux
nodes and these nodes are connected by a gigabit switch. So there is 1 Gbps bandwidth between any pair of nodes in rack. Of course 16 to 64
nodes is not sufficient. Multiple racks are lined up and connected by backbone switches. Each backbones is a higher bandwidth switch can
transfer two to ten gigabits between racks, these goup og racks form a datacenter. This is the classical traditional architecture that has emerged
over the last few years for storing and mining massive datasets.
The traditional machine learning algorithm which are designed for traditional computing environment consisting of singe memory and single
disk are not capable of mining big data on a distributed environment like hadoop, however a scalable machine learning algorithms provided by
mahout library can run on distributed environment like hadoop HDFS.
LITERATURE SURVEY
Because of the massive, heterogeneous, multisource and dynamic characteristics of application data involved in a distributed environment, one of
the most important challenges of Big Data is to carry out computing on the petabyte or even the Exabyte level data with a complex computing
process involving scalable machine learning algorithms and complex computing platform. Therefore, utilizing a parallel Computing
infrastructure like hadoop cluster, its corresponding programming language support to analyze and mining the distributed data are the critical
goals for Big Data analytics.
Designing and developing large scale machine learning algorithm has attracted a significant amount of research attention, many such algorithms
have been proposed in the past decades. However, designing the machine learning algorithms to work on massive datasets present on distributed
platforms really a challenging task.
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding (2014) presented a HACE theorem that suggests key characteristics of big data and
proposes a Big Data processing model for data mining, that model involves demand-driven aggregation of information sources, mining and
analysis, user interest modeling, and security and privacy considerations. It also discusses issues in the data-driven model and also in the Big
Data revolution
Jainendra Singh (2014) have analysed the challenging issues in the data-driven model and discussed the Machine Learning (ML) approach to the
security problems encountered in big data applications, technologies and theories.
A. N. Nandakumar and Nandita Yambem (2014) explored the current activities and challenges in migration of existing data mining algorithms
onto hadoop platform for increased parallel processing efficiency. Even identified the current gaps and open research areas in migration process.
Jiby Joseph, Omar Sharif and Ajit Kumar (2014) proposed a solution for Predictive Maintenance in Automotive Industry using machine learning
on big data platform and demonstrates how Machine Learning can enable accurate prediction of failure events in the production line of
automotive industries. Discussed the Applications of Machine Learning in Different Industries.
Jimmy Lin and Chris Dyer (2010) give a very detailed explanation of applying EM algorithms to text processing and fitting those algorithms into
the MapReduce programming model. The EM fits naturally into the MapReduce programming model by making each iteration of EM one
MapReduce job: mappers map over independent instances and compute the summary statistics, while the reducers sum together the required
training statistics and solve the M-step optimization problems. In this work, it was observed that when global data is needed for synchronization
of hadoop tasks, it was difficult with current support from hadoop platform.
Kang and Christos Faloutsos (n.d) applied hadoop for graph mining in social networking data. One of the main observations here is that some of
the graph mining algorithms cannot be parallelized, so estimated solutions are needed.
Anjan K Koundinya (2012) have implemented Apriori algorithm on Apache Hadoop platform. Contrary to the believe that parallel processing will
take less time to get Frequent item sets, they experimental observation proved that multi node Hadoop with differential system configuration
(FHDSC) was taking more time. The reason was in way the data has been portioned to the nodes.
Gong-Qing Wu (2009) implemented C4.5 decision tree classification algorithm on apache hadoop. In this work, while constructing the bagging
ensemble based reduction to construct the final classifier many duplicates were found. These duplicates could not have avoided if proper data
partitioning method have been applied.
BIG DATA ANALYTICS
Big data analytics (Han Hu, Yonggang Wen, Tat-Seng Chua, Xuelong Li, n.d) is the process of using analysis algorithms running on powerful
supporting platforms to uncover potentials hidden in big data, such as hidden patterns or unknown relationships. According to the processing
time requirement, big data analytics can be categorized into two different paradigms:
Streaming Processing
The start point for the streaming processing model is the assumption that the potential value of data depends on freshness of the data. Thus, the
streaming processing paradigm analyzes data as soon as possible it arrives. In this paradigm, data arrives in a stream. In its continuous arrival,
because the stream is fast and carries enormous volume, only a small share of the stream is stored in limited memory. One or few passes over the
stream are made to find approximation results. Streaming processing technology have been studied for decades. Open source systems such as
Storm, S4 and Kafka supports stream processing. The streaming processing paradigm is mainly used in online applications.
Batch Processing
In the batch-processing paradigm, data are stored first and then analyzed. MapReduce become the leading batch-processing model. The basic
idea of MapReduce is that data are divided into small chunks. Then, these chunks are processed in parallel on a distributed platform to generate
intermediate results. Finally the result is derived by aggregating all the intermediate results. This model schedules computation resources very
close to data location, which avoids the communication overhead of data transmission. The MapReduce model is simple and widely applied in
several fields such as bioinformatics, web mining, and machine learning.
In general the streaming processing paradigm is appropriate for applications in which data are generated in the form of a stream and speedy
processing is required to obtain approximation results. Therefore, the streaming processing application domains are relatively narrow. Recently,
most applications have adopted the batch-processing paradigm; even some real-time processing applications use the batch-processing paradigm
to realize a faster response. Because the batch-processing paradigm is widely adopted, in this chapter we consider batch processing based big
data platforms.
Previously the focus was on building the technologies to overcome various challenges of Big Data, todays focus is on enabling advanced Analytics
on Big Data. Apache’s open-source section carrying out many development activities in this regard and also there are number of start-ups
booming with products for performing Advanced Analytics like supervised, un-supervised learning, predictive modelling, regression etc. on Big
Data in Hadoop.
Hadoop
Apache Hadoop the Hadoop open source project from apache supports intensive processing of large datasets across distributed systems. It is
designed to feature high performance and scalability on data-intensive applications, whereby data systems can scale up from a single sever to
hundreds or thousands of computing nodes, each offering parallel computation and distributed storage. For further information
HBase
The HBase open source distributed database system is part of the apache Hadoop project. It is a noSQL, versioned, column-oriented data storage
system that provides random real-time read/write access to big data tables and runs on top of the Hadoop Distributed File system
Hive
The Hive open source data warehouse system is also part of the apache Hadoop project. It provides data summarization, queries, and analysis of
large datasets. Likewise, it incorporates a mechanism to feature ad-hoc queries via a general purpose SQL-like language, called HiveQL, while
maintaining traditional map/reduce operations in those situations where complex logic can’t adequately be expressed using HiveQL.
COMPUTATIONAL FRAMEWORK
Hadoop is specially designed for two core concepts: HDFS and MapReduce (Vignesh Prajapati, 2013). Both are related to distributed
computation. MapReduce is believed as the heart of Hadoop that performs parallel processing over distributed data. Hadoop have its several
distributions and tools that are compatible with its distributed file system, such as Hive, pig scripts and HBase.
Hadoop Distributed File System (HDFS)
HDFS is rack-aware filesystem of Hadoop, which is a UNIX-based data storage layer of Hadoop. HDFS is resulting from concepts of Google
filesystem. Hadoop partition the data and computation across several (thousands of) nodes of the hadoop clusters, and executes the application
computations in parallel, close to their data. On HDFS, data files are replicated as sequences of blocks in the cluster. A Hadoop cluster capable of
scaling storage capacity, computation capacity and I/O bandwidth by simply adding commodity servers. HDFS can be accessed from applications
in various different ways. Natively, HDFS provides a Java API for applications to use.
The Hadoop clusters at Yahoo! span around 40,000 servers and store 40 petabytes of data. Also, worldwide around one hundred other
organizations are known to use Hadoop.
Characteristics of HDFS:
• Fault tolerant
MapReduce
MapReduce is a programming model for processing huge datasets scattered on a large cluster. MapReduce is the core of Hadoop. Mapreduce
programming model allows performing massive data processing across thousands of nodes configured with Hadoop clusters. Mapreduce
paradigm is derived from Google MapReduce.
Hadoop MapReduce is a software framework for writing applications easily, which process huge amounts of datasets in parallel on large clusters
(thousands of nodes) of commodity hardware in a fault-tolerant and reliable manner.
This MapReduce paradigm is separated into two phases, Map and Reduce that mainly deal with key and value pairs of data. The Map and Reduce
task run sequentially in hadoop cluster to compute the final result; the output of the Map phase becomes the input for the Reduce phase
Large Scale Machine Learning
The most commonly accepted definition of “data mining” (A. Rajaraman and J. Ullman, 2011) is the discovery of “models” for data. Discovery of
model can follow one of the several approaches such as statistical modeling, machine learning, computational and summarization.
Algorithms called “machine learning” not only summarize the data and also learn a model or classifier from the data, and thus discover
something about data that will be seen in the future.
Data mining is going through a significant shift with the volume, variety, value and velocity of data increasing significantly each year. In the past,
traditional data mining software was implemented by loading data into memory and running a single thread of execution over the data. The
process was constrained by the amount of memory available and the speed of a processor. If the process could not fit it entirely into memory, the
process would fail. The single thread of execution also failed to take advantage of multicore servers unless multiple users were on the system at
the same time. The solution is to increase the machine configuration or parallelize with commodity hardware.
There are three different types of machine-learning algorithms for intelligent system development:
• Recommender systems
Supervised MachineLearning Algorithms
• Linear regression
• Logistic regression
Linear Regression
Linear regression is mainly used for predicting and forecasting values based on historical information. Regression is a supervised machine-
learning technique to identify the linear relationship between target variables and explanatory variables. We can say it is used for predicting the
target variable values in numeric form.
Logistic Regression
In statistics, logistic regression or logit regression is a type of probabilistic classification model. Logistic regression is used extensively in
numerous disciplines, including the medical and social science fields. It can be binomial or multinomial. Binary logistic regression deals with
situations in which the outcome for a dependent variable can have two possible types. Multinomial logistic regression deals with situations where
the outcome can have three or more possible types.
Unsupervised Machine Learning Algorithm
In machine learning, unsupervised learning is used for finding the hidden structure from the unlabeled dataset. Since the datasets are not
labeled, there will be no error while evaluating for potential solutions.
Unsupervised machine learning includes several algorithms, some of which are as follows:
• Clustering
• Vector quantization
Clustering
Clustering is the task of grouping a set of object in such a way that similar objects with similar characteristics are grouped in the same category,
but other objects are grouped in other categories. In clustering, the input datasets are not labeled; they need to be labeled based on the similarity
of their data structure.
In unsupervised machine learning, the classification technique performs the same procedure to map the data to a category with the help of the
provided set of input training datasets. The corresponding procedure is known as clustering (or cluster analysis), and involves grouping data into
categories based on some measure of inherent similarity; for example, the distance between data points. Clustering used in the applications like
Market segmentation, Social network analysis, Organizing computer network, Astronomical data analysis
Recommendation Algorithms
Recommendation is a machine-learning technique to predict what new items a user would like based on associations with the user's previous
items. Recommendations are widely used in the field of e-commerce applications. Through this flexible data and behavior-driven algorithms,
businesses can increase conversions by helping to ensure that relevant choices are automatically suggested to the right customers at the right
time with cross-selling or up-selling.
For example, when a customer is looking for a Samsung Galaxy S IV/S4 mobile phone on Amazon, the store will also suggest other mobile phones
similar to this one, presented in the Customers Who Bought This Item Also Bought windows
APPLICATIONS OF MACHINE LEARNING IN DIFFERENT INDUSTRIES
Machine Learning can be applied to high volumes of data in order to gain deeper insights and to improve decision making. Table 1 depicts some
emerging applications of Machine Learning
Table 1. Machine learning applications across industries
Some of the softwares that makes the machine learning possible on Hadoop are discussed below.
Mahout
Apache software came up with Apache Mahout to facilitate machine learning on Big Data, Mahout provides machine learning libraries that
enables running various scalable machine learning algorithms on Hadoop in a distributed manner using the MapReduce paradigm. Currently,
Mahout supports only Clustering, Classification and Recommendation Mining.
R is a Statistical tool consist of the packages designed specifically for executing machine learning algorithms on structured, semi-structured and
un-structured data, but R alone not possible to support design of large scalable machine learning algorithms, that is where the RHadoop comes
in, RHadoop is a collection of five R packages developed by Revolution Analytics that allow users to manage and analyze data with Hadoop using
map-reduce programming model. The packages are compatible with open source Hadoop and other Hadoop distributions such as Cloudera,
Hortonworks, mapR's. RHadoop provides a way to an analyst applying machine learning algorithm on large datasets using MapReduce
paradigm. Similarly other softwares such as RHIPE, ORCH, Hadoop Streaming also makes R and Hadoop integration possible.
A BIG DATA FRAMEWORK FOR CATEGORIZING TECHNICAL SUPPORT REQUESTS USING LARGE SCALE MACHINE
LEARNING
According to Arantxa Duque Barrachina and Aisling O’Driscoll (2014), technical support call centers regularly receive several thousand customer
queries on a daily basis. Usually, the organisations discard the data related to customer enquiries within a relatively short period of time due to
limited storage capacity. However, the use of big data platform and machine learning for big data analytics enables the call centers to store,
manage, analyse and identify customer patterns, improve first call resolution and maximise daily closure rates. This chapter provides an overview
on proof of concept (POC) end to end solution that make use of the Hadoop programming model, HABSE, HIVEQL and the mahout big data
analytics library for categorizing similar support calls for large technical support data sets.
The PoC provides an end to end solution for conducting large scale analysis of technical support datasets using the open source Hadoop platform,
Hadoop sub projects such as HBase, Hive and distributed clustering algorithms from the Mahout library. Figure 2 illustrates the architecture of
the PoC end to end solution.
Figure 2. PoC end to end Solution for analyzing large technical
support data sets
To process the technical support data by the Mahout, first it must be uploaded to HDFS, then run the Hadoop map-reduce job to convert the
technical support data exported in CSV file format into Sequence File format of the Hadoop. A Hadoop Sequence File is a flat file consisting of
binary key/value pairs.
The goal of identifying related customer support calls based on their problem description is resolved by using distributed clustering machine
learning algorithms provided by mahout library to analyse the data set, thereby finding support calls with a similar problem description and
grouped into separate clusters.
Most importantly, Hadoop alone does not provide real-time data access capability since it is designed for batch processing. Thus, once the data
analysis phase is completed using mahout’s distributed machine learning algorithms, a Hadoop Mapreduce job stores the clustering results into a
non-relational database such as Hbase, it is one of the Hadoop subproject by apache, so that technical support engineers can query the
information in real-time from Hbase using SQL like language called hive query. The information stored includes the support call number, the
cluster identifier and the probability of that support call belongs to a given cluster. When an engineer queries the support calls related to a
particular case, the cluster to which such a case should belong is identified in Hbase, along with all support calls within that cluster sorted based
on their cluster membership probability. An ordered list returned to the technical support contains support call identifier and its associated
cluster membership probability. The support calls displayed at the top of the ordered list are more likely to contain similar problems to the
specified technical support case and as a result are more likely to have the same solution. The end to end framework presented above integrates
Hadoop platform and mahout scalable machine learning library to provide a well-designed end to end solution that addresses a real world
problem, the same framework can be used for solving most of the real word big data problems of different industries.
CONCLUSION
This Chapter explores the need of Hadoop data cluster, Hadoop file system, Map Reduce programming framework Hadoop sub projects like
Hive, Hbase for storing, managing and real time analysis of big data and also explores the importance of large scale machine leaning in big data
analytics with a PoC. So it mainly concentrated on the big data analysis using large scale machine learning algorithms. This chapter also covers
big data platform compering with traditional platform of data processing. Software tools which currently available for applying machine learning
on hadoop platform have been explored. The various efforts in designing scalable machine learning algorithms have been surveyed. It also covers
the batch and stream processing of big data analytics and list out the various application of machine learning in solving the big data problem of
various industries.
This work was previously published in Managing and Processing Big Data in Cloud Computing edited by Rajkumar Kannan, Raihan Ur
Rasool, Hai Jin, and S.R. Balasundaram, pages 156167, copyright year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Anjan, Srinath, Sharma, Kumar, Madhu, & Shanbag. (2012). Map Reduce Design and Implementation of a Priori Algorithm for Handling
Voluminous data-sets. Advanced Computing: An International Journal, 3(6).
Barrachina & O’Driscoll. (2014). A big data methodology for categorizing technical support requests using Hadoop and Mahout. Journal of Big
Data.
Hu, Wen, Chua, & Li. (n.d.). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access.
Joseph, J., Sharif, O., & Kumar, A. (2014). Using Big Data for Machine Learning Analytics in Manufacturing. Tata Consultancy Services
Limited White Paper.
Kang & Faloutsos. (n.d.). Big Graph Mining: Algorithms and Discoveries. SIGKDD Explorations , 14(2).
Nandakumar, & Yambem. (2014). A Survey on Data Mining Algorithms on Apache Hadoop Platform. International Journal of Emerging
Technology and Advanced Engineering, 4(1).
Rajaraman, A., & Ullman, J. (2011). Mining of Massive Data Sets . Cambridge Univ. Press. doi:10.1017/CBO9781139058452
Singh. (2014). Big Data Analytic and Mining with Machine Learning Algorithm. International Journal of Information and Computation
Technology, 4.
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data Mining with Big Data. IEEE Transactions on Knowledge and Data Engineering, 26(1).
CHAPTER 42
Web Usage Mining and the Challenge of Big Data:
A Review of Emerging Tools and Techniques
Abubakr Gafar Abdalla
University of Khartoum, Sudan
Tarig Mohamed Ahmed
University of Khartoum, Sudan
Mohamed Elhassan Seliaman
King Faisal University, Saudi Arabia
ABSTRACT
The web is a rich data mining source which is dynamic and fast growing, providing great opportunities which are often not exploited. Web data
represent a real challenge to traditional data mining techniques due to its huge amount and the unstructured nature. Web logs contain
information about the interactions between visitors and the website. Analyzing these logs provides insights into visitors’ behavior, usage patterns,
and trends. Web usage mining, also known as web log mining, is the process of applying data mining techniques to discover useful information
hidden in web server's logs. Web logs are primarily used by Web administrators to know how much traffic they get and to detect broken links and
other types of errors. Web usage mining extracts useful information that can be beneficial to a number of application areas such as: web
personalization, website restructuring, system performance improvement, and business intelligence. The Web usage mining process involves
three main phases: pre-processing, pattern discovery, and pattern analysis. Various preprocessing techniques have been proposed to extract
information from log files and group primitive data items into meaningful, lighter level abstractions that are suitable for mining, usually in forms
of visitors' sessions. Major data mining techniques in web usage mining pattern discovery are: clustering, association analysis, classification, and
sequential patterns discovery. This chapter discusses the process of web usage mining, its procedure, methods, and patterns discovery
techniques. The chapter also presents a practical example using real web log data.
INTRODUCTION
The explosive growth of the internet and the substantial amount of information being generated daily has turned the web into a huge information
store. The relationships between the data available online are often not exploited .Web mining analyzes web data to help create a more useful
environment in which users and organizations manage information in more intelligent ways. (Srivastava, Cooley, Desphande, & Tan, 2000).
The internet has become an important medium to conduct business transactions. Therefore the application of data mining techniques in the web
has become increasingly important to organizations to extract useful knowledge that can be utilized in many ways such as improving the web
system performance, restructuring website design, providing personalized web pages, and deriving business intelligence. Web data mining
methods have strong practical applications in E-Systems and form the basis for marketing and e-commerce activities. It can be used to provide
fast and efficient services to customers as well as building intelligent web sites for businesses. Data mining in e-business is considered to be a very
promising research area.
Web data mining deals with different type of data, which is semi-structured or even unstructured, called web data. Web data, can be divided into
three categories: content data, structure data, and usage data. This type of data differentiates web mining from data mining.
Web data represent a new challenge to traditional data mining algorithms that work with structured data. The nature of the web data which is
less structured, and the rapid growth of information being generated daily, it has become necessary for users to utilize automated tools in order
to find the required information. There are several commercial web analysis tools but most of them provide explicit statistics without real
knowledge. These tools are also considered slow, inflexible, and provide only limited features. While some tools are being developed that using
data mining techniques, but the research still in its first stages and faces real challenges such as large storage requirements and scalability
problems (Rana, 2012).
2. To identify the mean web usage mining challenges due to the Big Data phenomena;
4. To evaluate the different emerging methodologies and implementation tools for Big Data web usage mining.
This chapter discusses the web usage mining process, also known as web log mining, is a three-phase process: pre-processing, pattern discovery,
and pattern analysis. There are many data sources for web usage mining, among all; the web server’s log file is the most widely used source of
information. This chapter will also cover the following major techniques in web usage mining pattern discovery in relation to Big Data:
Association Rules
It is the process of finding associations within data in the log file. This technique can be used to identify pages that are most often accessed
together. Association rules can be useful for many mining purposes, such as predicting the next page and to preload it from the remote server to
speed up browsing.
Clustering
In web usage mining, there are two kinds of clusters: user clusters and page clusters. User clustering can be exploited to perform market
segmentation in e-commerce web sites or provide personalized web page. Clustering identifies pages with related content and can be exploited by
search engines and web recommendation systems (Srivastava, Cooley, Desphande, & Tan, 2000).
Classification
Classification is the process of assigning a class to each data item based on a set of predefined classes. In web mining, classification can be used,
for example, to develop a profile of users belonging to a particular class or to classify HTTP request as normal or abnormal (Srivastava, Cooley,
Desphande, & Tan, 2000).
Sequential Patterns
Sequential patterns allow web-based organization to predict user visit patterns. This can help in developing strategies for advertisement targeting
groups of users based on these patterns. The aim of this technique is to predict future URLs to be visited based on a time-ordered sequence of
URLs followed in the past.
BACKGROUND
With emergence of Web 2.0, user’s generated content, social media, wide spread of mobile internet access and above, the scale and volume of web
usage data to be mined have outpaced traditional data mining processes. Unlike traditional data mining, web usage mining deals with extremely
huge data sets. In addition, traditional data mining extracts patterns from databases whereas web mining usage mining extracts the patterns
from web data of different formats. This makes web usage mining one the clear Big Data issues.
Web usage mining, which is one of three forms of web mining. Web mining is the use of data mining techniques to automatically extract useful
knowledge from web documents and services (Rana, 2012). Web mining is not equivalent to data mining. A definition of web mining which
clearly differentiates between data mining and web mining by putting an emphasis on the type of data can be given as follows: Web mining is the
application of data mining techniques to extract knowledge from web data, where at least one of web data (the structure or usage data) is used in
the mining process (Srivastava, Desikan, & Kumar).
Types of Web Data
Web data can be classified into four categories: Content data, structure data, usage data and user profile. (Shukla, Silakari, & Chande, 2013)
(Srivastava, Cooley, Desphande, & Tan, 2000).
Content Data
Is intended for end-user, it is the data that the web page is designed to convey to users. It constitutes display of the web page in the web browser.
Content data includes text data, images, videos, audios, and structured information retrieved from databases.
Structure Data
Represents how page contents are organized internally and how pages are interconnected via hyperlinks. Intra-page structure information
describes how HTML or XML tags are organized within a web page. Tree structure can be used to represent the organization of a web page in
which the HTML tag becomes the root of the tree. Inter-page structure information is represented by models of hyperlinks that connect web
pages with each other.
Usage Data
It is generated by user’s interactions with the website. Usage data includes visitor’s IP address, date and time of the access, the path to requested
web resource, and other attributes. This information reveals how the website is being used and can reveal users’ behaviors and access patterns
User Profile
The forth type of web data is the user profile data. Profile data collects the demographic information about the website’s users through
registration forms.
Classification of Web Mining
Web mining can be broadly classified into three categories: web structure mining, web content mining, and web usage mining (Costa Júnior &
Gong, 2005).
Web Structure Mining
In web structure mining, the hyperlink structures between web pages are analyzed so that correlations between the web pages can be discovered.
The primary goal of structure mining is the discovery and retrieval of information from any type of data in the web. Web structure mining can be
divided into two categories: hyperlinks structure analysis and web documents structure analysis. In the first category, hyperlinks are analyzed to
construct models of hyperlinks structures in the web. In the second category, the structure of individual web pages is analyzed to describe the
organization of different HTML or XML tags (Bhatia, 2011) (S, K, & J, 2011).
Web Content Mining
Web content mining refers to the process of extracting useful information from the content of web documents. Web content mining uses
technologies such as Natural Language Processing (NLP) and Information Retrieval (IR) to extract useful knowledge from the web.
Content mining is different from data mining because it deals with less structured data, but similar to text mining because most of web contents
are basically texts. Content data includes all sorts of text documents and multimedia documents such as images, audios, and videos (Costa Júnior
& Gong, 2005) (Bhatia, 2011).
Web Usage Mining
Web usage mining is the process of discovering of new information from web servers’ logs. Web usage mining involves three steps: preprocessing,
patterns discovery, and patterns analysis (Rana, 2012).
Data can be collected from three different locations: web server, proxy server, and client browser (Srivastava, Cooley, Desphande, & Tan, 2000).
Usage data can be collected from the web server’s access logs, cookies, clients’ logs, or proxy server’s log files. Client side data collection is more
reliable than server side the disadvantage is that collecting usage data from users’ browsers requires users’ cooperation. In the web usage mining
research the mostly used data source is the web logs.
It is necessary to convert data in the logs into data abstractions necessary for pattern discovery. Logs files are big in size, semi-structured and
contain lots of irrelevant records. Main tasks in preprocessing are: data cleaning, users identification, and sessions identification (Srivastava,
Cooley, Desphande, & Tan, 2000) (Gupta & Gupta).
Techniques from several fields such as statistics, data mining, machine learning and pattern recognition can be used to extract knowledge from
usage data (Srivastava, Cooley, Desphande, & Tan, 2000). Major techniques used in pattern discovery are: clustering, classification, association,
and sequential patterns (Shukla, Silakari, & Chande, 2013).
Patterns are further analyzed to refine the discovered knowledge. Data can be analyzed using query language such as SQL or can be loaded into a
data cube in order to perform different OLAP operations. Visualization techniques can also be used to help understand patterns and highlight
trends data (Srivastava, Cooley, Desphande, & Tan, 2000) (Shukla, Silakari, & Chande, 2013).
Applications of Web Usage Mining
Different web analytics tools are available which are capable of generating statistics such as number of page views, number of hits, top referrer
sites and so on. However, these tools are not designed to provide knowledge about individual users and their access patterns. Web usage mining
goes steps beyond traditional web analytics by incorporating data mining techniques which are powerful.
Web data mining methods have strong practical applications in E-Systems and form the basis for marketing in e-commerce. It can be used to
provide fast and efficient services to customers as well as building intelligent web sites for businesses (Mehtaa, Parekh, Modi, & Solamki, 2012).
Data mining in e-business is considered to be a very promising research area. Major application areas are: website modification, site performance
improvement, web personalization, and business intelligence (Srivastava, Cooley, Desphande, & Tan, 2000).
Personalization
With the massive amount of information available online, users may not have the knowledge or experience to find relevant information.
Personalization tries to meet users’ expectations and needs by offering customized services and content during the interaction between the users
and the website. Customers prefer to visit those websites which understand their needs and provide them with customized services and fast
access to relevant information. Personalization is a rapidly growing and challenging area of web content delivery which is very attractive to e-
commerce applications. The goal of a personalized system is to provide users with the information they want or need without they ask for it
explicitly. (Srivastava, Cooley, Desphande, & Tan, 2000) (Mehtaa, Parekh, Modi, & Solamki, 2012).
Web personalization and recommendation systems utilize data which is collected directly or indirectly form web users. This data is preprocessed
and analyzed and the results are used to implement personalization. Approaches to achieve personalization include content-based filtering,
rating, Collaborative filtering, rule-based filtering, and web usage mining .Web usage mining is an excellent approach to web personalization.
1. Memorization: Is the most widespread form of personalization, user information such as name and browsing history is stored (e.g.
Using cookies) so that when the users come back this information is used to greet them and display their browsing history.
3. Recommendation: Recommendation systems try to automatically recommend hyperlinks to users in order to facilitate fast and easy
access to needed information. These systems utilize data that reflect users’ interests and usage behavior. These data maybe captured
implicitly from web server logs or explicitly through registration forms or questionnaire.
Web Site Improvement
Analysis of usage data by web usage mining provides insights into traffic behavior and how to the website is being used. This knowledge can be
exploited in many forms. It can be used to develop policies for web caching, network transmission, load balancing, or data distribution. The
results of usage mining can be used to enhance security by providing usage patterns the can be useful for detecting intrusion, fraud, attempted
break-in, etc. These system improvements are crucial to user satisfaction from the services offered by the web system (Srivastava, Cooley,
Desphande, & Tan, 2000).
Website Modification
The attractiveness of the website design is a critical aspect to the success of the web applications. This includes how the content is structure and
laid out. Web usage mining provides great opportunities to web designer to understand the actual navigation behavior of the web users so that
better decisions can be made about redesigning the website. For example if a pattern reveals that two pages are often accessed together, the
designer can consider linking them directly with a hyperlink .
Another trend is to automatically change the structure of the website based on the usage patterns discovered from server access logs. A website
that changes its design during interaction with the users is called adaptive website (Anand & Hilal, 2012) (Srivastava, Cooley, Desphande, & Tan,
2000).
Business Intelligence
Web usage mining is an excellent tool to generate business intelligence from web data. Understanding of how customers are using the website is
critical from marketing and customer relationship in e-commerce. Clustering of online customers can be utilized to perform market segmentation
in order to develop marketing strategies. Knowledge of user interests and behavior also help individualized marketing by targeting
advertisements to interested users (Chitraa & Davamani, 2010).
Usage Characterization
Data available in the web server logs can be mined to extract interesting usage patterns. Various data mining techniques can be used for this
purpose. Web users can be divided into groups based on their navigational behaviors using clustering techniques. Classification techniques can
be used to find correlation between web pages or web users. In addition to clustering and classification, other techniques can be used to find
interesting usage patterns (Anand & Hilal, 2012).
THE WEB USAGE MINING PROCESS (MAIN FOCUS)
Web usage mining is the process of performing data mining on web log file data. This process goes through three main phases: preprocessing,
pattern discovery, and pattern analysis. Figure 1 shows the overall picture of the web usage mining. Raw logs are preprocessed with the help of
site files (site topology). Then data mining techniques are applied to obtain patterns and rules. Finally, results are filtered to get “interesting”
patterns and rules (Srivastava, Cooley, Desphande, & Tan, 2000).
Figure 1. The Web Usage Mining Process
Data Collection
Usage data can be collected from various sources such as: web servers, clients, intermediary sources between clients and servers (Srivastava,
Cooley, Desphande, & Tan, 2000) (Shukla, Silakari, & Chande, 2013).
Web Server Data Collection
Server logs are considered the primary source of information for web usage mining. On each request made to the web server, an entry is recorded
in the web log containing information about the request. Include the IP address, the requested URL, a timestamp, user agent and so on. The first
problem with server logs is that they are not completely reliable because some of browsing information may not be registered in the log such as
clients accessing a cached copy of a page in the client or proxy cache. The second problem arises when there is a proxy server between clients and
the web server. In this case, clients share the same IP address and their requests are registered with same IP address in the server’s logs. It
becomes difficult to identify each client uniquely. This problem is called IP misinterpretation. Another source of data which is capable of storing
usage data is cookies. Cookies are piece of information generated by the web server for individual clients and stored at the client machine. The
next time that client accesses the web server it send the cookie along with the request to the web server. Thus the web server can identify each
user using cookie data which contains unique User ID. Cookies suffer from two major issues. They are small in size, usually less than 4KB, which
limits the possible benefits. Also there will be user misinterpretation if the same browser is used by multiple users. The second problem with
cookies is that user must accept them first which means cooperation from users is needed for this method to succeed. The third source of data
from web server is by explicit user input through registration forms. This approach requires additional work from users and may discourage users
to visit the website. Information collected through this method cannot be fully trusted because users tend to provide limited information due to
privacy issues. The last source of data from the web server is by getting usage data from external sources (third party). This approach many not be
appropriate due to security and privacy issues (Srivastava, Cooley, Desphande, & Tan, 2000) (Shukla, Silakari, & Chande, 2013).
Client Data Collection
The second major source of usage data is client’s machines. Client-side method considered more reliable than server-side because it can
overcome two issues associated with server-site data collection: caching and user sessions identification. This method can detect browsing
activities at the client site such as back and forward navigation, copy web page, adding the page to favorites. The most common techniques from
collecting data from the client is to dispatch a remote agent. These agents are embedded within the web page and usually implemented in Java or
Java Scripts. Another approach is use a modified browser (such as Mozilla), which is modified to collect usage information, and as users to use
this browser to navigate the website. In either case, corporation of users is required. Also there are security and privacy concerns regarding this
approach. Since this approach may be misused to compromise user’s security and privacy, it is difficult to convince users to allow foreign codes to
run on their machines (Srivastava, Cooley, Desphande, & Tan, 2000) (Shukla, Silakari, & Chande, 2013).
Intermediary Data Collection
The third major approach is intermediary data collection. Usage data is collected somewhere between the web server and the clients. The
existence of proxy servers provides us with opportunity to collect and study data related to a group of users. A proxy server is an intermediate
server between clients and web server and is used to manage internet activities with an organization. This server maintains access logs similar in
format to that of web server and is considered a valuable source of information regarding group of users accessing a web server using a common
proxy. However like web servers, collecting data from proxy server has the web caching and IP misinterpretation issues.
Another approach is to install Packet sniffer in the network to monitor network activities. Packet sniffers can either be a software or hardware
that is capable of collection and analyzing usage data in real time. This approach has some advantages over log files such as complete web page is
captured along with the request data, cancelled requests can be captured as well. There are some disadvantages of packet sniffer. Since packet
sniffers operate in real time and data is not logged, data may be lost forever. Furthermore, TCP/IP packets may be encrypted or may not arrive in
sequence in the same order in which they were sent. This greatly limits the ability of useful information being extracted. Finally packet sniffers
represent a serious threat to network security and may compromise users’ security and privacy (Srivastava, Cooley, Desphande, & Tan, 2000)
(Shukla, Silakari, & Chande, 2013).
Web Sever Logs
Web server logs are plain text (ASCII) files maintained as a normal function of the web servers to log the details of HTTP traffic. An entry is
written to the logs on each HTTP request that is made to the server. Web logs take different formats depending to the server configurations (see
Figure 2) (Gupta & Gupta, 2012).
Types of Web Server Logs
Each web server may run different software with different configuration, but generally there are four types of server logs: transfer log, agent log,
error log, and referrer log. Transfer and agent logs are the standard logs. The referrer and agent logs may not be turned on or may be added to the
transfer log to create what is called extended log format (Jose & Lal, 2012).
Transfer Log
The server access log records all requests processed by the server. Transfer or access logs provide most of server information, such as the IP
address of the client, the requested web resource, the data and time of the access. Analysing transfer log enables the web administrator to
discover usage patterns such the number of visitors accessing the website from a specific domain type (.com, .edu etc), the number of unique
visitors, the most popular page in the website, and the number of accesses during specific hours and days of the week.
Agent Log
The agent log provides information about the user’s browser version and the operation system of the user machine. This great information for the
design and the development of the website because the operating system the browser define the capabilities of the client and what the can do with
the website.
Error Log
It is normal for the web user to encounter errors while browsing such as Error 404 Page Not Found. Another type of errors is when a user presses
the stop button while downloading a large file. The error log also provides information about missing files and broken links. When a user
encounter an error an entry will be written to the error log.
Referrer Log
This log identifies what other sites which link to the website. When a user follows the link this generate an entry in the referrer log.
Format of Access Logs
Each entry consists of a sequence of fields separated by whitespace and terminated by a CR or a CRLF. The access log can be configured to
different formats that show different fields. Typical formats are Common Log Format and Combined Log Format (Sumathi, PadmajaValli, &
Santhanam, 2011) (Grace, Maheswari, & Nagamalai, 2011).
Common Log Format
A typical entry of common log format might look as follows:
Host Logproprietor Username Date:Time Request Statuscode Bytes
Example:
1. Host: This is the IP address of the client which made the request to the server. If there is a proxy server between the client and the server
this will be the address of the proxy not the address of the client. Ex: 127.0.0.1.
2. Log Proprietor: The name of the owner making HTTP request. This information will not be exposed for security reasons. In this case a
“hyphen” will be there.
3. Username: Is the HTTP authentication user ID of the user requesting the page. In the example above the username is frank.
4. Date:Time: This is the time that the server finished processing the request. Times are specified in GMT. The format is
[day/month/year:hour:minute:second zone] where: Day = 2 digits. Month = 3 letters. Year = 4 digits. Hour = 2 digits. Minute = 2 digits.
Second = 2 digits. Zone = (+/-) 4 digits. Ex: [10/Oct/2000:13:55:36 -0700].
5. Request: This string contains the details of the request made by the client consists of, first the method used by the client is GET, second
the requested resource product.html, and third the protocol used by the client HTTP/1.0 . Ex: “GET /product.html HTTP/1.0”.
6. Status code: This is the status code returned by the server. The status code shows whether the request resulted in a successful response
or not. Generally the codes for successful requests begin in 2, a redirection codes begin in 3, codes for errors caused by the client begin in 4,
and errors in the server begin 5. In the above example the status code is: 200.
7. Bytes: The last field indicates the size of the object returned to the client in bytes. In the above the size is 2326 bytes.
Combined Log Format
Is a commonly used log format similar to the common log format, with addition of two more fields, namely, the referrer and user agent. The
format can be shown as follows:
Host Logproprietor Username Date:Time Request Statuscode Bytes Referrer User_Agent
Example:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] “GET /product.html HTTP/1.0” 200 2326 “http://www.example.com/start.html”
“Mozilla/4.08 [en] (Win98; I ;Nav)”
1. Referrer: Shows the page that the client was browsing before making the request. Ex: “http://www.example.com/start.html”.
2. UserAgent: This field shows information about the client's browser and the operating system. Ex: “Mozilla/4.08 [en] (Win98; I ;Nav)”.
PreProcessing Web Logs
Data preprocessing is the most important and time consuming task in web usage mining, it takes about 80% for the total time of the mining
process. The goal of preprocessing step is to clean data by removing irrelevant records and selecting essential features. The data is then is
transformed into sessions file (Chitraa & Davamani).
Preprocessing consists of data cleaning, user identification, session identification, and path completion. Figure 3 shows a complete preprocessing
method for web usage mining (Sheetal & Shailendra, 2012).
Figure 3. Web log preprocessing
Data Cleaning
When a user requests a web page, a log record containing the URL of the page is written to the log, along with a log record for each file (e.g.,
image file, css) that constitutes the web page. Data cleaning includes the removal of records that might not be useful for mining. Following are the
log records that are unnecessary and should be removed (Chitraa & Davamani):
1. Records of multimedia files and format files that have file extensions like GIF, JPEG, MPEG, and CSS and so on. These files can be
identified by checking the URI field of the log entry.
2. Failed HTTP requests should be removed. Log record with status codes that over 299 or fewer than 200 are removed. In most
preprocessing methods only requests with a GET value in the method field are kept. Other records that have HEAD or POST values are
removed from the log.
3. Records of search engine robot (also known as bots or spiders) are removed. These are software that visit the website and automatically
search website's content.
4. At the end of the cleaning step only relevant records are kept which only show the page accesses of web users. The size of the log file
should now be reduced significantly.
User Identification
In this step, individual users should be identified. A user is a single user that is accessing a web page through a browser. However, in reality,
identification of users is not an easy task due to the stateless nature of the HTTP protocol. Different users may have same IP address in the log
due to the use of proxy servers. The same user may access the website from different machines using different browsers. So the identification of
users is the most difficult part of log preprocessing (Chitraa & Davamani).
The most used method for identifying user can be described as follows:
2. For the next record, if the IP address is the same as the IP address identified in the previous records, but the agent field shows a different
browser or operating system, the IP address represents a different user.
3. If the same IP address, a path is constructed using the referrer field and site topology. If the page requested is not directly reachable from
any previously visited page (linked with a hyperlink). A different user with the same IP address is identified.
Session Identification
The main task in this step is to divide the set of pages accessed by each user into individual sessions. A method for identifying user sessions can
be adopted as follows (Chitraa & Davamani):
3. If the referrer website is null, appears as “-” in the log, then a new session started.
4. If the referrer site is a search engine or any other website, then it can be assumed a new session is started.
Path Completion
Some of page requests may be answered from local cache or proxy cache. These page accesses will not be recorded in the web server logs. Path
completion attempts to fill in these missing page references. The referrer field and site topology are used to accomplish this task. If the requested
page is not directly linked to the last page requested by the user, the referrer field can be checked to determine which page the request came from.
It is assumed that the user has viewed a cached version of the page if the referrer page is in the users' recent history. Site topology can be
employed instead of the referrer field to do the same task (Chitraa & Davamani).
After completing these tasks, the output file contains the log records along with two additional fields for user ID and session ID. The sessions file
now contains meaningful information and is ready for the pattern discovery process.
Patterns Discovery Techniques
This section discusses various techniques that are most widely used to extract knowledge from a website log files. The following subsections
classify them according to the technique used.
Association Rules
It is the process of finding associations within data items in the log file. This technique can be used to identify pages that are most often accessed
together in a single user session. Association rules can be utilized, for example, as a way to guess the next page to visit so that the page can be
preloaded from a remote server. These associated pages may not be linked directly via a hyperlink so this information can help web designers
figure out related pages so that hyperlinks can be added between them (Srivastava, Cooley, Desphande, & Tan, 2000).
The authors in (Shastri, Patil, & Wadhai, 2010) discussed a constraint-based association rule mining approach. The goal is to make the
generation of the association rules more efficient and effective. By removing useless rules and focus only on the interested rules instead of a
complete set of rules. And thus eliminate the unwanted rules. In addition to the support and confidence, data and rule constraints are employed.
Data constraints use SQL-like queries to focus on a subset of the transactions. For example, find items sold together on weekend days. Rule
constraints specify the properties of the rules to be mined such as maximum rule length and item constraints. For example, the length of an
association rule should be 3 at maximum with a specific item that should appear in every mined rule.
Maja and Tanja discussed in (Dimitrijević & Krunić, 2013) how the confidence of association rules change over time. They compared confidence
levels between two time periods. They showed that interestingness measures, the confidence and the lift, along with rule pruning techniques, can
be used to produce association rules that are easier to understand and allow the Webmaster to make informed decisions about improving the
structure of the website. The changes in confidence levels over the two time periods have brought new information about the association rules in
these two periods. If the confidence increased then the previous decision about adding a link between the pages involved in the rule has been
confirmed. On the other hand, if the confidence dropped, the administrator may consider removing the link between the pages involved in the
rule.
Preeti and Sanjay (Sharma & Kumar, 2011) discuss preprocessing activities that are necessary before applying data mining algorithm. Association
rule mining algorithm (Apriori) is used to find most frequent associated pages. The paper discusses how association rules can be beneficial in
some application areas such as business intelligence, site modification, and website restructuring.
Clustering
In web usage mining, there are two kinds of clusters: user clusters and page clusters. User clustering can be exploited to perform market
segmentation in e-commerce web sites or provide personalized web. Page clustering identifies pages with related content and can be useful for
search engines and web recommendation systems (Srivastava, Cooley, Desphande, & Tan, 2000).
Authors in (Langhnoja, Barot, & Mehta, 2013) described a method for performing clustering on web log file to divide web visitors into groups
based on their common behaviors. The clustering technique which is based on DBSCAN algorithm helped in efficiently finding groups of users
with similar interests and behavior.
The paper (M & Dixit, 2010) proposed a system for analyzing user sessions to find groups of users with similar activities. The system provides two
types of clustering: clustering based on time and clustering based on IP addresses. The system allows for selection of relevant attributes and
shows clustering based on either IP addresses or timestamp.
NeetuAnand and SabaHilal (Anand & Hilal, 2012) discussed a web usage mining approach for identifying user access patterns hidden in web log
data. The goal is to discover unknown relationships in the data. The clustering analysis technique is applied to group similar page requests. The
WEKA software is used to perform k-means clustering of URLs.
Four types of clustering approaches have been investigated in (Sujatha & Punithavalli, 2011). These represent different methods for performing
clustering on web log data in order to discover hidden information about users’ navigation behavior. The author discussed Ant-based clustering,
fuzzy clustering, graph partitioning, and page clustering algorithm. The authors also discussed the process of web usage mining and the activities
necessary for performing patterns discovery tasks.
Classification
Classification is the process of assigning a class to each data item based on a set of predefined classes. In web mining, classification can be used,
for example, to develop a profile of users belonging to a particular class or to classify HTTP request as normal or abnormal (Srivastava, Cooley,
Desphande, & Tan, 2000).
Priyanka and Dharmaraj (Patil & Patil, 2012) proposed an intrusion detection system based on web usage mining. Two different techniques have
been used to detect intrusion: misuse detection and anomaly detection. Misuse detection employs a set of attack descriptions or “signatures”.
Then match the actual usage data against these descriptions, if match found, a misuse has been detected. Misuse detection works well in
detecting all known attacks. However, it cannot detect new attacks. The second mode is anomaly detection. This method has the advantage of
detecting all known as well as unknown attacks. This is done by comparing usage data against already constructed profiles in order to detect
unexpected events or anomaly patterns of activity.
Apriori algorithm is used to learn association rules from URI lists. After that they used classification to classify each log record as either normal
or a specific kind of attack. To detect anomalies they employ login and page frequencies thresholds. A model of normal behavior is constructed
during the learning phase. This model is used in detection phase to compare the usage data with model. If abnormal behavior found then
intrusion has been detected. Experiments have been carried out to prove that the proposed system had better or same detection rate as other
similar intrusion detection systems.
The paper (Santra & Jayasudha, 2012) discussed classification of web users into “interested user” or “not interested user” based on Naïve
Bayesian classification algorithm. The goal is to focus on the behavior of interested users instead of all users. An interested user is identified by
checking the time spent on each page and the number of pages visited. Records of uninterested users are eliminated from the log file. Authors
conducted comparative study and evaluation of Naïve Bayesian algorithm and an enhanced C4.5 decision tree. They concluded that Naïve
Bayesian algorithm has an efficient implantation and a better performance than C4.5 decision tree.
Sequential Patterns
The aim of this technique is to predict future URLs to be visited based on a time-ordered sequence of URLs followed in the past. This technique
finds temporal relationship among data items over ordered sequences of click-stream. Sequential patterns technique can be thought of as an
extension of the association mining which is not capable of finding temporal patterns. The difference between association mining and sequential
patterns is that association mining focuses on searching intra-sequence patterns, whereas sequential patterns find inter-sequence patterns
(Valera & Rathod, 2013) (Sharma & Kumar, 2011).
Examples of sequential patterns can be shown as follows: If a user visits page X followed by page Y, he will visit page Z with c% confidence. 60%
of clients, who placed an online order in /computer/products/webminer.html, also placed an online order /computer/products/ iis.html within
10 days.
Sequential patterns allow web-based organization to predict user visit patterns. This can help in developing strategies for advertisement targeting
groups of users based on these patterns.
Patterns Analysis
The results of pattern discovery phase need to be analyzed to obtain more useful patterns. Each cluster obtained by the clustering technique
needs to be analyzed to find general description of it. The association rule algorithm may generate large number of rules. We need to explore and
analyze the result of association rules technique in order to focus on interesting rules. Generally, patterns are further analyzed to refine the
discovered knowledge. Patterns can be analyzed using query language such as SQL or can be loaded into a data cube in order to perform different
OLAP operations. Visualization techniques can also be used to help understand patterns and highlight trends data (Srivastava, Cooley,
Desphande, & Tan, 2000) (Shukla, Silakari, & Chande, 2013).
Implementation Example
This section discusses the implementation of the web usage mining process. A real web log file will be utilized. Clustering and association mining
will be performed to find usage patterns of the website’s users.
Software Tool
Orange is a powerful and easy to use data visualization and analysis tool that supports data mining through visual programming or Python
scripting. Orange can be freely downloaded from http://orange.biolab.si/ .
The Data Sets
A log file collected from the web domain of an organization. The data was collected over the period 2 Sep 2013 to 22 Oct, 2013, a total of 51 days.
There were 750381 records in the log. Each record is a sequence of information about one HTTP request. A description of each filed in the log
record will be presented in Table 1.
Table 1. Data description
Field Description
The following preprocessing tasks will be performed: data cleaning, user identification, and session identification. Data should be formatted and
a suitable feature set should be chosen. Following are description of each task.
Data Cleaning
Irrelevant records are removed. These records do not add anay new information because they are not explicitly requested by visitors. All records
are removed except the records that show the URL requested by the visitors. Data will be extracted from the log file (a text file) and stored in a
relational table so that further processing would be more efficient. Following are the steps required in cleaning the log data:
1. Remove the records of multimedia and format file, file that have extension like .jpg, .mpeg, .gif, .js, .css.
2. Remove the records with status code less than 200 or greater than 299 (failed requests).
3. Remove the requests of search engines by checking the agent field for patterns like “bot”, “spider”.
User Identification
Since we are interested in modeling users' behaviors, it is important to identify each user uniquely. A method based on the combination of the IP
address and the agent will be employed for this purpose.
Session Identification
A session is a group of pages requested by a single user during one visit or during a specified period of time. To find users browsing patterns, it is
important to group pages requested by individual users into sessions. For this purpose a navigation oriented session identification method has
been adopted based on the referrer filed as the following:
1. First, sort the records of each user according to time.
4. If the referrer is a search engine or any other site we assume that the user is starting a new session.
Table 2 Shows the results of data cleaning, user identification, user sessions identification.
Table 2. Preprocessing results
Further processing of the data is performed to extract as much useful information as possible from the log data. Following are some tasks to
improve the data set.
1. Convert IP numbers to the geographical location (country) using a reverse DNS service.
Feature Set (First 3 Clicks, Last 2 Clicks)
Some user sessions contain only one or two pages, other contain more than 10 pages. Due to these differences in session’s length the first three
pages and the last two pages of each session will be chosen. If pages in the session are fewer than five pages there will be missing values. This is
also known as First3-last2 which is one of several feature set methods presented in (Ciesielski & Lalani, 2003). The missing pages will be filled in
with a hyphen. The feature set that will be used in experiments includes: the user ID, session ID, the geographic location, the first three pages in
the session and the last two pages, the referrer site, and the browser name.
Pattern Discovery
In this section we will apply clustering technique on the preprocessed data. The goal is to divide user sessions into groups that share similar
characteristics. The K-means clustering method with different settings will be used. The number of clusters (K) will be optimized by the software
tool from 2 to 5 clusters. The tool optimizes the clustering and selects the K value that corresponds to the best clustering.
Evaluation Framework
There are different measures to assess the quality of clustering by measuring how well clusters are separated from each other, or how compact
the clusters are. Following are the measure available in Orange software:
1. Clustering quality: Clusters scoring will be explored include silhouette (heuristic), Between Cluster Distance, and Distance to centroids
2. Distance measures: is used to measure the distances between examples (user sessions in our case) in the dataset. Euclidean, Manhattan,
Maximal, and Hamming will be used.
Results
For each combination of scoring method and distance measure, the best score and the resulting number of clusters (best k) are shown in Table 3.
Table 3. Cluster results
Maximal 0.0813 5
Hamming 0.355 2
Maximal 4966.0 5
Hamming 33982.0 2
Maximal 4562.4 5
Hamming 19255.0 5
The best clustering using Silhouette (heuristic) method is achieved by the distance measure (Manhattan) which renders the highest scoring value
(0.45) and resulted in two clusters. Similarly, Between-cluster-distance method achieved best clustering by using the Hamming distance measure
(scoring=33982.0), this clustering resulted in two clusters. Finally, Distance-to-centroids method achieved the best clustering through the
Maximal measure which gives the lowest scoring (4562.4) and resulted in five clusters.
It should be noted that the Euclidean distance measure was not able to generate best clustering (according to scoring) regardless of the scoring
method used. Also, the between-cluster-distance method tends to generate two clusters except for Maximal distance measure which gives five
clusters. Similarly, the distance-to-centroids method always generates five clusters regardless of the distance measure used.
It is difficult to compare different quality methods due to the differences in how they calculate the coring value. For example, according to
between-cluster-distance method, biggest scoring value is better, while in distance-to-centroids method, smallest is better.
Pattern Analysis
Silhouette Combined with Manhattan Distance Measure
The best clustering by k-means algorithm and Silhouette as a scoring method was obtained by Manhattan distance measure. According to these
settings, two clusters were formed.
The first cluster consists of 3517 user sessions which constitute 70.8% of the data set. 40.2% of sessions in this cluster belong to users from the
United States. The next most frequent country is Sudan (17.2%). Users in this cluster tend to start their session from the homepage/ar/index.php
(42.7 of the sessions). Interestingly, in 93.7% of the sessions, there is no referrer site, which means that users started browsing from the website
itself and did not come from a search engine or any other site. They typed the URL or they have set the website as the homepage in their
browsers. The top browser used was iPhone browser (26.6%).
The second Cluster consists of 1449 user sessions which constitute 29.2% of the total sessions. 46.7% of sessions in this cluster belong to users
from Sudan. The next most frequent country is Norway (15.5%). Users in this cluster tend to start their session from page view_law.php (57.6 of
the sessions). 60.5% of the sessions started by searching at Google and then arrive at the website, and usually not through the homepage but the
page view_law.php. The top browser used was Opera Mini (22.7%).
BetweenClusterDistance Combined with Hamming
The best clustering using the k-means algorithm combined with between-cluster-distance as scoring method was obtained by Hamming distance
measure. According to these settings, two clusters were formed.
It should be noted that the result of this experiment is very similar to the experiment of 1 above. Clusters formed are almost identical with just
minor differences.
DistancetoCentroids Method Combined with Maximal
The best clustering using k-means algorithm with distance-centroids as scoring method was obtained by Maximal distance measure. This
experiment resulted in five clusters.
The first and largest cluster contains 90.8% of the total sessions. The top countries are United States, Sudan, and India. The most entry pages
were /ar/index.php and view_law.php. Most of the session had no referring site. The top browser was iPhone browser.
The second cluster consists of 3.7% of the total sessions. All users were from Sudan and started browsing from /ar/index.php and they had no
referring website. All users in this cluster used Firefox browser.
The third cluster represents of 0.1% of the total sessions. All users were from Sweden and started browsing from the page dstor2005.php and
they had no referring website. All users in this cluster used Firefox browser.
The fourth cluster represents of 0.6% of the total sessions. All users are from Finland and started browsing from /ar/index.php and they came
from the same website. All users in this cluster used Nokia browser.
The fifth cluster represents of 4.8% of the total sessions. All users are from France and started browsing from /ar/index.php and they had no
referring website. All users in this cluster used Safari browser.
General Thoughts about Clusters
The between-cluster-distance and Silhouette produced almost the same results with minor differences. These two methods assess how well the
clusters are separated from each other.
The third method, the distance-to-centroids, gave very different results. The clustering consists of one large cluster, 90.8% of the total sessions,
and the other four clusters share the remaining 9.2%. Each one of these four clusters is pure, meaning that the sessions are very similar, same
pages visited, same browser, and same location and so on. This may raise the question whether this clustering is a really good clustering or not?
After further investigation, It has been found that one cluster of them contains sessions by a single user. Another cluster consists of sessions of
only two different users. Although the quality of this clustering is high and therefore interesting according to the objective measure, this
clustering is might not interesting according to a subjective measure because it might not provide useful information.
Association Rules
The goal is to generate association rules from user sessions. Association rules can reveal users' behavior and navigation patterns. We will set the
minimum support and minimum confidence values to 12 and 60 respectively. First we will mine association rules from the whole data set. Then
we will find association rules for each country in top 10 countries in the data set.
Evaluation Framework
Whether or not a rule is interesting can be assessed either subjectively or objectively. Only the user can judge if a rule is interesting, and this
judgment is subjective, it may differ from one user to another. In this project we used objective interestingness measures: support, confidence.
Some rules can have high support and confidence values but still these rules are misleading. To filter out the strong but misleading rules, we will
enhance the evaluation framework with the lift, which is a correlation measure based on the statistics of the data.
1. Support: The support of the rule A=>B is the proportion of transactions that contain both A & B.
2. Confidence: The confidence of the rule A=>B is the conditional probability of occurrence of B given A.
3. Lift: Is a simple correlation measure that used to filter out misleading association rules of the form A => B. The lift value is used identify
weather A&B are positively correlated, negatively correlated, or independent.
Mining Rules from the Entire Data Set
In this experiment we selected (click1, click2, location, ref_site, and browser) from the feature set described above. The reason that we did not
select the other three page of each session (click3, click4, click5) is that users did not visit more than two pages in a single session most of the
time. The missing values have been filled in with a ‘-‘. This will create problems to the association rule mining as they generate more meaningless
rules, for this reason, we will remove the last three pages from each session.
After the application of the association mining technique with 14 minimum supports and 60 minimum confidences, we found 45 rules. Some
rules may not be meaningful, but still, these are strong rules and represent correlations within the data set. It should be noted that even if a rule is
interesting enough according to the support and confidence measures we need to check the lift column to see if antecedent and consequent of the
rule are statistically correlated. This will help filter out strong but misleading rules. If the lift is less than 1, that means the antecedent and
consequent of the rule are negatively correlated, if the lift is greater than 1 it means they are positively correlated, and if the lift equals 1 this
means they are independent. We are interested in rules that are positively correlated. Table 4 shows the most interesting rules.
Table 4. Most interesting rules from the data set
Mining Association Rules by Country
Sometimes it is desirable to divide users into groups and derive high level conclusions about these groups. In this experiment we consider each
geographical location a separate group. We want to search for association rules in these groups (countries). So we will limit our search only to
sessions of visitors from each location. These rules cannot be discovered from the entire data set as they may not qualify the minimum support
threshold. We will consider the top countries as they represent the majority.
In this experiment we select sessions belong to the each top country and see if we can find any interesting rules specific to that location. We used
minimum support and confidence as 12 and 60 respectively. Table 5 shows some of the most interesting rules.
Table 5. Most interesting rules in top countries
We applied the association rule technique with minimum support and minimum confidence values set to 14 and 60 respectively. We will only
discuss some of the interesting rules from the entire data set and also from each country appeared in top 10 countries.
Association Rules from Entire Data Set
According to minimum support and confidence of 14 and 60 respectively, we found 45 association rules. The results are shown in Table 4. When
we check the lift column we can see that all rules represent positive correlations except for a single rule which is a negative association rule.
Following are some of the interesting rules: We will use following structure of the rules Antecedent => Consequent, (support, confidence, lift).
1. If the user from the United States, then there is no referrer. The is rule means that users from United States start their sessions directly
from the website itself and did not arrive from a search engine or any other website. It also means these users are familiar with the web site,
the entered the URL directly or they have set the website as the homepage in their browser. The rule is: location = United States => ref_site=
no referrer, (0.277, 0.884, 1.188).
2. The second rule states that if a user accesses the homepage (/ar/index.php) then the next page is ‘-‘, mean is empty. This means the
session contains a single page. The rule is: click1=/ar/index.php -> click2= -, (0.234, 0.677, 1.071).
3. If a user accesses the homepage (/ar/index.php) then he arrived from Google. The rule is: ref_site=Google => click2=/ar/index.php,
(0.154, 0.853,2.664).
4. The next rule is: click1=/ar/view_law.php => click2=/ar/index.php, (0.155, 0.610, 2.129). This rule means the two pages; view_law.php
and index.php are accessed together in users sessions.
5. The next interesting rule is: Location= United States - > ref_site = no referrer, browser=iPhone browser, (0.191, 0.609, 3.046). This rule
means that users from the United States open the website directly and they are mobile users.
6. If the user access the website directly by typing the web address of the website (i.e. they did not arrive from other websites via hyperlinks),
then they visit a single page in their session. The rule is: ref_site=no referrer => click2=’-‘, (0.585, 0.786, 1.243). This rule has a high support
meaning that more than 50% of the sessions follow this rule.
We noticed that most of the rules related to users from the United States. This can be explained by the fact that most of the visitors are from the
United States. This is why we select to search association rules in subsets of the data set so that we can find other rules.
Association Rules by Country
In this section, we will discuss the rules found that correspond to each country. The experiments were described in section 4.3.2 in the previous
chapter. Following are some of the interesting rules:
1. United States: The rule: click1=/ar/view_law.php, click2=- => ref_site=no referrer, (0.153, 0.968, 1.095). This rule means that user
from United states visit the page view_law.php directly without going through the homepage, and they only visited that page. this rule means
that users are not only familiar with the website but also a specific page in the web site . This rule is consistent with the results shown in
section 5.3.1 above.
2. United States: The second interesting rule is: click1=/ar/news.php, click2=- => ref_site=no referrer, (0.159, 1.000, 1.131). This rule
means that users only visited the page news,php directly without any referrer website. This rule is new and cannot be discovered from the
entire data set.
3. Sudan: The second interesting rule is that user from Sudan start by searching at Google and then find the page view_law.php which is not
the homepage. The rule is: click1=/ar/view_law.php => ref_site=Google, (0.237, 0.768, 2.307).
4. Sudan: The third interesting rule is that the pages view_law.php and the home page /ar/index.php often accessed together in a single
session which is consistent with results in section 5.3.1 above. The rule is: click1=/ar/activity/view_law.php -> /ar/index.php, (0.259, 0.838,
1.563).
5. India: One interesting rule found. Other rules basically mean the same thing. The rule is: click1=/ar/members.php => click2=-,
ref_site=no referrer, (0.939, 0.951, 1.011). this means that users from Indian visited a single page in their session, members.php, and they
did not arrive from a search engine or any other site. It should be note that this rule has very high support and confidence. Almost all users
from India follow this rule and have very similar behavior which is a surprising result. This may raise the question why visitors from this
location have high interest in this single page? We found that there sessions actually belong to a single user.
6. China: The new and most interesting rule is that if the first page in the sessions is /en/news.php and there is no referrer site then users
will not visit more pages. The rule is: click1=/en/news.php, ref_site=no referrer => click2=-, (0.157, 1.000, 1.245). the rule is new and cannot
be discovered from the entire dataset.
7. Ukraine: The first interesting rule. When users visit the homepage then they arrive from parliament.gov.sd. ref_site=parliament.gov.sd
=> click1=/ar/index.php, (0.946, 0.959, 1.014).
8. Ukraine: The second interesting rule states that users revisit the same page. This is a strong which is a surprising result. This may be
attributed to the website design. The rule is: click1=/ar/index.php =>click2=/ar/index.php, (0.946, 1.000, 1.057). After investigating the
data, we found that the sessions are actually belonging to different users. Different users have the same strange behavior of revisiting the
homepage multiple times in their sessions. We suspect the homepage design might be the reason.
General Thoughts about Association Rules
The general behavior of visitors that revealed by the association rules can be described as follows: visitors are either regular visitors that familiar
with website or they arrive after searching at Google. If visitors arrive from Google they usually tend to visit page view_law.php. This specific
page is very popular and visitors have increased interest in its content.
After portioning the data set and searching sessions according to the geographical location, we were able to discover new association rules that
could not be discovered from the entire data set because they might not qualify the minimum support threshold. Some of these new rules are
quite interesting and revealed strange behaviors in user navigations
CHALLENGES AND RESEARCH ISSUES (FUTURE DIRECTIONS)
Challenges
Web data represent a new challenge to traditional data mining algorithms that work with structured data. Some web analysis tools are being
developed that offer an implementation of some data mining techniques, but the research in this area still in its infancy. Using data mining in the
web is faced by real challenges such as large storage requirements and scalability problems (Rana, 2012).
In the pre-processing phase, log file needs to be cleaned from irrelevant records. Records of graphics and failed HTTP requests are removed. For
some domains these information should be removed while in other domains it should not be eliminated. Thus pre-processing is highly dependent
on the domain and there is no standard techniques that can be applied to all domains (Srivastava, Cooley, Desphande, & Tan, 2000).
In practice; it is difficult to identify web users uniquely due to the stateless nature of the HTTP protocol and presence of proxy servers. A single
user may access the website from different machines using different browser. The same machine or browser may be used by several users. In web
usage mining research it is usually assumed that the same IP address and the same user agent means the same user. However, the same user may
come from different IP address or use different browsers. This represents a real challenge when identifying users accessing a website.
For identification of user sessions, we assumed in our example that if a user opens the website directly by typing the web address in the browser,
the user is starting a new session. Furthermore, if the user is arriving from a search engine or other website via a hyperlink, we also assumed that
user is starting a new session. We need to make some assumptions when identifying users’ sessions and such assumptions may affect the
accuracy of the results of the usage mining.
Web caching represents another problem. Accesses to cached pages are not recorded in the server logs leaving us without information about
these accesses. The existence of a proxy server represents another challenge. Clients accessing a web server through a proxy server will be
assigned same IP address and will be registered in the server log. So it becomes difficult to identify each user uniquely from the web server logs
(Shukla, Silakari, & Chande, 2013).
The interpretation of the knowledge extracted by data mining techniques may be limited due to the limitations of the data set. Web logs are
basically used for debugging purposes and for knowing how much traffic the website is getting and not intended for studying users’ behaviors and
finding their interests. To improve the quality of the discovered patterns, it is important to extract as much useful information as possible from
the available data in the log file.
FUTURE RESEARCH DIRECTION
Future research may explore new and different feature sets. This chapter, discussed the first three clicks and the last two clicks feature set. A
feature set implies that the first three pages and the last two pages will be included in each user session. Such feature set has limitations and
might not be good for other data mining goals. In addition to developing new feature sets, it might be helpful to extract more information from
the raw log file, similar to extracting location information from raw IP numbers, so that the quality of the data set will be improved and as a result
more interpretable and actionable patterns will be mined.
Log file data go through various kinds of transformations during the preprocessing phase. Such transformations may affect the quality of the data
set. To handle this issue, future research may focus on better preprocessing techniques.
This chapter discussed association rules, clustering classification and sequential patterns analysis. Other data mining techniques that work with
structured data can be incorporated into the web usage mining process. In addition to exploring different data mining techniques, more data can
be combined with log file's data in many ways. Results of one kind of analysis can be used as input to other data mining techniques. For instance,
results of text analytic, information extracted from web page content can be combined with web log data. This combination provides a rich data
source for pattern discovery and lead to significant knowledge to be mined
Privacy Issues
It is very important to maintain users’ privacy while performing web usage mining. Users may hesitate to reveal their personal information
implicitly or explicitly. Some users may hesitate to visit those websites that make use of cookies or agents. Most users want to maintain strict
anonymity on the web. In such environment it becomes difficult to perform usage mining (Srivastava, Cooley, Desphande, & Tan, 2000) (Mehtaa,
Parekh, Modi, & Solamki, 2012). On the other hand site administrators are interested in finding out about users’ demographic information and
their usage patterns. This information is extremely important for improvement of the site. For tasks of personalization and improving the
browsing experience, it is required that each user is identified uniquely every time they visit the website.
The main challenge is to develop rules such that administrators can perform analysis on usage data without compromising the identity of
individual user. Furthermore there should be strict regulations that prevent usage data from being exchanged or sold to other sites. Every site
should inform its users about the privacy policies followed by them so that users are aware when deciding to reveal personal information.
To meet the need to maintain users’ privacy while browsing the website, W3C presented what is called Platform for Privacy Preferences (P3P) .
(Srivastava, Cooley, Desphande, & Tan, 2000). P3P allows privacy policies followed by the website to be published in a machine readable format.
When the user visits the site for the first time the browser reads the privacy policies and then compares these policies with the security settings in
the browser. If the policies are satisfactory the user can continue browsing the website, otherwise a negotiation protocol is used to arrive at a
setting which is acceptable to the user. Another aim of P3P is to provide guidelines for independent organizations to can ensure that sites comply
with the policy statement they are publishing (Langhnoja, Barot, & Mehta, 2013).
CONCLUSION
As a result of the internet being an important medium for business transaction and the recent interest in e-commerce, the study of web usage
mining has attracted much attention from the research community. Web usage mining is the process of application of data mining techniques to
usage data which is available primarily in server logs. This chapter has discussed pattern discovery from web server’s logs files to get insight into
visitors’ behavior and usage patterns.
The web usage mining process involves three main phases: preprocessing, pattern discovery, and pattern analysis. Usage data can be collected
from three main locations: web server, clients, and an intermediary location such as proxy servers. Web server log files are considered the
primary source of data for usage mining. However, server logs are not entirely reliable due to various kinds of clients’ caches and proxy caches.
The IP address misinterpretation and users’ misinterpretation are example of challenges that affect the accuracy of users’ identification, which is
a crucial step in web usage mining. Data in server logs have to be converted into structured format by preprocessing algorithms before apply any
data mining techniques.
This chapter discussed major pattern discovery techniques in web usage mining. Techniques include: association rules, clustering, classification,
and sequential patterns. These techniques are basically data mining techniques adopted form web domain.
The chapter also presented a practical example that implement an approach to web usage mining.The raw log file contained 750381 records, after
preprocessing, there were12848 different users, and 4966 users sessions.
In the knowledge discovery, clustering models using k-means clustering and association rules mining have been performed to find usage
patterns. The study has found that, using k-means algorithm, among all combinations of quality methods and distance measures, silhouette is
better combined with Manhattan distance measure, between clusters distance is better combined with Hamming distance measure, and distance
to centroids is better combined with Maximal distance measure. Regarding association rules, we have found that mining rules from the entire
data set resulted in rules which are dominated by visitors from the United States. After dividing the data set according to the geographical groups
so that high level conclusions about these groups can be derived, we were able to find some new and interesting associations. We found that most
visitors are from the United States and Sudan and the most popular pages in the web site were the homepage index.php and view_law.php. These
two pages were most often accessed together in user sessions. A significant number of users visited a single page, mostly the homepage. Most
users stared their sessions by search at Google, then arrive to the page view_law.php after that they move to the homepage.
Web usage mining combines to active research areas: World Wide Web and data mining. Application of usage mining show promising results in
different areas such as web personalization, business intelligence, web system improvement, and enhancing security. The study of web usage
mining is expected to attract more attention as the internet becoming and important medium for business, education, news, government, etc.
This work was previously published in a Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 418447, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
REFERENCES
Anand, N., & Hilal, S. (2012). Identifying the User Access Pattern in Web Log Data. International Journal of Computer Science and Information
Technologies , 3(2), 3536–3539.
Bhatia, T. (2011). Link Analysis Algorithms For Web Mining.International Journal of Computer Science and Technology , 2(2), 243–246.
Chitraa, V., & Davamani, A. S. (2010). A Survey on Preprocessing Methods for Web usage Data. International Journal of Computer Science and
Information Security , 7(3), 78–83.
Chitraa V. Davamani A. S. (2010). An Efficient Path Completion Technique for web log mining.IEEE International Conference on Computational
Intelligence and Computing Research.
Ciesielski V. Lalani A. (2003). Data Mining of Web Access Logs From an Academic Web Site.Proceedings of the Third International Conference
on Hybrid Intelligent Systems (HIS’03): Design and Application of Hybrid Intelligent Systems (pp. 1034-1043). IOS Press.
Costa Júnior M. G. Gong G. d. (2005). Web Structure Mining: An Introduction.Proceedings of the 2005 IEEE, International Conference on
Information Acquisition (p. 6). Hong Kong: IEEE. 10.1109/ICIA.2005.1635156
Dimitrijević, M., & Krunić, T. (2013). Association rules for improving website effectiveness: Case analysis. Online Journal of Applied Knowledge
Management , 1(2), 56–63.
Grace, L. J., Maheswari, V., & Nagamalai, D. (2011). Analysis of Web Logs and Web User in Web Mining. International Journal of Network
Security & Its Applications , 3(1), 99–110. doi:10.5121/ijnsa.2011.3107
Gupta, R., & Gupta, P. (2012). Application specific web log pre-processing. Int. J.ComputerTechology&Applications , 3(1), 160–162.
Gupta, R., & Gupta, P. Fast Processing of Web Usage Mining with Customized Web Log Pre-processing and modified Frequent Pattern
Tree. International Journal of Computer Science & Communication Networks, 1(3), 277-279.
Jose, J., & Lal, P. S. (2012). An Indiscernibility Approach for preprocessing of Web Log Files. International journal of Internet Computing, 1 (3),
58-61.
Langhnoja, S. G., Barot, M. P., & Mehta, D. B. (2013). Web Usage Mining to Discover Visitor Group with Common Behavior Using DBSCAN
Clustering Algorithm. International Journal of Engineering and Innovative Technology , 2(7), 169–173.
M, K., & Dixit, D. (2010). Mining Access Patterns Using Clustering. International Journal of Computer Applications, 4(11), 22-26.
Mehtaa, P., Parekh, B. P., Modi, K., & Solamki, P. (2012). Web personalization Using Web Mining: Concept and Research issue.International
Journal of information and Education Technology, 2 (5), 510-512.
Patil, P. V., & Patil, D. (2012). Preprocessing Web Logs for Web Intrusion Detection. International Journal of Applied Information Systems, 11-
15.
Rana, C. (2012). A Study of Web Usage Mining Research Tools.Int. J. Advanced Networking and Applications , 3(6), 1422–1429.
S, Y., K, A., & J, S. (2011). Analysis of Web Mining Applications and Benefcial Areas. IIUM Engineering Journal, 12 (2), 185-195.
Santra, A. K., & Jayasudha, S. (2012). Classification of Web Log Data to Identify Interested UsersUsing Naïve Bayesian
Classification. International Journal of Computer Science Issues ,9(1), 381–387.
Sharma, P., & Kumar, S. (2011). An Approach for Customer Behavior Analysis Using Web Mining. International Journal of Internet
Computing , 1(2), 1–6.
Shastri, A., Patil, D., & Wadhai, V. (2010). Constraint-based Web Log Mining for AnalyzingCustomers’ Behaviour. International Journal of
Computers and Applications , 11(10), 7–11.
Sheetal, & Shailendra. (2012). Efficient Preprocessing technique using Web log mining. International Journal of Advancements in Research &
Technology, 1 (6), 59-63.
Shukla, R., Silakari, S., & Chande, P. K. (2013). Web Personalization Systems and Web Usage Mining: A Review.International Journal of
Computers and Applications , 72(21), 6–13. doi:10.5120/10468-5189
Srivastava, J., Cooley, R., Desphande, M., & Tan, P.-N. (2000). Web Usage Mining: Discovery and Applications of Usage Patterns from Web
Data. ACM SIGKDD Explorations Newsletter , 1(2), 12–23. doi:10.1145/846183.846188
Srivastava J. Desikan P. Kumar V. (2002). Web Mining Accomplishments& Future Directions.National Science Foundation Workshop on Next
Generation Data Mining (NGDM'02), (pp. 51-56).
Sujatha, M., & Punithavalli. (2011). A Study of Web Navigation Pattern Using Clustering Algorithm in Web Log Files.International Journal of
Scientific & Engineering Research , 2(9), 1–5.
Sumathi, C. P., PadmajaValli, R., & Santhanam, T. (2011). An Overview of preprocessing of Web Log Files. Journal of Theoretical and Applied
Information Technology , 34(2), 178–185.
Valera, M., & Rathod, K. (2013). A novel approach of Mining Frequent Sequential Patterns from Customized Web Log
Preprocessing. International Journal of Engineering Research and Applications , 3(1), 369–380.
KEY TERMS AND DEFINITIONS
Big Data: Any data set that is too difficult to be handled by traditional database systems due to its size and complexity.
Privacy: Refers to users’ concerns to share personal information and reveal their browsing habits. Users want that the information they share
with the website will not be shared with a third-party.
Session: A time period starts from the time the user accesses the website until they leave. It also refers to a list of web pages visited during this
period.
Unstructured Data: Information which is not stored in a database. Often include text and multimedia content such as e-mail message, photos,
videos, and so on.
Web Analytics: It means the analysis of web data to understand web usage and to improve the effectiveness of a website. It is also a tool to
derive business intelligence.
Web Cache: Is a mechanism to store web pages temporarily to improve browsing, save bandwidth, and reduce the load on the server.
Web Log File: A text file contains information about all requests made to the server.
Web Personalization: Providing users with customized services or features without expecting from them to ask for it explicitly by adapting the
presentation of the website.
Asad I. Khan
Monash University, Australia
Heinz W. Schmidt
RMIT University, Australia
ABSTRACT
One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed
data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data
management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze
data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding
overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather
than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed
processing approach for scalable data recognition and processing.
INTRODUCTION
Recent advancements in computing technology and data analysis have brought forward the ability to generate enormous volumes of highly-
complex data which have called for a paradigm shift in the computing architecture and large scale data processing approaches. Jim Gray, a
distinguished database researcher and manager of Microsoft Research's eScience group called the shift a “fourth paradigm”. The first three
paradigms were defined as experimental, theoretical and, more recently, computational science (Hey, Tansly & Tolle, 2009). Gray argued that the
only solution to this outgrowth of big data, commonly known as data deluge, is to develop a new set of computing tools to process and analyze the
data flood as the existing computer architectures are becoming more incapable of dealing with data-intensive tasks over time due to their
constantly growing latency gaps between multi-core CPUs and mechanical hard disks (Gray, Bell & Szalay, 2006). In fact; with an emerging
interest to leverage massive amounts of data available in open sources such as the Web for solving long standing information retrieval problems;
the question remains, how to effectively incorporate and efficiently exploit immense data sets. This question brings to the forefront a crucial need
for high levels of scalability in the world of big data. Thus reinforcing Moore’s Law of exponential increases in computing power and solid-state
memory (Moore, 2000), in which it is stated that:
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this
rate can be expected to continue, if not to increase (pg. 57).
Although this was initially referred to the transistor counts within a processor, the effect of this law seems to be applicable in almost all areas of
computing, including data generation and analysis. The implications of Moore's Law are quite profound as it is one of the few stable rulers we
have today, in other words it's a sort of technological barometer (Malone, 1996):
It very clearly tells you that if you take the information processing power you have today and multiply by two, that will be what your
competition will be doing 18 months from now. And that is where you too will have to be (pg. 6).
This outgrowth of big data has significant implications regarding the existing developments of computing applications. According to Anderson
(2011), the chief editor of Wired magazine:
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first
search engine crawlers made it a single database. Now Google and likeminded companies are sifting through the most measured age in
history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. The Petabyte Age is
different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk
arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to
the library analogy to well, at petabytes we ran out of organizational analogies (pg. 769).
As human beings, our brains could be viewed as large-scale distributed and interconnected networks of sensory systems and memories.
Observing, recognizing and recalling what we have seen contribute to a significant portion of the activities conducted within these large-scale
networks. Provided that an optimal solution is found for the scalability problem, the internet could provide the levels of interconnectivity and
complexity that bear a resemblance to the human brain. Harnessing the massive potential embodied within these distributed networks of
interconnected high-performance machines may provide recognition and processing capabilities for large-scale and highly-complex data.
LARGE SCALE AND BIG DATA RECOGNITION
Transforming big data into valuable information is an issue that real-world systems must grapple with. In fact, more data translates into more
effective algorithms, and thus makes sense to take advantage of the enormous amounts of data that exist. In this regard; the development of
powerful high-resolution data-capture instruments and sensors, in areas such as satellite and biomedical imaging, has resulted in a massive
production of voluminous and complex data. In satellite imaging applications, including the geographical information system (GIS) and the
global positioning system (GPS), depending on the in-depth resolution of images required, the amount of data produced can be quite excessive.
These huge data sets need to be properly processed before they can be used in relevant applications.
In biomedical imaging, intelligent processing approaches are commonly employed to extract critical information from high-dimensional images
obtained through sophisticated imaging schemes, such as Magnetic Resonance Imaging (MRI) to help medical experts with their diagnosis. With
the advent of high-resolution imaging techniques along with the advancements in high speed networking and storage technologies, medical
experts are capable of conducting a collaborative diagnosis by collecting data from various sensory and imaging instruments over large networks
then storing and accessing these data within distributed repositories. With all of these capabilities at hand, the volume of data generated and
processed can be at the scale of the Internet. Furthermore, rapid advancements in large-scale scientific analysis activities have inspired the
development of sophisticated and state-of-the-art technologies. One example is the advent of next-generation DNA sequencing technology which
has resulted in a deluge of sequenced data. This enormous amount of data needs to be efficiently stored, indexed, organized and delivered to
scientists for further analysis. Considering the fact that in modern genetics, genotypes explain phenotypes, the implications of this advanced
technology is nothing less than transformative (Elaine, 2008).
The European Bioinformatics Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, is a prime example of growth
in data storage. EBI increased their storage capacity from 2.5 petabytes in 2008 to 18 petabytes in 2013. Medical experts are claiming that, in the
near future, sequencing an individual’s genome will be no more complex than getting a simple blood test today – resulting in a new era of
personalized medicine, where prescriptions can be specifically targeted for an individual. As another example and mentioned in the work of Fox
et al. (2005) is the development of sophisticated data-capture instruments and sensors. The Large Hadron Collider and Interferometric Synthetic
Aperture Radar (InSAR) in high energy physics has resulted in consistent generation of large volumes of highly complex multi-dimensional data.
In fact, today’s rapid generation of highly complex and large-scale data sets is the result of significant advancements in the building and
deployment of state-of-the-art technologies for data capturing and processing. Clearly, petabyte datasets are rapidly becoming the norm, and the
trends are obvious; the ability to store data is fast overwhelming the ability to process what is stored. In this regards; the need for highly
sophisticated computational schemes is somehow prevalent as the volumes of generated data make it absolutely impractical for data analysts to
do any form of data processing without having the right tools at hand. However; existing data mining schemes are mostly suffering from various
shortcomings including algorithmic complexity of deployed methods. For example; depending on the form of pruning applied the order of
complexity for a decision tree classification tool can range from O(nlogn) to O(n2) or even worse (Kamath and Musick, 1998) which in turn makes
it practically infeasible for use in large-scale data processing approaches.
Moreover; rapid expansion of the integration between various computational devises and sensor networks with the Internet has created a
pervasive computational framework which we refer to as Internet-of-Things (IoT) (Kopetz, 2011). This development builds a bridge between
physical and information domains and then creates a smart space where a large number of high-performance computational devices can interact
in real-time providing various services, a model analogous to the human biological nervous system. The problem arises where there is an
enormous amount of data captured from various sensing and computer systems and there is an urgent need to process this data load somewhat
in real-time.
Cloud Computing
Cloud computing presents a pay-per-use paradigm for providing services over the Internet in a scalable manner. Supporting data intensive
applications is an essential requirement in the cloud. Due to changes in the data access patterns of applications and necessity to use thousands of
compute nodes, major cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio.
Such integration makes it easier for customers to access these services and to deploy the applications. This integration can create efficiencies
through wide-spread use of multi-core CPUs, cost reduction for commodity hardware, enhanced performance and higher reliability in use
derived from an architectural paradigm which favors a massively distributed data processing framework running on a large number of
inexpensive compute nodes. Large data operations such as processing crawled documents or regenerating a web index are split into several
independent subtasks, distributed among the available nodes and computed in parallel within the network.
To simplify the development of distributed applications on top of such highly distributed architectures, customized data processing frameworks
are developed and deployed. Well-known examples are Google’s MapReduce (Dean & Ghemawat, 2004), Microsoft’s Dryad (Isard, Budiu, Yu,
Birrell & Fetterly, 2007) and Yahoo!’s Map-Reduce-Merge (Yang, Dasdan, Hsiao & Parker, 2007). Although these schemes differ in structure,
their design concepts share similar objectives, namely hiding the complexity of parallel programming, fault tolerance and execution optimization
issues from the developer. In fact, developers can typically proceed with writing sequential programs and then the processing framework
engages, taking care of distributing the program among the available compute nodes and executing each instance of the program on the
appropriate fragment of data set. Hence, emergence of successful cloud computing projects can be mainly attributed to commoditizing
parallelism for solving the data management problem.
However, the dynamic and distributed nature of the cloud computing environment makes data management processes very complicated,
especially in the case of real-time data processing/database updating (Szalay, Bunn, Gray, Foster & Raicu, 2006). To cope with today’s intensive
data workloads, scalable Database Management Systems (DBMS) are a critical component of the cloud infrastructure and play a paramount role
in ensuring the smooth transition of applications from the traditional enterprise frameworks to the next generation of cloud computing services.
Distributed data management has been the vision of the database research community for a long period of time; however, much of this research
has been focused on designing scalable schemes for intensive workloads in traditional large-scale data processing settings with a lesser impetus
on re-designing the processing architecture to keep up with big data. While the opportunities for parallelization and distribution of data in clouds
have brought some efficiency, in particular with existing relational and object-oriented data models, storage and retrieval processes have
increased in complexity, especially for massively parallel real-time data. Chaiken et al. (2008) observed that the challenge of processing
voluminous data sets in a scalable and cost-efficient manner has rendered traditional database solutions prohibitively expensive. At the other end
of the spectrum, high-performance computing (HPC) has advanced rapidly but has generally focused on computational complexity and
performance improvements. Virtual HPC in the cloud has significant limitations especially when big data is involved. According to Shiers (2009),
“it is hard to understand how data intensive applications, such as those that exploit today’s production grid infrastructures, could achieve
adequate performance through the very high-level interfaces that are exposed in clouds” (pg. 3). The efficiency of the cloud system in dealing with
data intensive applications through parallel processing essentially lies in how data is partitioned and processing divided among the nodes. As a
result, data access schemes are sought to be able to efficiently handle this partitioning automatically and support the collaboration of nodes in a
reliable manner. Moreover, due to changes in the data access patterns of applications and necessity to use thousands of compute nodes, major
cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio; making it easier for
customers to access these services and to deploy their applications. Thus efficiencies through wide-spread use of multi-core CPUs, cost reduction
for commodity hardware, enhanced performance, and higher reliability in use are derived from an architectural paradigm which favors a
massively distributed data processing framework running on a large number of inexpensive compute nodes. Large data operations such as
processing crawled documents or regenerating a web index are split into several independent subtasks, distributed among the available nodes,
and computed in parallel within the network. Currently, there is no cloud approach, capable of optimally managing large amounts of widely
distributed data over a heterogeneous infrastructure. Microsoft’s Dryad and Google's MapReduce have achieved greater scalability than parallel
databases at the cost of avoiding complex transaction support but still require customization of the analysis code. Moreover, real-time reliability
guarantees remain elusive.
In addition to all that, existing large-scale data processing schemes such as MapReduce involve isolating basic operations within an application
for data distribution and partitioning. This excludes their applicability to many applications with complex data dependency considerations.
MapReduce models when used with complex data requirements generally entail additional difficult and error-prone application-level
customizations. Hence, MapReduce cannot automatically scale up for many applications and data sets, in practice. The MapReduce model does
not explicitly provide support for processing multiple related heterogeneous datasets. While processing data in relational models is a common
requirement, this restriction limits its functionality when dealing with complex and unstructured data such as images. Relational databases use a
separate, uniquely-structured table, for each different type of data for specific applications; programmers must know the precise structure of
every table and the meaning of every column a priori. To overcome this, we explore possibilities to evolve a novel virtualization scheme that can
efficiently partition and distribute data for clouds. For this matter, loosely-coupled associative techniques, not considered so far, can be pivotal to
effectively partition and distribute data in the cloud. Our associative model will use a universal structure for all data types. Information about the
logical structure of the data – metadata – and the rules that govern it may be stored alongside data. This allows programmers to work at a higher
level of abstraction without having to know the structural details of every data item. Hence, our approach to cloud-based data processing is
unique. It elevates the MapReduce key-value scheme to a higher level of functionality by replacing the purely quantitative key-value pairs with
higher order data structures that will improve parallel processing of data with complex associations (or dependencies). By having an associative
key/value framework, we can deal with data in any form and in any representation simply by using a pattern matching model (including
fuzziness), which treats data records as patterns and provides a distributed data access scheme that enables balanced data storage and retrieval
by association. We believe that the performance of MapReduce parallelism as a scalable scheme for data processing in clouds may be significantly
improved by transforming the data processing operations into one-shot distributed pattern matching sub-tasks, which in distributed
computations are performed in-network, enabling data storage and retrieval by association (instead of pre-set referential data access
mechanisms).
Feature Extraction
A practical solution to the challenge of voluminous data sets can be implemented. This implementation can occur through the use of pattern
recognition/matching models where patterns represent a collection of captured data over a specific period of time. To obtain useful information
from captured data, some sort of feature extraction needs to be implemented in a very efficient manner. Feature extraction can be expressed as a
mapping from a typically high-dimensional data space to a reduced dimension space, while maintaining some key properties of the data. This
approach for feature/pattern extraction is commonly referred to as data mining. This approach involves the process of uncovering patterns,
determining associations between data objects, detecting anomalies and even predicting future data trends. In this regards; pattern recognition is
a common processing tool used in a wide range of applications including medical diagnosis, environment and condition monitoring, decision-
making and various sorts of scientific explorations. However, when it comes to processing enormous amounts of data, common pattern
recognition schemes that operate within a CPU-centric environment may not scale well to deal with big data in the range of GigaBytes or
PetaBytes. Hence, a paradigm shift in data processing approaches is essential to handle recognition and feature extraction at the Internet-scale.
PATTERN RECOGNITION
In very simple terms; a pattern may be expressed through the use of a common denominator among multiple instances of an entity. In this
regards; pattern recognition schemes aim to make the process of detecting these common characteristics explicit in such a way that they can be
employed in computational devices to facilitate data processing by learning and adapting to its characteristics (Figure 1).
In recent years; interest in pattern recognition has been dramatically renewed mainly due to the data explosion phenomenon. This data deluge
along with rapid advancements in data capture technologies, such as in sensor networks, has called for a paradigm shift in recognition schemes
and analytical approaches. In fact current recognition schemes must be reconsidered from a larger perspective to scale with the growth of the
data to the Internet-scale perspective. Scalability is one of the most important criteria factors when considering deploying an efficient pattern
recognition model. To meet the requirements of existing Internet-scale data, the capability of pattern recognition schemes should continue to
grow and scale to minimize the risk of becoming obsolete. In this regards, Pal and Mitra (2004) have restated the question of scalability as
follows:
Can the pattern recognition algorithm process large data sets efficiently, while building from them the best possible models? (pg. 17).
This new surge in interest for scalable pattern recognition schemes is accompanied with the exponential growth of data sizes generated by digital
media (images/audio/video), web authoring, scientific instruments and physical simulations. Thus the question, how to effectively process such
immense data sets is becoming increasingly important. Nevertheless, most of existing models suffer from excessive computational complexity
when dealing with highly complex data sets. To achieve an adequate level of efficiency a number of barriers should be overcome when
implementing pattern recognition. These include but are not limited to:
• Large Data: As the size of data increases over time, pattern recognition schemes should be able to handle copying with this outgrowth of
data in the most efficient and effective way. This in turn requires taking into account all relevant data considerations from storage and
transport perspectives.
• HiDimensional Data: There are many application domains where data, to be extracted from the environment, is of considerably higher
dimensionality and is not basically spatial (e.g. biological data measuring gene features) within current data capture technologies. In this
context, pattern modeling schemes should be able to incorporate higher dimensionalities of data in processing/implementation.
• Algorithmic Complexity: To measure performance of existing pattern recognition models two aspects of algorithm performance, time
and space, must be considered. There is a need to know how fast the algorithm performs and what affects its runtime? And to know what
sort of data structure can/should be used to maximize performance. Although existing pattern recognition models are very powerful and
capable of providing efficient solutions most suffer from excessive complexity due to an iterative nature along with complex mathematical
foundations. Large portions are exponential and hence infeasible for implementation in large-scale data scenarios. Moreover, the high-cost
of implementation in terms of time and space results in large-scale data becoming too expensive.
Hence, any scheme for processing big data should be capable of addressing increasing size and dimensionality of the data while minimizing
implementation complexity. In this regard there are a few major techniques for scaling up pattern recognition when dealing with big data:
• Data Approach: In this model captured data will be pre-processed and modified in preparation for the recognition process. A number of
techniques have been proposed in the literature for this purpose including Data Reduction (Chow and Huang, 2008), Dimensionality
Reduction (Rueda and Herrera, 2008) and Data Partitioning (Kbir, Maalmi, Benslimane & Benkirane, 2000). The ultimate goal is to
reduce/minimize size and dimension of data for faster and more efficient recognition, but the approach is liable to overlook the importance
of data integrity by reducing the size of data domain.
• Learning Approach: Learning mechanism is a common component among pattern recognition schemes. Large numbers of attempts
have been made by researchers to reduce the computational complexity of learning phase in favor of achieving salable models with faster
recognition speed. A few examples are Active Learning (Cheng and Wang, 2007) and Incremental Learning (Schlimmer and Granger, 1986).
The risk associated with this approach is reaching faster recognition by compromising recognition accuracy.
Among the above three computing approaches; distributed processing looks more promising to scale up with today’s outgrowth of data. Major
advancements in parallel computing technology from simple multithreading computational models to multi-core and graphical processing unit
(GPU) forms of distributed computing have enabled large-scale processing to be performed in much more elegant and efficient ways; however,
some of existing models are extremely complex and highly cumbersome to parallelize. Moreover scalability of deployed methods for processing
voluminous data is still an open problem to address. Furthermore, the existing data management schemes do not work well when data is
partitioned among numerous available nodes dynamically. Thus the question, how to effectively process large-scale data sets is becoming
increasingly important. Approaches towards parallel data processing in the cloud, which offer greater portability, manageability and
compatibility of applications and data, are yet to be fully-explored. In this regard an active area of research is in using bio-inspired mechanisms
to reduce data analytic complexity.
Neural network approaches which have so far not been considered can provide the break through needed for cloud-based data management.
Neural networks in simple terms can be defined as interconnected parallel-computing network of massive number of processing nodes, known as
neurons. One of the main benefits of using neural network techniques for data processing is in the fact that they let the system to learn from data
and adapt accordingly to that nature of data. This adaptive feature offers a promising tool for scalable large-scale recognition. However, there
appear to be a number of issues needed to be overcome in relation to their implementation and deployment. One of the main problems within
artificial neural networks (ANN) is that the computational complexity increases substantially with increases in the problem size; often these
algorithms fail to scale-up for large and complex datasets. Furthermore, there is no clear solution to optimally segmenting multidimensional
datasets. Addressing these shortcomings, for large-scale data analysis, will transform the way big data processing is done at present and it will
create a new path for fast data dimensionality reduction and classification. Graph Neuron (GN) scheme on the other hand has proved to be an
optimal solution for efficient distributed in-network data processing (Khan, 2002). GN has been tested in pattern recognition applications within
different types of distributed environments (Muhammad Amin and Khan, 2008). GN makes use of a graph-based model for pattern learning and
recognition. One of the strengths of this technique is the employment of parallel in-network processing to address scalability issues effectively, a
primary concern in distributed approaches. Some of the techniques currently being researched by us are to utilize human brain-like constructs to
make correct associations within data sets. Hierarchical Graph Neuron (Nasution and Khan, 2008), which was developed bio-mimetically for
real-time analysis of sensory data and has been successfully used to search for complex patterns in very large data set by Basirat and Khan in
cloud computing (Basirat, Khan & Srinivasan, 2013).
CONCLUSION
The dynamic and distributed nature of cloud computing environments, and not least their exponential growth makes real-time data management
complicated and storage, updates and analytics costly. We hypothesize, that fundamental changes and improvements in data access and
movements are possible and beneficial for cloud-based processing. In other words, transforming big data into valuable information requires a
fundamental re-think of the way in which future data management models will need to be developed on the Internet. As previously discussed,
distributed pattern recognition approaches can be investigated as an optimal solution for large-scale data processing. Nevertheless, some major
obstacles should be overcome before making them efficiently suitable for cloud environments. In fact, existing distributed pattern recognition
models have been mainly formed along a top-down approach, in which relatively CPU-centric (or sequential-based) algorithms are instrumented
and enhanced to function in a distributed manner. In addition to this limitation, most of current approaches implement distribution partially, i.e.
in the context of training and validation (e.g. feed-forward neural networks and self-organizing maps). This chapter included a discussion of
redesigning data management architectures from a scalable distributed computing perspective for creating a database-like functionality to treat
data records as patterns. Processing the database and handling the dynamic load using a distributed pattern recognition approach rather than
deploying a referential data access mechanism. In this scheme, the principle of associative memory based learning will be implemented through
the use of hierarchically connected layers, with local feature learning at the lowest layer and upper layers combining features into higher
representations. This approach will entail a two-fold benefit. Applications based on associative computing models will efficiently utilize the
underlying hardware that scales up and down the system resources dynamically and automatically, controls data distributions and allocation of
the computational resources in the cloud. In order to achieve the aforementioned objectives, an initial step would be to develop a distributed data
access scheme that enables record storage and retrieval by association, and thereby circumvents the partitioning issue experienced within
referential data access mechanisms. In our model, data records are treated as patterns. As a result, data storage and retrieval can be performed
using a distributed pattern recognition approach that is implemented through the integration of loosely-coupled computational networks,
followed by a divide-and-distribute approach that facilitates distribution of these networks within the cloud dynamically. Our online-learning
associative memory scheme is conceived on the principle that “moving computation is much cheaper than moving data”. Hence, it will provide
methods for automatic aggregation and partitioning of associated data in the cloud for widely used data sets.
This work was previously published in Strategic DataBased Wisdom in the Big Data Era edited by John Girard, Deanna Klein, and Kristi
Berg, pages 198208, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Anderson, C. (2011). The end of theory: The data deluge makes the scientific method obsolete . Discovery Magazine.
Basirat, A. H., Khan, A. I., & Srinivasan, B. (2013). Highly distributable associative memory based computational framework for parallel data
processing in cloud. In Proceedings of the 10thInternational Conference on Mobile and Ubiquitous Systems: Computing, Networking and
Services. Tokyo, Japan: Academic Press.
Chaiken, R., Jenkins, B., Larson, P.A., Ramsey, B., Shakib, D., Weaver, S., & Zhou, J., (2008). SCOPE: Easy and efficient parallel processing of
massive data sets. Proceedings of Very Large Database Systems, 1(2), 1265 – 1276.
Cheng, J., & Wang, K. (2007). Active learning for image retrieval with Co-SVM . Pattern Recognition , 40(1), 330–334.
doi:10.1016/j.patcog.2006.06.005
Chih Yang H. Dasdan A. Hsiao R. L. Parker D. S. (2007). Map-reduce-merge: Simplified relational data processing on large clusters. In
Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (pp. 1029 – 1040), New York: ACM.
10.1145/1247480.1247602
Chow, T. W. S., & Huang, D. (2008). Data reduction for pattern recognition and data analysis . Springer. doi:10.1007/978-3-540-78293-3_2
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Operating
Systems Design & Implementation. Berkeley, CA: Academic Press.
Elaine, R. M. (2008). The impact of next-generation sequencing technology on genetics . Trends in Genetics , 24(3), 133–141.
doi:10.1016/j.tig.2007.12.007
Fox G. C. Aktas M. S. Aydin G. Donnellan A. Gadgil H. Granat R. Scharber M. (2005). Building sensor filter grids: Information architecture for
the data deluge. In Proceedings of the 1stInternational Conference on Semantics, Knowledge and Grid. Washington, DC: Academic Press.
10.1109/SKG.2005.48
Gray, J., Bell, G., & Szalay, A. (2006). Petascale computational systems . IEEE Computer , 39(1), 110–112. doi:10.1109/MC.2006.29
Hey, T., Tansly, S., & Tolle, K. (2009). The fourth paradigm: Data-intensive scientific discovery . Microsoft Research.
Isard M. Budiu M. Yu Y. Birrell A. Fetterly D. (2007). Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings
of the 2nd ACM European Conference on Computer Systems (pp. 59 – 72). New York: ACM. 10.1145/1272996.1273005
Kbir, M. A., Maalmi, K., Benslimane, R., & Benkirane, H. (2000). Hierarchical fuzzy partition for pattern classification with fuzzy if-then rules
. Pattern Recognition Letters , 21(6-7), 503–509. doi:10.1016/S0167-8655(00)00015-5
Khan, A. I. (2002). A peer-to-peer associative memory network for intelligent information systems. In Proceedings of the 13thAustralasian
Conference on Information Systems (Vol. 1, pp. 22 -29). Academic Press.
Moore, G. E. (2000). Cramming more components onto integrated circuits. Academic Press.
Muhamad Amin, A. H., & Khan, A. I. (2008). Commodity-grid based distributed pattern recognition framework. In Proceedings
of6th Australasian Symposium on Grid Computing and eResearch. Wollongong, Australia: Academic Press.
Nasution, B. B., & Khan, A. I. (2008). A hierarchical graph neuron scheme for real-time pattern recognition . IEEE Transactions on Neural
Networks , 19(2), 212–229. doi:10.1109/TNN.2007.905857
Pal, S. K., & Mitra, P. (2004). Pattern recognition algorithms for data mining, scalability, knowledge discovery, and soft granular computing .
London, UK: Chapman & Hall, Ltd.doi:10.1201/9780203998076
Rueda, L., & Herrera, M. (2008). Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space .Pattern
Recognition , 41(10), 3138–3152. doi:10.1016/j.patcog.2008.01.016
Schlimmer, J. C., & Granger, R. H. Jr. (1986). Incremental learning from noisy data . Machine Learning , 1(3), 317–354. doi:10.1007/BF00116895
Shiers, J. (2009). Grid today, clouds on the horizon . Computer Physics Communications , 180(4), 559–563. doi:10.1016/j.cpc.2008.11.027
Szalay, A., Bunn, A., Gray, J., Foster, I., & Raicu, I. (2006). The importance of data locality in distributed computing applications. In Proceedings
of the NSF Workflow Workshop. Academic Press.
KEY TERMS AND DEFINITIONS
Cloud Computing: Presents a pay-per-use paradigm for providing services over the Internet in a scalable manner.
Distributed Computing: Performing computations within the body of network exploiting resource-sharing capabilities of distributed systems
to cope with incremental growth of resource demands.
Feature Extraction: A mapping from a typically high-dimensional data space to a reduced dimension space, while maintaining some key
properties of data.
Hierarchical Graph Neuron: A lightweight in-network processing algorithm which does not require expensive computations; hence, it is very
suitable for real-time applications with low memory requirements.
Pattern: Is expressed through the use of a common denominator among multiple instances of an entity.
Pattern Recognition: The process of observing and detecting common characteristics in such a way that they can be employed in
computational devices to facilitate data processing by learning and adapting to its characteristics.
Parallel Processing: Simultaneous use of more than one processing unit to perform computational tasks.
CHAPTER 44
Expressing Data, Space, and Time with Tableau Public™:
Harnessing Open Data to Enhance Visual Learning through Interactive Maps and Dashboards
Shalin HaiJew
Kansas State University, USA
ABSTRACT
Virtually every subject area depicted in a learning object could conceivably involve a space-time element. Theoretically, every event may be
mapped geospatially, and in time, these spatialized event maps may be overlaid with combined data (locations of particular natural and human-
made objects, demographics, and other phenomena) to enable the identification and analysis of time-space patterns and interrelationships. They
enable hypothesis formations, hunches, and the asking and answering of important research questions. The ability to integrate time-space
insights into research work is enhanced by the wide availability of multiple new sources of free geospatial data: open data from governments and
organizations (as part of Gov 2.0), locative information from social media platforms (as part of Web 2.0), and self-created geospatial datasets
from multiple sources. The resulting maps and data visualizations, imbued with a time context and the potential sequencing of maps over time,
enable fresh insights and increased understandings. In addition to the wide availability of validated geospatial data, Tableau Public is a free and
open cloud-based tool that enables the mapping of various data sets for visualizations that are pushed out onto a public gallery for public
consumption. The interactive dashboard enables users to explore the data and discover insights and patterns. Tableau Public is a tool that
enables enhanced visual- and interaction-based knowing, through interactive Web-friendly maps, panel charts, and data dashboards. With
virtually zero computational or hosting costs (for the user), Tableau Public enables the integration of geospatial mapping and analysis stands to
benefit research work, data exploration and discovery and analysis, and learning.
INTRODUCTION
In the age of “big data” and “open data,” the broad public has access to more information than they have ever had historically. Some of this data
has been released to the public domain through open government endeavors (Gov 2.0). Others have been shared as part of the “digital exhaust”
(or “data exhaust”) of Web 2.0, or the social age of the Web (with APIs enabling access to a range of social media platforms, social networking
sites, microblogging sites, wikis, blogs, and other platforms). Another source of datasets comes from academia, with a range of sites that host
downloadable datasets as part of the formal publication process. Beyond these prior open-access and / or open-source datasets, there are also
propriety ones released by for-profit companies as part of their public service and public relations outreaches. Much data are collected by sensor
networks in physical spaces and robots on the Internet and Web. The popularity of mobile devices, navigation systems, and software applications
that track location data means that there are publicly available datasets of geo-spatial or locative information. Much of this data, though, cannot
be understood coherently without running them through data analysis and visualization tools—to identify patterns and anomalies, as well as
create data-based maps, graphs, and charts. If “big data” is going to have direct relevance to the general public, they have to be “big data”-literate:
they have to be able to understand and query big data. Concomitant with the “democratization of data” are some tools that enable data processing
and visualization. One leading free online tool, Tableau Public, the free public version of a professional enterprise suite (Tableau Professional),
serves as a gateway to such data analysis and visualization. This chapter provides a light overview of the software tool and its possible use in
multimedia presentations to enhance discovery learning.
The dynamically generated visualizations themselves are almost invariably multivariate and multi-dimensional, with labeled data in a variety of
accessible visualizations; these may include multiple pages of related visualizations. While these visualizations may be complex, there are others
that may be created for other purposes than research and discovery; some data visualizations are whimsical and attention-getting (to capture
attention and encourage awareness of the data).
The public edition of Tableau Public provides a gigabyte of storage for each registered user. There is no “save” for the visualizations except
through publishing out the data visualizations on the Tableau Public Gallery, which makes all visualizations publicly viewable in an infographics
and data visualization gallery, along with the downloads of related datasets. The panel charts are zoomable and pan-able; they enable data
filtering (with responsive dynamic revisualizations). This gallery is of particular use if public awareness is part desired outcomes. Such maps and
data visualizations have been used in an emerging class of digital narratives, used by journalists and other storytellers (Segel & Heer, 2010) in
computational journalism, or journalism which relies on in-depth data processing to tell “data stories” or “narrative visualizations”. Such
presentations meld data, statistics, design, information technology, and storytelling, for a broad audience. As such, these visualizations appear
also as parts of commercial sites based around business, real estate, and others. They are used as online conversation starters in a variety of
contexts.
Contents may be authored in Windows machines only, but the dashboards and data visualizations are viewable on Windows and Mac machines
without any plug-ins required (just browsers with JavaScript enabled). The makers of this tool use the tagline, “Data In. Brilliance Out,” to
express their objectives for the tool. As a tool for multimedia presentations, Tableau Public enables rich and interactive data visualizations to
broaden the perceptual (visual) and cognitive (symbolic reasoning, textual, and kinesthetic) learning channels to understand data. Tableau Public
is a free tool (albeit a cloud and a hosted solution) that enables the uploading of complex data (in .xl and .txt formats) for intuitive presentations
on the Web. Such interactive depictions offer accessible ways of understanding interrelationships and potential trends over time, and provide
some initial predictive analytics.
REVIEW OF THE LITERATURE
The phenomena of Web 2.0 (or the Social Web) and Gov 2.0 (social e-government) have meant that there is wide availability of datasets linked to
social media platforms and open government geospatial datasets, among others. Social media platform data tend to be self-organizing and
dynamic data based on people’s lifestyles and activities; these involve the mapping of social networks based on people’s electronic
communications through social networking sites, microblogging sites, blog sites, wiki sites, and so on. The latter involve a mix of more static (less
dynamically changeable) data about citizen demographics, business records, economics, national security, law enforcement, nature management,
weather, and other aspects. As with various types of data, such sets may be combined to highlight particular issues of interest as long as there is a
column of unique identifiers that can help researchers match records (such as record based on physical space).
Where the finesse of researchers comes in is in knowing what may be asserted from the mapped data and relating that to their research. Indeed,
geospatial mapping in academic works may not require specialist training in some cases; often it may involve common-sense mapping informed
by researchers’ specialist topical focus. The assumptions of mapping such data are simple: that physical location may be relevant and interesting
to represent certain phenomena. [The physical world is mapped, and all that is needed in most datasets to tie records to physical locations is a
column with location data (whether latitude and longitude, ZIP codes, cities, counties, states, provinces, and countries, or other indicators). The
quality of the location data enables various levels of specificity in terms of three-dimensional physical space. Some data are so precise that it can
locate a smart mobile device to within inches of its actual location. With phenomena and data related to locations, other knowledge of those
locations may be brought into play.] Map visualizations are often used for sharing the results of the research for specialist audiences as well as the
broader public. Visualizations may make geo-spatial understandings more broadly accessible. They may reduce complexity. Variables themselves
may be visually depicted (often as multi-faceted glyphs or semantic icons) to represent some of the differences in data. By convention,
information about variables are communicated through “position, size, shape and, more recently, movement” to convey interrelationships and
patterns (Manovich, 2011, p. 36).
Error and Data
Researchers in most fields have intensive training on the numerous ways that error may be introduced into data. While data visualizations enable
the asking of certain types of data questions, and the unpacking and exploration of complex data, they may also be misleading. This risk of errors
is a critical issue throughout the data collection and analysis pipeline. Figure 1 shows that error may be introduced at virtually every phase of the
process.
Figure 1, “Critical Junctures for the Introduction of Error in Data Collection and Analysis,” provides a general data collection and analysis
sequence with the following semi-linear (occasionally recursive) steps:
1. Theorizing/mental modeling,
2. Research design,
3. Data collection methods/data collection tools (technologies and instrumentation) / data sourcing and provenance,
6. Data cleaning and formatting/anonymization and de-identification / creation of shadow datasets, extrapolations,
9. Presentation/write-up.
Errors are any distortions that may affect the accuracy of understandings; they may be intentional or unintentional. In theorizing or mental
modeling, the analyst may apply an inaccurate conceptualization over the information, which may then be viewed inaccurately given cognitive
dissonance and biases. The research design may introduce error into the data by introducing systematic (measurement) or sampling (non-
random) errors. The way those data are collected, the technologies and instrumentation used in their collection, and the selected data sourcing
and provenance may introduce error: the methods and tools have to align with the research context and what the researchers need to know, the
data sources have to be solid, and all tools should be precisely created, tested, calibrated, and applied. The labeling of the data and the application
of metadata may introduce error by allowing imprecision and inaccuracy. How the data are stored and transferred may introduce error if it is not
done securely (to ensure data validation and reliability). The work of data cleaning and formatting may introduce error through mishandling and
mislabeling; the anonymization and de-identificaiton of data for attaining research standards may involve critical lossiness of information. Data
visualization may introduce error with data reduction or simplification, which reduces the ability to discriminate between the finer points or
nuances of given data. Data visualizations are summaries of the underlying data; they encapsulate complexity within their own simplicity. Those
who consume visualizations themselves without understandings of the underlying data and their provenance and treatment can be misled and
may experience false confidence about the knowability of the data and their level of knowledge of that data. (To elaborate, the data from a
theoretical model, a thought experiment, or computational experiment should have difference resonance than from-world empirical data or
scientific research.) Data analysis involves some of the classic errors of analysts—of data insufficiency, premature interpretive commitment,
confusing noise (non-information or “static”) for signal (false positive or a Type I error or a failure to reject a true null hypothesis), or
insensitivity to signal (false negative or a Type 2 error or a failure to reject a false null hypothesis). Certainly, there are many other potential
errors beyond the general ones mentioned here. There are unique challenges with accurate data for the respective research types and fields /
domains. With complexity and high dimensional data, there are still other challenges. Clearly, it makes the best sense to get the error rates down
in the first before the work is actually done, but in some cases, corrections may be applied afterwards. The discussions of that are beyond the
purview of this chapter though.
While these steps are identified as junctures when error may be introduced, these are also the same junctures at which errors may be corrected
for and headed off. Any time that information is handled, it may be handled in a way that is thorough, ethical, accurate, and constructive.
TABLEAU PUBLIC: THE TOOL
The underlying software for Tableau Public was created as part of a doctoral project out of Stanford University known as the Polaris project
(“Polaris interactive database visualization”) by Chris Stolte in 2002. The tool would enable basic users to create visualizations from data even
without database experiences. One researcher explains:
Starting out in 2003 as an output of a PhD project called Polaris, an interface for the exploration of multidimensional databases that extends
the Pivot Table interface to directly generate a rich, expressive set of graphical displays” (Stolte, et al. 2008), it became commercialized as
Tableau later that year. In 2010, the free version, Tableau Public was released. Tableau Public requires a client to be downloaded and
installed, and also requires an internet connection to function. Rather than accepting a specific data structure for a specific plot type, Tableau
accepts an entire database, and allows the user to explore the variables in the data via a variety of potential plots.(Oh, 2013, p. 5)
The commercial product was released in 2003. The public version, Tableau Public, launched in February 2010, requires users to register with
Tableau Software, Inc. and then download the Tableau Public desktop client. With this client, users are able to upload various types of datasets
into the tool for manipulation and data visualizations. A range of data types from various heterogeneous sources may also be integrated,
including from Google Analytics, Cloudera Hadoop, Google BigQuery, Microsoft SQL server, Oracle, and Teradata. There are also ways to extract
data from various data mart servers. The data may be files, datacubes (three- or higher dimensional data expressed through an array of values),
databases, data marts, and others (Morton, Balazinska, Grossman, Kosara, Mackinlay, & Halevy, 2012, p. 1). Tableau Public enables limited “data
blending”—when a primary data source is combined with a secondary data source through “join keys,” and duplicate records are excised from the
visualization. Tableau Public has an intuitive graphical user interface (GUI) for data ingestion (importation or upload). In the drop-down menu
for ways to connect to data, the file types enabled included Tableau data extract files, Microsoft Access, Microsoft Excel, and various text files
(such as in comma or vertical bar-separated formats).
A variety of data visualizations may be created from the data. To use the tool’s nomenclature (which draws in part from common terms), users
may draw text tables, heat maps, highlight tables, symbol maps, filled maps, pie charts, horizontal bars, stacked bars, side-by-side bars,
treemaps, circle views, side-by-side circles, continuous lines, dual lines, area charts (continuous and discrete), dual combination, scatterplots,
histograms, box-and-whisker plots, Gantt charts, bullet graphs, and packed bubbles. Tableau Public has a built-in wizard that automatically
detects data types and suggests appropriate data visualizations and chart types based on the selected variables. If additional data are needed for a
full visualization, the “Show Me” tool will suggest what to add to the mix. Selected textual labels may be applied to each of the data points,
individual records, or nodes. Gradations of data may be indicated by color, size, locations, and other indicators.
The end user license agreement (“EULA”) for Tableau Public’s software is described on their site. It delimits the uses of the “media elements” and
visualizations created from the tool and disallows for-profit use. It reads: “For the avoidance of doubt, you may not sell, license or distribute
copies of the Media Elements by themselves or as part of any collection or product. All Media Elements are provided ‘AS IS’, and Tableau makes
no warranties, express or implied of any kind with respect to such Media Elements.”
Users may publish out their interactive worksheets, dashboards, and panel charts (with the related dataset offered as a downloadable file) on the
public Web gallery (located at https://www.tableausoftware.com/public/gallery) for anyone with any of the major Web browsers to view and
interact with the data. Users may also use links or embed code to share their information through websites, blogs, wikis, and emails. In terms of
interactivity, users may filter contents, visually explore the data, attain details of individual records, acquire data from different data sources, and
even download the originating dataset(s). Users may also interact in the Community around Tableau Public (located at
https://www.tableausoftware.com/public/community). The hosts of the site select a “Viz of the Day” to highlight notable data visualizations. The
tool itself is the free and public version of Tableau Software’s Desktop Business Intelligence / Business Analysis tool. The tool is Javascript-based.
At present, the tool enables up to a million records, but big data is expected to present large challenges for data visualization (Morton, Balazinska,
Grossman, & Mackinlay, 2014). If the host machine lacks sufficient processing power, that could also limit the capabilities of the Tableau Public
desktop client (at which point a pop-up window indicates the limitation or the system hangs and crashes).
CREATING DATA VISUALIZATIONS WITH TABLEAU PUBLIC
To provide a sense of this tool, Tableau Public 8.1 was used to create some visualizations from real-world data. Figure 2, “Tableau Public
Graphical User Interface for the Desktop Client” provides a sense of the simplicity of the desktop client interface.
Once the visualizations are finalized, they may then be published out in a static format (such as through screenshots) or in a dynamic format
(such as through a website).
An overview of this closer-in process is conceptualized in Figure 4, “The Work Pipeline in Tableau Public’s Public Edition (One
Conceptualization).” This process begins at the point of verifying the provenance of the data and acquiring select data, whether from open-source
repositories or proprietary or self-created sources. It helps to know if the data provider has credibility in the field. It also helps to read the fine
print about the data. For example, one dataset explored (but not depicted) as part of the work for this chapter included interpolated data (the
construction of new data points from known data points); in other words, the data was not exact empirical data but a processed approximation
based on other empirical data. (It is possible to start the pipeline earlier with a research question or need to discover particular information.)
Once the data are selected, they may be integrated for mixed data sets or cleaned for more effective analysis. The data are then ingested into
Tableau Public and processed for various types of visualizations (maps and graphs) and dashboards. Additional labeling and annotation may be
done. Finally, the visualizations and datasets may be shared out on the public web gallery (through an enforced sharing based on the free tool), or
screenshots may be taken for static mapping. Clearly, the data analysis could also lead to further datasets and additional analysis.
SOME REALWORLD EXAMPLES
Understanding a tool requires putting it through the paces of the actual work. To this end, five real-world examples have been created. Example 1,
“Visualizing Consumer Complaints about Financial Products and Services” uses a dataset with over 200,000 records. Example 2, “Visualizing
Hospital Acquired Infections (HAIs) in Hospitals in New York State” focuses on issues of nosocomial infections. Example 3, “Popular Baby
Names: Beginning 2007” offers county-level information of popular baby names in the United States. Example 4, “Aviation Accidents Data from
the NTSB (from 1982 - 2014)” offers insights about when flights are at the most risky. All four of the prior examples involve data from open-
source datasets and created by various government agencies (with access through the Data.gov site). Finally, Example 5 “President Barack
Obama’s Tweets and Political Issues” uses an original dataset extracted from the Twitter microblogging site (using NCapture of NVivo) to show
integration of an original researcher-created dataset.
Example 1: Visualizing Consumer Complaints about Financial Products and Services
To provide a sense of how this process might work, a dataset of 208,474 records of consumer complaints about financial products and services
was accessed from the Data.gov site (at https://catalog.data.gov/dataset/consumer-complaint-database). The data was collected by the
Consumer Financial Protection Bureau. A screenshot of the raw dataset is available in Figure 5, “The Data Structure of the Consumer Complaints
Records of the Consumer Financial Protection Bureau.” As may be seen, the first row of data (Row 1) has the names of each of the columns of
data which follow. Below, in each of the cells are various data. The first column (the one on the furthest left) contains unique identifiers for the
respective record. Each row following the first one contains one record in the dataset. This is the basic structure of many common worksheets
and data tables. Any data ingested into Tableau Public needs to be in this format in order for the machine to know how to “read” the data. The
data in the worksheets or tables are read as either dimension data (descriptive data) or as measures (quantitative data). Location data may fall
into the dimension category (such as a country, state, province, county, ZIP code, etc.), or in the measures one (quantitative expression), such as
in terms of (generated) latitude and longitude. (Where this data structure knowledge is especially important is in realizing that various open-data
datasets may download as a zipped folder full of additional folders with data tables, PDF figures, data declarations and overviews files, topics and
resources, and even various other data file types).
Figure 5. The data structure of the consumer complaints
records of the Consumer Financial Protection Bureau
More particularly, for this consumer complaints dataset, the names of the fields in the first row read as follows: 1A= Complaint ID, 1B=Product,
1C=Subproduct, and so forth. This dataset also includes by state information per both the columns F and G (State and then ZIP code). This
dataset shows some of the complaints in progress...so this is not finalized data. In Tableau Public, the user begins through trial-and-error
learning by moving the elements from the dimensions or measures spaces onto the main visualization pane into the Columns or Rows text
windows and elsewhere in the workspace to see what visualizations may be created. It helps to start simple and not overload the visualization
with too many variables.
The workspace simulates what users may experience as they mouseover (place the cursor over) certain parts of the data visualization. The
mouseover action brings up the selected “Detail” of each of the records represented by the particular node. (The screen capture tool did not
enable the capture of the dynamic pop-up of the detail window). In Figure 6, the mouseover-triggered pop-up window showed the name of the
Company, the Complaint ID, the Issue, the Product, the Sub-Issue, whether there was a Timely Response, and the Number of Records related to
that issue. The “packed bubble” visualization looks more like a “tree trunk” diagram. Here, the “rings” of the virtual tree trunk contain the
alphabetized names of the various financial institutions beginning with A’s in the middle and Z’s on the outside.
Another view of the data may involve a geospatial map to give a sense of proximity between these various types of financial complaints and to see
if there is a deep clustering of such cases (such as on the two coasts where such financial firms may be clustered). A Filled Map (a chloropleth
map that would show frequency of complaints by state through intensity of color) may have been made with the same information. (Such
mapping of numerical data to space may make the information more accessible to some—who may innately understand data plotted on a map
than tables of quantitative data). The tool itself is not drawing any maps (digital cartography) but is rather placing the data on pre-existing map
templates; additional overlays of spatialized data may be applied to the map visualizations.
All the variables do not have to be used per se, and the user drawing the graphs may select to portray only some of the available information. So
how well did the various financial institutions deal with resolving consumer complaints? Figure 7, “A Back-end Worksheet of Resolution
Measures for Financial Complaints (created in Tableau Public)” shows a table with data pullouts to show the speed of the response and the type
of resolution. Multiple visualizations were then created from one dataset, and certainly, many dozens more may be extracted to answer particular
data questions. Data visualizations are often created to answer targeted questions made of the data.
Example 2: Visualizing Hospital Acquired Infections (HAIs) in Hospitals in New York State
This second example involves information from the New York State Department of Health (NYSDOH), with 7,600 records of hospital-acquired
(nosocomial) infections from 2008 - 2012. A description on the dataset reads: “This includes central line-associated blood stream infections in
intensive care units; surgical site infections following colon, hip replacement/revision, and coronary artery bypass graft; and Clostridium difficile
infections.” This dataset, like many others, is in process—with a post-release audit and a revised file forthcoming a year after the original dataset
was made public. (Strong analyses often work backwards and forwards in time. There is not a fixity of understanding or a closing of a case if there
is any potential for new understandings.) This set was downloaded at https://catalog.data.gov/dataset/hospital-acquired-infections-beginning-
2008.
A note with the dataset reads: “Because of the complicated nature of the risk-adjustment methodology used to produce the HAI rates, the advice
of a statistician is recommended before attempting to manipulate the data. Hospital-specific risk-adjusted rates cannot simply be combined.” It
seems advisable to consult with both statisticians and professionals in the field before making assertions about any data. (The general
assumption in this chapter is that researchers are themselves experts in their respective fields and so know how to expertly engage the given
information.)
Figure 9 shows a screenshot of the dataset. Generally, it is a good idea to keep the dataset open for observations while interacting with Tableau
Public—so there are clear understandings of what types of information each of the columns contain and what the various variables mean, given
the wide variance in data labeling, their brevity, and original nomenclature. Figure 9, “Hospital-acquired infections from New York Hospitals
dataset (2008 - 2012)” shows a data visualization from the dataset that highlights the types of procedures that most commonly involved HAI by
year. Below the first visualization is another view of the same data albeit with one particular hospital highlighted to see what its main HAIs were
in each of the covered years.
Another visualization from the HAI dataset from the New York Department of Health is to get an overview of the trendline data of which
procedures were the most high-risk across the dataset. Trendline data refer to time-varying (temporal) data that show frequency of occurrences
over time—generally without data smoothing (no averaging of the adjacent data points). As such, this data may show tendencies and changes
over time; they show sequentiality. From the visualization, it’s clear that some paths start later because of a lack of information for some prior
years…or the possibility of non-existence of the issue in prior years.
Example 3: Popular Baby Names: Beginning 2007
This third example involves a dataset from the U.S. Department of Health and Human Services. This set involves the most popular baby names in
various counties or boroughs (based on birth certificates)…beginning in 2007 and running through 2012. An explanation came with the dataset:
“The frequency of the Baby Name is listed, if there are: 5 or more of the same baby name in a county outside of NYC; or 10 or more of the same
baby name in a NYC borough.” The data was downloaded from https://catalog.data.gov/dataset/baby-names-beginning-2007.
Figure 13. The most popular baby names in New York state (by
county or borough) from 2007 – 2012 (Trendline
Visualization)
A dashboard is a mix of worksheets, with the visualizations, legends, filters, text labels, and other elements. When deployed on the Web (through
the Tableau Public gallery), these are interactive and often informative. Figure 14, “An Interactive Dashboard with both Space (County) and Time
Represented in First-Name Popularity in Birth Certificates in New York State” provides a back-end view of just such a dashboard, this one
including both a sense of space and time.
The fourth example comes from the National Transportation Safety Board (NTSB) and its aviation accident database. This extract from their
records list accidents since 1982 in a common separated values (CSV) or text format. A descriptor that came with the data read: “The NTSB
aviation accident database contains information about civil aviation accidents and selected incidents within the United States, its territories and
possessions, and in international waters.” The dataset was downloaded from http://catalog.data.gov/dataset/ntsb-aviation-accident-database-
extract-of-aviation-accident-records-since-1982-.
In terms of civil aviation, one question of interest for investigators is in which broad phases of an airplane’s flight do most accidents occur? This
data visualization follows the taxi, standing, takeoff, climb, cruise, go-around, descent, approach, landing, maneuvering, standing, and other
phases of a flight (in no particular order listed here).
Another view of this data may be the distribution of accidents and incidents across the world by location, as is shown in Figure 16, “Back-end
view of geographical spread of Civil aviation Accidents and Incidents (from the NTSB aviation accident database) (1982 – 2014)”.
Those interacting with this information may filter out particular data in order to enhance the focus. Are there certain locales where more of one
type of incident happens than another? If so, why would that possibly be?
Example 5: President Barack Obama’s Tweets and Political Issues
Finally, the last example involved a self-created dataset of U.S. President Barack Obama’s Tweets from his @BarackObama Twitter account
(https://twitter.com/BarackObama). The capture of this dataset was achieved with NCapture of NVivo 10. The target site mentioned 11,500
Tweets (a rounded-up number), with 652,000 accounts following (being followed by the @BarackObama account), and 42.3 million followers.
The capture of the Tweet stream dataset included re-tweets, with 3,229 microblogging messages captured.
Figure 20, “A Packed Bubbles Chart of Tweets (orange) and Subset of Re-Tweets with Hashtags Available on Mouseover” shows a predominance
of fresh Tweets with a smaller set of retweets.
Figure 20. A packed bubbles chart of Tweets (orange) and
subset of Re-Tweets with hashtags Available on Mouseover
To gain a sense of some of the topics being Tweeted about, Figure 21, “A Packed Bubbles Chart of Topics (based on Hashtags) @BarackObama on
Twitter” shows dominant issues as recurrent terms. The size of the bubble here is based on the number of Tweets (represented as individual
records). While Tableau Public could not create a geographical dimension, given the sparseness of such geo data (and a lack of consistent method
for indicating geospatial information on Twitter), NVivo could. Figure 22, “A Geographical Map of Tweets @BarackObama from the NVivo Tool”
represents this NVivo-created map extrapolated from the @BarackObama tweets and re-tweets.
Figure 21. A packed bubbles chart of topics (based on
hashtags) @BarackObama on Twitter
Figure 22. A geographical map of Tweets @BarackObama
from the NVivo Tool
Some data extraction tools that are used to extract data from social media platforms may have built-in data structures that are not necessarily
conducive to data visualizations in Tableau Public. As an example, the open-source and freeware tool NodeXL (Network Overview, Discovery,
and Exploration for Excel) enables extracting social network data from various social media platforms and their visualization based on a number
of popular layout algorithms, but the datasets extracted are not easily readable by Tableau Public (without some human intervention). .
Discussion
The uses of data visualizations in a multimedia presentation may enhance user understanding of particular concepts or relationships or
phenomena. The foregrounding of some information will necessarily “background” other information. While some data are brought to the fore,
others are obscured. There are almost always trade-offs.
The users of the data visualizations will also approach that information with varying skill levels and sophistication; they will have different
purposes in using that data. Depending on how they use their attentional resources, those referring to a particular visualization may have varying
degrees of understanding of the provenance of the data and then how that those are transformed into a diagram, chart, map, or dashboard. Some
data consumers will just view a visualization in its resting state while others will interact with the data and explore the informational depths and
implications. To meet the needs of a variety of potential users, those who would design such Tableau Public datasets, worksheets, and dashboards
would do well to gain a lot of experiences in the work before going live. Every visualization should be as accurate as possible and as accessible and
machine-readable as possible. The data visualizations should be conducive to use by both those with the more popular “naïve” common-sense
geography (and understanding of data) and those with higher-level knowledge of the topic. The level of uncertainty inherent in data should be
defined and communicated. (After all, data is an “isolate” and abstraction from the real world. It should never be seen to map to the world with
full fidelity, for example.) All data and their visualizations should be explained and contextualized.
Those who are uninterested in going public with their data visualizations may stop short of going live. They may create the visualizations for
internal use only. They may take screenshots of the findings. They may maintain their own datasets…and also directions for how to achieve the
visualization for analysis…but not output any “save-able” finalized file form Tableau Public. Screenshots, as static portrayals of the dashboards,
may be taken for other purposes as well—such as report-writing, presentations, or publications.
Delimitations of the Software Tool
Tableau Public, as the free version, is limited by the types of data sources it may access. It is limited by the computational limits of the host
machine where the desktop client is downloaded. While there are some claims that the software itself is relatively easy to use (and it does have
plenty of wizard supports and help documentation, and it offers plenty of drag-and-drop features and does not require command line work), the
complexity in using this tool comes from striving to create coherence with the datasets. Another limit is the inability to edit hard-baked features
in the tool. Also, the two-dimensional visualizations are fairly standard; as such, a wide range of other possible ways of representing the data are
not included here, including 3D, fractal, word clouds, and other methods. Currently, there are no included data animations in Tableau Public (to
indicate changes over time).
FUTURE RESEARCH DIRECTIONS
The potential directions for further research are many. Certainly, the research literature would benefit from research on different applications of
the Tableau Public tool (and especially of the professional version). As the tool’s functionalities evolve, unique applications of a range of
visualization types and analytical approaches would benefit the work. More may be written about the development and evolution of this software
tool. Research on how to contextualize data visualizations with the proper lead-up and lead-away information would be beneficial—particularly in
regards to creating the proper explanatory depth.
CONCLUSION
This chapter has offered a simple introduction to Tableau Public and some of its functionalities with the visualizations of real-world datasets from
open sources and social media platforms. It has addressed the potential for error in the data collection and analysis work stream. It has shown
how Tableau Public may introduce the broad public to analyze and visualize data at minimal computational expense. This work suggests that the
broader public may access this free tool to build their sense of data literacy and geospatial understandings (spatial cognition). Specialists and
experts may use this tool to mine data, enhance analysis, improve decision-making, and communicate data visually in multimedia presentations
and websites.
While Tableau Public may serve as a powerful data visualization resource, it is a strong “gateway” tool to its own professional version and also to
other types of data visualization tools: geospatial data analysis and visualization in ArcGIS; large-scale data animations through Gapminder
World, Google Motion Charts; network depictions in NodeXL or UCINET; and others. Suffice it to say that there is expected to be ever-more open
data released to the public and evolving ways of analyzing and visualizing data. Tableau Public offers a fine point-of-entry to start experimenting
with data sets and making them coherent and understandable through interactive data visualizations in Web-friendly panel charts.
This work was previously published in Enhancing Qualitative and Mixed Methods Research with Technology edited by Shalin HaiJew, pages
556582, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Morton, K., Balazinska, M., Grossman, D., Kosara, R., Mackinlay, J., & Halevy, A. (2012). A measurement study of two webbased collaborative
visual analytics systems (Technical Report UW-CSE-12-08-01). Univ. of Washington. Retrieved Mar. 10, 2014, at
ftp://ftp.cs.washington.edu/tr/2012/08/UW-CSE-12-08-01.PDF
KEY TERMS AND DEFINITIONS
Dashboard: A data panel of data visualizations and other information, a panel chart.
Data Blending: Connecting and mixing data from different sources for a new mixed dataset.
Data Mining: The exploration of data to exploit for informational value (through pattern and relationship identification).
Interpolation: A mathematical method of creating data points within a range of discrete and defined data points.
Legend (Map): An explanatory listing of the meanings behind the symbols and distances on a map.
OpenSource Data: Raw data that is released under free licensure (to promote universal access).
SelfOrganization: The internally driven evolution and growth of an electronic community through local-level decisions by its members
(without external influences); an emergent social organizational phenomena.
Wizard: A feature within software tools that guide users through the use of the tool (to simplify the experience).
Anirban Mitra
VITAM, India
ABSTRACT
One of the fundamental tasks in structured data mining is discovering of frequent sub-structures. These discovered patterns can be used for
characterizing structure datasets, classifying and clustering complex structures, building graph indices & performing similarity search in large
graph databases. In this chapter, the authors have discussed on use of graph techniques to identify communities and sub-communities and to
derive a community structure for social network analysis, information extraction and knowledge management. This chapter contributes towards
the graph mining, its application in social network using community based graph. Initial section is related literature and definition of community
graph and its usage in social contexts. Detecting common community sub-graph between two community graphs comes under information
extraction using graph mining technique. Examples from movie database to village administration were considered here. C++ programming is
used and outputs have been included to enhance the reader's interest.
INTRODUCTION
Social network can be defined as the set of relationships between individuals where each individual is a social entity. It represents both the
collection of ties between people as well as the strength of those ties (Mitra, Satpathy, Paul, 2013). In a general way, Social network is used as a
measure of social “connectedness”, within the social networks for observing and calculating the quality and quantity of information flow within
individuals and also within groups. Hence, from author’s point of discussion, a social network can be defined as structure comprises of social
actors (consisting of either group of individuals or organization) and the connectivity among them. Further, such structured network can also be
called as social structure.
Network comprising of social entities becomes active when the relationships get established in course of regular interaction in the process of daily
life and living, cultural activities such as marriage, thread ceremonies, different communities yearly celebrations, engagements, etc. and so on.
Among vast examples, regular interaction may be a household requesting another for help, support or advice, creation of new friendship or
choice of individuals to spend leisure time together; and many more. Sometimes a relationship can be a negative, i.e. hostility or alliance,
alienation as against mutuality or integration with even having the security aspect as an important factor (Tripathy & Mitra, 2012).
To extract information or pattern of interaction between two or more social entities or between two or more social group, one needs to look deep
into the properties of social network and the interactions. To analyse the above said process, the authors have put the network into mathematical
model using the concept and properties of graph theory. About one of the properties on interaction, the Social networks show strong community
relationships for interaction of people, such that, these interaction may be limited with in specific group or community or exceed outside the
virtual boundary of the community or group. Relationships between communities can be analyzed using the basic algebraic concept
of transitivity. Considering a simple example of three actors (say A1, A2 and A3), there is a high possibility that if A1 and A2 are friends, A2 and
A3 are friends then most likely A1 and A3 are also friends (however, the degree of their friendship may differs and can be easily demonstrate using
weighted graph) . Further, the property of transitivity can be used to measure the Clustering coefficients.
Due to the property of strong community effect, the actors or the social entities in a network form a group which is closely connected. These
groups are termed as modules or clusters or communities or even sub-groups. The authors have observed that individuals interact more
frequently within a group. To detect similar groups within a social network, which is also known as similar community detection is a major
challenge in analyzing the social network. To extract such type of communities helps in solving some more tasks, that are associated with the
analysis of social network. Several works in terms of definitions and approaches are available in the area of community detection.
One of the most important processes in structured data mining is to discover frequent sub-structures. Authors have initiated the work with a
discussion on various available techniques. Once we find at least one similar vertex between two sub-structures, then it is easy to merge both the
sub-structures to form a larger structure. For this purpose the authors follow the simple graph techniques to identify at least one community
between two villages community sub-structure and merge both to produce a larger community structure. The authors propose a new algorithm
which merges two community sub-graphs in an efficient, easier and faster way. As each community sub-graph is represented as an adjacency
matrix form which only contains 0s and 1s. Though, the adjacency matrix of both the community sub-graph contains 0s and 1s, which can be
represented as a bit-matrix which substantially occupies less space for a larger community graph. In later section the authors discuss about the
algorithm, memory management, and some examples related to the proposed algorithm which shows merging of two community sub-graph and
produces one community graph having with at least one common community between those community sub-graphs.
With available set of graphs, discovering frequent sub-structures can be considered as graph patterns. Such patterns in the graph plays important
role in various applications like characterizing graph sets, analyzing the difference among various groups of graphs, classifying and clustering
graphs and building graph indices. Two of the fundamental steps that are followed for discovering of frequent sub-structures are firstly to
generate frequent sub-structure candidates where and secondly to check the frequency of each candidate. The recent work by other researchers
focuses on discovering various methods for frequent sub-structures. This step is further followed by the process of graph isomorphism, a NP-
complete problem.
AGM is considered as initial frequent sub-structure mining algorithm, was proposed by Inokuchi et al. (Inokuchi, Washio, and Motoda, 2000),
which has similar characteristics with the Apriori-based item set mining (Agrawal and Srikant, 1994). The other algorithms are FSG (Maniam,
2004) and the path-join algorithm (Vanetik, Gudes, and Shimony, 2002). These algorithms follow a join operation to merge two (or more)
frequent sub-structures into one larger sub-structure. They distinguish themselves either by using vertices, edges, and edge-disjoint paths. In
frequent sub-structure mining, Apriori-based algorithms have two sets of overheads: (i) joining two size–k frequent graphs (Vanetik, Gudes, and
Shimony, 2002) to generate size –(k + 1) graph candidates, and (ii) checking the frequency of these candidates separately. These overheads are
the drawbacks of Apriori-based algorithms.
The Apriori-based approach follows breadth-first search (BFS) strategy because of its level-wise candidate generation. To determine whether a
size –(k + 1) graph is frequent; then it has to check all of its corresponding size –k sub-graphs to obtain its frequency. Before mining any size
–(k + 1) sub-graph, the Apriori-based approach starts mining of size –k sub-graphs. Therefore, BFS strategy is essential in the Apriori-like
approach. In search method the pattern growth approach is more flexible. In graph search, both breadth-first search (BFS) and depth-first search
(DFS) are used.
The present world can be called as a digital world with information highways and speed connectivity. In this digitization world, new methods are
exploded for creation and storing of amount of structured and unstructured data (Infosys, 2013). The size of datasets of big data is beyond the
ability for database software tools to capture, store, manage and analyse. The information which cannot be processed or analyzed using
traditional processes or tools, there big data applies for extraction of information. There is no specific definition of big data. So we can define big
data as “data is too big, moves too fast, or does not fit the structures of existing database architectures” (Infosys, 2013).
The organization of this chapter is as follows. The authors have initiated the chapter with a general overview followed by mentioning related
definitions and notations on graph theory and big data with few of its characteristic. The next section discussed on Aspects of Graph Mining and
authors have explained using examples of Movie Database and representation of World Wide Web followed by analyzing the proposed approach.
Graph Representation Techniques is one of the important steps in the process of knowledge extraction from graphs. Various graph
representation techniques like sequential representation and linked list representation are discussed. Sub-structure and graph matching for
various kinds of graphs have been mentioned in due course of literature. The next section focuses on Apiori and Non-Priori based techniques and
Pattern Growth Approach followed by implementation of those techniques using algorithms. Other than graph representation techniques as a
step in extraction of knowledge, graph grouping and Community or Group Detection has also its own importance. Various community detection
technique, its implementation and analysis have been discussed in the followed section. The last section gives an overview on application of graph
theoretic concepts in big data analysis followed by the conclusion.
DEFINITIONS AND NOTATIONS ON GRAPH THEORY
Social network, its actors and the relationship between them can be represented using vertices and edges (Cook and Holder, 2007). The most
important parameter of a network (i.e., a digraph) is the number of vertices and arcs. Here the authors have denoted n for number of vertices
and m for number of arcs. When an arc is created by using two vertices u and v, which is denoted by uv. Then the initial vertex is u and
the terminal vertex is v in the arc uv.
• Converse of Digraph: The converse of a digraph G = (V, A) is the digraph H with the same vertex set V, uv being an arc in H if and
only if vu is an arc of G. Note that the adjacency matrix of the converse of G is the transpose AT of the adjacency matrix A of G.
• Null Graph and Complete Digraph: A digraph G is said to be null if no two vertices of G are adjacent. G is said to becomplete if, for
any two distinct vertices u and v, at least one of uv and vu is an arc. Clearly a null graph on n vertices has no arcs and any complete digraph
on n vertices has at least (n|2) = (n(n-1)/)2 arcs.
• Symmetry in a Digraph: A digraph G is said to be symmetric if vu is an arc whenever uv is an arc. G is symmetric if and only if its
adjacency matrix is a symmetric matrix. G is said to be asymmetric (or anti-symmetric) if vu is not an arc whenever uv is an arc.
• OutDegrees and InDegrees: The arcs in a digraph Gmay not be evenly distributed at the different vertices. So one considers the out-
degree d+g(u) of a vertex u in G, defined as the number of vertices v such that uv is an arc. It is the number of vertices u is joined to. In a
social network, d+g(u)usually indicates the expensiveness of u. So the out-degree sequence of a digraph with vertex set V = {v1, v2, …….,
vn}is {d1,d2, …….,dn} where di = d+(vi) for all i. The in-degreedg(u) of a vertex u in G, defined as the number of vertices wsuch that wu is
an arc. In a social network, dg(u) usually indicates the popularity or power of u. So the in-degree sequence of a digraph with vertex set V =
{v1, v2, ……., vn}is {e1,e2, …….,en} where ei = d-(vi) for all i.
ON BIG DATA
Definition: According to Dumbill (Dumbill, 2012), the big data is that amount of data that any conventional database systems fails to process
due to limitation in capacity or size of handling the data. Hence, author can conclude that the data is too big having the characteristic of fast
transaction and movement and does not fit into the standard structures of existing database architectures.
Moreover, IEEE has further explained (ieee, 2014) on big data that it is a collection of large and complex data sets so such that processing it using
conventional database management tools or traditional data processing applications is of great challenge (ieee, 2014).
According to Manyika (Manyika, 2014), when the data cannot be process by a typical database software tools for capturing, storing, managing
and analyzing the data due to its size, is known as big data (Manyika, 2014).
Characteristics of Big Data: The characteristic of big data is explained in four simple points using 4V, i.e. Volume, Velocity, Variety, and
Veracity (ieee, 2014).
In “Big Data”, the meaning of “big” means Volume. So volume is a relative or fuzzy word. For smaller-sized organisations, data in size of
gigabytes or terabytes may be called voluminous where as for bigger organization; data size may vary from petabytes to exabytes. For many
organisations, the size of datasets are limited to terabytes range but is expected to reach in terms of petabytes or exabytes in the future.
The velocity of data is in terms of generation of frequency of data and its delivery. The term, Velocity under the domain of big data is about
fastness of the data arrival, storing it efficiently and retrieving it again with in response time. Velocity of data can also be expressed as the speed
of flow of the data inside the domain. The advancement in information processing, streaming and the increase in sensors through network have
allow the data to flow at a constant or specific pace.
Data can arrive from a variety of sources (which includes internal source as well as external source) and in different types and structures. Due
to advancement in techniques and with proper hardware supports, the arrived data can easily be segregate and retrieve in form of structured
traditional relational data as well assemistructured and unstructured data.
• Structured Data: The structured data are those Data that can be grouped into a schema (consists of finite rows and columns in database)
or simply relational scheme.
• Semi Structured Data: It is a kind of structured data that is not limited to an explicit and fixed schema or scheme. The data is inherently
self-describing and contains tags or other markers to enforce hierarchies of records and fields within the data. Examples include weblogs and
social media feeds.
• Unstructured Data: Data consists of formats which cannot easily be indexed into relational tables for analysis or querying. Examples
include images, audio and video files.
• Veracity: The data mainly biases, noise and abnormality. Veracity in data analysis is the biggest challenge when compares to volume and
velocity of data. The quality of the data being captured can vary rapidly. So the accuracy of analysis completely depends on the veracity of the
source data.
Above two figures, Figure 1 and Figure 2 gives an overview and relation of the big data and its logical components (Rao, Mitra, ichpca, 2014),
(Batagelj and Pajek, 2003).
ASPECTS OF GRAPH MINING
Graph representation is a collection of nodes and links between nodes and supports all aspects of the relational data mining process. The graph
easily represents entities, their attributes, and their relationships to other entities. As, one entity can be arbitrarily related to other entities in a
relational databases. Graph representations typically store each entity’s relations with the entity. Finally, relational database and logic
representations do not support direct visualization of data and knowledge. So, that relational information stored in this way can be easily
converted to a graph form for visualization. Using a graph for representing the data and the mined knowledge supports direct visualization and
increases comprehensibility of the knowledge. So, we can say that mining graph data is one of the most promising approaches to extracting
knowledge from relational database.
Representation of Movie Database
Three domains of mining graph data are the Internet Movie Database, the Mutagenesis dataset, and the World Wide Web. Several graph
representations for the data in these domains were proposed. These databases serve as a benchmark set of problems for comparing and
contrasting different graph-based data mining methods (Corneil and Gotlieb, 1970).
To represent movie information as a graph, relationships among movies, peoples, and attributes can be captured and included in the analysis.
The cited figure here (Figure 3) shows one possible representation of information related to a single movie. Here each movie as a vertex, with
links to attributes describing the movie. Similar graphs could be constructed for each person as well. With this representation, one can perform
the following query: What are the commonalities that one can find among movies in the database? The answer to this query gives us the required
knowledge.
The authors can easily analyse about the common relationships between objects in the movie database. For the above movie graph, a discovery
algorithm may find a pattern that movies made by the same studio frequently also have the same producer. Jensen and Neville (Jensen and
Neville, 2002) mention another type of discovery from a connected movie graph. In this case, a successful film star may be characterized in the
graph by a sequence of successful movies in which he or she stars as well as by winning one or more awards.
Ravasz and Barabasi (Ravasz and Barabasi, 2003) analyzed a movie graph constructed by linking actors appearing in the same movie and has
found graph of hierarchical topology. Movie graphs can also be used for classifications. The patterns can be mined from structural information
that is explicitly provided. However, missing structure can also be inferred from this data. Mining algorithms can be used to infer missing links in
the movie graph. For example, to know number of actors who starred together in a movie, link completion (Goldenberg and Moore, 2004),
(Kubica, Goldenberg, Komarek, and Moore, 2003) can be used to determine who the remaining actors are starred in the same movie. The same
link completion algorithms can also be used to determine whether one movie is a remake of another (Jensen and Neville, 2002).
Representation of World Wide Web
World Wide Web is a valuable information resource that is complex, dynamic contents, and rich in structure. Mining the Web is a research area
that is almost as old as the Web itself. According to Etzioni, Web mining (Etzioni, 1996) is referred to extracting information from Web based
documents and web based services. The types of information that can be extracted are of variable in nature. So it has been refined to three classes
of mining tasks, i.e. Web content mining, Web structure mining and Web usage mining (Kolari and Joshi, 2004).
Web content mining algorithms is attempting to mining types of patterns from the content of Web pages. The most common approach is to
perform mining of the content that is found within each page on the Web. This content generally consists of text occasionally supplemented with
HTML tags (Velez and Sheldon, 1996), (Chakrabarti, 2000). Using text mining techniques, the discovered patterns forms a classification of Web
pages and Web querying (Mendelzon, Michaila Milo, 1996), (Zaiane and Han, 1995), (Berners-Lee, Hendler and Lassila, 2001).
When structure is added to Web data in the form of hyperlinks, analysts can then perform Web structure mining. In a Web graph, vertices
represent Web pages and edges represent links between the Web pages. The vertices can be labelled by the domain name (Cook and Holder,
2007) and edges are unlabeled or labelled with a uniform tag. Additional vertices can be attached to the Web page nodes that are labelled with
keywords or other textual information found in the Web page content. The Figure 7(Figure 7. Graph representation of web text and structure
data) shows a graph representation for a collection of three Web pages. With the inclusion of this hypertext information, Web page classification
can be performed based on structure alone (Cook and Holder, 2007) or together with Web content information algorithms (Gonzalez, Holder and
Cook, 2002) that analyze Web pages based on more than textual content can also potentially learn more complex patterns.
Figure 7. Graph representation of web text and structure data
Other researchers focus on the structural information alone. Chakrabarti and Faloutsos (Cook and Holder, 2007) and others (Broder, Kumar,
Maghoul, Raghavan, Rajagopalan, Stat, and Tomkins, 2000), (Kleinberg and Lawrence, 2001) have studied the unique attributes of graphs
created from Web hyperlink information. Such hyperlink graphs can also be used to answer the question for finding the patterns in the Web
structure. Again, the answer will be an extraction of knowledge.
The authors have observed about frequent discovering of sub-graphs in the above topology graph (Cook and Holder, 2007). How new or
emerging communities of Web pages can be identified from such a graph (Cook and Holder, 2007)? Analysis of this graph leads to identification
of topic hubs (overview sites with links to strong authority pages) and authorities (highly ranked pages on a given topic) (Kleinberg, 1999). The
Page Rank program (Kleinberg, 1999) pre-computes page ranks based on the number of links to the page from other sites together with the
probability that a Web surfer will visit the page directly, without going through intermediary sites. Desikan and Srivastava (Desikan and
Srivastava, 2004) proposed a method of finding patterns in dynamically evolving graphs, which can provide insights on trends as well as
potential intrusions.
Web usage mining is used to find commonalities in web navigation patterns. Mining click stream data on the client side has been investigated
(Maniam, 2004), data is easily collected and mined from Web servers (Srivastava, Cooley, Deshpande and Tan, 2000). Didimo and Liotta (Cook
and Holder, 2007) provide some graph representations and visualizations of navigation patterns. According to Berendt (Berendt, 2005), a graph
representation of navigation allows the construction of individual’s website. From the graph one can determine which pages act as starting points
for the site, which collection of pages are navigated sequentially, and how easily pages within the site accessed are. Navigation graphs can be used
to categorize Web surfers and can assist in organizing websites and ranking Web pages (Berendt, 2005), (McEneaney, 2001), (Zaki, 2002), (Meo,
Lanzi, Matera and Esposito, 2004).
ANALYSIS ON THE PROPOSED APPROACH
Given the importance of the problem and the increased research activity, the graph mining is needed to extract knowledge from the social
community network, which is defined as a complex graph, and each unit is an individual, village, household, country, etc. so the social
community graph may be defined as a set of villages V = {V1, V2, ….., Vn} and between villages there is a set of links or connectivity or edges E
= {E1, E2,…….,En}. The figure describes the community graph (Figure 8). Each village has a set of communities and connectivity among
communities forms a community sub-graph (Rao and Mitra, Oct-2014). From literature survey, analysis and discussion, authors have the
solution to the below listed queries. Here the authors have proposed a social community network which is being shown as a Graph, with respect
to Movie and World Wide Web Graph.
Figure 8. Community Graph of village
Applying the graph data management algorithms on the above proposed social community network using earlier referred figure (Figure 9), one
can mine the following knowledge (Rao and Mitra, Oct-2014):
Figure 9. Connected graph by dotted lines
• On applying Indexing and Query Processing Technique, the authors can mine similar communities and isolated communities.
• On applying Reachability Queries, the authors can find path between villages by BFS or DFS techniques.
• On applying Keyword Search, the authors can find the community name which forms a community detection sub-graph is shown in the
earlier figure. In the authors example a community detection sub-graph for C5 has been shown here in Figure 10.
• On applying Synopsis Construction of Massive Graphs(Brandes, Kenis and Wagner, 2003), (Brandes and Erlebach, 2005), (Batagelj and
Pajek, 2003), the authors can respond to any query for sufficient information and maintain in a smaller space. It means a graph can be
represented as a square matrix having consists of 0s and 1s, which substantially occupies less space.
Figure 10. Community detection graph for C5
GRAPH REPRESENTATION TECHNIQUES
A graph can be represented in two different ways (Seymour, 1999), (Mitra, Satpathy and Paul, 2013), (Rao and Mitra, Oct-2014), (Rao and Mitra,
ichpca-2014), (Rao and Mitra, iccic-2014). These two representations are explained in details as follows.
SEQUENTIAL REPRESENTATION
The sequential representation of graphs in the memory is further classified into two ways.
Adjacency Matrix: Let G be a graph with n nodes or vertices V1, V2,....,Vn having one row and one column for each node or vertex. Then the
adjacency matrix A = [aij ] of the graph G is the nXn square matrix, which is defined as:
Else
aij = 0 otherwise.
This kind of matrix contains only 0s and 1s is called bit matrix or boolean matrix. In undirected graph, the adjacency matrix will be a symmetric
one. For example in the given digraph G has verticesV = {A, B, C, D, E} in (i) of the Figure 11 and the set of edges E = {(A, B), (A, C), (B, D),
(A, E)}. Then the adjacency matrix of the graph G is shown in (ii) of same.
Else
pij = 0 otherwise.
The path matrix only shows the presence or absence of a path between a pair of vertices. It only says about the presence or absence of a cycle at
any vertex. It cannot count the total number of paths in a graph. Let us consider a graph G = {A, B, C, D, E}. Its adjacency matrix and the final
path matrix P in adjacency matrix is shown here Figure 12. The presence of edge between A toD in (iii) of the same figure is the indication of
presence of path in the graph.
Figure 12. With Path matrix
Linked List Representation
It this representation two types of lists is used. They are node listand edge list. The node of node list is a double linked list kind. The node
consists of three parts: Info, Next, and Adj. Info is the information part of a node or vertex, Next is a pointer which holds address of next node
of node list, and Adj is a pointer which holds address of node of edge list where the actual adjacent is present. The node of edge list is a single
linked kind. The node consists of two parts Node and Edge. Node is a pointer which holds address of node of node list where the adjacent is
present and Edge is also a pointer which holds address of next node of edge list.
Let us consider a graph G = {A, B, C, D}. The adjacency matrix and adjacency list for the graph G is shown in Figure 13. By using adjacency list
is in (iii) of the same figure, one can draw its equivalent adjacency list with the help of edge list and node list.
Incidence Matrix
There are two types of incidence matrix, (i) Un-oriented incidence matrix and (ii) Oriented incidence matrix. In the next sub section, we have
discussed about them.
Unoriented Incidence Matrix
The incidence matrix of undirected graph is called as unoriented incidence matrix. The incidence matrix of an undirected graph Gis a v ×
ematrix(MATij), where v and e are number of vertices and edges respectively, such that MATij = 1 if the vertex vi and edge ej are
incident. MATij = 0 If there is no incident between viand ej.
Let us consider a graph G = {V1, V2, V3, V4} whose unoriented incidence matrix is shown in (i) Figure 14. The order of unoriented incidence
matrix in (ii) of the same figure is 4 (number of rows) X 4 (number of columns).
Oriented Incidence Matrix
The incidence matrix of directed graph is called as oriented incidence matrix. The incidence matrix of a directed graph G is
a v× e matrix (MATij), where v and e are number of vertices and edges respectively, such that MATij = 1 if the edge ej away from vertex vi,
that MATij = 1 if the edge ej pointing to the vertex vi, and MATij = 0 if there is no edge at all.
Let us consider a graph G = {V1, V2, V3, V4} whose oriented incidence matrix is shown in (i) Figure 15. The order of unoriented incidence
matrix in (ii) of the same figure is 4 (number of rows) X 4 (number of columns).
Figure 15. Incidence matrix
APIORI AND NONPRIORI BASED ALGORITHMS
Algorithm Frequent Sub-Structure Mining proposed by Inokuchi et al. (Inokuchi, Washio and Motoda, 2000) and algorithm Apriori-Based
Frequent Itemset Mining developed by Agrawal and Srikant (Agrawal and Srikant, 1994) have similar characteristics where searching starts from
bottom to top and generating candidates with an extra vertex, edge, or path. Different kinds of candidate generation strategies are:
1. AGM (Inokuchi, Washio and Motoda, 2000) is a vertex-based candidate generation method that increases the sub-structure size by one
vertex for each iteration. Two (size−k)frequent graphs are joined only when the two graphs have the same size − (k − 1) sub-graphs.
Here size is the number of vertices in a graph. The newly formed candidate includes the common size − (k − 1) sub-graphs and the additional
two vertices from the two size−k patterns.
2. An edge-based method adopts by FSG proposed by Kuramochi and Karypis (Kuramochi and Karypis, 2001) increases the sub-structure
size by one edge in each call. Twosize−k patterns are merged together if they share the same sub-graph having k − 1 edges. The same sub-
graphs are calledcore. Here size means the number of edges in the graph. The newly formed candidate includes the core and the additional
two edges from the size−k patterns.
3. The disjoint-path method based on Apriori-based proposed by Vanetik (Vanetik, Gudes Shimony, 2002) uses a more complicated
candidate generation procedures. A sub-structure pattern with k + 1 disjoint path is generated by joining sub-structures with k disjoint paths.
The discovery of Non-Apriori based algorithms due to considerable overheads at joining two size−k frequent sub-structures to generate size−(k +
1) graph candidates. Most of the Non-Apriori based algorithms adopt the pattern growth methodology (Han, Pei and Yin, 2000), which extends
patterns from a single pattern directly. Pattern-growth based discovers a frequent sub-structure. It recursively discovers frequent sub-graph and
embeds. Finally produces a final sub-graph until it does not find any more frequent sub-graph. It discovers the same graph more than ones. This
detection of duplicate graph leads workload to the algorithm. To avoid duplicate graphs for discovery, other algorithms gSpan (Yan, Zhou and
Han, 2005), MoFa (Borgelt and Berthold, 2002), FFSM (Huan, Wang and Prins, 2003), SPIN (Prins, Yang, Huan and Wang, 2004) and Gaston
(Nijssen and Kok, 2004) are evolved. Among this gSpan algorithm is the efficient one and adopts DFS traversing.
Closed Frequent SubStructure
According to the Apriori property, all the sub-graphs of a frequent sub-structure must be frequent. A large graph pattern may generate an
exponential number of frequent sub-graphs. A frequent pattern is closed if it does not have a super-pattern. A frequent pattern is maximal if it
does not have a frequent super-pattern.
Approximate SubStructure
To reduce the number of patterns, it mines approximate frequent sub-structures that allow minor structural variations. In this technique, one can
approximate one sub-structure from several frequent sub-structures having slight differences. For mining approximate frequent sub-structures,
Holder et al. (Holder, Cook and Djoko, 1994) proposed a new method which adopts the principle of minimum description length (MDL),
called SUBDUE.
Contrast SubStructure
For a predefined set of two graphs, the contrast patterns are sub-structures that are frequent in one set but infrequent in the other set. It uses two
parameters, one from the positive set of a sub-structure which is the minimum, called as minimum support and the other one from the negative
set of the sub-structure which is the maximum, called as maximum support. This algorithm is called as MoFa algorithm (Borgelt and Berthold,
2002).
Dense SubStructure
Relational graph is a special kind of graph structure where each node label is used only once. These structures are widely used in modelling and
analyzing massive networks. For this Yan et al. (Yan, Zhou and Han, 2002, 2005) proposed two algorithms,CloseCut and Splat, to discover exact
dense frequent sub-structures in a set of relational graphs.
Graph Matching
Graph matching is one-to-one correspondence amongst the nodes between two graphs. This correspondence is based on one or more of the
following characteristics. (i) The labels on the nodes in the two graphs should be same. (ii) The existence of edges between corresponding nodes
in two graphs should match each other. (iii) The labels on the edges in two graphs should match each other.
Such problems arise due to different database applications such as schema matching, query matching, and vector space embedding. Its detailed
explanation can be seen in (Riesen, Jiang and Bunke, 2010). Exact graph matching determines one-to-one correspondence between two graphs
with an edge exists between a pair of nodes in one graph, and the same edge must also exist between the corresponding pair in the other graph.
Inexact graph matching detects the natural errors during the matching process. A proper method is required to quantify these errors and the
closeness between different graphs. A function called graph edit distance is used to quantify these errors. Thegraph edit distance function
determines the distance between two graphs by measuring the cost of the edits required to transform from one graph to other. So it may be a
node or edge insertions, deletions or substitutions. The cost of the corresponding edits between two graphs judges the quality of the matching.
The concept of graph edit distance is to finding a maximum common sub-graph (Bunke, 1997).
Frequent Graph
Given a labelled graph dataset, D = {G1, G2, . . ., Gn }, where frequency(g) is the number sub-graphs in D. A graph is frequent if its support is
not less than a minimum support of nodes or vertices.
EXAMPLE:
Let us consider three graphs are shown in Figure 16, which is considered as three datasets D = {G1, G2, G3}. The authors depict three frequent
sub-graphs from the datasets D, which is shown in Figure 17. So the three frequent sub-graphs are { g1, g2, g3 } and their frequencies are {2, 2,
3}, which is shown in the same figure.
Figure 16. Graph dataset G1, G2, G3
APRIORIBASED APPROACH
Apriori-based frequent item set mining algorithms developed by Agrawal and Srikant (Agrawal and Srikant, 1994). Searching of frequent graphs
having larger sizes follows bottom-up manner by generating candidates having an extra vertex, edge, or path.
In the algorithm Sk is the frequent sub-structure set of size k. It follows a level-wise mining technique. The size of newly discovered frequent sub-
structure is increased by one for each iteration. The new sub-structures are generated by joining two similar but slightly different frequent sub-
graphs that are discovered in the previous iteration. So Line 4 of algorithm is the candidate generation logic. Then the frequency of newly formed
graphs is checked. The detected frequent sub-graphs are merged together to generate larger candidates in the next round. The algorithm is listed
below.
Let us consider two frequent item sets of size 3 each i.e. (abc) andbcds). It generates a candidate frequent item set of size 4 i.e.(abcd). Here two
item sets are (abc) and (bcd) are frequent in candidate frequent item set (abcd). Then we check the frequency of (abcd). So, the candidate
generation problem in frequent sub-structure mining becomes much harder than frequent item set mining since there are different ways of
joining two sub-structures.
Another kind of candidate generation strategies AGM (Inokuchi, Washio and Motoda, 2000) proposed a vertex-based candidate generation
method that increases the sub-structure size by one vertex at each iterations. The complete algorithm is listed below. Two size-k frequent graphs
are joined only when two graphs have same size−(k−1) sub-graph. Here size means the number of vertices in a graph. The newly formed
candidate includes the common size−(k − 1) sub-graph and the additional two vertices from the two size−k patterns. If there is an edge
connecting the additional two vertices, then only we can form two candidates. FSG proposed by Kuramochi and Karypis (Kuramochi and Karypis,
2001) adopts an edge-based method that increases the sub-structure size by one edge in each call also follows the listed below algorithm. Two size
−k patterns are merged if they share the same sub-graph that has k−1 edges, which is called the core. Here size means the number of edges in a
graph. The newly formed candidate includes the core and the additional two edges from the size−k patterns.
Other Apriori-based methods such as the disjoint-path method proposed by (Vanetik, Gudes and Shimony, 2002) use more complicated
candidate generation procedures. A sub-structure pattern with k + 1 disjoint path is generated by joining sub-structures with k disjoint paths.
Apriori-based algorithms have more overheads while joining two size-k frequent sub-structures to generate size-(k + 1) graph candidates. To
avoid such overheads, the authors need to follow non-Apriori-based algorithms.
PATTERN GROWTH APPROACH
A graph can be extended by adding a new edge. The edge may or may not introduce a new vertex to the newly formed graph. The extension of
new graph may extend is in a forward or backwarddirection. Its algorithm is listed below.
The above algorithm is simple but not efficient. It is inefficiency during extending the graph. It discovers the same graph again and again. This
repeated discovery consumes more time and space. It has to be avoided. Let us consider an example that there exist ndifferent (n−1)−edge graphs
can be extended to the same (n−edge) graph. It ignores the repeated discovery of the same graph. Line 1 of the above algorithm creates duplicate
graphs whereas Line 2 discovers that duplicate graph. So the generation and detection of duplicate graphs may require additional workloads. To
reduce the generation of duplicate graphs, each frequent graph should be preserved. This principle leads to the design of several new algorithms
such as gSpan (Batagelj and Pajek, 2003), MoFa (Borgelt and Berthold, 2002), FFSM and SPIN (Prins, Yang, Huan and Wang, 2004), and
Gaston (Nijssen and Kok, 2004).
AUTHORS PROPOSED APPROACH
The authors have studied the scenario of a social graph, which consists of various villages in a panchayat (panchayat is an Indian term for
administration of villages). In a village different types of communities live together and have connectivity. Taking this scenario into mind, one
can compare two community graphs for finding a similar sub-graph from it. For such case the authors have proposed an algorithm for detecting
frequent sub-graph between two community graphs. A simple technique using graph theory is employed to detect the frequent sub-community
graph between two community graphs in two villages. The proposed algorithm has been given below.
Let us consider a community graph for villages V1, V2, V3, V4, and V5 which is shown in earlier in figure (Rao and Mitra, ichpca-2014). For
village V1 the communities VC1 = {C1, C2, C3, C4}. Similarly for village V2 the communities VC2 = {C1, C2, C3, C5 },village V3 the
communities VC3 = {C1, C2, C3, C4, C5 }, villageV4 the communities VC4 = {C1,C3, C4, C8} and village V5 the communities VC5 = {C1, C2,
C3, C5, C9, C10} respectively.
Now the authors have to detect the frequent sub-community graph by considering two community graphs. In this example the authors have
considered two villages’ V3 and V5’s community graph which is shown in (i) and (ii) of the Figure18.
Figure 18. Adjacency matrix and village sub graph
And its adjacency matrix is shown in (iii) and (iv) of the same figure. The above algorithm has to pass four parameters such asV3, 5, V5, and 6.
Then the authors have to union both the communities of the villages V3 and V5. Now, the authors are able to find the order of resultant adjacency
matrix by finding the total number of unique communities. In this example it is 7 and the adjacency matrix must be created which is shown in
(iii) of the Figure 19. Based on the algorithm, the final resultant matrix is formed and shown in (i) of Figure 20. (Figure 20. Common community
sub grapg between V3 and V5). By using the resultant matrix, the authors are able to draw the frequent sub-community graph. In this example
the sub-community graph is shown in (ii) of the same figure.
Figure 19. Common community sub graph between villages
Figure 20. Common community sub grapg between V3 and V5
OUTPUT FOR THE EXAMPLE
The screenshots of the output are given in sequence in the following Figures 21, 22, 23, 24, and 25.
MERGING OF GRAPHS
Merging of two sub-graphs at a time to form a large graph having at least one common vertex or node between those sub-graphs.
AUTHORS PROPOSED APPROACH
The authors propose here for merging of two community graphs into a larger community prior to at least one common community between these
community graphs (Rao and Mitra, iccic 2014). The authors have followed graph matching technique by matching one-to-one correspondence
communities of the two community graphs which is for merging. Here the authors represent a community graph (an undirected graph) in
memory as an adjacency matrix. Finally the authors have proposed three algorithms for merging of two community graphs.
AlgorithmI explains about finding the order of merged communities of two villages. And make available of initial form of merged community
matrix. AlgorithmII explains about creation of adjacency matrix community graph of village. These adjacency matrices of community graph of
villages are used in AlgorithmIII.AlgorithmIII explains about creation of merged community adjacency matrix. Finally from this adjacency
matrix the authors can construct the merged community graphs of two villages.
Proposed Algorithms
(I) Algorithm for finding order of merged communities of two villages.
AlgorithmOrder_of_Merged_Commuity_Matrix (CV1, N1, CV2, N2)
CV1 [1:N1]: Village-1’s communities 1, 2, 3, ……., N1.
CV2 [1:N2]: Village-2’s communities 1, 2, 3, ……., N2.
CMV [1:N1+N2]: Global array to hold the merged community arrays CV1 and CV2.
MV [1:size, 1:size]: Global 2D-Array for representation of initial form of merged community matrix.
size: The size of merged array.
1. Merge CV1 and CV2 and store in CMV.
2. Arrange CMV in ascending order.
3. Set size:= N1 + N2.
4. [Remove the repeated community from CMV[ ] ]
Repeat for I = 1, 2, …….,size-1:
(i) If CMV [I] = CMV [I+1],
Then
[Shift one step left all communities]
Repeat for J = I+1, I+2,…….,size:
Set CMV [J-1]:= CMV [J].
End for
Set size:= size-1.
Set I:= I – 1.
End if
End for
5. [Initial representation of merged community matrix of order sizeXsize]
Repeat for I = 1, 2, ……,size:
Repeat for J = 1, 2, …..,size:
Set MV [I][J]:= 0.
End For
End For
6. Return
(II) Algorithm for Adjacency matrix creation for village’s community graph
AlgorithmAdjacency_Village_Community_Matrix (CA, Size, CMatrix)
CA [1:Size]: Community array of a village of dimension Size.
CMatrix [1:Size+1, 1:Size+1]: Community matrix of village V of order (Size+1) X (Size+1).
1. [Community number assignments at row and column places in community matrix CMatrix[][]]
Repeat for I = 1, 2, ……,Size:
a) Set CMatrix [1][I+1]:= CA [I].
b) Set CMatrix [I+1][1]:= CA [I].
End For
2. [Adjacency Matrix Creation through community matrix CMatrix[][]]
Repeat For I = 2, 3, …..,Size+1:
Repeat For J = 2, 3, ……,Size+1:
If Edge from CMatrix [I][1] to CMatrix[1][J] = True,
Then
Set CMatrix [I][J]:= 1.
Else
Set CMatrix [I][J]:= 0.
End If
End For
End For
3. Return
Notei. To create a community graph CV1[1:N1]’s adjacency matrix as V1[1:N1+1, 1:N1+1] then follow the given statement:
CallAdjacency_Village_Community_Matrix (CV1, N1, V1)
Noteii. Here CV1 village-1’s community array having 1, 2, 3, …….., N1 communities.
V1: Adjacency matrix of the community graph of village-1 of order (N1+1) X (N1+1) ]
(III) Algorithm for merging two villages’ community adjacency matrices and represent in the
merged community adjacency matrix MV of order sizeXsize.
Noteiii. The initial state of matrix MV [1:size,1:size] has been created in Algorithm-1.
AlgorithmMerge_Community_Graphs (V1, N1, V2, N2)
V1 [1:N1+1, 1:N1+1]: Adjacency community matrix of order (N1+1) X (N1+1).
V2 [1:N2+1, 1:N2+1]: Adjacency community matrix of order (N2+1) X (N2+1).
(Above both matrices formed from Algorithm-II)
MV [1:size, 1:size]: Global 2D-Array (formed from Algorithm-I) for representation of merged community matrices V1 and V2, of order
sizeXsize.
1.[Adding adjacency matrices V2[][] with merged community matrix mv[][]]
Repeat for i = 2, 3, ……,N1+1:
Repeat for j = 2, 3, …….., N2+1:
Repeat for m = 2, 3, ……..,size:
Repeat for n = 2, 3, ……,size:
If MV[m][1]=V1[i][1] And MV[1][n]=V1[1][j],
Then
i) Set MV[m][n]:= MV[m][n] OR V1[i][j].// logical OR
//operation
ii) Break.
End If
End For
End For
End For
End For
2. [Adding adjacency matrices V2[][] with merged community matrix MV[][]]
Repeat for i = 2, 3, ……,N1+1:
Repeat for j = 2, 3, …….., N2+1:
Repeat for m = 2, 3, ……..,size:
Repeat for n = 2, 3, ……,size:
If MV[m][1]=V2[i][1] And MV[1][n]=V2[1][j],
Then
i) Set MV[m][n]:= MV[m][n] OR V2[i][j]. // logical
//OR operation
ii) Break.
End If
End For
End For
End For
End For
3. The matrix MV [][] is the resultant merged community matrix. From it we can draw the merged community graph.
4. Exit.
EXAMPLE AND ANALYSIS FOR AUTHORS PROPOSED APPROACH
Let us consider an example which consists of villages V = { V1, V2, V3, V4, V5 }. Each village consists of a community network known as sub-
community graph. The villages’ unique community list is given in seed table of villages in Table 1. The total number of communities in a village is
termed as its seed. When two villages community graph are to be merged for a larger community graph, then there must be at least one common
community between those community graphs. Taking the help of seed table, the authors can find the common communities between the villages
whose community graph are to be merged. AlgorithmI (Rao and Mitra, iccic 2014) says about the common communities to be used as the order
of merged community adjacency matrix (MV [ ][ ]matrix). Finally AlgorithmI creates the initial form of the merged community adjacency matrix
which initially assigns with 0 (zero) values.
Table 1. Seed table of villages
AlgorithmII (Rao and Mitra, iccic 2014) for adjacency matrix creation for village’s community graph and AlgorithmIII (Rao and Mitra, iccic
2014) for merging of two villages’ community adjacency matrices. Example1 and Example2’s data is being used as sample data in our algorithm
to reach to a conclusion that the algorithm is being worked out properly. Example1 is shown in Figure 26 and 27, whereas Example2 is shown
in Figures 28 and 29. Its’ sample output are listed in Results1 and Results2. The algorithm is being implemented in C++ programming
language.
Figure 26. Community graph and adjacency matrix
Results in form of output screen for Example-1is mentioned in Figure 30 and 31.
Figure 30. First screenshot for output of program
Figure 31. Second screenshot for output of the program
Results in form of output screen for Example-2 is mentioned in Figure 32 and 33.
Figure 32. First screenshot for output of program
Figure 33. Second screenshot for output of the program
CRITERIA FOR GROUPING
The criteria of groups can be classified node-centric category, hierarchy-centric category group-centric category and network-centric category
and. Some methods that are available for grouping are listed in the below.
NODESCENTRIC COMMUNITY DETECTION
To detect a community using node-centric criteria “each node in a group requires satisfying certain properties which includesmutuality,
reachability, or degrees. Some grouping based methods are briefly explained below.
• Groups Based on Complete Mutuality: A sub-graph is formed by considering more than two nodes and all are adjacent to each other
which are termed as a clique. The existence of a complete bipartite in a community graph from a directed graph is given in (Kumar,
Raghavan, Rajagopalan and Tomkins, 1999).
• Groups Based on Reachability: It says the reachability between two actors or nodes in a community. Two nodes can be considered as
part of one community if there is a path between these two nodes. So the connected component is said to be a community (Kumar, Novak
and Tomkins, 2006).
• Groups Based on Nodal Degrees: It checks actors within a group to be adjacent to a relatively large number of group members or not.
Two commonly studied sub-structures are: kplex and kcore.
• Groups Based on WithinOutside Ties: It only detects and selects node which has more connections to nodes that are within the
group rather than outside the group.
GROUPCENTRIC COMMUNITY DETECTION
It only focuses connection of nodes only inside a particular group. One such example is density-based groups. It has no guarantee whether the
reachebility for each node in a group. In (Abello, Resende, and Sudarsky, 2002), maximum -dense quasi-cliques are discussed.
NETWORKCENTRIC COMMUNITY DETECTION
It only detects the complete connection of the whole network. It partitions the actors into a number of small disjoint sets. It never defines
independently a group.
HIERARCHYCENTRIC COMMUNITY DETECTION
Based on the network topology, it builds a hierarchical structure of communities. Basically there are three types of hierarchical clustering:
divisive, agglomerative and structure search.
• Divisive Hierarchical Clustering: Divisive clustering first partitions the actors into several disjoint sets. Then each set is further
divided into smaller ones which contain only a small number of actors. A divisive clustering based on edge betweeness is proposed in
(Newman and Girvan, 2004).
• Agglomerative Hierarchical Clustering: The clustering starts with each node as a separate community and merges them successively
to form a larger community. The hierarchical clustering is based on modularity which uses criterion (Mewman and Moore, 2004).
• Structure Search: It starts from a particular hierarchy and searches for similar hierarchies to generate the network (Vismara, Battista,
Garg, Liotta, Tamassia and Vargiu, 2000). A random graph model for hierarchies is defined in (Moore, and Newman, 2008).
GROUP DETECTION
Group detection refers to the discovery of underlying organizational structure from a large structure which consists of selected individuals which
related each other.
Best Friends Group Detection Algorithm
It passes user-defined parameters which finally form an initial group. Every group has begin node is said to be a “seed” node. The instructions for
running Best Friends Group Detection either interactively or through TMODS batch scripts can be found in (Moy, 2005).
Terrorist Modus Operandi Detection System
It searches for and analyzes instances of particular threatening activity patterns. Its distributed Java software application can be found in (Moy,
2005). Graph matching is sometimes called as sub-graph isomorphism (Paugh and Rivest, 1978), (Diestel, 2000). Graph matching finds subsets
of a large input graph and returns a true value if it is found.
TMODS EXTENSIONS TO STANDARD GRAPH MATCHING
It extends to standard sub-graph isomorphism problem to provide additional capabilities and able to detect a sub-graph in a large graph.
• Inexact Matching: TMODS able to find and highlight activity that exactly matches that pattern, as well as find the close pattern, but not
an exact match by following sub-graph isomorphism.
• Multiple Choices and Abstractions: TMODS supports how various patterns can be instantiated. TMODS defines various alternative
graphs for each pattern, called choices.
• Hierarchical Patterns: TMODS allows defining patterns that are built from other patterns. Rather than describing the entire pattern,
TMODS modularizes the patterns.
• Constraints: TMODS allows defining constraints between attribute values on nodes and edges. Constraints restrict the relationships
between attributes of actors, events, or relationships.
ALGORITHMS
The exhaustive algorithms those that authors have listed in the earlier sections will be able to solve the problem in reasonable time for
particularly large graphs (Messmer, 1995), (Corneil and Gotlieb, 1970), (Cordella, Foggia, Sansone and Vento, 2011) practically. Non-exhaustive
techniques are used for practical implementations to achieve results faster. TMODS follows two major algorithms, the merging matches’
algorithm and the fast genetic search.
• Merging Matches: It builds up a list of potential matches. The initial entries of the list match a node from the input pattern to one node
in the pattern graph. This way the merging matches take place.
• Genetic and Distributed Genetic Search: When the input patterns are large, then the merging matches’ algorithm is not feasible for
searching task. TMODS adopts a Genetic Search algorithm efficiently solving the above problem. TMODS allows a genetic search over
several processes to increase the speed and completeness of the search. Each process is assigned with a limited search domain, which
actually runs on different computer systems.
AUTHORS ANALYSIS AND WORK
The authors have studied the scenario of a social graph, which consists of various villages in a panchayat (A panchayat is an Indian term for
administration of villages). In a village different types of communities live together and have connectivity. The authors have proposed a
community graph given in figure. How a same community shares their social values, feelings, and activities with the same community living in
other villages of the same panchayat or different panchayat? Taking this scenario into mind, one can find the community match graph (Rao and
Mitra, Dec-2014). For such case the authors have proposed an algorithm to detect the same community and find the graph which is underlying in
the original large community graph. It is a simple technique using graph theory to detect a particular community from various villages. The new
proposal of algorithms (Rao and Mitra, Dec-2014) for creation and detection as follows:
rcno[1: ]: Global dynamic array which holds merged community numbers of size, S = [Summation (from i=1 to n)(nc[i]). Assume it is
arranged in ascending order and removed the duplicate community numbers.
CIMatrix [1: n+1, 1: S+1]: Global matrix of order ‘n’ villages X ‘S’ unique communities.
1. [Initial form of community incidence matrix]
Repeat For I = 2, 3, 4,………., n + 1:
[Village number assignment]
i) Set CIMatrix [I][1]:= vno [I-1].
Repeat For J = 2, 3, 4,…….., S + 1:
[Community number assignment]
ii) If I=2, Then: Set CIMatrix [1][J-1]:= rcno [J-1]. [Assign 0’s to all villages and their respective communities]
iii) Set CIMatrix [I][J]:= 0.
End For
End For
2. [Final form of community incidence matrix]
Repeat For I = 1, 2, 3, …….., n:
Repeat For J = 1, 2, 3,…….., nc [I]:
Repeat For K = 2, 3, 4,…….., S+1:
If cno [I][J] = CIMatrix [1][K],
Then
a) Set CIMatrix [I+1][K]:= 1.
b) Break.
End If
End For
End For
End For
3. Exit
Algorithm Community Detection (Cno)
(Algorithm conventions (Seymour, 1997))
Cno: Community Number, vno[1: M]: Global array which holds ‘M’ village numbers in ascending order.
Cno_Array[1: M]: Global Array for assignment of true(1) value for detected community number Cno; otherwise false (0) value for M villages.
CIMatrix[1: M, 1: N]: Global Matrix which shows the existence of communities for M villages and N number of unique communities.
1. Repeat For I = 1, 2, 3, …….., M:
[Initializes 0 to M villages in Cno_Array.]
Set Cno_Array[I]:= 0.
End For
2. Repeat For I = 1, 2, 3, …….., N:
[To find community number Cno in matrix CIMatrix[][]; when found assign 1 to the corresponding village in the array Cno_Array[ ] ]
3. If CIMatrix[1, I] = Cno, Then:
4. Repeat For J = 1, 2, 3, ……, M:
If CIMatrix[J, I] = 1, Then: Set Cno_Array[I]:= 1.
End For
End If
End For
5. Repeat For I = 1, 2, 3, …….., M:
[Output the array vno[] for community matched villages]
If Cno_Array[I] = 1, Then: Write vno[I].
End For
6. Exit.
EXAMPLE:
Let us consider a community graph for villages V1, V2, V3, V4, and V5 which is shown in figure (Rao and Mitra, ichpca-2014). For village V1 the
communities are C1, C2, and C3. Here C1 indicates the community head. Village V1’s extracted view is given in Figure 34. In this case the black
filled circles are the community heads. It is represented as one node or vertex or community in figure, labeled as V1, V2, and so on for individual
community. So the seed number of village V1 is 3. The villages seed number and communities are mentioned (Figure 35). The unique
communities from villages V1 to V5 are C1, C2, …., C10.
Figure 34. Displaying Community Heads
Figure 35. Seed table of villages
To represent the community graph in figure using incidence matrix, the order would be (‘Number of villages’X‘No. of distinct
Communities’). Thus, the order of the above community graph in figure is 5 X 10. Its incidence matrix representation is shown in Figure 36.
The final incidence matrix only holds boolean values 1 and 0. This matrix is said to be thecommunity incidence matrix because it shows the
presence (1) or absence (0) of a community for a particular village.
Figure 36. Community incidence matrix
Now the detection of community number Cno is being carried out on Community Incidence Matrix CIMatrix[ ][ ]. If Cno found, then assign 1
to the corresponding village in community detection array Cno_Array[ ]. For the above example the community number is C2. And
the Cno_Array[ ] is shown in Figure 37. So the indication of 1 against those villages is the same community detected i.e. C2 in this example.
Figure 37. Community detected array
Using Cno_Array[ ], the authors can show the community detected graph which is shown in Figure 38. The dotted path is the underlying graph
for community C2 which is said to be the community detection graph for the nodes (or villages) V1, V3, V5,and V4. From the above cited figure,
we can draw our community matched graph as a digraph which is shown here in Figure 39. Finally, it is possible to detect the isolated
communities in the community incidence matrix.
In the authors example, the isolated communities are C5 in villageV2, C6 in village V2, and C9, C10 in village V5. The isolated communities are
black filled circles which are shown Figure 40.
Figure 40. Displaying isolated communities
Authors have presented the results in form of output screen in sequence Figure 41, 42, 43, and 44. The results in form of output screen
for COMMUNITY DETECTION FORC2: is mentioned here in order in Figure 45 and the results in form of output screen forCOMMUNITY
DETECTION FORC8: is mentioned in Figure 46.
Figure 41. First screenshot of output
SOME ISSUES ON GRAPH ANALYTICS
• Single Path Analysis: The goal is to find a path through the graph, starting with a specific node. All the links and the corresponding
vertices that can be reached immediately from the starting node are first evaluated. From the identified vertices, one is selected, based on a
certain set of criteria and the first hop is made. After that, the process continues. The result will be a path consisting of a number of vertices
and edges.
• Optimal Path Analysis: This analysis finds the ‘best’ path between two vertices. The best path could be the shortest path, the cheapest
path or the fastest path, depending on the properties of the vertices and the edges.
• Vertex Centrality Analysis: This analysis identifies the centrality of a vertex based on several centrality assessment properties:
• Degree Centrality: This measure indicates how many edges a vertex has. The more edges there are, the higher the degree centrality.
• Closeness Centrality: This measure identifies the vertex that has the smallest number of hops to other vertices. The closeness centrality
of the node refers to the proximity of the vertex in reference to other vertices. The higher the closeness centrality is the more number of
vertices that require short paths to the other vertices.
• Eigenvector Centrality: This measure indicates the importance of a vertex in a graph. Scores are assigned to vertices, based on the
principle that connections to high-scoring vertices contribute more to the score than equal connections to low-scoring vertices.
APPLICATION OF GRAPH ANALYTICS
In the finance sector, graph analytics is useful for understanding the money transfer pathways. A money transfer between bank accounts may
require several intermediate bank accounts and graph analytics can be applied to determine the different relationships between different account
holders. Running the graph analytics algorithm on the huge financial transaction data sets will help to alert banks to possible cases of fraudulent
transactions or money laundering.
The use of graph analytics in the logistics sector is not new. Optimal path analysis is the obvious form of graph analytics that can be used in
logistics distribution and shipment environments. There are many examples of using graph analytics in this area and they include “the shortest
route to deliver goods to various addresses” and the “most cost effective routes for goods delivery”.
One of the most contemporary use cases of graph analytics is in the area of social media. It can be used not just to identify relationships in the
social network, but to understand them. One outcome from using graph analytics on social media is to identify the “influential” figures from each
social graph. Businesses can then spend more effort in engaging this specific group of people in their marketing campaigns or customer
relationship management efforts.
AN EXAMPLE ON USE OF GRAPH ANALYTICS
For analytical purposes, a social network is visualized as a digraph(in a graph if the relationship has no direction) (Bandyopadhyay, Sinha and
Rao, 2006). So in a digraph, one unit may be an individual, a family, a household, a village, an organization in a village is called a node or vertex.
A tie between two nodes indicates the presence of a relationship. No tie between two nodes is the absence of a relationship. A tie with a direction
is called an arc and tie without direction is called an edge. The weight of a tie is the value or volume of flow. If the arc or edge is labeled with any
weight then the graph is termed as weighted graph. In social networking we concentrate only the presence (1) or absence (0) of the relationship.
We also assume that ties have directions.
Let us denote G is a digraph. The set of vertices of G can be denoted by V (G) and the set arcs can be denoted by A (G). If uvis an arc, then
diagrammatically it can be shown as an arrow from vertex u to vertex v. if both uv and vu are arcs, then we sometimes represent these two
together by a line without arrow heads joining vertex u and vertex v.
For the Figure 47, we have considered a digraph G. The vertex set is V = {v1, v2,……..v21}. The different arcs are v1v2, v2v1, v3v1, v4v1, v4v5 etc.,
but v1v3 and v2v3 are not arcs in the graph G.
Figure 47. A digraph
Representing Social Network Using Properties of Graph Theory and Matrices
The 1st network illustrates that everybody goes to everybody else. The 2nd network, the ties are reciprocated but the network is highly fragmented.
The 3rd network is connected but highly centralized and shows concentration of power lies in only one node or vertex. The 4th network is
connected cyclic i.e. everybody can go to everybody else through a large number of intermediaries. The 5thnetwork illustrates a strong hierarchy,
things flow only in one direction. The number of vertices and number of arcs constitute the most basic data in a social network.
Let us consider an example for the 4th network in the Figure 48, suppose there are six households in a neighborhood connected in a circular way.
Here each of household goes to exactly one among the remaining five and only one of the remaining five comes to it. If we impose this condition
further, there three other patterns possible besides that the 4th network of the same figure. These patterns are:
Figure 48. Five different networks
From the figure mentioned in Figure 49, we understand that a few parameters cannot determine completely the structure of a social network.
AUTHORS OBSERVATIONS AND PREDICTIONS
Considering a state is one large community graph. The authors combine India’s 29 states together to form a very large community graph (Rao
and Mitra, Oct-2014, Dec-2014).
From such a large community graph (Rao and Mitra, 2014) - how big data can be mined so that the standard of living of a particular community
can be analyzed? For this we can apply distributed mining technique to discover the very big data which is in 4V. From it the authors try to select
the desired item-set so that prediction can be taken place on those big data.
A Brief Case Study on Application of Big Data on Social Media
Social Media usages among its consumers are growing at exponential pace thus resulting in huge amount of data created at every minute. General
users are no more restricted in using the Internet, rather usage of Smart phones’, location based apps and other Internet of Things, has led to
generation of data in a much faster pace. Looking at the basic statistics on the social data growth, one can easily observe that more than 250
million tweets are generated in a day and it is increasing at a high speed. Further, at an average of 30 billion pieces of content are shared on
Facebook per month. It is predicted that data will grow over 800% in the next 5 years and 80% of these data will be unstructured (Infosys, 2013).
While handling such huge volumes of data is big challenge as well as it also provides numerous opportunities and competitive edge for the
enterprises to acquire, store and manage the data for information and knowledge extraction. Thus, the need for a robust platform is a necessity,
which cans efficient handle hardware challenges that are common in case of Big Data.
A high performance scalable architecture is essential component of the infrastructure for processing big data. Segregating or storing the
unstructured data from different sources within minimum response time for query processing with accuracy remains a software challenge for
such an environment. Providing security and safety and thus extracting pattern, meaning or sense from the huge stored data is the ultimate
challenge for the big data environment.
The analysis of Big Data will help in combining social media data streams with enterprise applications in a powerful way to derive meaningful
insights on the social conversations. Progress can be made in analysis the opinion and sentiments employee based on their interaction with social
media.
Further, the developed analysis agent can be build so as it can process huge volume of data at a higher speed and can identify sentiment,
buzzword, predictive, correlation and influencer data.
The knowledge extracted from analysis of big data of social media has many important advantages. The organization can easily detect and
respond to a social outburst before negative sentiments go viral in the social world. Decisions can be easily taken after analysing the customer’s
sentiments and purchase pattern. Identifying and engaging key influencers who are impacting the increase or decrease of sales is now easy, as the
decision is based on the information extracted from the huge data.
CONCLUSION
The initial sections of this chapter give an overview of modeling basic representation of social network using the concepts of graph theory,
especially using digraphs. Foundation of graph mining has been discussed thoroughly. Two existing examples from social network background
have been considered and proposed social community networks have been represented using the graph theoretic concepts to focus on the
discussed concepts. A brief outlining on the process and steps for extracting knowledge by using graph mining techniques has been mentioned.
Community network can be analyzed using the concepts of graph technique, especially incidence matrix of an undirected graph. Further, the
chapter gives an overview on various representation techniques for any graphs.
Breadth-first search (or simply BFS) is used in Apriori-based approach for generating level-wise candidate. The authors have studied earlier
algorithm and have proposed the present technique, which is also based on Apriori-based approach. The authors have observed that the
proposed technique by them is little bit efficient compared to the earlier algorithms. The derived properties with the proposed techniques work
perfectly on specific cases. The related sample output and have hence justified. An example of social community graph of undirected kind is
considered for better understanding. Further, literature on community detection followed by technique for detection of community with example
from social network background has been discussed using the graph theoretic concepts. Thus extraction of knowledge i.e. particular community
detection as well as isolated communities by using graph mining techniques is achieved.
Sub-structures are helpful for analysis and extraction of information. The related sections discussed on merging of sub-structures to form a larger
structure which can be used for information extraction. Further, authors have proposed three algorithms which merge two community sub-
graphs in an efficient and simpler manner. The first algorithm explains about finding the order of merged communities and to make available of
initial form of merged community matrix, the second one explains about creation of adjacency matrix community graph and the third one uses
the adjacency matrices of community graph and explains about creation of merged community adjacency matrix. Stepwise implementation of
proposed algorithms and obtained satisfactory and acceptable results are presented.
Fundamental concepts of big data and its characteristics have been discussed as a case study. This part of the chapter is conclusive in nature and
focuses on issues of Graph Analytics and its applications in Big Data
This work was previously published in Product Innovation through Knowledge Management and Social Media Strategies edited by Alok
Kumar Goel and Puja Singhal, pages 94146, copyright year 2016 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Abello, J., Resende, M. G. C., & Sudarsky, S. (2002). Massive quasi-clique detection . doi:10.1007/3-540-45995-2_51
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of 1994 International Conference Very Large
Data Bases (VLDB’94).
Batagelj, V., & Mrvar, A. P. (2003). Analysis and visualization of large networks. In M. Junger & P. Mutzel (Eds.), Graph Drawing Software.
Springer Verlag.
Berendt, B. (2005). The semantics of frequent subgraphs: Mining and navigation pattern analysis. In Proceedings of WebKDD. Chicago:
Academic Press.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic Web. Scientific American , 279(5), 34–43. doi:10.1038/scientificamerican0501-
34
Borgelt, C., & Berthold, M. R. (2002). Mining molecular fragments:Finding relevant substructures of molecules. InProceedings of 2002
International Conference on Data Mining(ICDM’02). Academic Press.
Brandes, U., & Erlebach, T. (Eds.). (2005). Network Analysis: Methodological Foundations. Lecture Notes in Computer Science, 3418.
Brandes, U., Kenis, P., & Wagner, D. (2003). Communicating centrality in policy network drawings. IEEE Transactions on Visualization and
Computer Graphics , 9(2), 241–253. doi:10.1109/TVCG.2003.1196010
Brandes, U., & Wagner Visone, D. (2003). Analysis and visualization of social networks . In Junger, M., & Mutzel, P. (Eds.), Graph Drawing
Software (pp. 321–340). Berlin: Springer Verlag.
Broder A. Kumar R. Maghoul F. Raghavan P. Rajagopalan S. Stat R. Tomkins A. (2000). Graph Structure in the Web: Experiments and models.
In Proceedings of the World Wide Web Conference. Amsterdam, The Netherlands: Academic Press. 10.1016/S1389-1286(00)00083-9
Bunke, H. (1997). On a Relation between Graph Edit Distance and Maximum Common Subgraph. Pattern Recognition Letters ,18(8), 689–694.
doi:10.1016/S0167-8655(97)00060-3
Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations , 1(2), 1–11. doi:10.1145/846183.846187
Clauset, A., Mewman, M., & Moore, C. (2004). Finding community structure in very large networks. Arxiv preprint cond-mat/0408187.
Clauset, A., Moore, C., & Newman, M. E. J. (2008). Hierarchical structure and the prediction of missing links in networks. Nature ,453(7191),
98–101. doi:10.1038/nature06830
Cordella L. P. Foggia P. Sansone C. Vento M. (2001). An improved algorithm for matching large graphs. In Proceedings of the 3rdIAPR-TC-15
International Workshop on Graph-based Representations.
Corneil, D. G., & Gotlieb, C. C. (1970). An efficient algorithm for graph isomorphism. Journal of the ACM , 17(1), 51–64.
doi:10.1145/321556.321562
Desikan, P., & Srivastava, J. (2004). Mining Temporally Evolving Graphs. In Proceedings of WebKDD.
Etzioni, O. (1996). The World Wide Web: Quagmire or Gold Mine?Communications of the ACM , 39(11), 65–68. doi:10.1145/240455.240473
Goldenberg A. Moore A. (2004). Tractable learning of large Bayes net structures from sparse data. In Proceedings of the 6thInternational
Conference on Machine Learning. 10.1145/1015330.1015406
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of 2000 ACMSIGMOD International
Conference on Management of Data (SIGMOD’00). Dallas, TX: ACM.
Holder, L. B., Cook, D. J., & Djoko, S. (1994). Substructure discovery in the subdue system. In Proceedings of AAAI’94 Workshop Knowledge
Discovery in Databases (KDD’94). Seattle, WA: AAAI.
Huan, J., Wang, W., & Prins, J. (2003). Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. InProceedings of 2003
International Conference on Data Mining(ICDM’03). 10.1109/ICDM.2003.1250974
IEEE. (2014). ieee.bigdata.tutorial.1.1slides.pdf. IEEE.
Inokuchi, A., Washio, T., & Motoda, H. (2000). An apriori-based algorithm for mining frequent substructures from graph data. InProceedings of
2000 European Symposium Principle of Data Mining and Knowledge Discovery (PKDD’00). 10.1007/3-540-45372-5_2
Jensen, D., & Neville, J. (2002). Data Mining in Social Networks. Paper presented at the Workshop on Dynamic Social Network Modeling and
Analysis, Washington, DC.
Kleinberg, J., & Lawrence, S. (2001). The Structure of the Web.Science , 294–322.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM , 46(5), 604–632. doi:10.1145/324133.324140
Kolari, P., & Joshi, A. (2004). Web mining: Research and Practice.IEEE Computational Science & Engineering , 6(4), 49–53.
doi:10.1109/MCSE.2004.23
Kubica J. Goldenberg A. Komarek P. Moore A. (2003). A comparison of statistical and machine learning algorithms on the task of link
completion. In Proceedings of the KDD Workshop on Link Analysis for Detecting Complex Behavior.
Kumar R. Novak J. Tomkins A. (2006). Structure and evolution of online social networks. In KDD ’06: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining. New York: ACM. 10.1145/1150402.1150476
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the web for emerging cyber-communities. Computer Networks , 31(11-
16), 1481–1493. doi:10.1016/S1389-1286(99)00040-7
McEneaney, J. E. (2001). Graphic and Numerical Methods to assess Navigation in Hypertext. International Journal of Human-Computer
Studies , 55(5), 761–786. doi:10.1006/ijhc.2001.0505
Mendelzon, A., Michaila, G., & Milo, T. (1996). Querying the WWW. In Proceedings of the International Conference on Parallel and Distributed
Information Systems, (pp. 80–91). Academic Press.
Meo, R., Lanzi, P. L., Matera, M., & Esposito, R. (2004). Integrating web conceptual modelling and web usage mining. InProceedings of
WebKDD.
Newman, M., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E: Statistical, Nonlinear, and Soft
Matter Physics , 69, 26–113.
Nijssen, S., & Kok, J. A. (2004). Quickstart in frequent structure mining can make a difference. In Proceedings of 2004 ACM SIGKDD
International Conference on Knowledge Discovery in Databases (KDD’04). Seattle, WA: ACM.
Prins, J., Yang, J., Huan, J., & Wang, W. (2004). Spin: Mining maximal frequent subgraphs from graph databases. InProceedings of 2004 ACM
SIGKDD International Conference on Knowledge Discovery in Databases (KDD’04). Seattle, WA: ACM.
Rao, B., & Mitra, A. (2014b). A new approach for detection of common communities in a social network using graph mining techniques. In High
Performance Computing and Applications (ICHPCA), 2014 International Conference on. doi: .2014.704533510.1109/ICHPCA
Rao, B., & Mitra, A. (2014c). An Approach to Merging of two Community Sub-Graphs to form a Community Graph using Graph Mining
Techniques. 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC-2014). IEEE.
Ravasz, E., & Barabasi, A.-L. (2003). Hierarchical Organization In Complex Networks. Physical Review , 67.
Riesen, K., Jiang, X., & Bunke, H. (2010). Exact and Inexact Graph Matching: Methodology and Applications. In Managing and Mining Graph
Data. Academic Press.
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining: Discovery and applications of usage patterns from web
data. SIGKDD Explorations , 1(2), 1–12. doi:10.1145/846183.846188
Tripathy, B. K., & Mitra, A. (2012). An algorithm to achieve k-anonymity and l-diversity anonymisation in social networks. IEEE – CASoN, Brazil.
doi:10.1109/CASoN.2012.6412390
Vanetik, N., Gudes, E., & Shimony, S. E. (2002). Computing Frequent Graph Patterns from Semistructured Data. InProceedings of 2002
International Conference on Data Mining(ICDM’02). 10.1109/ICDM.2002.1183988
Vismara, L., Di Battista, G., Garg, A., Liotta, G., Tamassia, R., & Vargiu, F. (2000). Experimental studies on graph drawing algorithms. Software,
Practice & Experience , 30(11), 1235–1284.
Weiss R. Velez B. Sheldon M. (1996). HyPursuit: A hierarchical network search engine that exploits context-link hypertext clustering. In
Proceedings of the Conference on Hypertext and Hypermedia. 10.1145/234828.234846
Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of 2002 International Conference on Data
Mining (ICDM’02).
Yan, X., Zhou, X. J., & Han, J. (2005). Mining closed relational graphs with connectivity constraints. In Proceedings of 2005,ACM SIGKDD
International Conference on Knowledge Discovery in Databases (KDD’05). 10.1145/1081870.1081908
Zaiane O. R. Han J. (1995). Resource and knowledge discovery in global information systems: A preliminary design and experiment. In
Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining.
Zaki M. J. (2002). Efficiently Mining Frequent Trees in a Forest. In Proceedings of the 7th International Conference on Knowledge Discovery and
Data Mining. 10.1145/775047.775058
Section 4
M. Baby Nirmala
Holy Cross College, India
ABSTRACT
In this emerging era of analytics 3.0, where big data is the heart of talk in all sectors, achieving and extracting the full potential from this vast
data is accomplished by many vendors through their new generation analytical processing systems. This chapter deals with a brief introduction of
the categories of analytical processing system, followed by some prominent analytical platforms, appliances, frameworks, engines, fabrics,
solutions, tools, and products of the big data vendors. Finally, it deals with big data analytics in the network, its security, WAN optimization tools,
and techniques for cloud-based big data analytics.
INTRODUCTION
In this technological era of big data, the important issue that arises is how such huge amount of data which is semi structured, unstructured,
machine generated /or sensor data, mobile data and large scale data can be stored and processed. It is fair to say that we are now entering an era
of analytics 3.0 in which analytics will be considered to be a “table stake” capability for most organizations to find out the great insights of that
enormous data. There is a high-level categorization of big data platforms to store and process them in a scalable, fault tolerant and efficient
manner. Data growth, particularly of unstructured data poses a special challenge as the volume and diversity of data types surpass the
capabilities of older technologies such as relational databases. Organizations are investigating next generation technologies for data analytics. In
this increasingly digital world, achieving the full transformative potential of big data requires not only new data analysis algorithms, but also a
new generation of systems and distributed computing environments to handle the spectacular growth in the volume of data, the lack of structure
and the increasing computational needs of massive-scale analytics.
In this increasingly digital world, achieving the full transformative potential from the use of data requires not only new data analysis
algorithms but also a new generation of systems and distributed computing environments to handle the spectacular growth in the volume of
data, the lack of structure for much of it and the increasing computational needs of massive-scale analytics.
This chapter covers technical details on the “categories of analytical processing system, how to effectively analyze data from the different
analytical processing systems primarily with the help of white papers of many companies and organisations that were identified by IDC, Forrester
and Gartner surveys during their analysis. The big data technology allows storing of bulk of data, searching meaningful data for visualization,
enabling predictive analysis, thereby internalizing business process for application just by giving valuable insights of big data. By taking all these
into consideration, this chapter strives to provide a glimpse of various platforms, frameworks, appliances, products and solutions offered by the
leading BIG DATA ANALYTICS PROVIDERS though many of them are not dealt here because of space constrain.
CATEGORIES OF ANALYTICAL PROCESSING SYSTEMS
In his blog, Eckerson, (2013) explained, at a high-level, there are four categories of Analytical Processing Systems available in this era of Big data:
• Hadoop Distributions.
• NoSQL Databases.
• Analytic Platforms.
Other than these categories, there are analytical engines, frameworks, fabrics, etc., which also play a prominent role in big data analytics.
TRANSACTIONAL RELATIONAL DATABASE MANAGEMENT SYSTEMS
To make Transactional RDBM systems, more pleasant to analytical processing, most of them have been retrofitted with various types of indexes;
join paths, and custom SQL bolt-ons, although they were originally designed to support transaction processing applications. There are two types
of transactional RDBM systems- Enterprise and Departmental.
HADOOP DISTRIBUTIONS
Hadoop is an open source software project run within the Apache Foundation for processing data-intensive applications in a distributed
environment with built-in parallelism and failover. The most important parts of Hadoop are the Hadoop Distributed File System, which stores
data in files on a cluster of servers, and MapReduce, a programming framework for building parallel applications that run on HDFS (Hadoop
Distributed File System)
NOSQL DATABASES
NoSQL is the name given to a broad set of databases whose only common thread is that they don't require SQL to process data, although some
support both SQL and non-SQL forms of data processing. There are many types of NoSQL databases, and the list grows longer every month.
These specialized systems are built using either proprietary and open source components or a mix of both. In most cases, they are designed to
overcome the limitations of traditional RDBM systems to handle unstructured and semi-structured data. Here is a partial listing of NoSQL
systems - Key Value Pair Databases, Document Stores, SQL MapReduce, Graph Systems, Unified Information Access and Others.
ANALYTIC PLATFORMS
Analytic platforms symbolize the first beckon of big data systems. To offer superior price-performance for analytical workloads compared to
transactional RDBM systems the analytical platorms are purpose-built and designed as SQL-based system. There are different types of analytic
platforms and most are being used as data warehousing replacements or stand-alone analytical systems.
• Analytical Appliance: They arrive as an integrated hardware-software blend, tuned for analytical workloads and come in many shapes,
sizes, and configurations. IBM Netezza, EMC Greenplum, and Oracle Exadata which are more general purpose analytical machines that can
serve as replacements for most data warehouses are liked by many. Others, such as those from Teradata, are geared to precise analytical
workloads, such as conveying awfully fast performance or managing super large data volumes.
• InMemory Systems: The in-memory system is ideal so that all data can be put into memory where raw performance is needed. SAP,
which is setting a stake with its business onHANA, an in-memory database for transactional and analytical processing is evangelizing the
need for in-memory systems. Another contender in this space is Kognitio.
• Columnar: Because of the way these systems store and compress data by columns instead of rows, columnar databases, and offer fast
performance for many types of queries. SAP's Sybase IQ Hewlett Packard's Vertica, Paraccel, Infobright, Exasol, Calpont, and Sand are
few of them by which column storage and processing is fast becoming a RDBM system feature rather than a distinctive subcategory of
products. (see Figure 1)
BIG DATA PLATFORM
Big data platform cannot just be a platform for processing data; it has to be a platform for analyzing that data to extract insight from an immense
volume, variety, velocity, value and veracity of that data. The main components in the Big data platform provide:
• Deep Analytics: a fully parallel, extensive and extensible toolbox full of advanced and novel statistical and data mining capabilities
• High Agility: the ability to create temporary analytics environments in an end-user driven, yet secure and scalable environment to deliver
new and novel insights to the operational business
• Massive Scalability: the ability to scale analytics and sandboxes to previously unknown scales while leveraging previously untapped data
potential
• Low Latency: the ability to instantly act based on these advanced analytics in the operational, production environments. (see Figure 2)
Figure 2. Big data platform
SURVEY ON BIG DATA PLATFORMS
Survey is made on some existing big data platforms for large scale data analysis. There are many types of vendor products to be considered for big
data analytics. More recently, vendors have brought out analytic platforms based on MapReduce, distributed file system, and NoSQL
indexing. ParAccel Analytic Database (PADB), is the world’s fastest, most cost-effective platform for empowering analytics-driven businesses.
When combined with theWebFOCUS BI platform, ParAccel enables organizations to tackle the most complex analytic challenges and glean ultra-
fast, deep insights from vast volumes of data (Gualtieri, Powers & Brown, 2013).
The SAND Analytic Platform is a columnar analytic database platform that achieves linear data scalability through massively parallel processing
(MPP), breaking the constraints of shared-nothing architectures with fully distributed processing and dynamic allocation of resources (Pavlo,
Paulson, Rasin, Abadi, DeWitt, Madden, & Stonebraker, 2009).
The HP Vertica Analytics Platform offers a robust and ever growing set of Advanced In-Database Analytics functionality. It has a high-speed,
relational SQL database management system (DBMS) purpose-built for analytics and business intelligence. It offers a shared-nothing, Massive
Parallel Processing (MPP) column-oriented architecture. 1010data offers a data and analytics platform that is the only complete approach to
performing the deepest analysis and getting the maximum insight directly from raw data, at a fraction of the cost and time of any other solution
(Sagynov, 2012)
Netezza, a leading developer of combined server, storage, and database appliances designed to support the analysis of terabytes of data provides
companies with a powerful analytics foundation that delivers maximum speed, reliability, and scalability.
IBM BIG DATA PLATFORM
IBM’s Big data platform moves the analytics closer to the data. This gives organizations a solution that is designed exclusively with the
requirements of the enterprise in mind. This integrates and manages the full variety, velocity and volume of data, applies advanced analytics to
information in its native form, visualizes all available data for ad-hoc analysis, provides a development environment for building new analytic
applications and workload optimization and scheduling.
• Hadoop-based analytics.
• Stream computing.
• Data warehousing.
• InfoSphere Streams: Enables continuous analysis of massive volumes of streaming data with sub-millisecond response times.
• InfoSphere BigInsights: An enterprise-ready, Apache Hadoop-based solution for managing and analyzing massive volumes of
structured and unstructured data.
• IBM Smart Analytics System: Provides a comprehensive portfolio of data management, hardware, software, & services capabilities that
modularly deliver a wide assortment of business changing analytics.
• InfoSphere Information Server: Understands, cleanses, transforms and delivers trusted information to your critical business
initiatives, integrating big data into the rest of the IT system.
• IBM Puresystems: PureSystems combine the flexibility of a general purpose system, the elasticity of cloud and the simplicity of an
appliance. They are integrated by design and come with built in expertise gained from decades of experience to deliver a simplified IT
experience (deRoos, Eaton,Lapis, Zikopoulos, & Deutsch 2012). IBM offers a platform for big data, including IBM InfoSphere Biginsights
and IBM InfoSphere Streams. IBM InfoSphere Biginsightsrepresents a fast, robust, and easy-to-use platform for analytics on big data at
rest. IBM InfoSphere Streams are a powerful analytic computing platform that delivers a platform for analyzing data in real time with micro-
latency Zikopoulos, deRoos, Parasuraman, Deutsch, Corrigan, & Giles, 2013). To the best of our knowledge, there is no other vendor that can
deliver analytics for big data in motion (InfoSphere Streams) and Big data at rest (BigInsights) together (Ferguson, 2012).
IBM NETEZZA ANALYTICS: HIGH PERFORMANCE ANALYTIC PLATFORM
IBM Netezza Analytics is an embedded, purpose-built, advanced analytics platform, delivered with every IBM Netezza appliance that empower
analytic enterprises to meet and exceed their business demands and:
GREENPLUM UNIFIED ANALYTICS PLATFORM: THE ANSWER TO AGILE (EMC PERSPECTIVE PURSUING THE AGILE
ENTERPRISE, 2012; & A WHITE PAPER FROM EMC: BIG DATA AS A SERVICE, 2013)
Greenplum is driving the future of big data analytics with the industry’s first Unified Analytics platform (UAP). Greenplum UAP is a single,
unified data analytics platform that combines the Co-processing of structured and unstructured data with the productivity engine that empowers
collaboration among data science teams. UAP is uniquely able to facilitating the discovery and sharing of the insights that lead to greater business
value.
These components are delivered as hardware, cloud infrastructure, or on the Greenplum Data Computing Appliance (DCA). The modular DCA is
the first best step to agile analytics with Greenplum UAP. Purpose-built for data co-processing on commodity hardware, the DCA delivers the
fastest time to value in a platform that allows organizations to combine a shared-nothing MPP relational database with enterprise-class Apache
Hadoop, and expand it gracefully as needed in a single, unified appliance. The DCA delivers rich facilities for redundancy, availability, fault
detection, and alerting, enabling to avoid the integration and maintenance tasks that typically reduce productivity of data science teams. EMC's
Data Computing Division is expanding on Greenplum's deep support for in-database analytics with partners including SAS and MapR.
(Henschen, 2011). (see Figure 4)
Figure 4. Greenplum unified analytical platform
HP VERTICA PLATFORM WITH ‘R’
The HP Vertica Analytics Platform is a proven, high-performance data analytics software platform. R is one of the most popular open-source data
mining and statistics software offerings in the market today. The combination of the two technologies helps you turn big data into big value.The
integration of R, a no-charge offering-into the HP Vertica Analytics Platform lets the enterprise sift through the data quickly to find anomalies
using advanced data mining algorithms provided by R. Now, no complex import, export, or extract/transform/load (ETL) jobs are required. By
integrating data mining into the processes, people in the organization are poised to make better business decisions, in less time, based on data
mining results. (Innovative technology for Big data analytics. (2012, October))
Key Features and Benefits
The HP Vertica Analytics Platform is designed for speed, scalability, and simplicity. Among other features and benefits, the platform uses a SQL
database for data mining analytics, runs on industry-standard x86 hardware and is based on a massively parallel processing (MPP) columnar
architecture that scales to petabytes, reduces footprint via advanced data compression, provides extensible analytics capabilities, is easy to set up
and use, provides elasticity to grow and shrink as needed and offers an extensive ecosystem of analytic tools, The HP Vertica Analytics Platform
features a high-performance MPP, cluster-based architecture that implements a “divide and conquer” strategy applied to SQL queries. This same
strategy can now be also applied to some of advanced analytics data mining algorithms using R. Some data mining algorithms lend themselves to
a division of work that is similar to how the HP Vertica User Defined Extension (UDX) framework transforms functions work, and use standard
SQL-99 syntax called “windowing.” In addition, the R programming model is vector-based, which in concept is very similar to the column-based
architecture of the HP Vertica Analytics Platform. (“R” you ready?, Turning Big data into big value with the HP Vertica Analytics Platform and R.
(2012, October)) (see Figure 5)
SAs PLATFORMS FOR HIGH PERFORMANCE
According to Carter.P (2011), Associate Vice President of IDC Asia Pacific, “Big data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture,
discovery and/or analysis.”
SAS InMemory Analytics: InMemory Systems
“Resolve complex problems in near-real time for highly accurate insights”. With SAS In-Memory Analytics solutions, organizations can tackle
unsolvable problems using big data and sophisticated analytics in an unfettered and rapid manner. They:
Solutions and Capabilities:
• SAS HighPerformance Analytics Server: Get near-real-time insights with appliance-ready analytics software designed to tackle big
data and complex problems.
• HighPerformance Risk: Faster, better risk management decisions based on the most up-to-date views of the overall risk exposure.
• HighPerformance Risk Management: Take quick, decisive actions to secure adequate funding, especially in times of volatility.
• HighPerformance Stress Testing: Make faster, more precise decisions protect the health of the firm.
• Visual Analytics: Explore big data using in-memory capabilities to better understand all of the data, discover new patterns and publish
reports to the Web and iPad.
• SAS Visual Analytics: SAS Visual Analytics is a high-performance, in-memory solution for exploring massive amounts of data very
quickly. It enables one to spot patterns, identify opportunities for further analysis and convey visual results via Web reports, the iPad or an
Android Tablet.
• SAS Social Media Analytics: A solution that integrates, archives, analyzes and enables organizations to act on intelligence gleaned from
online conversations on professional and consumer-generated media sites. Most social media analytics solutions are simply listening
platforms, but that’s not enough. SAS provides context to the conversations one’s customers are having by better aligning what one listens to
with the lens through which one views one’s business
• SAS HighPerformance Analytics Server: An in-memory solution that allows to develop analytical models using complete data, not
just a subset, to produce more accurate and timely insights. Now one can run frequent modeling iterations and use sophisticated analytics to
get answers to questions that were never thought of or had time to ask. SAS High-Performance Analytics includes domain specific offerings
for statistics, data mining, text mining, forecasting, optimization, and econometrics – all available for execution in a highly scalable,
distributed in-memory processing architecture.
SAP HANA–INMEMORY COMPUTING PLATFORM
SAP HANA is an innovative in-memory data platform that is deployable on-premise as an appliance, in the cloud, or as a hybrid of the two. It is a
columnar, Massively Parallel Processing (MPP) platform. SAP HANA combines database, data processing, and application platform capabilities
in- memory. The platform provides libraries for predictive, planning, text processing, spatial, and business analytics. This new architecture
enables converged OLTP and OLAP data processing within a single in-memory column-based data store with ACID compliance, while
eliminating data redundancy and latency. (Sagnou, E. (2012, April) (see Figure 6)
By providing advanced capabilities, such as predictive text analytics, spatial processing, data virtualization, on the same architecture, it further
simplifies application development and processing across big data sources and structures. This makes SAP HANA the most suitable real-time
platform for building and deploying next-generation, real-time applications and operational analytics with extreme speed. SAP HANA is unique
in its ability to converge database and application logic within the in-memory engine to transform transactions, analytics, text analysis, predictive
and spatial processing.
Working in conjunction with SAP, Accenture is offering the opportunity to pilot HANA live and learn more.(Experience SAPHANA with
Accenture and SAP. (2011, August).
1010 DATA OFFERS BIG DATA AS A SERVICE: CLOUDBASED BIG DATA ANALYTICS PLATFORM
1010 data offers a cloud-based big data analytics platform. Many database platform vendors offer cloud-based sandbox test-and-development
environments, but 1010data's managed database service is aimed at moving the entire workload into the cloud (Bloor, 2012) . The service
supports a “rich and sophisticated array of built-in analytical functions” including predictive analytics. A key selling point is that the service
includes the data modeling and design, information integration, and data transformation. Customers includes Hedge funds, Global banks,
securities exchanges, retailers, and packaged goods companies, and 1010data claims “higher performance at a fraction of the cost of other data
management approaches”(Henschen, 2011). (see Figure 7)
INTELLICUS: POWER TO UNDERSTAND YOUR BUSINESS (BUSINESS INSIGHTS PLATFORM)
Intellicus is the first tool in the industry that introduces a classic state-of-the-art merger of traditional reporting data sources like SQL & OLAP
with promising data storage and processing tools like Hadoop, Columnar DB & MPPs to offer the most intuitive self service analytics tool.
Intellicus brings most comprehensive Business Intelligence features to build an enterprise reporting and business insights platform. Intellicus
provides reporting - both Ad-hoc and Traditional Pixel Perfect, Dashboards, OLAP server, Advanced Visualization, Scheduled Delivery, Business
User Meta Layer and Collaboration with industry standard security features. Intellicus, which pioneered in browser based ad hoc reporting now
offers flexible ad hoc analytics on mobile devices. Dashboards, with most advanced visualization tools enable the users to track business metrics
derived from correlating multiple enterprise data sources. Intellicus gives a rich platform to create a Business Meta Layer below which all the
database complexities will dwindle and lets end users smoothly play with reporting on their own. These self serviced reports are web 2.0
technologies based with advanced visualization including GIS Maps. (Henschen, 2014). (see Figure 8)
INFORMATICA: A PLATFORM POWERED BY POWERED VBM
The Power Driving the Informatica Platform: The Vibe VDM
An information platform that supports all styles of analytics, including the latest flavors of big data technologies is the present need of the current
era. An information platform that is powered by the Vibe virtual data machine which understands the differences in the underlying analytical
computing platforms and languages That platform is the Informatica Platform. With Informatica, one can tap into all types of data, rapidly
discover insights, and innovate faster in the age of Big data. The Informatica Vibe virtual data machine is a data management engine that knows
how to ingest the data and then very efficiently transform, cleanse, manage, or combine it with other data. It is the core engine that drives the
Informatica Platform. The Vibe VDM works by receiving a set of instructions that describe the data source(s) from which it will extract data, the
rules and flow by which that data will be transformed, analyzed, masked, archived, matched, or cleansed, and ultimately where that data will be
loaded when the processing is finished. Vibe consists of a number of fundamental components.
Bridging the Gap Between Traditional Data and Big Data
Although Informatica is typically best known for its data integration and data governance capabilities, the 9.5 release adds new support for
Hadoop, natural language processing, social networking data which Informatica supports with its Social Master Data Management (MDM)
offering, and drag-and-drop data mapping to increase the usability of big data and traditional data. (Informatica white paper: Informatica and
the Vibe Virtual Data Machine (VDM). (2013) & NucleusResearch Note: Informatica Vibe for social, mobile and cloud-based data (Document
N89). (2013))
TERADATA ASTER’S BIG ANALYTICS APPLIANCE
This appliance from Teradata makes the big data problems convenient. This is the first big analytic and discovery appliance where Aster
database, SQL-MapReduce and Apache Hadoop are brought together. This helps the executives and analysts leverage breakthrough business
insights to:
▪ Minimize risk, maximize ROI and accelerate time value with enterprise ready big data solutions.
Teradata Moves from EDWs to Extensive Analytic Family
Once traditionalist preachers of the enterprise data warehouse (EDW) approach, Teradata has loosened up in recent years and come out with an
extended family of offerings built around the Teradata database. The company's high-performance and high-capacity products have been widely
trited, as have many of the company's workload management features, including virtualized OLAP (cube-style) analysis.
Though Teradata did not have a footing in blended analysis of structured data, semi-structured data, and largely unstructured data, it has been
thrusting the wrapper on in-database analytics. That's why it purchased Aster Data, which offers a SQL-MapReduce framework. Because
MapReduce processing is useful in crunching massive quantities of Internet clickstream data, sensor data, and social-media content, Teradata
recently announced plans for an Aster Data MapReduce appliance to be built on the same hardware as the Teradata appliance. It also added two-
way integration between the Teradata and Aster Data databases. By buying AsterData, Teradata has broadened what is widely regarded as the
broadest, deepest, and most scalable family of products available in the data warehousing industry (Henschen, 2011). (see Figure 9)
ORACLE BIG DATA APPLIANCE
Oracle is uniquely qualified to combine everything needed to meet the big data challenge – including software and hardware – into one
engineered system. The Oracle Big data Appliance is an engineered system that combines optimized hardware with the most comprehensive
software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy solution for acquiring, organizing and
loading big data into Oracle Database 11g. It is designed to deliver extreme analytics on all data types, with enterprise-class performance,
availability, supportability and security. With Big data Connectors, the solution is tightly integrated with Oracle Exadata and Oracle Database, so
one can analyze all the data together with extreme performance (Dijcks, 2013). (see Figure 10)
Oracle Big data Appliance includes a combination of open source software and specialized software developed by Oracle to address enterprise big
data requirements. The Oracle Big data Appliance integrated software2 includes:
• Open source distribution of the statistical package R for analysis of unfiltered data on Oracle Big Data Appliance.
• And Oracle Enterprise Linux operating system and Oracle Java VM. (see Figure 11)
Figure 11. Usage model for big data appliance and exadata
By using the Oracle Big data Appliance and Oracle Big data Connectors in conjunction with Oracle Exadata, enterprises can acquire, organize and
analyze all their enterprise data – including structured and unstructured – to make the most informed decisions.
KOGNITIO OFFERS THREE APPLIANCE SPEEDS AND VIRTUAL CUBES
Kognitio is a database vendor known for in-memory database management that doesn't have its own hardware, but yields to customer interest in
quick deployment. This offers Lakes, Rivers, and Rapids appliances with its WX2 database preinstalled on HP or IBM hardware. The Lakes
configuration delivers high-capacity storage at low cost, with 10 terabytes of storage and 48 compute cores per module. This appliance is aimed at
financial firms doing algorithmic trading or other high-performance demands.This year Kognitio added a virtual-OLAP-style “Pablo” analysis
engine that offers flexible, what-if analysis by business users. This optional extension to WX2 builds virtualized cubes on the fly.
Thus, any dimension of data in a WX2 database can be used for rapid-fire analysis from a cube held entirely in memory. The front-end interface
for this analysis is Microsoft Excel by way of A La Carte, a Pablo feature that lets users of this familiar spreadsheet interfaces tap into the data in
WX2. (Henschen, 2011).
MICROSOFT APPLIANCE SCALES OUT SQL SERVER WITH PDW
The Microsoft SQL Server R2 Parallel Data Warehouse (PDW) was released in early 2011 to enable customers to scale up into deployments
analyzing hundreds of terabytes. The appliance is offered on hardware from partners including Hewlett-Packard. At launch, PDW pricing was
just over $13,000 per terabyte of user-accessible data, including hardware, though Microsoft shops can expect discounting. It remains to be seen
how deep street-price discounts will go.
PDW, like many products, uses massively parallel processing to support high scalability, but Microsoft was late to the market and lags behind
market leaders on in-database analytics and in-memory analysis. Microsoft is counting on the appeal of its total database platform as a
differentiator. That means everything from its data lineage and budding master data management capabilities to its widely used Information
Integration, Analysis and Reporting services, all of which are built-in components of the SQL Server database.
The Azure service will debut by the end of 2011 while the on-premises software is expected in the first half of 2012. No word on whether Microsoft
will work with hardware partners on a related big data appliance (Henschen, 2011).
SAND TECHNOLOGY – COLUMNAR SYSTEMS: A WORLD’S HIGHEST PERFORMING ENTERPRISE ANALYTIC DATABASE
PLATFORM
SAND delivers the world’s highest performing Enterprise Analytic Database Platform. SAND Analytic Platform is a patented column database
management system (CDBMS), delivering optimal performance for every user through Infinite Optimization. Generation based concurrency
control (GBCC) technology supports thousands of concurrent users with massive and constantly growing data. SAND delivers on the promise of
sharing data throughout the Enterprise, providing instant access, driving decision-making, and ensuring that the best information is in the hands
of the right people at the right time.
The SAND Analytic Platform is a columnar analytic database platform that achieves linear data scalability through massively parallel processing
(MPP), breaking the constraints of shared-nothing architectures with fully distributed processing and dynamic allocation of resources. SAND
supports thousands of concurrent users with mixed workloads, infinite query optimization (requiring no tuning once data is loaded), in-memory
analytics, full text search, and SAND boxing for immediate data testing. The SAND Analytic Platform focuses on complex analytics tasks,
including customer loyalty marketing, churn analytics, and financial analytics.
INFOBRIGHT CUTS DBA LABOR AND QUERY TIMES
The Infobright is a column-store database which is expected at analysis of moderate data volumes ranging from hundreds of gigabytes up to tens
of terabytes (Henschen, 2011). This is also the core market for Oracle and Microsoft SQL Server, but InfoBright says its alternative database,
which is built on MySQL and designed for analytic applications, delivers higher performance at lower cost with much less database
administrative work. The column-store database creates indexes automatically and there's no data partitioning and minimal ongoing DBA tuning
required. The company claims customers are doing 90% less work than required for conventional databases while incurring half the cost in terms
of database licensing and storage, thanks to high data compression.
Infobright's recent 4.0 releases added a Domain Expert feature that lets companies ignore repeating patterns of data that don't change, such as
email addresses, URLs, and IP addresses. Companies add their own patterns as well, whether it's related to call data records, financial trading, or
geospatial information. The Knowledge Grid query engine then has the brains to ignore this static data and explore only changing data. That
saves query time because irrelevant data doesn't have to be decompressed and interrogated.
PARACCEL COMBINES COLUMNSTORE, MPP AND INDATABASE ANALYTICS
ParAccel is the developer of the ParAccel Analytic Database (PADB), a database that combines the fast, selective-querying and compression
advantages of a column-store database with the scale-out capabilities of massively parallel processing. The vendor says its platform supports a
range of analyses, from reporting to complex advanced-analytics workloads. Built-in analytics enable analysts to perform advanced
mathematical, statistical, and data-mining functions, and an open API extends in-database processing capabilities to third-party analytic
applications. Table functions are used to feed and receive results to and from third-party and custom algorithms written in languages such as C
and C++. ParAccel has partnered with Fuzzy Logix, a vendor that offers an extensive library of descriptive statistics, Monte Carlo simulations and
pattern-recognition functions. The table function approach also supports MapReduce techniques and more than 700 analyses commonly used by
financial services. (Henschen, 2011).
ALPINE DATA LAB: PREDICTIVE ANALYTICS BUILT FOR BIG DATA
The Greenplum Unified Analytics Platform (UAP) combined with Alpine Data Labs delivers deep insights and models from all the data in a
simple web-based application that combines the power of big data processing with the sophistication of predictive analytics. This solution moves
beyond traditional business intelligence by delivering in-database analytics that allow one to unlock the full potential of the data. More
importantly, the platform’s accessible and easy-to-use interface allows everyone on the team to participate in the iterative discovery process.
CONNECTING TO ORACLE
Performance Acceleration for Oracle Data Warehousing
The analytics race is on and every company is choosing a partner. The goal is to create an agile, analytic environment – one that enables high
analytic productivity without requiring a lot of effort to set up and manage. That’s why an Oracle data warehouse is no longer enough. To address
the challenges that lie ahead, one will need an enterprise data warehouse (EDW) platform to quickly deliver analytic capacity, and to be able to
respond to unforeseen circumstances.
Why are Oracle Data Warehouses No Longer Enough?
Because most of them are overwhelmed by the increasing volume, variety, and velocity of data being created by today’s companies. But storing
information in smaller data marts is only an interim fix that introduces additional complexity into data management while reducing query
validity and performance.
The total cost of ownership (TCO) for an analytic platform based on Oracle Exadata is extremely high – as much as five times that of a purpose-
built analytic platform such as EMC Greenplum.
▪ Expensive and proprietary Exadata hardware required for all environments (including Development, QA, Testing, and Production).
For all of these reasons, enterprises are turning to the EMC Greenplum Data Computing Appliance (DCA) as the best solution for their EDW
platform needs.
SAS AND IBM ARE UNSHAKEABLE LEADERS, WHILE NEWCOMER SAP PERFORMS WELL
SAS’s Enterprise Miner tool is easy to learn and can run analysis in-database or on distributed clusters to handle big data. IBM’s Smarter Planet
movement and acquisitions of SPSS, Netezza, and Vivisimo represent its commitment to big data predictive analytics. IBM’s complementary
solutions, such as InfoSphere Streams and Decision Management, strengthen the appeal for firms that wish to integrate predictive analytics
throughout their organization. SAP is a newcomer to big data predictive analytics but is a Leader due to a strong architecture and strategy. SAP
also differentiates by putting its SAP HANA in-memory appliance at the center of its offering, including an in-database predictive analytics
library (PAL), and offering a modeling tool that looks a lot like SAS Enterprise Miner and IBM SPSS Modeler.
ALTERYX DESKTOPTOCLOUD SOLUTIONS: THE NEW APPROACH TO STRATEGIC ANALYTICS SOLUTION
Alteryx Strategic Analytics is a desktop-to-cloud solution that combines business data with the market insight and spatial processing that today’s
strategic planners need. Now, data artisans can pull, overlay, and analyze any combination of enterprise data, industry content, and location
intelligence all in a single picture.
Alteryx offers three components to give you an unmatched strategic analysis software experience:
• Designer’s desktop used by data artisans to manage and analyze data. This is then embedded into analytic applications.
• Analytic Applications used by business decision makers. These are simple to use and focused on specific business problems with
embedded analytics, reporting, and visualization.
• Cloud services that offer the ability to publish to private or public cloud environments allowing the critical broad sharing of analytic IP to
business users, internal or external. (see Figure 12)
TOOLS/PRODUCTS
Actuate (Products like Quiterian, Actuate One)
Actuate acquires Quiterian to deliver Big data Analytics and Visual Data Mining for Business Users .Business and non-technical users will reap
the benefits of access to data mining and analytics within an intuitive, easy-to-use user interface. ActuateOne includes tools for developers and
end users and accesses them from a single user interface. The ability to provide predictive analytics to the Actuate product pushes it through-and
beyond what other business intelligence (BI) providers give Quiterian technology will be fully integrated into ActuateOne to form BIRT Analytics,
enabling users and IT to add the functionality required to access big data, further enhancing the flexibility and capabilities extant in ActuateOne.
GREENPLUM IS NOW PIVOTALA NEW PLATFORM FOR THE NEW ERA
This is powered by new data fabrics, Pivotal One is a complete, next generation Enterprise Platform-as-a-Service that makes it possible, for the
first time, for the employees of the enterprise to rapidly create consumer-grade applications. To create powerful experiences that serve a
consumer in the context of who they are, where they are, and what they are doing in the moment. To store, manage and deliver value from fast,
massive data sets. To build, deploy and scale at an unprecedented pace. (see Figure 13)
IBM and Brocade have teamed up to provide cost-effective and energy-efficient solutions to handle big data management’s biggest problems.
IBM and Brocade have, in concert, developed a system with an underlying infrastructure that can handle the biggest conceivable big data
problems. This system, the Brocade VDX 8770, is based on the VCS architecture and expands the scale. The fabric architecturally will scale up to
8,000 ports, easily accommodating the largest big data installations that one can now conceive of. Every port can handle either ten or forty
gigabits (1.25 or 5 gigabytes). Simple multiplication reveals that the system is capable of operating easily in the terabyte range.With their
relationship, IBM and Brocade hope to bridge the gap between slower predictive analytics and by definition quick real-time analytics.
IMPETUS ECOSYSTEM
Impetus offers a quick start program‚ architecture advisory‚ proof of concept‚ and implementation services. To extract valuable intelligence and
insights from this voluminous information, analytics and business intelligence (BI) over structured/semi-structured big data is needed For the BI
strategy, one should look at factors such as ease and cost of implementation, real time vs. batch analysis, and ad-hoc analytics.Impetus offers
services for extraction of business intelligence from big data. It has development proficiency in intercepting and extracting Big data, converting
big data to standardized consumable mass of information, applying esoteric ‘analytics’ algorithms to extract patterns/rules, presenting the
patterns using advanced data visualization tools and techniques. It expertises with solutions like Pentaho and Intellicus for serving BI needs.
Both these suites offer batch as well as ad-hoc reporting‚ using Hive over Hadoop. Impetus has also partnered with Greenplum and has expertise
with solutions like Aster and Vertica to serve your massively parallel processing (MPP) database needs. Built to support the next generation of
Big data warehousing and analytics, this MPP database is capable of storing and analyzing petabytes of data
GREENPLUM CHORUS: PRODUCTIVITY ENGINE
Take advantage of the productivity engine for data science teams, empowering people within an enterprise to more easily collaborate and derive
insight from their data--whether it's big, small, structured, or unstructured.
ACTIAN DATARUSH: ANALYTICS ENGINE FOR PARALLEL DATA PROCESSING (ANALYTICS ENGINE FOR PARALLEL
DATA PROCESSING: ACTIAN DATARUSH, 2013)
Actian DataRush is a patented application framework and analytics engine for high speed parallel data processing. DataRush is used for risk
analysis applications, fraud detection, healthcare claims management, cyber security, network optimization, telecom call detail record analysis
for optimizing customer service, and thr organization is doing some pioneering work with utility companies on smart grid optimization.
SPLUNK ENGINE
Splunk is a general-purpose search, analysis and reporting engine for time-series text data, typically machine data. It provides an approach to
machine data processing on a large scale, based on the MapReduce model (Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, & Stonebraker, 2009).
Machine data is a valuable resource. It contains a definitive record of all user transactions, customer behavior, machine behavior, security
threats, fraudulent activity and more. It’s also dynamic, unstructured, non-standard and makes up the majority of the data in the organization.
The Splunk search language is simple enough to allow exploration of data by almost anyone with a little training, enabling many people to
explore data. It is powerful enough to support a complicated data processing pipeline. Figure 14 describes the Splunk architecture.
BIG DATA ANALYTICS IN NETWORK
In this era of big data, large Internet businesses, cloud computing suppliers, media, entertainment organizations and high frequency trading
environments run larger clusters than which are used in High Performance Computing (HPC) which is the latest technology in the cloud
environment (Fernander, 2012).
The type of networks allied to the programming models and the platform sets used differentiates the High Performance Computing (HPC) and
Cloud computing environments. In the scientific/academic sector, it is typical to use proprietary solutions to achieve the best performance in
terms of latency and bandwidth, while sacrificing aspects of standardization that simplify support, manageability and closer integration with IT
infrastructure.
Within the enterprise the use of standards is paramount, and that means heavy reliance upon Ethernet which won’t satisfy the current need. The
need of the hour is a new approach, a new“maverick fabric.”
MAVERICK FABRIC
Such a fabric should have a way to eliminate network congestion within a multi-switch Ethernet framework to free up available bandwidth in the
network fabric. It also should significantly improve performance by negotiating load-balancing flows between switches with no performance hit
and, use a “fairness” algorithm that prioritizes packets in the network and ensures that broadcast data or other large frame traffic, such as
localized storage sub-systems, will not unfairly consume bandwidth.
Adaptive Routing and LossLess Switching
A fundamental problem with legacy Ethernet architecture is congestion, a byproduct of the very nature of conventional large-scale
Ethernetswitch architectures and also of Ethernet standards. Managing congestion within multi-tiered, standards-based networks is a key
requirement to ensure high utilization of computational and storage capability. The inability to cope with typical network congestion causes:
But the latency of proprietary server adaptors and standard Ethernet is only one hindrance to achieving the performance necessary for a wider
exploitation of Ethernet in HPC environments.
JUNIPER NETWORKS
For better network intelligence and to drive informed decisions, this delivers big data analytics solution (Hamel, 2013)
Junos Network Analytics Suite
This helps service providers, reduces costs and grows revenue with the power of a scalable big data solution and developed with Guavus, a leading
provider of big data analytics solutions. This industry leader in network modernism today revealed the Junos Network Analytics suite, a family of
next-generation big data analytics and network intelligence solutions. BizReflex and NetReflex products are few of them. BizReflex and NetReflex
were developed with Guavus, influencing an innovative “analyze first” architecture that delivers valuable insights from IP and Multiprotocol
Label Switching (MPLS) network traffic patterns to better understand network performance. These products afford Service Providers (SPs) a
significant tool to optimize their routing network assets, increase revenue opportunities and attract and preserve customers.
It is now more significant than ever for service providers to obtain insights by extracting network data from their routing infrastructure in order
to make business critical decisions with the onslaught of dynamic cloud applications, the blast of mobile device use and numerous amounts of
data traversing networks. However, for most service providers, capturing and analyzing the data within their router networks is a complex and
laborious process that is not built to scale. With the Junos Network Analytics suite, customers will be able to extract more productivity from the
network, query data easily and adapt more quickly to changing business needs.
The first two products in the Junos Network Analytics suite combine a powerful analytics engine. That presents network insights as customizable
graphics, statistics and drill-downs with state-of-the art visual dashboards.
• BizReflex: This is a network analytics engine and dashboard for business decision makers. It allows them to gather critical intelligence on
how customers, peers and prospects interact with the network. This tool extracts and analyzes information from edge and core routers to
allow operators to segment enterprise customers according to their respective value and cost services accordingly, progressing margins and
customer preservation. It also allows service providers to recognize high-value prospects and obtain new customers more efficiently. These
valuable insights can boost revenue opportunities and improve service differentiation.
• NetReflex: This solution offers network architects and operations personnel with detailed traffic trends and analysis for IP and MPLS
networks. This gives operators more insight than previously possible into traffic patterns on the network, permitting network service
providers to reduce costs with enlightened decision capabilities and progress the efficiency of their network.
BIG DATA ANALYTICS CAN BOOST NETWORK SECURITY
It can help to disclose violations that might otherwise have gone undetected, while big data analytics will probably never replace existing network
security measures like IPS and firewalls,. Beyond fraud detection, big data analysis has many uses, and one of the uses that is filtering down from
government circles into the enterprise is to detect anomalous network behavior that is investigative of a security violation.
Analyzing for Anomalies
Just like the banks do to detect credit card fraud, the purpose of accumulating this big data is anomaly detection. There are a number of diverse
vendors who already analyze big data for security purposes. Some from a big data analysis background, and others from a log management
background. These include Splunk, IBM (with its Security Intelligence with Big Data offering), TIBCO LogLogic, LogRhythm
Once anomalous behavior on the network is noticed the next stage is to launch if there really is a hazard, or whether the alert has been terrified
up because of an unusual but harmless event, perhaps a user just doing something odd. This is where KEYW's system promises to differ from
those of Splunk and others
On a Smaller Scale
Big data analysis solutions tend to be expensive; KEYW’s costs “a six figure sum” for a typical deployment, for example. But smaller companies
can get some security benefit from big data using a solution such as SourceFire's fireAMP, which takes data from endpoints and analyzes it in the
cloud, Sourcefire analyzes large numbers of known good and know bad files, among other techniques, and runs machine learning algorithms over
them to come up with rules that recognize malicious files that if can share with its customers. Big Data analytics is unlikely ever to be a
replacement for existing security measures like IPS and firewalls -- not least because something has to generate the big data before it can be
analyzed. But its value lies in the fact that it can reveal breaches that might otherwise have gone undetected. And in a world where network
compromise is more a question of when than if, this can be very valuable information indeed. (Rubens, 2013)
WAN OPTIMIZATION FOR BIG DATA AND BIG DATA ANALYTICS
As advancements in WAN optimization technology are constantly happening, there is fierce competition in the market for producing WAN
optimizers that deliver tangible benefits in terms of performance, scalability, and integrity. Users look for optimizations, security, scalability, and
mobility in WAN optimization devices and related appliances. The customer demands form the basis for further developments. So introducing
innovative technologies, products, solutions, appliances, and devices for WAN optimization for Big data platforms are very much necessary.
Following are few of those vendors who play a greater role in WAN Optimization for Big data and analytics.
Vendors such as Riverbed, Bluecoat Systems, Cisco, Citrix, Juniper, F5 Networks, Packeteer, Expand Networks, FatPipe Networks, Silver Peak
Systems, and other vendors are known for their unique suite of WAN optimization solutions and technologies
In 2007, Riverbed and Juniper led the WAN Optimization appliances market. The key techniques noted these two vendors offered to users were
Advanced Compression, Caching, Data and De-duplication.
While Silver Peak systems provide scalable solutions which is known for its software based solution replacing all physical devices, Blue Coat and
Cisco offer a suite of solutions and technology that can be plugged into the router and proxy products.
• Ability to monitor application and network performance for bandwidth, latency, and loss
WAN optimization tools are becoming more flexible, agile, and virtualization-friendly to accommodate all of these key trends.
Wan optimization solutions for cloud based big data analytics. Following are few of them:
VENDORS WHO ARE LESS KNOWN BUT DOES A GREAT JOB IN BIG DATA ANALYTICS
Finding vendors with a truly innovative, successful approach to big data can be as daunting as digging through the data itself. Although not a
comprehensive list of the best (or biggest self-promoters) in the space, this roundup consists of Big data vendors making waves or showing
consistent promise. These vendors are making a name in the big data space and helping businesses get more value from their unstructured and
voluminous data sets.
• Hadapt: Hadapt specializes in integrating SQL with Apache Hadoop, or making the “elephant and pig fly,” as the analytic platform
provider colloquially promises.
• Platfora: After two years of engineering and six months of beta testing, San ateo-based Platfora launched its namesake platform to bring
business intelligence capabilities to big data pools in Hadoop.
• SiSense: Behind some of big data’s most notable early wins at Target and Merck is SiSense’s Prism product. The vendor also continues to
drum up data intrigue at SMBs, along with awards for functionality and speed.
• Kapow Software: While not new to the some-600 users of its products, Kapow is making some big value-ads in terms of integration and
automation capabilities, according to Ventana’s Mark Smith. The bicoastal vendor also boasts an executive team with varied backgrounds
in networking and visualization startups, including CTO (and frequent blogger) Stefan Andreasen.
• ZettaSet: ZettaSet got a huge nod as support system for Intel’s Hadoop launch. With $10 million in recent funding and another deal inked
last year with IBM, there’s probably more attention ahead for the Mountain View maker of the Orchestrator management platform.
• ClearStory Data: ClearStory Data offers a scalable application for data discovery and analysis across sources, and the straightforward
presentation of business value to go with it.
CONCLUSION
We have looked into various analytical processing systems and how to effectively analyze data through the different analytical processing
systems. We briefly reviewed various organizations that are providers of various platforms, solutions, products and appliances for big data
analytics and the prominent role of big data analytics in network and optimization. It is clearly seen that big data will eventually serve its role
when the processing technology and the capability of people/organizations that use the technology are well combined.
This work was previously published in the Handbook of Research on Cloud Infrastructures for Big Data Analytics edited by Pethuru Raj and
Ganesh Chandra Deka, pages 392418, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
deRoos, D., Eaton, C., Lapis, G., Zikopoulos, C. P., & Deutsch, T. (2012). An ebook from IBM: Understanding big data – Analytics for enterprise
class hadoop and streaming data. Retrieved from http://public.dhe.ibm.com/common/ssi/ecm/en/iml14296usen/IML14296USEN.PDF
Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M. (2009). A comparison of approaches to large-scale
data analysis. In Proceedings of the 35th SIGMOD International Conference on Management of Data (pp. 165-178). New York: ACM. Retrieved
from http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
Zikopoulos, C. P., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., & Giles, J. (2013). An ebook from IBM: Harness the power of big
data The IBM big data platform. Retrieved from http://public.dhe.ibm.com/common/ssi/ecm/en/imm14100usen/IMM14100USEN.PDF
KEY TERMS AND DEFINITIONS
Analytical Appliance: They arrive as an integrated hardware-software blend, tuned for analytical workloads and come in many shapes, sizes,
and configurations.
Analytical Platforms: are purpose-built and designed as SQL-based system used as data warehousing replacements or stand-alone analytical
systems.
Big Data Platform: which cannot just be a platform for processing data; it has to be a platform for analyzing that data to extract insight from
an immense volume, variety, velocity, value and veracity of that data.
Columnar Systems: Because of the way these systems store and compress data by columns instead of rows, columnar databases, and offer fast
performance for many types of queries.
InMemory Systems: The in-memory system is ideal so that all data can be put into memory where raw performance is needed.
CHAPTER 47
Aesthetics in Data Visualization:
Case Studies and Design Issues
Heekyoung Jung
University of Cincinnati, USA
Tanyoung Kim
Georgia Institute of Technology, USA
Yang Yang
Dublin City University, Ireland
Luis Carli
University of São Paulo, Brazil
Marco Carnesecchi
Università della Valle d’Aosta & Università di Siena, Italy
Antonio Rizzo
Università di Siena, Italy
Cathal Gurrin
Dublin City University, Ireland
ABSTRACT
Data visualization has been one of the major interests among interaction designers thanks to the recent advances of visualization authoring tools.
Using such tools including programming languages with Graphics APIs, websites with chart topologies, and open source libraries and component
models, interaction designers can more effectively create data visualization harnessing their prototyping skills and aesthetic sensibility. However,
there still exist technical and methodological challenges for interaction designers in jumping into the scene. In this article, the authors introduce
five case studies of data visualization that highlight different design aspects and issues of the visualization process. The authors also discuss the
new roles of designers in this interdisciplinary field and the ways of utilizing, as well as enhancing, visualization tools for the better support of
designers.
INTRODUCTION: DATA VISUALIZATION DESIGN
Nowadays we are flooded with data of diverse kinds due to the increasing computational capability and accessibility. Specifically, in addition to
public data available on the Internet (e.g., census, demographics, environmental data), data pertaining personal daily activities are now more
easily collected, for example, through mobile devices that can log people’s running distances and time or their manual record of nutrition
consumption. Due to such expanded sources of data, there appear new applications that involve data collection, visualization, exploration, and
distribution in daily contexts. These applications do not only display static information but also let users navigate the data in forms of interactive
visualizations. This emerging trend has brought both opportunities and challenges to interaction designers to develop new approaches to
designing data-based applications.
Conveying information has been one of main functions of graphic and communication design since the analog printing era. The focus of
information design is the communicative and aesthetic presentation of structured data as in an example of subway route map. In treating the
increasing volumes of unprocessed data accessible either from public media or personal devices, the approaches of information design are now
more diverse with the influence of other disciplines (Pousman & Stasko, 2007). Specifically, unlike information design in a traditional and
confined manner, data visualization starts with the data that did not proceed through structuring and exist often in large volumes and
complicated formats (Manovich, 2010). Thus, data visualization requires designers to obtain diverse knowledge and skill sets in addition to their
visual aesthetic senses. The new requirements include human visual perception and cognition, statistics, and computational data mining.
Moreover, as data visualization has been more broadly applied to end-user services, interaction and experiential qualities need to be more
critically considered. Those qualities of data application do not only rely on the usability of data perception or task-based navigation but also
build up on aesthetics that affords engaging and exploratory data navigation. The latter have been remained as a less investigated area than the
former due to the strong disciplinary tradition of data visualization in computer science and cognitive science.
Furthermore, visualizations are presented, used, and shared in diverse contexts from science labs to online journalism sites, to personal mobile
devices. This means that data visualization has become a truly interdisciplinary field of research and practice by weaving informatics,
programming, graphic design, and even media art (Vande Moere & Purchase, 2011).
In what follows, we overview current design approaches and tools for data visualization, then introduce five case studies for further discussion of
design issues, process and tools in regard to aesthetics in data visualization.
CURRENT TOOLS AND PROBLEMS IN AESTHETICS OF DATA VISUALIZATION
In recent years visualization scientists, mostly from computer science, have suggested many visualization-authoring tools with a hope of
expanding the creators of visualizations and the contexts of their use. We suppose these tools are largely categorized into three kinds—1.) a
standalone programming language and its Integrated Development Environment (IDE) such as Adobe Flash ActionScript (Adobe) and
Processing (Processing 2), 2.) online or installation-required programs that provide visualizations of given chart topologies, such as ManyEyes
(IBM) and Tableau Public(Tableau, 2013), and 3.) libraries, toolkits or component model architecture integrated with existing programming
language, such as d3.js (Bostock, 2012) and gRaphaël.js(Baranovsky) for web documents.
These tools certainly open new spaces in which designers can apply visualization techniques with less effort and can exert their aesthetic
expressions. However, throughout the entire process from data acquisition to visualization, there appear challenging aspects for the specific goal
—the aesthetic and interactive qualities of visualization, and for the specific group of authors—interaction designers.
Online tools or visualization applications provide a set of chart templates as a means of presenting complex data into perceivable information.
However, the subtle variations of graphic and interactive design attributes are not fully considered in the existing visualization tools, resulting in
excluding designers who wants to have the full freedom of aesthetics and expressiveness. Thus, aesthetic consideration is limited to selecting
colors or symbols at a later phase of visualization. Visualization libraries, which are provided and shared by altruistic and enthusiastic
visualization experts, have expanded interaction designers’ job. However, the initial learning curves include the experiences of programming to
some extent. Regardless of old-school graphic designers, interaction designers who do not have extensive computer science knowledge may feel
difficulty in first facing the libraries and toolkits. The lack of the fundamental knowledge and tactical coding tips in the mother programing
language and the library may results in the abandon of aesthetic expression.
In sum, the tools have limitations in aesthetic expression due to either the lack of expressive freedom or the requirements of computer science
and programming knowledge. The more significant issue is that the aesthetic concern pertains to the mere surface of visualization, in other
words, “making pretty appearance” with the previously processed and well-formatted dataset. We acknowledge that the visually pleasing
appearance is a critical aspect of data visualization, especially when it comes to the job of graphic and interaction designers. However we argue
that the aesthetic consideration of data visualization goes far beyond that, covering the wider process from the process
of obtaining andorganizing data, to composing and narrating meaningful messages from the given data, and distributing the visualizations for
the audience’s access (Segel & Heer, 2010). In this sense, when we discuss aesthetics of data visualization in this article, it is not only about the
look and feel of data representation. Instead, aesthetics should be approached from this holistic perspective. Considering the situations and
aesthetics of use is a significant part of the emerging research and practice agendas in interaction and experience design. However, we argue that
in the field of data visualization, the concept of aesthetics is still remained as a look of visualization techniques.
Here we introduce the issues of designing visualizations focusing on the aesthetic values throughout the entire process. Unlike general interface
design projects, data visualization may not be directly simulated in wireframe forms without a broader picture of how data are actually collected
and organized. For example, multiple datasets can be layered in one frame through different modes. In this case, interactive and temporal
attributes such as animation and transition become critical aspects, which should result in guiding users’ data navigation in easier manners.
However, current prototyping approaches are limited in fully supporting such dynamic navigation to some extent. On one hand, designers have
used print-based mock-ups of static visualization images; they can produce these with Microsoft Excel, which is limited in demonstrating
interactive aspects of data visualization. On the other hand, a programming-b approach can build high fidelity prototypes with interactivity by
loading actual data. However, apparently it demands designers too much time and effort to learn a new skillset of coding. Even though designers
are willing to learn programming, they often get exhausted in choosing appropriate languages and tools for their particular project; they are
initially expected to understand the strengths, shortcomings, and compatibilities of all available tools. Due to these constraints of existing tools
and process, interaction designers have not been fully involved in data visualization research and practices although they have much potential to
contribute with their storytelling ability, aesthetic sensibility, andlogical thinking ability.
CASE STUDIES OF AESTHETICS IN DATA VISUALIZATION
Motivated by the problems and constraints with the current design tools and approaches for data visualization, we propose three issues for
designing aesthetic data visualization: data gathering, data representation, and data navigation. Through the discussion of these issues, we hope
to explore better ways to utilize and improve visualization tools for designers when they exert their tacit knowledge and aesthetic sensibility. In
what follows, we introduce five case studies of design projects in which data visualization plays a core part. Each project is described and
analyzed according to the three issues:
• Data Representation: Why did the designer choose the functional forms and the aesthetic styles to represent the collected data?
• Data Navigation: How did the designer make the visualizations interactive to navigate the data?
In addition to these issues, we also discuss the process of each case and the tools used in it.
• Design Process: In which order did the designer consider and execute the three issues (data collection, representation, and navigation)?
How did each step of design process mutually influence with one another?
• Design Tools: What visualization tools (libraries or languages) did the designer use? How did the tools influence each step of design
process in both positive and negative ways?
Finally, we summarize all case studies by reflecting on challenges and accomplishments from them.
Case Study 1: Visual Representations of Online Banking Transactions
This project is to support online banking customers’ financial managements of their income and expenditure through interactive data
visualization, which is alternative to typical monthly tabular report. Web-based tools (i.e., JavaScript libraries, Adobe Flash) enable various types
of interactive data visualization. This project especially aims to design prototypes of different visual representations—including different layouts,
navigation structure and interfaces—according to different customer motivations. The designs of visualizations were evaluated based on real life
user scenarios and iteratively refined to a final prototype.
Data Collection and Representation
We first analysed the current visual representations of transaction data offered to customers on the online banking service. It is currently
displayed in a table of four columns: 1) date of payment, 2) description of income or expenditure item, 3) amount value and 4) value date. Each
row represents a single occurrence of transaction and it is possible to sort their order by clicking on the header of each column. The current
online representation allows customers only a little more interaction than paper-based report. In our redesign we focused on exploring more
interactive data navigation and manipulation afforded by web technology. In Tufte’s words this means letting users understand the data
byhaving it represented instead of reading its analysis (Tufte, 2001). To explore possible visual representations we collected income and
expenditure reports over the period of a year from around 100 anonymous users of the service who had consented to share their data. Specific
research questions include: which type of visualization is most appropriate to different customer profiles? How to support customers reviewing
payment history, comparing expenditures in different categories, and eventually managing their income and expenditure from their previous
report?
The first prototype was a paper-based sketch (in Figure 1) only with a few changes from the existing table report: 1) income and expenditure are
presented in separate columns, 2) the time interval can be specified in input sections and calendar pickers, 3) multiple accounts marked with
different colours can be switched by different tabs. The significant change in this version from the tabular report is the separation between
income and expenditure and a colour scale to represent different accounts. Figure 2 shows another representation of the transaction data: there
is a spatial and chromatic separation between income and expenditure items. The two sliced segments of the inner circle represent the total
amount of income (green) and expenditure (red). User can retrieve details of a selected transaction by clicking each of the outer arcs. The angle of
each arc is proportional to the amount of money. This visual form allows comparative representation of single transactions and the overall
balance at the same time.
Data Navigation (Embedding Prototypes in Real Life Scenario)
We mapped the sample balance data onto the initial prototypes above, but it was difficult to evaluate actual use in terms of interactive navigation
without an activity scenario. Therefore we redesigned the online banking service page with more different visual representations of the sample
data (Figure 3). Then we came up with tasks 1) to retrieve a particular payment transaction to a fictional travel agency from the report and 2) to
make a new payment to the same recipient. The steps for each task consists of accessing the payments archive by choosing one of the visual
representations on the main widget. Once the users access the visual report they have to find the particular transaction either by scrolling
through the whole history of the month or by querying the name of the agency that received the payment in the search box.
Figure 3. The mock-up of the web page of the banking service
Design Process and Tools
The prototypes presented above were designed using Balsamiq (Balsamiq Studios, 2013) mock-ups for the static visual representations and
Adobe Flash for the interactive visualization. The animations for the scenario were created using Actionscript. The choice of graphic-oriented tool
instead of data visualization libraries had an important impact on the project with more freedom of choice in terms of encoding techniques and
faster iterations of various visual representations. However, forms and styles of data representation are not only determined by the structure of
data or design tools but also closely related to user goals for effective financial management, especially scenario-based iterative process. By using
scenario-based process, design concepts and navigation steps were specified through quick and easy prototyping test. Our initial sketches (Figure
1) were barely distinguished from a typical tabular report except for a spatial and a colour separation between income and expenditure items.
However, the following steps were to design prototypes that are drastically different from a table such as in the examples in Figures 2, 3 and 4. At
first these new visual representations did put users in a situation where they had to explore the visualization and interpret what the visual
elements mean. They then expressed their judgment on the aesthetics of the representation while trying to retrieve information and perform
tasks according to given scenarios. After both the most and least similar prototypes to the original table were tested on users, we implemented
them into web service in order to test visual representations as a part of the service ecosystem where individual transaction data is collected and
retrieved. According to Distributed Cognition theories, the users can benefit from visual elements since external representations allow people to
perceive data much faster and to cognitively process longer than they can in their mind internally (Kirsh, 2010). Users’ performance in visual
search can be more accurate with aesthetically aligned layouts, which is coherent with recent studies (Salimun, Purchase, Simmons, & Brewster,
2010). These assumptions are confirmed in the test we run on the web service UI. In fact, users preferred a rich and aesthetically attractive
visualization (Figure 5) to one that they consider familiar (Figure 1) when the former is more functional than the latter to complete the task they
are engaged in.
Case Study 2: Visualization of Weather Changes Over Years in Multiple Cities
This project is the visualization of the weather records over five years in nine cities. To support time-based comparison we came up with a unified
viewing mode, which overlays macro patterns of temperature changes from multiple years and still allows the access to details of daily data from
the overview (Figure 6).
Figure 6. The overview of the weather visualization
Data Collection: Daily Data of All Years and Hourly Data of the Selected Few Days
We collect the weather records from the Weather Underground (2013) that provides the past five years’ hourly-basis temperature data of the
major capital cities. A simple URL returns the data that can be saved in CSV file format (Weather Undergound, 2007). Then we made a custom
JavaScript to automate the process of accessing, downloading and aggregating the data over the targeting duration and places. Unfortunately, the
site blocked this repetitive URL request, so this process was not completely automatically performed. Due to this constrains, we had to reconsider
the volume of the dataset; finally we decided to collect day-by-day information from 2007 to 2011. Additionally, we collected hour-by-hour
information only for six days in each year.
Data Representation: Timeline for Yearly View and Radials for Daily Data
The view of visualization is divided largely into two parts. One part is for the yearly view where the datasets of daily lowest and highest
temperature from each year is plotted on one timeline of one year. The five datasets, each of which means a single year, are overlaid on the
timeline. The other part is the daily view where the datasets of the six selected days from each year is plotted in a radial form. Each closed curves
around the circle represents the hourly temperature change of the day. Same as the overlaid timeline view, the each of five closed curves
represent one dataset of a day’s hourly temperature change.
In the yearly view timeline, the area between the highest and lowest temperature is filled with a gradient set of colors to provide a sense to scale
the differences of temperature in one dataset of one city, as well as within the entire datasets of all cities. The strokes of the radial day view are
also painted with the same gradient scheme. Vertical grids that create the divisions between the months are also used to make connections
between the yearly view and the daily view above by showing the position of the highlighted date on the yearly timeline. On the Y axis, we set a
temperature range and filled it with white with the text of the two temperatures. This area works as a “base zone” of temperature, which helps
users investigate the bounds of temperature change. This zone is also displayed as a white circle in each daily view above the year view; the inner
bound is set on the minimal temperature, the outer bound on the maximum temperature. In general, our design requirements follow the idea of
good “data-ink ratio” on the visualization (Tufte, 2011).
Data Navigation: Switching Datasets and Details on Demand
Our goal was to add interaction to the cases in which displaying visual elements as data representation or labels would make the visualization
somewhat illegible. We included four possible interactive features in the visualization: 1) a conventional HTML select menu to change the city, 2)
mouse hovering on the year view to display detailed information as text (the mean value of the maximum temperatures of the last five years, and
the one of the minimum temperatures), 3) mouse hovering on the five circles of day view to display the time of the day at the point of mouse
cursor, and the average temperature of the last five year’s pointed time, and 4) dragging the white basis zone in year view, which prompts the
change of radius and thickness of the associated circles in the day views.
Design Process and Tools
We started by sketching several possible techniques of data representation with pen and paper. Sketching allowed us envision the visualization
forms and the layout. However, hand-drawing sketches are not appropriate to imagine the look when the data are actually populated. Thus, based
on the initial forms from the sketches, we started coding and plotted a small portion of real data. We developed this visualization using SVG,
JavaScript and the d3.js library (Bostock, Ogievetsky, & Heer 2011). The methods and abstractions that the d3.js provides are easy to use to
encode data into graphic elements. We first code the viewing modules separately, which made testing different sizes and positions of the modules
easier when we arranged them.
To seek the optimal forms and visually pleasing styles for the visualization, we kept iterating the code with the small parts of the real data. For the
cases that we wanted to test various forms, we made the code more parametric, by creating and connecting variables through mathematical
operations. (Figure 7 & 8). For developing the interaction we followed a very similar process to the one used for defining the forms and styles of
the visualization: sketching, coding, and testing.
Figure 7. Some variations developed during the iteration of the
year view
Case Study 3: Visualization of Daily Nutrition Consumption
This project is to support browsing and recording one’s daily nutrition consumption through interactive data visualization and application.
Especially with a focus on mobile health management, this project explored design solutions for small screen based browsing and recording
nutritional facts of food items. Mobile devices can support convenient recording and monitoring nutrient intake at any time, which can be critical
in health management (Andersson, Rosenqvist, & Sahrawi, 2007). However, there are still challenges in visualizing nutrition entries within a
small screen user interface, such as displaying a large amount of data, interactively switching modes of contextual and local data, among others.
At the same time, every food consists of various nutrients such as carbohydrates, proteins, and fats, which are hard to be understood, particularly
when presented using only numbers. In consideration of these issues, we focused on finding metaphors for cognitive and embodied visual form in
order to facilitate small touch screen information visualization and navigation.
Data Visualization
We first surveyed existing mobile and web nutrition management applications to understand specific design issues and the functional
requirements in terms of menu structure or interaction modes. Nutrition facts of different foods are hard to be read and kept track of,
particularly, when presented as numbers in a table. There is a need for exploring different ways of displaying a large amount of complex
information based on the understanding of a person’s cognitive processes. Based on the survey of existing applications, we learned that “time” is
a pivotal element which people use to record and navigate their daily nutrition consumption. Then we sketched various types of wireframes to
visualize nutrition intake over a timeline, for example, in line graphs, bar graphs, bubble charts and pie charts. A set of design requirements are
specified below in terms menu structure and navigation of the application:
• Support tracing all food items taken in a day and comparison of nutrition intakes for recommended levels.
• Use food item icons instead of text for quick review and intuitive understanding.
• Provide rich information using preattentive visual elements such as colors and symbols for quick overview.
• Display two modes of information—overall food entries and specific nutrition components for each food, and support dynamic navigation
between the modes.
Data Navigation
By analyzing selected applications, we were aware of the lack of motivation to record all food consumed and the cumbersomeness of browsing the
data. This means that ease of use or efficiency of use is critical, but at the same time such design criteria are not enough. Some extra values are
required to motivate and constantly engage people with a new type of application. Based on the objectives and requirements discussed above, we
designed Food Watch, a mobile application for browsing and recording nutritional facts of food items. Food Watch visualizes nutritional
composition of different food items (e.g. carbohydrates, proteins, fats) in pictorial elements and pie charts for intuitive perception and navigation
of information in a small screen interface. The circular shape in the center of the display was selected with two different metaphors in mind: a
wristwatch and a dinner plate (Figure 9).
Specifically, a plate metaphor is considered appropriate not only in terms of its everyday use for serving food, but also in that its image can be
applied as display object (Ware, 2000) to embed pie charts of nutrition facts over its round shape. At the same time, the wristwatch metaphor
provides a conceptual relation for recording and browsing daily nutrition intake as time-related information with its round shape. In this way, a
set of food items can be browsed by turning a graphic plate and food can be dragged onto the plate for selection. Then, daily nutrition intakes can
be recorded and browsed over the timeline of a graphic wristwatch in a different mode of interaction.
The interaction of the application consists of the three modes as specified below (Figure 10):
Figure 10. Food Watch Interaction: 1) Set Up Profile (top left),
2) Browse Food Items (top right), 3) Select A Food Item
(bottom left), and 4) Review Saved Food Items (bottom right)
• Set up a user’s profile by calculating personal daily nutrition requirements according to one’s body factors.
• Browse different foods comparing their nutrition facts to the daily nutrition requirements calculated previously.
• Review all the foods added to the review list and their nutrition facts accumulated in a day.
Data Collection
The data was collected after we had made initial decisions on the overall design direction in terms of visualization form, interaction, and layout.
The food database is built as a local xml file based on nutrition information collected from USDA Nutrition Data Laboratory (USDA, 2011). By
using a local database we more focused on experimenting graphic and interactive design attributes in data visualization instead of real-time
database connection. In addition, personal daily nutrition requirements are calculated according to one’s body factors based on the formulation
provided by WeightLossForAll.com. They are used to simulate a use case of the application by comparing nutrition facts of a selected food item to
one’s daily nutrition requirements.
Design Process and Tool
This study emphasizes the process of developing a form in order to illustrate how design intention comes into selecting specific shapes for
particular aesthetic and functional purposes. The aim of using metaphors of a dinner plate and wristwatch for browsing and selecting food items
was to simplify the visualization and navigation of nutrient data within a small touch screen interface. The shapes and functions of two existing
physical objects (a dinner plate and a wristwatch), incorporated into a digital form, offer a visual and behavioral analogy for display and
navigation of information. In this way a large amount of data is displayed in simple pictorial forms for quick overview of nutrition compositions
of different food items while keeping details of data as well.
We used Flash Lite 3.1 (ActionScript 2.0) to simulate interactive visualization. Flash was a good choice to experiment various graphic shapes and
symbolic icons to represent data in more familiar and engaging visual forms. However, dynamic data navigation in connection to the local XML
database was not intuitive to code with ActionScript especially without expert programming knowledge. Specifically, the challenges are
summarized into 1) browsing database by rotating a graphic object, 2) visualizing nutrition facts of a focused food item, 3) selecting food items by
drag and drop, and 4) storing their nutrition facts for reviewing total daily consumption. We created so many small functions that had not been
planned at the beginning. The overall design process was quite linear by starting from visual sketches, specification of interaction concepts, and
building a database. However the implementation was really complicated, not proceeded step by step. We went through multiple sketches of
interaction sequence and flowcharts in order to make sense of the relations of functions and variables that are used to store values from database
and to draw graphic shapes from them.
Case Study 4: Personal Lifelog Visualization
A variety of life-logging devices with sensing technologies have been created and their applications provide us with the opportunity to track our
lives accurately and automatically. In this context, there is a growing interest called the “Quantified Self Movement” driven by technologies that
sense numerous aspects of an individual’s life in detail. The idea of “quantified self” starts with tracking our daily activities such as location,
mood, health factors, sleep patterns, photos, phone calls, and so on. Many technologies support the capturing of lifelogs. For example, ubiquitous
smartphones allow us to record our life activities in previously unimaginable detail. These captured logs can be further used to infer interesting
and useful insights about people, their communication patterns with others, and their interaction with environments.
However, current tools to manage such rich data archives are still in need of improvement in terms of storing and organizing. Since the captured
material are voluminous and mixed in data types, it might be overwhelming and impractical to manually scan the full contents of these lifelogs,
which eventually results in lots of “worn memories”—we write once but never read again. Besides, the raw data do not give much insight to users
without additional semantic enrichment. Thus, a promising approach is to pre-process raw data to extract and aggregate useful information, and
then apply visualization to the large-scale data archives as a means to target users. We argue that lifelog visualization is capable of displaying the
sheer quantity of mixed multimedia contents in a meaningful way.
Data Collection: Smartphones with Various Sensors
We developed a lifelogging platform working on Android smartphones. A typical smartphone is equipped with many sensors to capture various
sources of data (Table 1). The gathered data is first analyzed in the phone and then uploaded to the central server. At the server we perform
additional semantic analysis and enrich the data into semantically rich “life streams”; by applying machine learning techniques, we extract
semantics from raw sensor log streams; by grouping related sensed logs together, we reconstruct the data and design several use case scenarios.
For example, by aggregating the logs from speaker, GPS, and Bluetooth together, we can detect whether test subject is at work or engaged in
social activity. In this way, we can extract various semantic contexts that are meaningful to target users. We also run more processor intensive
operations with the data such as face detection. Finally, the data were processed into a variety of formats such as JSON, XML, and CSV prior to
the representation phase.
Table 1. Available sensors equipped on a user’s smartphone
Sensor Description
Data Representation and Navigation: Three Types of Visual Logs
Sellen and Whittaker (2010) summarized five functions of memory that lifelogs could potentially support, referred to as the 5 Rs: recollecting
(recalling a specific piece of information or an experience), reminiscing (reliving past experiences for emotional or sentimental reasons),
remembering (supporting memory or tasks such as showing up for appointments), reflecting (facilitating the reflections and reviews on past
experiences) and retrieving (revisiting previously encountered digital items or information such as documents, email or web pages). According to
these functions of lifelogs, our primary design goal is to support users’ self-reflection, sharing, evoking thoughts, and reminiscence. We believe
that the form and styles of visualization should be determined by the purpose of uses and context.
Since the data generated from multiple sensor sources are complex, we need a more systematic approach to exploring the match between user
needs and data visualization. We first generate three scenarios of different user behaviors and contexts: visual diary, social interaction, and
activities review. Then we characterize UI patterns based on these three contexts, and create wireframe and interfaces prototypes. Visual diary
use case scenario help users to support recollecting, remembering, and reminiscing of their past experiences. Social interaction and activities
review support more abstract representation of personal lifelogs to facilitate reflecting and retrieving functions.
1) Visual Diary
Sensing devices automatically capture thousands of photos, and many times more sensor readings per day (Kalnikaite, Steve, and Whittaker,
2012). Hence, grouping the sequences of related images into “events” is necessary in order to reduce complexity (Doherty & Smeaton, 2008). The
visual design of the visualization is inspired by Squarified treemap pattern (Bruls,1999). Our visualization, called “Visual Diary,” provides a
summary of user’s daily log as photographs, with emphasis on important events (Figure 11). Each grid represents an event and the size of the grid
provides an immediate visual cue to the event’s importance level. The position of each grid depicts the time sequence. At the same time, users are
allowed to drill down (full photo stream and sensor log) inside each event.
2) Social Interaction Radar
For the visualization of social contexts, we utilize the data from three embedded sensors (i.e., Bluetooth, Wi-Fi, and GPS) (Figure 12). With these
datasets, we identify the social context of individual users over the course of a year. To support the different features of the multi-dimensional
data, we adopted a Coxcomb visualization technique, which helps users to understand the whole and its individual parts simultaneously. This
opens up new possibilities for rapid communication of complex constructs. Each concentric circle represents one type of the sensor data. This
radar graph has three dimensions: 1) type of sensor, 2) social activity value, 3) time. Scrolling around the circle enables rapid exploration and
comparison sensor values between different months.
3) Activity Calendar
Activity view allows users to gain a detailed understanding of their physical activities (Figure 13). We calculate the level of physical activity with
the data from accelerometer and GPS sensors. We visualize the data in an annual calendar layout, with color-coding to present the activity
intensity. A darker color indicates more activity involved in a given day. By investigating the activity pattern over the course of a full year, it is
possible to detect a user’s extreme days (i.e., the most active or the most quite days).
Figure 13. Yearly activity summary with a calendar view
Design Tool
When choosing the tools for visualization, we had to consider the objectives as well as the time constrains for creating interfaces. Our goal was to
analyze the large data archive, and to create dynamic interactive visualizations that are later exhibited on users’ web browser. Thus, the
accessibility without plug-ins, the compatibility with web standards, and cross-browser support are our main concerns when choosing
visualization tools. After the review of available tools, we chose open-source JavaScript-based toolkits—Protovis (Stanford Visual Group, 2010)
and d3 (Bostock, 2012). It binds arbitrary data into DOM, and brings data to life using html, CSS, SVG that appear almost identically on different
browsers and platforms. These independent toolkits make it possible to plot data in novel structure with rich user interaction. We also choose
jQuery library for more dynamic interactive features. It is lightweight compared to other JavaScript frameworks and uses familiar CSS syntax
that designer can learn and use easily. This combination of tools provides designers with powerful approaches to the aesthetic look and
interactivity, which also makes the data manipulation customizable. Ultimately they empower designers with the freedom to focus on the
aesthetics.
Case Study 5: Visualizations of Mobile Communication Data
This project is to use data visualization as a tool for exploratory data analysis (Shneiderman, 2002) to quickly discover insights for
research/design opportunities. Mobile phones can help collect extensive data ranging from personal usage of the phone to inter-personal
communication. In addition, due to their pervasive daily use, their various embedded sensors, and ubiquitous wireless technology infrastructure,
mobile phones have received increased use as a new kind of research tool. Researchers in this field have primarily used data mining, machine
learning, and other quantitative modeling methods, at which interaction designers are not typically trained. Such unmatched skillsets limit
designers from being actively involved in data-driven research. In this challenging context, we suggest a new role for interaction designers in
which they can apply and enhance their existing skills: designing information visualizations of the available data in a timely manner and
deploying them to other researchers during the early phases of data-driven research. Using visualizations as a tool for exploratory data
analysis (Shneiderman, 2002), researchers can find insights quickly to ground the next phase of research.
Data Collection
In late 2009 for more than one year, over 160 volunteers near Lake Geneva, Switzerland participated in a data collection campaign using mobile
phones. We gave study volunteers smart phones equipped with special software that ran in the background and gathered data from embedded
sensors continuously when the phone was turned on. The logged data were stored in the mobile phone and automatically uploaded to a database
server on a daily basis when a known WLAN access point was detected. The data was subsequently made available to the public through Nokia
Mobile Data Challenge (Nokia Research Center, 2012).
• Physical Proximity Data: The Bluetooth IDs of mobile phones scanned within a physically close distance
• Location Data: GPS, WLAN access points, and cell network information (in the order of precision)
• Media creation and Usage Data: Logs of photo taking, video shooting, music play, web browsing, alarm setting, etc.
Design Process and Tools (Requirements for Exploratory Data Analysis)
The data set emerging from the mobile phone study was very large and varied. Performing exploratory analysis on this data would require a high
degree of flexibility in supporting different subsets of the data extracted from the entire collection. We did not feel that any existing general-
purpose visualization systems could provide all the different perspectives we desired on the data. Also, building a custom visualization system for
the entire set appeared daunting and could possibly take a long time. Thus, we designed multiple, different interactive visualizations instead of a
monolithic visual analytic system. Each visualization focuses on a different aspect of the data collection and is designed to best portray a
particular aspect of the dataset. We assumed that the end-users of such visualizations would be researchers who want to explore massive
datasets, especially related to mobile communication data, before they begin investigation with other sophisticated data analysis methods.
In designing these visualizations, our priority was to generate flexible datasets that are directly related to researchers’ questions and easily
modified to their ongoing requests. Initially, we extracted relevant data subsets from a database using simple SQL queries. We stored these data
subsets in comma-separated-value formatted separated files to be linked and used in visualizations. To implement the visualizations, we
primarily used Java-based Processing. The simple structure of the visualization source files allowed us to easily manage and modify datasets and
visual design; we were able to quickly convert data formats, and generate new dimensions or datasets from the linked datasets; by simply editing
several lines of source code in a Processing file, it was not difficult to change the size or color of the drawn visual elements. We also reused
portions of visualization source code throughout multiple visualizations and modified it as necessary. This highly customized design process
would hopefully make the visualizations versatile enough to support analysts’ incremental demands during the analysis.
Data Visualization (An Individuals’ Daily Life)
Due to space limitations, we select one example exemplifies the micro traits of single participants. This visualization focuses on data about
individual participants in order to understand different mobile phone usage patterns depending on different temporal and social contexts. Better
understanding the dissimilar lifestyles of individual participants would ground the design of personalized mobile services and applications. For
the visualization design, we applied timeline-based visualization because our goal was to best support effective analysis with a minimal learning
curve, not to invent new visualization techniques. That said, we believe these visualizations provide innovative designs for the visual analysis of
communication and location data.
1) Dataset Datasets: Phone usage, Location, and Bluetooth Readings
Thanks to the wealth of data modalities tracked, the analysis of the data about a single user can inform extensive insights about his or her life.
This visualization was created with multiple datasets from multiple users (Figure 14).
Figure 14. Integrated visualization of an individual user’s
phone usage logs (colored squared below x-axis), GPS-based
moving status (gray-scale background), and Bluetooth
encounters with other people (gray-scale circles above x-axis)
• Location data from the high fidelity GPS logs table containing latitude and longitude coordinates: The number of GPS entries is roughly
equivalent to the frequency of physical movement.
• Bluetooth readings: Bluetooth detection data show the number of people appearing in a proximate physical distance and the frequency of
appearance.
• Mobile phone usage data whose entries were parsed from five different log tables in the server: These tables include voice calls, text
messages, web browsing, music play, and photo-shooting.
2) Timelinebased Visualization with Multiple Elements
We processed the GPS logs in ten-minute intervals and represent them as the background of the timeline. The presence of more GPS entries
results in a darker background. The same time interval is applied to the Bluetooth data, which are represented by the grey-scale dots above the
horizontal 24-hour line. The number of dots represents nearby people detected through Bluetooth, and the darkness of each dot is proportional
to the frequency of the corresponding person’s presence. The squares below the bar are color-coded to represent the five different kinds of phone
usage data. We added simple menus to enable analysts to select other participants and to select/de-select the categories of mobile phone usage.
We also included a time selector, so they could adjust the time range of the data to any number of days within the sixteen weeks retrieved.
He has many more people using Bluetooth around him during the typical working hours, which might suggest that he is present in an office
environment with business colleagues. He also exhibits a rough commuting time range between 6:30AM to 8AM. Between 12:10 PM to 12:30 PM,
Bluetooth detection is evident by the two darker dots. Based upon consistency of occurrence, we infer that the other participants might be regular
lunch friend(s).
Additionally, we examined visualizations of participants who were college students. We observed that students tended to have different patterns
in terms of movement and Bluetooth detection (Figure 15). Both students shown did not exhibit a fixed short period of commuting pattern.
Instead, they seemed to more randomly move around. One student had two to three mobile phones appear nearby during night time (roughly
between 11PM to 7AM) (Figure 15-top). One of them might b a roommate, represented as a darkest dot, whereas the other dots might represent
occasional visitors. The other student did not have a regular peak time in terms of the number of people nearby (Figure 15-bottom). She
encountered less people, but was around them more frequently than the office worker (i.e., less number of entire dots, but relatively more dark
dots). The mobile usage pattern is also distinguishable; for instance, the students used web browser SMS more than voice calls.
Summary and Reflection
In this chapter we introduced five case studies of data visualization projects. The first project is about visual representation of online bank
transactions in order to improve user experience of managing multiple bank accounts and being more aware of their financial status.
Traditionally design approaches to scientific visualization have been rather datacentric with focus on visual analysis of the data structure.
However, with increasing end-user visualization applications, taskcentric design approach should be more crucially considered in selecting
forms of data representation and navigation interfaces that can best afford action-based user goals. The importance of a scenario lies in enabling
a clear grasp of the contextual elements and in providing particular interaction paths (Rizzo & Bacigalupo, 2004). In this vein, this study
illustrates that paper-based sketches with particular user scenarios can serve as an efficient design medium at early phases of design for quick
iterations of initial ideas.
Moreover, as data visualization has been more broadly applied to end-user services, interaction and experiential qualities have become more
critical design issues, not only in terms of usability of data perception and task-based navigation but also in terms of attractive and engaging
representation and navigation of data. The second and the third case studies are more directly related to aesthetic forms and interface elements
of data representation and navigation. In the weather visualization project, specific graph shapes, colors, and interfaces were iteratively tested
and specified by coding (using d3.js). In the nutrition data application project, the main visual and interaction design directions were determined
by metaphors from physical objects, and then simulated in Flash ActionScript. In both cases, programming played a significant role in
stimulating design concepts and testing technical constraints in connection to database and interactive navigation. Although programming is still
a barrier for many designers, it can provide more logical and consistent visual and interactive styles to multiple parts of a design system over
iterative design process. Some common visualization tools and libraries (i.e., Flash ActionScript or d3.js) are efficient to demonstrate new
interaction and interface concepts. However one of the big challenges is to manage overall layout and whole sequence of interaction as well as
data collection and manipulation of the entire system of visualization applications. It would be really beneficial to create a new design process
and tools that can better support designers’ systematic thinking and simulation of data ecosystem.
The last two case studies are about exploring new applications of data visualization. Due to increasing data capturing and processing power, we
have more challenges as well as opportunities in making sense and use of such voluminous and complex data. The mobile lifelogs project explores
different ways of representing personal data for various purposes of raising insights to personal activities and self-reflection. The mobile
communication data project focuses on extracting research insights from data by using visualization as a rapid discovery tool for generating
insights from large-scale data at an early phase of research. This visualization-based analysis of user-generated data can serve as a new research
methodology, which is time-efficient and still people-centric by discovering traits about the people in the data. These approaches envision a new
role for visualization with its strength in transforming ideas into visual artifacts in data-driven user research.
CONCLUSION
As discussed in the case studies, this design process is not always linear; when other requests from either users or designer themselves arise,
iterations become necessary. Regardless of different design process and tools applied in each case study, they still share general issues to be
discussed in terms of 1) data collection, 2) data representation, and 3) data navigation.
Data collection is about a whole system in which data is gathered and linked to visualization application. This issue is closely related to both user
scenarios of data visualization (in terms of how data is provided, shared, and distributed) and technical challenges (in terms of how to retrieve
and link data in proper formats).
Data representation is related to graphic forms through which overall structure of data is represented with its details. There are many standard
forms of graphs such as bar chart, line chart, pie chart, etc. Beyond those graph forms, more exploratory forms of graphs are enabled and
experimented thanks to advanced computing technology, including network diagram, tag cloud, bubble chart, direct visualization, which shows
data as it is like in photo archive, etc. In addition, geographical maps are also frequently used to populate data onto familiar spatial coordinates
for simple and efficient data perception. According to the increasing number of various forms of data representation, the criteria to select an
efficient but still engaging forms is always a critical design consideration. The selection of an overall representation form proceeds to traditional
aesthetic concerns in terms selection of colors, sizes, and layout of graphic shapes, which could be iteratively polished afterwards.
Data navigation, in comparison to data representation, is rather a less investigated aspect than the issue of data representation. There are a few
options for standard interactivity including pulling out details-on-demand or browsing data by scrolling, panning, and zooming. While various
forms of graphs and interfaces are used to represent data, interactive features to enter and navigate data are relatively limited in terms of
diversity with similar interfaces in different applications. It is a great opportunity to explore new types of interactivity to navigate data, at the
same time a huge challenge to simulate new interaction ideas with current design tools.
These three design issues—data collection, representation, and navigation—can be considered as significant building blocks that constitute an
overall design process of data visualization, not necessarily put in a linear sequence. We expect the identification of the three topical design
blocks could support interaction designers flexibly planning out a design process in consideration of iterative simulation and refinement of the
connection of the three design issues.
This work was previously published in Innovative Approaches of Data Visualization and Visual Analytics edited by Mao Lin Huang and
Weidong Huang, pages 124, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Adobe. (n.d.). ActionScript technology center. Adobe Developer Connection. Retrieved from http://www.adobe.com/devnet/actionscript.html.
Andersson, P., Rosenqvist, C., & Sahrawi, O. (2007). Mobile innovations in healthcare: customer involvement and the co-creation of
value. International Journal of Mobile Communications , 5(4), 371–388. doi:10.1504/IJMC.2007.012786
Bostock, M., Ogievetsky, V., & Heer, J. (2011). D3 data-driven documents. IEEE Transactions on Visualization and Computer Graphics , 17(12),
2301–2309. doi:10.1109/TVCG.2011.185
Cross, N. (2011). Design thinking: Understanding how designers think and work . Oxford, UK: Berg Publishers.
Doherty, A. R., & Smeaton, A. F. (2008). Automatically segmenting lifelog data into events. In Proceedings of WIAMIS. New Brunswick, NJ:
IEEE Press.
Goodman, E., Stolterman, E., & Wakkary, R. (2011). Understanding interaction design practice. In Proceedings of Conference on Human Factors
in Computing Systems, 1061-1070. New York: ACM Press.
Kalnikaite, V., & Whittaker, S. (2012). Recollection: How to design lifelogging tools that help locate the right information. Human-Computer
Interaction: The Agency Perspective Studies in Computational Intelligence ,329-348. Berlin: Springer. doi:10.1007/978-3-642-25691-2_14
Kirsh, D. (2010). Thinking with external representations. AI & Society , 25(4), 441–454. doi:10.1007/s00146-010-0272-8
Löwgren, J., & Stolterman, E. (2004). Thoughtful Interaction Design: A Design Perspective on Information Technology . Cambridge, MA: MIT
Press.
Manovich, L. (2008). Introduction to infoaesthetics. Retrieved from http://goo.gl/NFLvy.
Pousman, Z., & Stasko, J. T. (2007). Data in everyday life. IEEE Transactions on Visualization and Computer Graphics , 13(6), 1145–1152.
doi:10.1109/TVCG.2007.70541
Rizzo, A., & Bacigalupo, M. (2004) Scenarios: heuristics for actions. In Proceedings of XII European Conference on Cognitive Ergonomics. York,
UK: EACE.
Salimun, C., Purchase, H. C., Simmons, D. R., & Brewster, S. (2010). The effect of aesthetically pleasing composition on visual search
performance. In Proceedings of the 6th Nordic Conference on HumanComputer Interaction: Extending Boundaries, 422–431. New York: ACM
Press.
Segel, E., & Heer, J. (2010). Narrative visualization: telling stories with data. IEEE Transactions on Visualization and Computer Graphics , 16(6),
1139–1148. doi:10.1109/TVCG.2010.179
Sellen, A. J., & Whittaker, S. (2010). Beyond total capture: A constructive critique of lifelogging. Communications of the ACM ,53(5), 70–77.
doi:10.1145/1735223.1735243
Shneiderman, B. (2002). Inventing discovery tools: Combining information visualization with data mining. Information Visualization , 1(1), 5–
12.
Stanford Visual Group. (2010). Protovis-A graphical approach to visualization. Protovis. Retrieved from http://mbostock.github.com/protovis/.
USDA. (2011). Welcome to the USDA national nutrient database for standard reference. National Agriculture Library. Retrieved from
http://ndb.nal.usda.gov/.
Vande Moere, A., & Purchase, H. (2011). On the role of design in information visualization. Information Visualization , 10(4), 356–371.
doi:10.1177/1473871611415996
Ware, C. (2000). Information visualization: Perception for design (interactive technologies) . New York: Morgan Kaufmann.
Shaojian Zhuo
Shanghai University, China
ABSTRACT
Text on the web has become a valuable source for mining and analyzing user opinions on any topic. Non-native English speakers heavily support
the growing use of Network media especially in Chinese. Many sentiment analysis studies have shown that a polarity lexicon can effectively
improve the classification consequences. Social media, where users spontaneously generated content have become important materials for
tracking people’s opinions and sentiments. Meanwhile, the mathematical models of fuzzy semantics have provided a formal explanation for the
fuzzy nature of human language processing. This paper investigated the limitations of traditional sentiment analysis approaches and proposed an
effective Chinese sentiment analysis approach based on emotion degree lexicon. Inspired by various social cognitive theories, basic emotion value
lexicon and social evidence lexicon were combined to improve sentiment analysis consequences. By using the composite lexicon and fuzzy
semantic model, this new sentiment analysis approach obtains significant improvement in Chinese text.
INTRODUCTION
Text sentiment analysis technology has been applied to many fields. For example, Pulse, a business intelligence system developed by Microsoft
can extract the user view from comment text by using the text clustering technology. Opinion Observer system can comment on the subjective
content from customer reviews on the Internet, and extract product features and consumers reviews. Many current sentiment analysis
approaches are mainly focus on emotion tendency analysis. Texts are usually classified into three categories: positive text, negative text and
neutral text. The technology of text sentiment analysis generally consists of subjective classification, emotion polarity, semantic orientation,
opinion mining, opinion extraction, emotion analysis and emotion summarization.
Text sentiment analysis is applied to find user reviews and emotion polarity from text. User reviews can help users make decisions or get the
product feedback. It can also make prediction for political elections and some other major events. Otherwise, the technology of sentiment
analysis also contributes to the research of other natural language processing fields. In sentiment analysis fields, there are two widely used
approaches: combining rules with emotional dictionary and machine learning technology. In combining rules technology, texts are classified by
using positive emotional words and negative emotional words. Machine learning often uses Naive Bayes, Max Entropy or Support Vector
Machine to classify texts. Most sentiment analysis researches are concentrating on English texts. Recent studies have shown that non-native
English speakers heavily support the growing use of Network media. And, Chinese text is growing fast at Internet in recent years. But there does
not exist an effective approach in Chinese text sentiment analysis.
Fuzzy semantics comprehension plays a crucial role in thinking, perception and problem solving. The fuzzy nature of linguistic semantics stems
from inherent semantic ambiguity, context variability, and individual perceptions. Almost all problems in natural language processing and
semantic analyses are constrained by these fundamental issues. The mathematical structure of fuzzy concepts and fuzzy semantics enables
cognitive machines and fuzzy systems to mimic the human fuzzy inference mechanisms in cognitive linguistics, fuzzy systems, cognitive
computing, and computational intelligence. Yingxu W demonstrated that fuzzy semantic comprehension is a deductive process, where complex
fuzzy semantics can be formally expressed by algebraic operations on elementary ones with fuzzy modifiers.
Natural language processing research, in fact, primarily depends on the availability of resources like lexicons and corpora. These utilities are still
very limited for Chinese text sentiment analysis. Cambria E, et al. developed a Chinese common and common sense knowledge for sentiment
analysis by blending the largest existing taxonomy of English common knowledge. By using machine translation techniques, they can effectively
translate its content into Chinese. However, English grammar is different from Chinese. Chinese grammar is enormous, so the approach
proposed by Cambria E can’t effectively classify Chinese text. The polarity of the sentiment words which are not in lexicon cannot be calculated
and classified effectively by lexicon-based classifiers. Miao Y proposed the EM-SO algorithm based on expectation maximization model for
constructing and updating sentiment lexicon. Experiments showed that the EM-SO algorithm and designed components outperform SO-CAL for
the calculation performance of the polarity and strength of sentiment words on review sets. Another model called Chinese sentiment expression
model has been proposed by Liang Y. It can effectively improve the accuracy of emotion classification.
Emotion degree lexicon composes of emotional word and corresponding metric value. Each word of the lexicon works as a basic semantic unit of
linguistics. Present emotion degree lexicon usually derived from artificial marking which is hard to extensible and low reliability. Social cognition
theories provide a strong theoretical basis to construct an emotion lexicon.
CHINESE FUZZY SEMANTIC MODEL
Fuzzy semantic model is a data model that uses basic concepts of semantic modeling and imprecision of real world at the attribute, entity and
class levels. The semantics of an entity in a natural language is used to be vaguely represented by a noun or noun phrase. There are mathematical
models of fuzzy concepts and fuzzy semantics qualified by fuzzy modifiers. The composite fuzzy semantic which is defined as a 5-tuple
encompassing the fuzzy sets i.e.:
(1)
where is a fuzzy concept; is a fuzzy set of attributes; is a fuzzy set of objects as the extension of the concept . and respectively are internal
relations and external relations between the fuzzy sets of objects and attributes .
A concrete fuzzy relation in a specific fuzzy concept will be an instantiation of the general relations tailored by a given characteristic matrix on the
Cartesian products.
A Chinese fuzzy semantic model was proposed on the basis of fuzzy semantic model above. A composite Chinese fuzzy semantic of a fuzzy
concept qualified by a composite fuzzy modifier which is a complex semantics of concept . And the concept qualified by a certain weight of
composite modifier. Chinese fuzzy semantic fuzzy concept defined as a 4-tuple fuzzy sets i.e.:
(2)
(3)
is a fuzzy set of stop-words which may be noun phrases or nouns which like adjectives:
(4)
is a fuzzy set of internal relations between the fuzzy sets of and :
(5)
where expresses the Cartesian product of a series of repeated cross operations between and .
is a fuzzy set of modifiers that modifies the concept as an external relations:
(6)
where obtains qualified weights when the fuzzy concept is modified by an adjective or adjective phrase. In addition, when the modifier has a
complex structure, the qualified weights will be multiplication.
Chinese fuzzy sematic analysis can be formally described as a deductive process based on models above. The Chinese fuzzy semantics of an
emotional tendency, denoted by , can be formally described as follows:
(7)
Complex Chinese fuzzy concept has been modeled as a composite Chinese fuzzy concept. Modifiers are applied to qualify fuzzy concept. With the
improvement of modifiers, this Chinese fuzzy semantic model could improve the Chinese text analysis consequences.
BUILD EMOTION DEGREE LEXICON BASED ON SOCIAL COGNITIVE THEORIES
Sentiment analysis with emotion degree lexicon is a disciplinary task in that it attempts to capture people’s social ideas. In this sentiment analysis
technology, the lexicon plays a very important part, and impact analysis accuracy in a large extent. Therefore, how to build an excellent lexicon is
the key problem to sentiment analysis which is based on emotion degree lexicon. We explore a different direction to improve emotion degree
lexicon with applying social cognitive theories and take user behavior into account.
Basic Emotion Word Usage Rate
Based on social cognitive theories, users assume consistency in traits and behavior, such that observations about current word-using behavior
lead to causal attributions regarding past and future behaviors. Certain word emotional tendency is consistently associated with a particular
sentiment across most users. In the vast majority of word cases, the lower the frequency is, the more close to the emotion value edge. We assume
that the probability distribution of emotion words conforms to standardized normal curve. The probability density:
(8)
where δ ranges from -8 to 8 and positive number implies positive words while minus means negative words.
The concrete distribution is shown in Figure 1. Normal curve peak is located at the central, the mean position. Left and right sides is symmetrical,
curve ends never intersect with the horizontal axis. By the mean of δ=0, to the left and right sides of the uniform decline gradually. That means
emotion words concentrated in a small numerical range. And any δ less than -8 or larger than 8 will be revised to -8 or 8.
Figure 1. Emotion value distribution
Assess Neologism Emotion
The social balance theory aims to analyze the interpersonal network among social agents and see how a social group evolves to a possible balance
state. Neologism balance states are given in Figure 2, where user A shows “negative” to new word “W”, user Bshows “negative”. The relationship
between A and B is “similar” and reach a balance state. How individuals interpret the results of their performance attainments informs and alters
their environments and their self-beliefs, which in turn inform and alter their subsequent performances.
Figure 2. Social balance theory: Balance states among peoples
and neologism
SUB(x) means to get the subject of user x. OPN(x, w) means to get the user x’s opinion about word w. In the emotional lexicon definition, OPN(x,
w) has two results; P means positive and N means negative; and P(w) means the polarity of word w. Therefore, the definition 1 can be formally
described as shown in Listing 1.
Listing 1.
For each unique A-B pair in the training data, we count the frequency of emotion tendency.
In order to find similar pair users, we proposed an approach to identify if users have common theme. Use Chinese lexical analyzer ICTCLAS for
text segmentation. Traverse training data to find user pairs which has similar keywords. If users have similar keywords and have same emotion
tendency to neologism, record as a balance state. Try to find out all the balance state from training data about Neologism.
It is important to consider a balance state which has more than 2 users. As shown in Figure 3, circle point express a user who show positive
emotion (triangle for negative) to target neologism and A,B, C and D formed a 4-users balance state. A group’s opinion is far more credible than a
2 or 3 pairs. So, we proposed that a group will be counted with:
(9)
Figure 3. Social balance theory: Multi-Balance states among
peoples and neologism
where N denotes for the final count, n express the group size and range from 2 to ∞. Definition 2 can be formally described as shown in Listing 2.
Listing 2.
In combination with standardized normal curve probability density, use a formula to calculate the emotion value of neologism:
(10)
where p equal to the probability of positive balance states in all training data. The x is the final result and will be used as emotion value of
neologism.
CHINESE TEXT SENTIMENT ANALYSIS
Chinese and English are quite different in many aspects so that Chinese text sentiment analysis is not alike English. Chinese sentiment analysis
resources are very limited and it is a challenging task to identify sentiment polarity of Chinese reviews. Most text sentiment analysis approaches
are based on machine learning technology. The computer should abstract the text and build a mathematical model before understanding. For the
complex Chinese language, the traditional modeling based on machine learning can’t reach a satisfactory result.
The sentiment analysis of Chinese text based on emotion degree lexicon and Chinese fuzzy semantic model is proposed. This new solution focuses
on classifying text from existing comments to discover the evaluation for the product from users. Text sentiments are divided into three
categories including positive, negative and neutral. And those categories are marked with an emotion value which is used to demonstrate emotion
degree. The main analysis process is shown in Figure 4.
Figure 4. Chinese sentiment analysis technological process
The first issue is to execute text preprocessing. Most of the review text contains a lot of punctuation and emoticons. Most of these emoticons are
behind some words. Due to the context complexity and irregular emoticons, these emoticons are not suitable to be treat as a part of text. Multiple
symbols overlapping phenomena also exists in the text. Users usually have different habits so that the text need to be unified normalized. For
Chinese text, a text segmentation method based on Stop-List was proposed. Stop-List is a word set collects a lot of neutral words which part of
them are non-neutrality emotion word. Find these words from the text and remove them to prevent affect sentiment analysis.
Syntactic Analysis
On the basis of Chinese fuzzy semantic model, sentences need to be partitioned and structured as a fuzzy concept instance. Syntactic analysis is
mainly to understand the structure of the sentence and use the most appropriate method to express sentence structure.
First of all, identify sentence type would be necessary before syntactic analysis. Take special keywords and the end of sentence as a condition. We
detect word by word and confirm the sentence category. There are many kinds of Chinese sentence, but only two kinds of sentences should be
specially calculated: question sentence and exclamatory sentence. If text matches these sentence patterns, it will be weighted with a specified
factor. As shown in Table 1, all Chinese text sentences have specific weights. Sentence structure analysis as an important part of text syntactic
analysis is to identify sentence structure through Chinese grammar “adverb-adjective structure” and “adjective-adjective structure” and formally
described as follows:
(11)
Table 1.Chinese sentence type weight
When the word match has finished, there will be many kinds of results. It may need to choose an optimal matching before emotion calculate. In
all of these matches, the most complex sentence structure will be chosen as the final structure. Assess the complexity of the sentence by:
(12)
where L(x) denotes length of x. C(y) denotes counts of y.
Sentiment analysis accuracy is highly depend on the syntactic analysis performance. Choose the highest CPT can get the most accuracy result.
Emotion Words Match and Calculating
We should extract emotional words and corresponding emotion value extraction before calculating the emotion value of text. For Chinese text,
the most appropriate method is to match the longest word in the emotion lexicon. Lookup each adverb and adjective in emotion degree lexicon, If
lookup succeed, replace the word with the corresponding emotion value. Then all the words which were not found will be replaced with zero.
When the emotion words match is finished, put the results above together:
(13)
where STW denotes sentence type weight. Different from other emotion polarity analysis methods, the approach this paper proposed provides a
numerical value to show the degree of text and more meaningful than simple polarity analysis.
Emotion Value Correction
The length of text has great influence in emotional value and the emotional value of full text needed to be standardized by a factor:
(14)
We estimate the emotion tendency by the emotion value, Typical tendency includes positive, neutral and negative. In general, the emotional value
which is greater than 2 will be classified to positive emotion, smaller than -2 will be negative, and others will be neutral emotions.
EXPERIMENTS
In previous sections, we showed that how Chinese text sentiment analysis utilizing emotion degree lexicon and fuzzy semantic models works. The
proposed approach aims to improve polarity classification for the informal, creative and dynamically changing Chinese text data.
The experimental data was collected from Ctrip online by Songbo Tan. The scale of corpus includes 2000, 4000 and 6000 and it is a balance
corpus, half is positive, the other half is negative. Chinese customer reviews are short and colloquial. Users often do not think deeply, so have a
style with serious colloquial language. And users often express their thoughts directly. Their reviews contain lots of network terms and acronyms.
Users can also express their emotions and attitudes in an indirect way.
National Taiwan University Sentiment Dictionary (NTUSD) is an emotion lexicon released by Taiwan University. It has two versions: Traditional
Chinese and Simplified Chinese. Each composed of 2810 positive words and 8276 negative words.
The test texts are artificially classified, so we did sentiment analysis with proposed approach and traditional machine learning classification
approach. The final results are shown in Table 2. Sentiment analysis of the Neologisms and their emotion value are in satisfactory accuracy. This
proposed approach can more accurately analyze the reality value Tendency of the text. Different from traditional emotion polarity text sentiment
analysis, we provide an emotion value to describe emotional degree. Through the emotion value, machine can extract valuable information and
predict the development of current affairs.
Table 2. Sentiment analysis results of data set
The proposed approach based on emotion degree lexicon and fuzzy semantic model has effective enhanced the performance of text sentiment
analysis, but there still have some challenges. In Chinese, there are a lot of punctuations, especially in the network text. Most punctuation is
ignored by preprocessing. And, users often use emoticons to express their emotions. But the user’s punctuation habit varies widely. That means it
is difficult to identify all emoticons accurately. The emoticons for sentiment analysis also have certain effect, and this would be a challenge to
build a lexicon based on social cognitive theories.
There often exists some non-standard Chinese expression in network terms. Now in the world’s communication environment, People often use
some complex sentences which may consist of two or more languages, such as English, Traditional Chinese, Simplified Chinese and so on. Some
people use more than one consecutive punctuation symbols to express emotion degree. These big challenges in lexicon building are very
promising research field.
CONCLUSION
In this paper, a fuzzy Chinese semantic model and a lexicon building rules based on social cognitive theories were proposed to improve existing
classification theories. With the help of fuzzy semantic model, we proposed a Chinese fuzzy semantic mode and give formally definition to explain
how it works. It has been affirmed that there needs to clearly define, before the emotion value calculation is done, what the emotion tendency and
users’ opinions are. We have investigated the limitations of approaches based on traditional machine learning technology and proposed an
efficiently lexicon building approach based on social cognitive theories. Chinese text analysis techniques include text preprocessing, syntactic
analysis, emotion words match and calculating and emotion value correction. This paper also have shown that how to use social balance theory to
improve the results of lexicon building. The experiment part evaluated our new approach using lexicon NTUSD and testing Ctrip reviews. We
have shown that this simple approach lead to good consequences.
This work was previously published in the International Journal of Software Science and Computational Intelligence (IJSSCI), 6(4); edited by
Yingxu Wang, pages 2032, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
Project 61303094 supported by National Natural Science Foundation of China, by Specialized Research Fund for the Doctoral Program of Higher
Education (20123108120027), by the Science and Technology Commission of Shanghai Municipality (14511107100), by Innovation Program of
Shanghai Municipal Education Commission (14YZ024) and by Shanghai Leading Academic Discipline Project (No. J50103).
REFERENCES
Abel, J., & Teahan, W. (2005). Universal text preprocessing for data compression. Computers . IEEE Transactions on , 54(5), 497–507.
Cambria, E., Hussain, A., Durrani, T., & Zhang, J. (2012). Towards a chinese common and common sense knowledge base for sentiment analysis .
In Advanced Research in Applied Artificial Intelligence (pp. 437–446). Springer Berlin Heidelberg. doi:10.1007/978-3-642-31087-4_46
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the ACM , 56(4), 82–89.
doi:10.1145/2436256.2436274
Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E. (2005). Pulse: Mining customer opinions from free text. In Advances in Intelligent Data
Analysis VI (pp. 121-132). Springer Berlin Heidelberg.
Guerra P. C. Meira W. Jr Cardie C. (2014, February). Sentiment analysis on evolving social streams: How self-report imbalances can help. In
Proceedings of the 7th ACM international conference on Web search and data mining (pp. 443-452). ACM.10.1145/2556195.2556261
Hamilton, D. L., & Sherman, S. J. (1996). Perceiving persons and groups. Psychological Review , 103(2), 336–355. doi:10.1037/0033-
295X.103.2.336
Hao, Z., Cheng, J., Cai, R., Wen, W., & Wang, L. (2013). Chinese Sentiment Classification Based on the Sentiment Drop Point. In Emerging
Intelligent Computing Technology and Applications (pp. 55-60). Springer Berlin Heidelberg. doi:10.1007/978-3-642-39678-6_10
Heider, F. (1946). Attitudes and cognitive organization. The Journal of Psychology , 21(1), 107–112. doi:10.1080/00223980.1946.9917275
Khanafiah, D., & Situngkir, H. (2004). Social balance theory: revisiting Heider’s balance theory for many agents.
Li, R., Shi, S., Huang, H., Su, C., & Wang, T. (2014). A Method of Polarity Computation of Chinese Sentiment Words Based on Gaussian
Distribution . In Computational Linguistics and Intelligent Text Processing (pp. 53–61). Springer Berlin Heidelberg.
Liu B. Hu M. Cheng J. (2005, May). Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th international
conference on World Wide Web (pp. 342-351). ACM.10.1145/1060745.1060797
Liu, L., Lei, M., & Wang, H. (2013). Combining domain-specific sentiment lexicon with hownet for chinese sentiment analysis.Journal of
Computers , 8(4), 878–883. doi:10.4304/jcp.8.4.878-883
Miao, Y., Su, J., Liu, S., & Wu, K. (2013). SO-CAL Based Method for Chinese Sentiment Analysis . In Informatics and Management Science
IV (pp. 345–351). Springer London. doi:10.1007/978-1-4471-4793-0_42
Miao, Y., Su, J., Liu, S., & Wu, K. (2013). SO-CAL Based Method for Chinese Sentiment Analysis . In Informatics and Management Science
IV (pp. 345–351). Springer London. doi:10.1007/978-1-4471-4793-0_42
Pang B. Lee L. (2004, July). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In
Proceedings of the 42nd annual meeting on Association for Computational Linguistics (p. 271). Association for Computational
Linguistics.10.3115/1218955.1218990
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2), 1-135.
Ruspini, E. H. (2013). Possibility as similarity: The semantics of fuzzy logic. arXiv preprint arXiv:1304.1115.
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational
Linguistics , 37(2), 267–307. doi:10.1162/COLI_a_00049
Tawari, A., & Trivedi, M. M. (2010). Speech emotion analysis: Exploring the role of context. Multimedia . IEEE Transactions on ,12(6), 502–509.
Wang, H., Liu, L., Song, W., & Lu, J. (2014). Feature-based Sentiment Analysis Approach for Product Reviews. Journal of Software , 9(2), 274–
279. doi:10.4304/jsw.9.2.274-279
Wang Y. (2014). Fuzzy Semantic Models of Fuzzy Concepts in Fuzzy Systems. In Proceedings of 2014 International Conference on Neural
Networks and Fuzzy Systems (ICNN-FS’14) (pp. 19-24).
Wu, H. H., Tsai, A. C. R., Tsai, R. T. H., & Hsu, J. Y. J. (2013). Building a Graded Chinese Sentiment Dictionary Based on Commonsense
Knowledge for Sentiment Analysis of Song Lyrics.J. Inf. Sci. Eng. , 29(4), 647–662.
Xianghua, F., Guo, L., Yanyan, G., & Zhiqiang, W. (2013). Multi-aspect sentiment analysis for Chinese online social reviews based on topic
modeling and HowNet lexicon. Knowledge-Based Systems, 37, 186–195. doi:10.1016/j.knosys.2012.08.003
Yang, H. L., & Chao, A. F. (2014). Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and
collocations. Information Systems Frontiers , 1–18.
Yang, L., Lin, H., & Lin, Y. (2012). Sentiment analysis based on Chinese thinking modes . In Natural Language Processing and Chinese
Computing (pp. 46–57). Springer Berlin Heidelberg. doi:10.1007/978-3-642-34456-5_5
CHAPTER 49
A Distributed and Scalable Solution for Applying Semantic Techniques to Big Data
Alba Amato
Second University of Naples, Aversa, Italy
Salvatore Venticinque
Second University of Naples, Aversa, Italy
Beniamino Di Martino
Second University of Naples, Aversa, Italy
ABSTRACT
The digital revolution changes the way culture and places could be lived. It allows users to interact with the environment creating an immense
availability of data, which can be used to better understand the behavior of visitors, as well as to learn about their thoughts on what the visit
creates excitement or disappointment. In this context, Big Data becomes immensely important, making possible to turn this amount of data in
information, knowledge, and, ultimately, wisdom. This paper aims at modeling and designing a scalable solution that integrates semantic
techniques with Cloud and Big Data technologies to deliver context aware services in the application domain of the cultural heritage. The authors
started from a baseline framework that originally was not conceived to scale when huge workloads, related to big data, must be processed. They
provide an original formulation of the problem and an original software architecture that fulfills both functional and not-functional
requirements. The authors present the technological stack and the implementation of a proof of concept.
INTRODUCTION
The digital revolution changes the way culture and places could be lived. It allows users to interact with the environment creating an immense
availability of data, which can be used to better understand the behavior of visitors, as well as to learn about their thoughts on what the visit
creates excitement or disappointment. Supporting the visit of an archaeological site by handled devices allows for collecting a lot of data, for
example about the movements of those who visited the exhibition, about which artifacts they focused on, which ones has avoided seeing, the
search performed, the feedback submitted, etc. Additional information can be collected from various sources such as social networks, data
warehouse, web applications, networked machines, virtual machines, sensors over the network, etc. It is necessary to think about how and where
to processing them. It is necessary a scalable, distributed storage systems, a set of flexible data models that allow for an effective utilization of
available technologies and computational resources.
The need to store, manage, and treat the ever increasing amounts of data is becoming increasingly felt. The effort spent in redesigning and
optimizing data storage for analysis requests could result in poor performance. In fact current databases and management tools are inadequate to
handle complexity, scale, dynamism, heterogeneity, and growth of such systems. Big data technologies can address the problems related to the
collection of data streams of higher velocity and higher variety.
Big Data are an important and valuable resource for innovation, competition and productivity if properly managed. Gartner defines Big Data as
”high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable
enhanced insight, decision making, and process automation” (Gartner, 2012). Those data set are enormous; their size is beyond the ability of
systems of typical database to capture, integrate, manage and analyze them. But the huge size is not the only property of Big Data. Only if the
information has the characteristics of Volume, Velocity and/or Variety we can talk about Big Data (P. Zikopoulos, and C. Eaton, 2011) Volume
refers to the fact that we are dealing with ever-growing data expanding beyond terabytes into petabytes, and even exabytes (1 million terabytes).
Variety refers to the fact that Big Data is characterized by data that often come from heterogeneous sources such as machines, sensors and
unrefined ones, making the management much more complex. Finally, the third characteristic, that is velocity that, according to Gartner
(Gartner, 2011) “means both how fast data is being produced and how fast the data must be processed to meet demand”. In fact in a very short
time the data can become obsolete. IBM (M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano, 2012) proposes the inclusion of
veracity as the fourth Big Data attribute to emphasize the importance of addressing and managing the uncertainty of some types of data. With the
amount of data being produced every day, there is the need to unlock the unnamed fifth V of big data: VALUE. According to analysts with
Forrester (Forrester, 2014), most organizations today use less than 5% of the data that is available to them. As our capability to collect data has
increased, our ability to store, sort and analyze it has diminished. In this context, Big Data becomes immensely important, making possible to
turn into this amount of data in information, knowledge, and, ultimately, wisdom. The requirements of many applications are changing and
require the adoption of these technologies. NoSQL databases ensure better performance than RDBMS systems in various use cases, most notably
those involving big data. But the choice of the one that best fits the application requirements is a challenge for the programmers that decide to
develop a scalable application. There are many differences among the available products and also among the level of maturation on them. From a
solution point of view it is necessary a clear analysis of the application context. In particular we focused on technologies that operate in pervasive
environments, which can benefit from the huge information available but need to be rethought to extract knowledge and improve the context
awareness in order to customize the services.
Even if a storage solution conceived ad hoc to provide good performance and to scale, exploiting the elasticity of new computing paradigms like
the Cloud, the design and programming of processing functions must follow an effective methodology. In particular the utilization of semantic
techniques for management and processing of Big Data requires the design and development of advanced technological solutions which grant
performance and scalability. The exploitation of value that can be obtained by processing big-data provides the opportunities to integrate
advanced functionalities in existing services. However the re-engineering of an existing framework is not straightforward. This paper aims at
modeling and designing a scalable solution that integrates semantic techniques with Cloud and Big Data technologies to deliver context aware
services in the application domain of the cultural heritage. We started from a baseline framework that has not been originally conceived to scale
when huge workload related to big data must be processed. The utilization of a GraphDB can improve performances and the operation on the
storage, but it is necessary a solution that allows to use semantic techniques for Big Data and can scale with velocity, heterogeneity and
dimension of dynamic changing workloads.
In this paper we provide an original formulation of the problem and original software architecture that fulfills both functional and not-functional
requirements. We present the technological stack and the implementation of a proof of concept of a prototype referring to a real use case in the
field of the cultural heritage.
RELATED WORK
In the last years, context awareness has widely demonstrated its crucial role to achieve optimized management of resources, systems, and
services in many application domains, from mobile and pervasive computing to dynamically adaptive service provisioning. Pervasiveness of
devices provides to application and services the possibility of using peripherals and sensors as their own extensions to collect about the user and
the environment, but also to improve service delivery. Furthermore the explosion of devices that have automated and perhaps improved the lives
of all of us has generated a huge mass of information that will continue to grow exponentially. Context awareness in pervasive environments
represent an interesting application field of big data technologies. In fact, according to IBM (P. Zikopoulos, and C. Eaton 2011): Big Data
solutions are ideal for analyzing not only raw structured data, but semi-structured and unstructured data from a wide variety of sources. Big Data
solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data is not nearly as effective
as a larger set of data from which to derive analysis. Big Data solutions are ideal for iterative and exploratory analysis when measures on data are
not predetermined.
Big data technologies can address the problems related to the collection of data streams of higher velocity and higher variety. They allow for
building an infrastructure that delivers low, predictable latency in both capturing data and in executing short, simple queries; that is able to
handle very high transaction volumes, often in a distributed environment; and supports flexible, dynamic data structures (Oracle, 2013).With
such a high volume of information, it is relevant the possibility to organize data at its original storage location, thus saving both time and money
by not moving around large volumes of data. The infrastructures required for organizing big data are able to process and manipulate data in the
original storage location. This capability provides very high throughput (often in batch), which are necessary to deal with large data processing
steps and to handle a large variety of data formats, from unstructured to structured (Oracle, 2013). The analysis may also be done in a distributed
environment, where some data will stay where it was originally stored and be transparently accessed for required analytics such as statistical
analysis and data mining, on a wider variety of data types stored in diverse systems; scale to extreme data volumes; deliver faster response times
driven by changes in behavior; and automate decisions based on analytical models. Most importantly, the infrastructure must be able to integrate
analysis on the combination of big data and traditional enterprise data. New insight comes not just from analyzing new data, but from analyzing
it within the context of the old to provide new perspectives on old problems (Oracle, 2013). Context-aware Big Data solutions could focus only on
relevant information by keeping high probability of hit for all application-relevant events, with manifest advantages in terms of cost reduction
and complexity decrease (Nessi, 2012).
THE M.A.R.A. FRAMEWORK
MARA is a multidisciplinary project, with both cultural and technological aims, whose purpose is the design and implementation of a context-
aware platform to assist users in visiting archaeological sites (A. Amato, B. Di Martino, and S. Venticinque, 2011). The key component of this
context aware recommendation system is data, often extremely diverse in format, frequency, amount and provenance. This heterogeneous data
will serve as the basis for recommendations obtaining using algorithms to find similarities and affinities and to build suggestions for specific user
profiles. By modeling an archaeological site as a pervasive environment we are able to improve its exploitation by the visitors. In a pervasive
computing environment, the intelligence of context-aware systems will be limited if the systems are unable to represent and reason about
context.
Moreover, in domains like tourism, the notion of preferences varies among users and strongly depends on factors like users’ personalities,
parameters related to the context like location, time, season, weather and others elements like user’s feedback, so it is necessary to provide users
with many other kinds of personalized experiences, based on data of many kinds. The growth of visitors and interactive archaeological sites in
recent years poses some key challenges for recommender systems. In fact it is necessary to introduce recommender system technologies that can
quickly produce high quality recommendations, even for very large-scale problems, so that users can benefit of context awareness in services
exploitation and mobile services can became really useful and profitable. Besides, it is necessary to address the problem of the variety of data
from sensors, RFID, devices, annotation tools, GIS (Geospatial, GPS), web. So the problem is both to capture data quickly and to store them
quickly in structured form. The structure of the data, then, allows to identify a pattern based strategy for the extraction of consistent, comparable
and updated information. These techniques can result quite intensive in terms of computational requirements, and so the needed resources can
exploit a distributed infrastructure. Part of the computation will be performed locally on the smartphone and expensive tasks will be off-loaded
remotely. In addition, the limited energy and data storages of the user’s device can affect agent’s capability. Remote services allows to the agent to
move on remote more complex reasoning on a wider knowledge base. In order to augment user’s knowledge and capability to interact with the
environment, services have to choose, according to their context awareness; i) what content and application it has to deliver; ii) when it needs to
present the content; iii) how this should be done.
In Figure 1 the architectural solution of the MARA framework is shown. The framework is composed of different tools and applications (A.
Amato, B. Di Martino, and S. Venticinque, 2012a). Tools for experts in the domain of the Cultural Heritage are used to augment the
archaeological site with a set of multimedia contents. They include a map editor, a semantic annotator and a content manager. A set of context
aware services are available for intelligent multimedia discovery and delivery. A tourist guide supports the user in visiting an archaeological site,
detects and suggests points of interest, provides context awareness of remote service, allows for the utilization of remote services and plays
multimedia contents. On the left side of Figure 1, the user is using his device that hosts a light agent that is able to perceive information from the
field by pervasive sensors. The agent executes autonomously and proactively in order to support the user’s activity within the environment where
he is moving. It discovers surrounding objects, it uses them to update the representation of the user’s knowledge, reacts using the local
knowledge to organize and propose the available contents and facilities by an interactive interface. If the connection works the device can access
remote services, which can access a wider knowledge and have greater reasoning capabilities to look for additional contents and applications.
Experts of the application domain define the ontology for the specific case study described in (G. Renda, S. Gigli, A. Amato, S. Venticinque, B. D.
Martino, and F. R. Cappa, 2012). They use or design a map to represent the environment. They add POIs to the map to geo-refer multimedia
contents and can link them to a concept of the ontology.
Figure 1. The MARA framework
Furthermore they select relevant contents and annotate them using concepts and individuals of the ontology. Remote applications implement
context aware services. They use personal devices to collect perceptions and for content delivery. An ontology implements the representation of
the global knowledge that is necessary to share a common dictionary and to describe the relationships among the entities/objects, which are part
of the model. Concepts of the ontology are used on client side to describe a representation of the reality as it is perceived by the user. On the back-
end the ontology is used to annotate digital resources like point of interests, contents, applications. It is also used to support reasoning. User’s
behaviors, information from pervasive devices or from other users, device properties, external events are heterogeneous data that are perceived
by the device and that are used to build a dynamic changing representation of the user knowledge about the reality, within which he is moving.
The applications are knowledge driven. The user’s knowledge can be used by the application that is running on the device to adapt its logic
locally, and is updated remotely to improve the awareness of services at server side.
Semantic techniques are used for intelligent content and application discovery and delivery. An ontology has been designed to describe the sites
of interest and to annotate the related media. A general part includes the concepts which are common to all the class of applications that can be
modeled according to the proposed approach. Among the others the Time class and his properties (CurrentTime, AvailableTime, ElapsedTime,
ExploitationTime) allow to organize and assist the visit taking into account time information and handling time constraints. Position class and its
properties allow to localize the user and objects around him. An application specific part of the ontology includes the concepts that belong to the
domain of the cultural heritage and additional classes and individual which are proper of the case studies introduced in the previous section. The
ontology is used also for annotating the multimedia contents. To annotate texts, images and any kind of contents we chose the AktiveMedia tool.
The output produced by the annotator is an RDF file with concepts and properties of the AktiveMedia ontology and of the domain ontology. The
Fedora repository is used to store digital objects and supports their retrieval. Into the Fedora repository a digital object is composed of a set of
files which are: object metadata used by the client application to understand how to deliver the content;binary streams which are images, video,
text ... any kind of raw information to be delivered; RDF annotation that describe the semantic of the object according to the
ontology; disseminations filters to be eventually used for adapting the object according to the target client. We loaded the Aktive-Media ontology
and the domain ontology into the Fedora repository in order to exploit its embedded SPARQL engine that is used to select the optimal set of
individuals (i.e. contents). Multimedia contents are automatically stored into the repository after the annotation phase. The RDF output is
automatically processed using an XSL transformation to make it compliant with the model used by the Fedora repository. The semantic discovery
service of MARA returns a set of digital objects related to POIs in the pervasive environments. Each content is annotated by concepts from the
ontology and can be discovered by a SPARQL query to the content repository. The result of the query is a set of N instances of digital objects
whose relevance to the user context is calculated as described in (A. Amato, B. Di Martino, and S. Venticinque, 2012).
PROBLEM STATEMENT
In (A. Amato, S. Venticinque, 2013) we discussed limitations of the current implementation of the MARA framework and opportunities coming
from a big-data support. During the analysis, some of the major constraints have been identified and the guidelines for re-engineering the
original solution have been drawn. In fact performance figures discussed in (A. Amato, B. Di Martino, M. Scialdone, and S. Venticinque, 2013)
demonstrated feasibility of the proposed solution implemented with a classical RDBMS. However a number of limitations have been assumed.
First of all the amount of data are limited to the proprietary knowledge base with a limited number of archaeological sites. If we aim at handling
all the national cultural heritage or the world one volume of data will be not supported. Besides the coverage of an increasing number of sites,
eventually wider, will affect the amount of geographical information and the number of connected mobile users. Data continuously received from
thousands of devices scattered in the environment handling queries and providing perception will augment volume, velocity and variety of
information. Finally the exploitation of the user feedback could be used to improve the expertise of the system building a social network of
visitors and enriching the knowledge base with semantic annotation inferred by asocial network of visitors who become authors and editors
themselves. The new vision of M.A.R.A. could not be implemented without considering the Big Data requirements and solutions that are ideal (P.
Zikopoulos, and C. Eaton 2011) to deal with raw data from a wide variety of sources that must be analyzed in toto. At this point the challenge is to
understand what would be the best choice for re-designing and developing the framework to satisfy the new requirements. Understanding what
NoSQL data models and technology could be more effective among the available alternatives needs further in sides. In particular we figured out
the opportunity to increase the number of users and to allow themselves to be editor of the media content delivered by MARA services providing
feedbacks and increasing the expertise and the context awareness of the system. In particular we identified a DocumentDB, which is the best
solution in terms of scalability and flexibility, for storing media content, but a GraphDB has all the features to support at the best data
management and data processing at the storage level for our application. About the drawback of such a choice, we have to consider that NOSQL
DDS usually do not support ACID transactions, than there is the lack of a native support for SPARQL interface for reasoning and semantic
retrieval of relevant information, which is currently used by the MARA application layer. In fact MARA uses ontologies defined using RDFS/OWL
that can be naturally represented using a labeled, directed, multigraph. The utilization of GraphDB does not returns for free scalability and
performance. We outline two main issues which affect performances and scalability of a trivial solution.
First of all the partitioning of a GraphDB is an open issue (A. Amato, B. Di Martino, and S. Venticinque, 2014), above all if we do not want to
define a priori a DB schema and we cannot foresee how the graph will grow. This problem has been addressed in (A. Amato, B. Di Martino, and S.
Venticinque, 2014), where we proposed the exploitation of Cloud elasticity and in particular of the MAP-Reduce programming paradigm for a
periodic partitioning of the graph that can balancing the data and the query over a distributed system. The second issue, we focus here, is the
design of a software architecture that can automatically scale over a distributed infrastructure by the integrated utilization of Big Data and Cloud
technologies.
It is necessary to consider that the system will be used by two kind of users who produce a different workload: editors and final users. Editors,
who update G1, represent a reduced number of users, which interact with low frequency, but the update they provide would affect many users. In
fact when a new media content is added or a new semantic annotation is provided, it needs to match those profiles in G2 which could be affected,
with the new changes in the graph. Their interactions are synchronous.
The number of final users is very high and can grow dynamically fast. The number of events they generate is very high. However we must
distinguish the synchronous requests which are triggered by the interactive usage of the applications, and the messages events generated
continuously by services running in background. In fact they will detect and notify user’s position, nearby objects, user’s feedback and behavior
and all those information which are relevant to improve the context awareness.
PROBLEM FORMULATION
The knowledge base (KB) of the system is represented as graph G. A graph G = (V,E) consists of two finite sets V and E. The elements of V are
called the vertexes and the elements of E the edges of G. Each edge is a pair of vertexes. The ontology describing our application domain will be a
sub-graph O: (Vo,Eo)⊂ G. Vertexes of O are concepts and individuals. Edges between them are semantic relationships. The vertexes of G1:
(V1∪Vo,E1) ⊂ G represent all the media contents of our KB. The edges of G1 ⊂ G represent the existing relationships between the contents and the
concepts of the application domain. An expert of the application domain can add new concepts and new relationships in O ⊂ G. An editor,
annotating media contents, adds new contents vi∈v1 and new relationships between contents and concepts ej∈e1.
The vertexes of G2: (V2,E2)⊂ G represents user profiles. The edges of G2⊂ G represents the relationships between the user profiles and the
concepts of the application domain. Information coming from the users can modify and add vertexes and edge connecting them to the application
domain. Hence G1 ∩G2=O.
We define three operations on the graph that can occur after annotation, ontology update and user’s profile update.
• Adding one or more vertexes, and edges: (vk-1; vk); (vk; vk+1) so the graph will be G = (V;E) with V = {v1; v2; …vi,vk…vn}; E = {(v1; v2);
…;(vk-1; vk); (vk; vk+1);…; (vn-1; vn)};
• Deleting one or more vertex vi and edges: (vi-1; vi); (vi; vi+1) so the graph will be G = (V;E) with: V = {v1; v2; …; vn}; E = {(v1; v2); …;
(vn-1; vn)}.
The recommendation after ontology/annotation update consists of those paths from a content to any profiles which have been evaluated relevant
to the users. So given a path that is a sequence of vertex:
v1-> v2->…->vn
so that:
(vi-1;vi) or (vi;vi+1) ∈ E
the system will recommend a content vj∈V1 to an user whose profile vi∈V2 if m(i,j) >φ. m(i,j) is a measure of the relevance of the concept vj to the
user i and φ is a threshold. m(i,j): (vi,vj) ∈V2XV1 → R.
In order to search for all the recommendations we need to search for all the path that connect any user’s profile to any content trough the
ontology O ⊂ G.
An example of metric and related experimental result has been presented in (A. Amato, B. Di Martino, M. Scialdone and S. Venticinque, 2014).
• Graph update: U={u: (V,E), (V,E) > (V,E)};
• Path search: S={s: V1 x V2 >VxEn};
• Path evaluation: M={m: VXEn → R}.
Figure 2. Front-end architecture
SOFTWARE ARCHITECTURE
Different technical solutions have been designed here to distribute and process the dynamically changing and heterogeneous workload over a
cloud infrastructure (A. Amato, B. Di Martino, and S. Venticinque, 2014). The user front-end is shown in Figure 2. It is composed of activities and
services. Activities provide controls and view to invoke synchronous services and to play the returned contents. There are many solutions to allow
for the scalability of synchronous services. For example there are many legacy application that allow for the deployment of stateless services
which share the session using Mem-Cache technologies or scalable NO-SQL solution like big-table or key-value store. For this reason we designed
a persistent cache that will be filled asynchronous at the back-end. Services run in background and use two different queues. The output queue is
to send continuously information perceived by on-board peripheral and feedbacks about the user’s behavior. The input queue is use to receive
asynchronously recommendations. We observe that the graph is accessed only by asynchronous tasks. A lower priority levels will be assigned to
services running in background to not affect the interactivity of the application.
In Figure 3 the back-end architecture is shown. The Orange queue collects events from different users. It is used to distribute the processing of
incoming requests. A stateless task, running on any computing resource, gets a message from the queue when it is idle. The event is processed by
the tasks that updates the profile Pi ⊂ G2 of the corresponding user.
A process A2 will implement a sequence of functions u ∈ U and s ∈ S, to update the graph and to search for new paths generated by that update.
Each new or updated path that could be relevant to the user is sent as a new message to the green queue. The consumers A3 of the paths-queue
are in charge to evaluate if these paths are relevant recommendations by implementing a function m ∈ M, and to decide if the recommendations
need to be notified. Each new recommendation is notified asynchronously to the corresponding by the Async-out queue. Annotation done by the
editors are published into the blue queue. Processes of A1 implements sequence of functions u ∈ U and s ∈ S themselves. Even if the annotation
would occur with a lower frequency, the path search could be still computationally expensive. Moreover in future we could allow users to
annotate contents themselves or automatically annotate on their behalf. They are asynchronous and stateless again, and they can be deployed
dynamically over an elastic infrastructure to be reconfigured according to the frequency of G1 updates. Moreover each change of G1 could
generate a number of paths to many user profiles. All these paths are published in the green queue as the paths originated by changes of G2.
Hence three queues are used to distributed the workloads among three sets of asynchronous tasks. We observe that we have three different
degrees of flexibility to let the back-end scaling.
The length of the three queues will be a measure of the workload and will be used to control the population of consumers for load balancing. The
more the length is increasing, the more it needs to deploy a higher number of consumers for that queue. When the utilization of a computing
resource will be above a threshold a new computing resource will be instantiated to run stateless tasks. The average queued time of a path will be
a performance indicator about the throughput of the system.
TECHNOLOGICAL STACK
To move from the theoretical formulation to a real implementation we need to address the most effective technology for representation and
processing of available information. Information include Ontology, annotations and user’s profile.
At semantic level information are represented using ontology languages such as OWL and RDF. This choice is convenient to facilitate the
ontology definition, the semantic annotation and the definition of the semantic query. Query are defined by the SPARQL language.
At logic level, according to the formulated problem, information are represented as a graph and processing function as operation over the graph.
Working at this level make easy the implementation of the solution.
Finally at data level the utilization of a NOSQL DB, in particular a GraphDB, allows for a direct mapping between the logic level and the storage,
also from a technological point of view, improving performances. In fact the implementation the translation from the semantic and graph
domains to the relational domain is the overhead we want to avoid. In particular the input and the output function of our system will be in the
semantic domain and will handle RDF data. The internal representation is a graph and the processing function will be implemented in this
domain. Data will be stored using the representation model No4j GraphDB than a driver for reading and writing data will translate the
information from the internal model to the stored one.
To develop a technological solution of the formulated problem we must be able to manage the available information across these domain and to
implement the needed processing function within the more convenient one. Our solution is based on Tinkerpop 2.6. software stack, whose
technological foundation is Blueprints. Blue prints is a collection of interfaces and implementations for the property graph data model.
Blueprints is analogous to the JDBC, but for graph databases. As such, it provides a common set of interfaces to allow developers to plug-and-
play their graph database backend. Moreover, software written atop Blueprints works over all Blueprints-enabled graph databases. Blueprints
include collection of libraries that include a data flow framework (Pipes), a graph traversal language (Gremlin), an object-to-graph mapper
(Frames), a graph algorithms package (Furnace) and a graph server (Rexter). In particular we used the SAIL (Storage and Inference Layer)
implementation of Blueprints for the mapping of RDF to the internal Property Graph representation. The Sail interface has been developed by
OpenRDF. The Storage and Inference Layer (Sail) API is a low level System API (SPI) for RDF stores and inferences. Its purpose is to abstract
from the storage and inference details, allowing various types of storage and inference to be used. We used the Neo4j (Neo Technology, 2012)
implementation of Blueprints to store the application data. Neo4j natively supports the Property Graph data model.
PROOF OF CONCEPT
Here we provide implementation details about our proof of concept by describing the main components of the software architecture and the
information they exchange.
The MARA Ontology
The MARA ontology has been developed by experts of the cultural domain using the Protégée tool. It is exported as an RDF file. We show an
excerpt of the ontology in Box 1. The Ontology use Italian words for concepts and relationships. Here we translate some terms for improving the
readability. In particular the individual defined is the main city Gate of the Roman Norba. The excerpt means that this main gate is a defensive
wall, in particular it belongs to the city wall of Norba. It is made of limestone using a building techniques called polygonal work. It is a specific
type of gate called sceo and dates back to the ellenistic age (see Box 1).
Box 1.
<NamedIndividualrdf:about=”&prist;Main_City_Gate”>
<rdf:typerdf:resource=”&prist;Defensive_Wall”/>
<prist::isMadeOfrdf:resource=”&prist;Limestone”/>
<prist:isPartOfrdf:resource=”&prist;Norba”/>
<prist:isPartOfrdf:resource=”&prist;CityWall”/>
<prist:hasBuildingTechniquerdf:resource=”&prist;polygonal_work”/>
<prist:hasTypologyrdf:resource=”&prist;sceo_gate”/>
<prist:hasHistoricalAgerdf:resource=”&prist;Ellenistic(323_a.C.
_III_a.C.)”/>
</NamedIndividual>
All the ontology has been imported in Neo4j by the SAIL library. In particular, in Figure 4 this excerpt is visualized by the Neo4j web interface as
a property graph. The legend explain the correspondence between node id and concept, that is a value of each node. Both concept and
relationships are identified by the related uri.
We can compare a SPARQL query in the semantic domain and a query using Cipher, that is native query language of Neo4j, to gets all concepts
and relationships shown in Figure 4:
• SPARQL Query: SELECT ?rel ?obj WHERE { prist:Main_City_Gate ?rel ?obj .}
• CIPHER QUERY: MATCH (a)[]>(b) WHERE a.value =” prist:Main_City_Gate”
We will use the CIPHER language to explain the following example rather than Java code or SPARQL because it make more explicit how the
technological solution implements the problem formulation.
Media Contents Annotation
To produce semantic annotation of media contents the editors uses an extended version of AktiveMedia. They load the MARA ontology and link
parts of the text or parts of the image to concepts or individuals of the ontology. In Figure 5 the GUI of the annotation tool is shown. In particular
a picture of the main gate of Norba City is shown. The editor can define a title for the annotated picture and for each annotated part a text, a
content description and a comment. The annotation is exported as and RDF file. Let us suppose that the user annotates two different area of the
image with two different concepts which are Main_City_Gate and Defensive_Wall. For each annotated area the RDF file specify shape, position,
dimension, color, comment, text and other information. An excerpt of the RDF file is shown in Box 2.
Box 2.
<j.0:hasAnnotation>
<rdf:Descriptionrdf:about=”http://www.dcs.shef.ac.uk/~ajay/image/annotation#0”>
<j.1:annotationWidth>90</j.1:annotationWidth>
<j.1:hasConcept>prist:Main_City_Gate</j.1:hasConcept>
<j.1:hasConcept>prist:Defensive_Wall</j.1:hasConcept>
<j.1:annotationHeight>234</j.1:annotationHeight>
[…]
</rdf:Description>
</j.0:hasAnnotation>
Such information is imported in Neo4J and will extend the ontology graph. In the same way after any annotation an agent A2 will search for the
available concepts from the annotated content to any profiles, that include the new concepts (6441,6444,262), (6441,6444,121). Figure 5 is a
graphical representation of Figure 6:
Updating Users’ Profile
When the user enjoys a media content his profiles is enriched with new information. Of course we will add other data about his position when it
moves, about his device etc. For sake of simplicity let us suppose that the user profile is just an array of concepts from the ontology and
represents his cultural interests. Here we consider a very simple example of an user who enjoyed a picture of Furba city Gate of Norba, or it was
detected near that ruin. On the occurrence of this event an agent A2 updates the user’s profile about an eventual interest about the Furba city
Gate. The section of the profile shown in Box 3 tells us in a very simple way about this interest.
Box 3.
<rdf:RDF>
<NamedIndividualrdf:about=”prist:Profile1”>
<prist:hasInterestrdf:resource=”&prist;Furba_Gate”/>
[…]
</NamedIndividual>
</rdf:RDF>
After the profile has been updated in the Graph DB the agent A2 will search for the available path from his profile to any available content that
includes that new interest.
Looking for New Recommendations
Referring to the presented RDF sections we show here how agents search for new path which could be candidates for new recommendation.
After an annotation, agent A1 searches, for any user, all paths starting from the new annotated Image0, following thehasConcept relationship
and reaching any user profile including the hasInterest relationships.
Figure 7 shows the results, for the profile “Profile1”, of the following queries asking for what we defined:
1. MATCH (a)-[:hasAnnotation] -(b)-[:hasConcept]-(c) where a.value = “Image0” MATCH (k)- [:hasInterest] -()-[r]-(c) RETURN k,r,c;
2. MATCH (a)-[:hasAnnotation]-(b) -[:hasConcept]-(c) where a.value = “Image0” MATCH (k)-[:hasInterest] -(c1)-[]-(c2)- []-(c) RETURN
k,r,c.
The first query finds only one path to Profile1. It starts with (19326, 261, 121). In fact Profile1 has interest about Furba Gate that is a
Defensive_Wall. The second query matches also an alternative path to the image through the Main City Gate concept, that is a Defensive Wall
itself. Different algorithm and patterns matches can be used to select the path of interest, limiting the length or defining weight (A. Amato, B. Di
Martino, M. Scialdone and S. Venticinque 2014), or other constraints. Agents A3 will evaluate the selected path in order to filter the one which
connect to contents which have been already recommended and comparing their score with a specific threshold. When an update of the user’s
profile occurs, Agents A2 will get the same results with similar queries, for all the relevant contents.
EXPERIMENTAL RESULTS
Preliminary experiments have been done to evaluate the bottleneck of a centralized solution and to figure out the opportunity to adopt the
proposed approach. In particular we will show that a centralized solution is not feasible, but the asynchronous processing of the queries, which
are routed through the distributed queues, make it feasible the utilization of the defined software stack. In fact in this case the graph database is
not bottleneck anymore.
The testbed environment is a server equipped with 64bit Intel Core T M 2 Quad Processor Q9300 (6M Cache, 2.50 GHz, 1333 MHz FSB) with
4GB RAM. Oracle Java7 is the runtime environment. Neo4j, community version 2.0, is the graph database chosen technology.
We used Gatling as load testing tool. It is an open source framework with excellent support of the HTTP protocol for load testing of the REST
interface of Neo4j.
The test scenario is composed of a number of concurrent users which submit the same query in a time window of 30 seconds. A random pause,
from 0 to 5 milliseconds, is introduced between two request by each user. We executed the scenario in the case of 1, 5, 25, 50 and 100 users.
Box 4.
MATCH (a)[:hasAnnotation](b)[:hasConcept](c)
where a.value = “image”
MATCH (b)[:hasInterest](r)[](d)[](c)
RETURN k,r,c
It matches all profiles which are interested to a concept that is connected along two edges of the graph to any other concepts which have been
used to annotate some images. In Table 1, varying the number of concurrent users, the client was able to submit a different number of requests at
a different rates.
Table 1. Total number of request sent within 30 seconds and average rate
100 50 25 5 1 User
Users Users Users Users
All requests have been received and processed. In Table 2 we can see the response time for each execution. In particular we can see how the
response time increases with the number of concurrent users. In particular in Figure 8 the minimum, mean and maximum response time are
drawn. It is straightforward to observe that a centralized solution and a synchronous interaction model does not work when the number of
concurrent requests increases.
Table 2. Response time for different scenarios
100 50 25 5 1
Users Users Users Users Users
In a second experiment we executed the same scenario using the blueprint software stack. We adopted an asynchronous model to execute the
queries. It means that, according to the proposed approach, the queries are routed through the queue to a process that collects and submits batch
of requests.
In this case the upper layer of the software stack receives many replicas of the same SPARQL query corresponding that is equivalent to the cypher
query defined before (see Box 5).
Box 5.
SELECT ?d ?user ?con
WHERE {
prist:image <hasAnnotation> ?obj .
?obj <hasConcept> ?con .
?d ?any2 ?con .
?user <hasInterest> ?d . }
In Table 3 we observe that the latency of all batch executions is independent from the number of queries. Moreover, even the batch execution of
10.000 queries use less than 3 seconds. This means that the adoption of an architectural patterns that implements a synchronous interface and
an asynchronous computations is a promising approach.
Table 3. Average execution time and latency of batch executions
CONCLUSION AND FUTURE WORK
The utilization of semantic techniques for management and processing of Big Data requires the design and development of advanced
technological solutions which grant performance and scalability. The availability of technologies for the design and deployment of distributed
application must be considered from the design phase in order to exploit the elasticity of the Cloud paradigms. Here we presented a software
architecture that allows to use semantic techniques for Big Data and can scale with velocity, heterogeneity and dimension of dynamic changing
workloads. Queue services are used both to distribute the workload and to monitor the current performance. The programming of asynchronous
and stateless agents allows to scale different parts of the architecture independently and according to the specific bottleneck in an elastic way. We
presented the methodology we adopted from the problem formulation to the implementation of a prototype referring to a real use case in the
field of the cultural heritage.
This work was previously published in the International Journal of Mobile Computing and Multimedia Communications (IJMCMC), 6(2);
edited by Ismail Khalil, Edgar Weippl, and Agustinus Waluyo, pages 5067, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This work has been supported PRIST 2009, Fruizione assistita e context aware di siti archelogici complessi mediante terminali mobili, founded
by Second University of Naples.
REFERENCES
Amato, A., Di Martino, B., Scialdone, M., & Venticinque, S. (2013). “Personalized recommendation of semantically annotated media contents”, in
Intelligent Distributed Computing VII. Springer International Publishing Switzerland , 511, 261–270.
Amato, A., Di Martino, B., Scialdone, M., & Venticinque, S. (2014).Personalized recommendation of semantically annotated media contents.
Chapter in Intelligent Distributed Computing VII, Studies in Computational Intelligence (Vol. 511). Springer International Publishing
Switzerland.
Amato, A., Di Martino, B., & Venticinque, S. (2011).“Bdi intelligent agents for augmented exploitation of pervasive environments,” in WOA (pp.
81–88).
Amato, A., Di Martino, B., & Venticinque, S. (2012). “Semantically augmented exploitation of pervasive environments by intelligent agents,” in
ISPA (pp. 807–814). doi:10.1109/ISPA.2012.118
Amato A. Di Martino B. Venticinque S. (2012a).“A semantic framework for delivery of context-aware ubiquitous services in pervasive
environments,” in Proceedings of the 4th International Conference on Intelligent Networking and Collaborative Systems, (pp. 412–419).
10.1109/iNCoS.2012.111
Amato, A., Di Martino, B., & Venticinque, S. (2014). “Big Data Processing for Pervasive Environment in Cloud Computing”, in INCOS (pp.598-
603) doi:10.1109/INCoS.2014.23
Amato A. Di Martino B. Venticinque S. (2014). Big Data Processing for Pervasive Environment in Cloud Computing. In Proceedings of
International Conference on Intelligent Networking and Collaborative Systems 2014, Salerno, Italy, ISBN 978-1-4799-6387-
4/1410.1109/INCoS.2014.23
Amato, A., & Venticinque, S. (2013). Big Data and Internet of Things: A Roadmap for Smart Environments . In Studies in Computational
IntelligenceVolume 546, 2014 (pp. 67–89). Springer International Publishing Switzerland.
Forrester (2014). “The forrester wave: Big data hadoop solutions” http://www.forrester.com/Big-Data, Forrester Research, Tech. Rep.
Gartner (2011). “Pattern-based strategy: Getting value from big data,” http://www.gartner.com/, Gartner, Tech. Rep.
Gartner (2012). “Hype cycle for big data, 2012,”http://www.gartner.com/, Gartner, Tech. Rep.
M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano (2012). “Analytics: The real-world use of big data”, IBM Institute for
Business Value - Executive Report.
Neo Technology. (2012). “Neo4j, the world’s leading graph database.” [Online]. Available: //www.neo4j.org/
Nessi (2012) “Nessi white paper on big data” Nessi Europe, Tech. Rep.
Oracle (2013). “Big data for the enterprise,” Oracle, Tech. Rep.
Renda, G., Gigli, S., Amato, A., Venticinque, S., Martino, B. D., & Cappa, F. R. (2012). “Mobile devices for the visit of “anfiteatro campano” in
santa maria capua vetere,” in EuroMed, (pp. 281–290).
Zikopoulos, P., & Eaton, C. (2011). Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (1st ed.). McGraw-Hill
Osborne Media.
CHAPTER 50
Evaluating NoSQL Databases for Big Data Processing within the Brazilian Ministry of
Planning, Budget, and Management
Ruben C. Huacarpuma
University of Brasília, Brazil
Daniel da C. Rodrigues
University of Brasília, Brazil
Antonio M. Rubio Serrano
University of Brasília, Brazil
João Paulo C. Lustosa da Costa
University of Brasília, Brazil
Rafael T. de Sousa Júnior
University of Brasília, Brazil
Lizane Leite
University of Brasilia, Brazil
Edward Ribeiro
University of Brasilia, Brazil
Maristela Holanda
University of Brasilia, Brazil
Aleteia P. F. Araujo
University of Brasilia, Brazil
ABSTRACT
The Brazilian Ministry of Planning, Budget, and Management (MP) manages enormous amounts of data that is generated on a daily basis.
Processing all of this data more efficiently can reduce operating costs, thereby making better use of public resources. In this chapter, the authors
construct a Big Data framework to deal with data loading and querying problems in distributed data processing. They evaluate the proposed Big
Data processes by comparing them with the current centralized process used by MP in its Integrated System for Human Resources Management
(in Portuguese: Sistema Integrado de Administração de Pessoal – SIAPE). This study focuses primarily on a NoSQL solution using HBase and
Cassandra, which is compared to the relational PostgreSQL implementation used as a baseline. The inclusion of Big Data technologies in the
proposed solution noticeably increases the performance of loading and querying time.
INTRODUCTION
Over the past years, Big Data storage and management have become challenging tasks. According to Russom (2011), when the volume of data
started to grow exponentially in the early 2000s, storage and processing technologies were overwhelmed managing hundreds of terabytes of data.
In addition, the heterogeneous nature of the data presents challenges that must be taken into consideration. Such characteristics can be observed
in different domains such as social network operations, gene sequencing or cellular protein concentration measurement (Andrew, 2012).
Moreover, improved Internet connections and new technologies, such as smartphones or tablets, require faster data storage and querying. For
these reasons, organizations and enterprises are becoming more and more interested in Big Data technologies (Collet, 2011).
Along with well-known IT companies such as Google and Facebook, governments are also interested in Big Data technologies in order to process
information related to education, health, energy, urban planning, financial risks and security. Efficient processing of all this data reduces
operating costs, thereby economizing the investment of public resources in a more rational way (Office of Science and Technology Policy of The
United States, 2012). In the same way as other governments, Brazil is starting to employ Big Data technologies in its IT systems.
In this chapter, we propose the use of Big Data technology to solve the limitations observed on the SIAPE database processing. Notably, the
SIAPE system controls payroll information regarding all federal public sector employees in Brazil. Given its growth rate of 16GB per month, the
SIAPE database can be characterized as a relevant Big Data case. Specifically we focus on the Extract, Transform and Load (ETL) (Jun, 2009)
modules in this system, which in our proposal are modified in order to operate with NoSQL data storage, using HBase and Cassandra. Thus, the
SIAPE database is used as a case study in this chapter in order to validate our proposal.
The remainder of this chapter is structured as follows: Section 2 presents basic concepts related to Big Data; in Section 3, we describe the use case
including our proposed solution; Section 4 discusses our implementation results. Finally, Section 5 presents our conclusions.
BIG DATA
The current data we manage is very diverse and complex. This is a consequence of social network interactions, blog posts, tweets, photos and
other shared content. Devices continuously send messages about what they or their users are doing. Scientists are generating detailed
measurements of the world around us with sensors installed within devices such as mobile telephones, tablets, watches, cars, computers, etc. and
finally the internet is the ultimate source of data with colossal dimensions (Marz, 2013).
Big data is exceeding the conventional database systems capacities. The data is too big, moves too fast or does not fit into existing database
architectures (Dumbill, 2012). Although the literature usually defines Big Data based on the size of the data, in this work point of view, Big Data
is not only defined by the size, but also according to Russom (2011), we take into account the so called by 3Vs factors, i.e. Volume, Variety and
Velocity.
Volume
Data volume is the primary attribute of Big Data is the greatest challenge to conventional IT structures. With that in mind, Big Data can be
quantified by counting records, transactions, tables, or files. However, most people define Big Data in terabytes (TB) and sometimes petabytes
(PB) (Russom, 2011). Many companies already have large amounts of archived data such as Google, Yahoo and Facebook. Assuming that the
volume of data is larger than conventional relational database infrastructures can cope with, the options for processing become limited to a
choice of massive parallel processing. For instance, Hadoop based solutions use a distributed file system (Hadoop Distributed File System –
HDFS), which makes data available to multiple computing nodes.
Variety
Data can be produced and stored in multiple formats. For example, it could be text from social networks, image data, and raw feed directly from a
sensor source. None of these are initially ready for integration into an application. A common use of Big Data processing is to take unstructured
data and extract ordered meaning, so that it can be used as a structured input to an application or as information to decision making. In this
context, specific data types suit certain classes of databases better (Dumbill, 2012). For instance, documents encoded as XML are most versatile
when stored in a dedicated XML store. Social network relations are basically graphs, and in this sense this data type fits better in graph
databases.
Velocity
Another describing attribute of Big Data is its velocity or the speed with which it is generated. Another way to describe can be through the
frequency of data generation or frequency of data delivery. In this context, the data rate growth related to social media utilization has changed
how people consume the data. On social media, users are often solely interested by their recent messages (tweets, status updates, new comments,
etc). Then they frequently discard old messages and pay attention only to recent updates. The data movement is now almost real time and the
update window has been reduced to fractions of a second. In this same sense, the smartphone era contributes to increasing the rate of data
inflow, by promoting a performant streaming data source (Dumbill, 2012).
There are many reasons to consider the issue of streaming the data processing. First, when the input data comes in real time, and cannot be
entirely stored, then some level of analysis must occur as the data streams in, in order to keep storage requirements practical. Secondly, analyzing
the input stream is needed when the application mandates an immediate response to the data. Due to the rise of mobile applications and online
gaming this is an increasingly common situation.
BIG DATA TECHNOLOGIES
The astonishing growth in data has profoundly affected traditional database systems, such as relational databases. Traditional systems and data
management techniques associated with them are not suitable to manage. In this context, a new set of technologies has emerged to deal with the
Big Data challenges.
Hadoop (Apache Hadoop, 2014) and MapReduce (Dean, 2008) have long been mainstays of the big data movement. They emerged as newer and
faster ways to extract business value from massive datasets. The Hadoop project has become one of the most popular implementations of
MapReduce, supported by the Apache Foundation. HadoopMapReduce is a Java implementation framework created by Google. Hadoop was
originally developed at Yahoo to manage and analyze Web-scale datasets and has quickly been adopted by other technical companies and
industries.
The Hadoop project aims to develop open source software for distributed, scalable and reliable computing. To achieve this, the Hadoop
framework runs applications on a large cluster built of commodity hardware. Hadoop implements MapReduce, a strategy where the application
is divided into many small tasks, each of which may be executed or re-executed at any node in the cluster. Data is stored in a Node using the
distributed file system HDFS which is the open source implementation of the Google File System (GFS). While HDFS provides a high aggregate
bandwidth across the cluster, both MapReduce and the HDFS are designed so that node failures are automatically handled by the framework
(Apache Hadoop, 2014).
Besides the Hadoop framework, other technologies can be grouped under the term Not-only SQL (NoSQL). These systems have the main
characteristic of scaling to tackle a large number of data sets and effectively use new techniques to manage Big Data (Marz, 2013). As the use of
Big Data technologies makes it possible to extract value from very large volumes of a wide variety of data by enabling high-velocity capture,
discovery, and analysis, these new technologies are spreading out and, consequently, the storage, manipulation and analysis of enormous
datasets are becoming cheaper and faster than ever before. In this context, database storage systems are preeminent, a fact that motivates the
development of new database technologies such as NoSQL databases which as would be expected are designed mainly for storing and retrieving
great quantities of data.
NoSQL Databases
NoSQL database systems have emerged in the beginning of 21st century as an alternative to traditional Relational Database Management
Systems (RDBMS). They are distributed databases built to meet the demands of high scalability and fault tolerance in the management and
analysis of massive amounts of data, the so called Big Data paradigm. Over the last 10 years, various NoSQL technologies have emerged, each one
with its own set of peculiarities, advantages and disadvantages. Thus, we devote the first part of this section to contextualize the most
representative examples of NoSQL systems currently available and their main characteristics.
In general, NoSQL comprises a class of data storage systems non-adherent to the relational model and that are useful when storing and retrieving
great quantities of data is more important than managing the relationships between the elements. NoSQL systems are used on a distributed
system that offers scalability by taking profit from multiple node processors. Unlike traditional relational models, NoSQL does not provide a
strict consistency for the data, as defined by the Atomicity, Consistency, Isolation and Durability (ACID) principles. Indeed, this definition of
consistency may be too strict and not necessary in some cases, especially if we want to work in a distributed environment. Instead of the strict
ACID model, NoSQL systems are based on the Consistency, Availability and Partition Tolerance (CAP) theory (Ramakrishnan, 2012), which is
applied to all the database storage systems. Consistency means that “all clients have the same view of data”. Availability means that “each client
can always read and write”. Partition tolerance means that “the system works well despite physical network partitions”. All database storage
systems could present only two of those three characteristics (Ramakrishnan, 2012).
The modern relational database management system (RDMS) provides consistency and availability, but it cannot be extended to partition
tolerance. On the other hand, a NoSQL database provides partition tolerance and either consistency (strong and weak) or availability, but not
both. Besides that, several NoSQL databases have loosened up the requirement on consistency in order to achieve better availability and
partitioning; the resulting systems are known as Basically Available, Soft-state and eventually consistent (BASE) systems (Prichett, 2008).
There are three main categories of NoSQL databases: key-value stores, column family databases and document-based stores (Padhy, 2011). These
NoSQL databases are subject to the CAP theorem, being scalable and performing with an in-memory dataset and on-disk storage (designed to
exist in memory for speed and can be persisted to disk) (Padhy, 2011). Each category is represented by one or more NoSQL databases written in
various programming languages (Java, Erlang, C, etc), and generally available as open source software. Within the scope of this chapter, we use
column family and key-values databases.
Column Family Database
A Column Family (CF) database supports the concept of tables with rows and columns. These tables need to be created at the time of defining the
database schema, but despite their similarities with the relational database model, these stores follow a different architecture because column
family stores do not support ACID transactions, do not follow the relational model, and do not comprise a high-level query language like SQL.
The atomicity unit of a CF is a row, i.e., every read or write in one line is atomic, making it impossible for two competing operations to change the
same line simultaneously. However, transactions are usually not supported.
Technically, each table in a CF database is defined as a sparse multi-dimensional distributed map. The word sparse comes from the fact that
columns created in a database do not waste space if most cells are empty. They are multidimensional because they can be configured to store the
last n versions (usually n = 3) of each cell, and because the database is also distributed and runs on a cluster.
KeyValue Database
The key-value database has the simplest data model among the NoSQL databases. The data is scattered in a cluster, where each node is
responsible for a portion of the data. In other words, the data is automatically partitioned (or else, sharded) and evenly distributed, and
replicated, among the nodes. This strategy brings the following direct advantages: higher throughput, small read & write latencies, high
availability and fault tolerance. These databases favor architectural principles like availability, fault tolerance, operational simplicity, and
predictable scalability.
• Put (Key, Value): Given a sequence of bytes as an identifier (key), and another sequence as the data (value), the database stores the pair
(key, value) in one of the cluster machines. Additionally, it replicates this same data on yet another machine to increase the availability and
fault tolerance in the event of hardware and software failure of the original machine that received the data. These databases use a hash
function to evenly distribute the data among the machines.
• Get (Key): Given the same key used in the put operation, this operation returns the correspondent data, so that it retrieves the data closest
to the client machine.
• Delete (Key): Also, given the key, the data is removed from the node that was storing it, along with its replicas.
Thus, these databases implement a Distributed Hash Table (DHT) between various nodes of the cluster. In a traditional hash table, a hash
function provides an even spread of data between the various slots of an in-memory table. In a DHT, this same hash function can be used to
determine which node in the cluster will be responsible for storing the data. Through the use of a good hash function, as MD5 or SHA1, the key-
value database allows an even spreading of data across the cluster, thus providing load balancing between machines, and increasing the
throughput of the system, because the database will be able to perform various requests simultaneously, each directed to a separate machine.
Each machine becomes the owner of a range of data, represented by part of the key space, in what is called Consistent Hashing (Karger, 1997).
Also, the nodes in the cluster nodes form a peer-to-peer (p2p) network where each node performs the same function, i.e., there are no special
nodes (managers or masters) that could represent a potential single point of failure. Thus, even if one or more nodes are down, the cluster is able
to keep operating only suffering a graceful degradation of performance.
HBase NoSQL Database
HBase (Vora, 2011) is one of the most widely used NoSQL databases. It is an Apache open source project and aims to provide a storage system
similar to Bigtable in the Hadoop distributed computing environment. HBase implements the column-oriented model as well as Bigtable.
Bigtable was one of the first NoSQL systems, along with Dynamo at Amazon, but both are closed source systems not available outside their
proprietary institutions. However, the articles published on the Dynamo (DeCandia, 2007) and BigTable (Chang, 2008) systems motivated the
implementation of open source clones that achieved high popularity and production use.
HBase stores data in tables described as sparse multidimensional sorted maps, which are structurally different from the relations found in
conventional relational databases. An HBase table stores data rows based on the row key. Thus, each row has a unique row key and an arbitrary
number of columns. Several columns can be grouped in a column family. A column of a given row, which we denote as table cell, can store a list of
timestamp-value pairs, where the timestamps are unique and the values may contain duplicates (Franke, 2013).
To access a particular cell in HBase it is necessary to pass the following tuple: row key, column family, column, timestamp. We must specify the
timestamp because HBase can store multiple versions of the same multidimensional cell so we can use this resource as a sort of content
“versioning”. It is noteworthy that the content of each cell is opaque to HBase. In other words, it can be any sequence of bytes. The row key is a
sequence of bytes that uniquely identifies an HBase table row. In practical terms, it is the row primary key. The row key can not be updated, and
each HBase table sorts all the row keys, lexicographically ordered. The Column Family groups a set of related columns, and every column must be
part of one, and only one, column family. To retrieve a given column one should inform which column family it belongs to ascolumn
family:column. The column family acts as a namespace (scope) that separates not only logically, but physically, a set of related columns, because
all columns belonging to a given column-family are stored contiguously on the same file on a file system. The column family is represented by a
sequence of printable characters and should be declared at the time of database schema creation, but the columns, which can be any byte
sequence, can be created or deleted dynamically while the database is running. A given HBase table can contain hundreds of column families and
thousands of columns.
Each HBase table is partitioned into one or more adjacent sets of rows, called regions. Through replication and the partitioning of regions, HBase
provides fault tolerance and high availability. An HBase cluster is divided into two types of servers: HMaster and HRegionServer. The HMaster
manages the metadata of HBase tables while the HRegionServer stores and manages access to the Regions. In general, a single HMaster server
and several HRegionServer machines are used as an HBase instance. The replication and high availability of data is delegated to an HDFS.
Each table in HBase is horizontally partitioned among multiple machines according to the interval and range size. The HBase does not have a
query language (DML) or a metadata manipulation language (DDL), and it does not support secondary keys. All those features were left out to
allow for a high performance database that can process many thousands of operations per second with high availability and fault tolerance.
The architecture of HBase (and Cassandra) is based on the concept of Log Structured Merge Tree (LSMTree). The HBase keeps a table in memory
(Memtable) where operations like insertion, search, and data changes are performed. Periodically, or after the Memtable reaches a certain size
(threshold), the Memtable is flushed to disk, and a clean Memtable is created. Each Memtable writing will result in an immutable indexed file
called SSTable (Sorted String Table). One feature that explains the high HBase performance is that it performs this operation asynchronously and
flush in the background. Before returning the acknowledgment to the user, each write request is first recorded in a command log on disk, and
after that the data is inserted / changed in RAM in Memtable. The writing command log provides fault tolerance because if the machine fails
before generating the SSTable then, after the machine Restarts, HBase will identify checkpoints not finalized and make a replay with the log
command to reenter the data in the Memtable not affected. Obviously, a failure or corruption of the log file would make the data recovery
impossible, so administrators of HBase clusters recommend storing this log on a separate fault-tolerant disk with RAID 10, for example.
In a production system, HBase should run on HDFS, so that it can provide resources such as replication and fault tolerance. On the other hand,
as high availability is a requirement, it is recommended the use of a coordination module (Zookeeper) to prevent the machine from becoming
unavailable given that HMaster failures can turn HBase unavailable.
Basically there are two data retrieving operations in HBase: scan table and get, the first one scanning a range of rows and retrieving the lot, while
the get performing a specific operation of only retrieving a row (or column-family/column subset).
Cassandra NoSQL Database
Cassandra (Lakshman & Malik, 2009) is an open source NoSQL database created by engineers at Facebook, specifically to implement the Inbox
Search feature, which allows users of the social network to search their inboxes. After releasing the system as free software in 2008, engineers at
Facebook backtracked on their strategy of using Cassandra internally, but the Apache community adopted the project and has added features to
this system at a constant pace, making this a remarkable active project regarding new features and bug fixes. Cassandra presents high scalability
regarding data volume as well as concurrent users, high availability under failure scenarios, and high performance, mainly for writing operations.
Unlike HBase, Cassandra's latest versions have the Cassandra Query Language (CQL) which derives directly from a subset of SQL.
Cassandra's architecture draws heavily on Amazon's Dynamo, a key-value database, and on Google's Bigtable column family database.
Accordingly, Cassandra incorporates the concepts of tunable consistency and consistent hashing, while following the Bigtable Column Family
model (also used in HBase). An instance of Cassandra is a cluster composed of machines. A keyspace is the namespace for a particular database,
typically one per application. Data in a Keyspace is inserted as a row and addressed by a primary key (a byte array), called Row Key. Each row
contains one or more Column Families (CFs) that are a collection of columns grouped by logical affinity. Columns can be created and removed
dynamically and have an associated timestamp to deal with concurrent updates.
CASE STUDY
The case study presented in this work consists of the implementation of NoSQL storage systems using Hadoop, HBase and Cassandra
technologies. The performance of the proposed solutions is compared to the PostgreSQL (PostgreSQL, 2014) implementation that is currently
used on the SIAPE database. SIAPE is a national system to manage the payroll of Brazilian federal employees. This system manages every paying
unit at the national level. In addition, it ensures the availability of employee data on the page siapenet.gov.br. (SIAPE, 2014) and produces a
mirror file for auditing purposes, as well as for business intelligence data warehousing and open data publishing, overall comprising an anti-
corruption policy.
In this context, we are interested in analyzing the loading time of the SIAPE mirror file into HBase and Cassandra compared to the same
operation into a PostgreSQL database. Besides that we are interested in comparing query time performance of auditing operations over the
SIAPE mirror database in order to find payroll anomalies. In order to audit this database, a set of data mining and filtering software modules,
called audit trails, was developed in another correlated work (Campos, 2012).
For the present work, the hardware setup used for the relational database system consists of a Dell Inc. PowerEdge R610 with a 2xCPU Xeon
2.80GHz, 32GB RAM and a 500GB HD. The Database Management System (DBMS) used in this case is the PostgreSQL v8.4 optimized for this
hardware. The HBase implementation comprises a single system that has been configured as a Master Server, while three systems are used as
Region Servers, being each master and region server configured with 2xCPU Xeon 2.80GHz, 8GB RAM and a 500GB HD. Finally, for the
Cassandra implementation, the cluster comprises three nodes, each one configured with 2xCPU Xeon 2.80GHz, 8GB RAM and a 500GB HD.
In the remainder of this section we explain, initially, how the mirror file is formatted before beginning with the data loading. Then the process of
loading data is described and, finally, we discuss some aspects related to the resulting data storage.
SIAPE File/Database
On a monthly basis, the SIAPE system generates a mirror file, which contains a sequential copy of the SIAPE database including personal,
functional and financial data of federal public workers in Brazil (Serrano, 2014). This file has information about two and half million workers,
including active, inactive and retired people. The actual size of each SIAPE file per month amounts to nearly 16GB, and is growing every month.
Moreover, this file contains 36 fields of personal data, 153 fields of functional data and 32 fields of financial data, totaling 221 fields. The data file
structure comprises a fixed size field where the paying unit lists every personal data in the first line, and the next line contains the functional
data. After the functional data, there are several lines of financial data whose number depends on the number of rubrics that the employee
receives.
The number of lines for personal and functional data is about 28,696,200 lines and 167,077,000 lines for financial data. This information covers,
exclusively, the 2012 fiscal year, so in both short and long-term periods, this database is prone to an important growth rate regarding operations
for storing and querying data.
Modeling
For this work purposes, the first challenge was to adapt the relational model of the SIAPE database (See Figure 1) to the NoSQL model, both for
HBase and Cassandra. The relational model of the SIAPE database is composed mainly of the tables “servidor_historico” (employee records),
“servidor_dado_financeiro” (employee financial data), as well as auxiliary tables and the related indexes for the main queries. The
servidor_historico table contains 192 fields of personal and financial information of employees and servidor_dado_financeiro table has 20 fields.
Figure 1 shows a part of the relational data model. The relationships between rubrica (rubrics), rubricas_pontos_controle (control point of
rubrics), rubrica_pc_rubricas_incompativeis (rubric incompatibly) and ponto_controle (control point) are auxiliary tables. These relationships
are related to rubrics that must not be paid simultaneously for the same employee in a month. A rubric is a classification for payment
components that an employee can receive. For example, a component may be a regular salary payment or a sporadic vacation payment. A control
point is a classification for every object that can be audited. For the purposes of this work the main interest is on rubric incompatibly. Figure 2
shows a part of the data model related to rubrics.
HBase Data Model
For the HBase model, the “Worker” table was defined and its row key composed of the fields “Year/Month”, “MatSiape”, “CodOrgao”, “Upag” and
“SeqSdf” to guarantee a unique identifier for every employee (see Figure 3). The creation of this row key was defined to take advantage of the
HBase data structure. This row key was created according to the most common questions to be answered in the queries.
Figure 3. HBase structure for the SIAPE Database
In the HBase implementation, two column families were defined: “Personal Data” and “Financial Data”. The “Personal Data” is a column family
to group all columns related to the employee’s personal data, while the “Financial Data” is a column family to store the employee’s financial data.
In this sense, the last attribute (SeqSdf) of the row key contains a sequence value to represent every financial data of a specific employee.
Moreover, the use of column families in HBase for storing data allows for disk usage optimization. The data structure designed for storing SIAPE
data in the HBase system is shown in Figure 3.
Cassandra Data Model
The Cassandra data model is composed of family columns. The servidor_dado_financeiro column family has a composed row key comprising the
columns cod_rubrica (cod_rubric), ano (year), mês (month), cod_orgao, upag e mat_siape. This row key is similar to the HBase model, though
with one more field compared to HBase model, the cod_rubrica field. Consequently, the Cassandra model uses fewer fields of personal and
functional information because just these few fields are used when executing payrolls. Another column family is called “rubrica” (rubric) with the
composed row key comprising ponto_controle, cod_rubricaand cod_rubrica_incompativel. Figure 4 shows the data model implemented by
Cassandra.
HBase Implementation
First, in the HBase implementation, the Cloudera Hadoop Distribution v4.0 (CHD4) was used, given that it is user friendly and has free
distribution, thereby leading to the configuration of the CHD4, Zookeeper (only for the Master Server), HDFS and HBase. Secondly, we defined
two steps for loading the SIAPE file: formatting the SIAPE file in a CSV format and loading the CSV files using the ETL provided by the Pentaho
Data Integration (PDI) (Pulvirenti, 2011). To implement the formatting process for HBase, we used a shell script to separate the personal and
functional data in a CSV file, called servidor (employee), as well as another CSV file containing financial data of employees,
calledservidor_dados_financeiros. The servidor file has 192 fields. After the formatting, this file contains almost 2.4 million rows (or employees
and their personal and functional data). Theservidor_dados_financeiros file has 20 fields and, after formatting, this file contains almost 20
million rows. In the case of Cassandra, the formatting process is the same as the HBase formatting process.
To implement the loading process the PDI software is used to load data into the “Worker” HBase table for “servidor” and
“servidor_dados_financeiros” data. Figures 5 and 6 show the flow model of the loading process for the “Worker” table in the HBase database. In
the first step, we treat the CSV file of “servidor” concatenating the year and month fields. In the Second step, the row key is generated for every
employee by composing with the fields Year/Month, MatSiape, CodOrgao, Upage and generating the SeqSdf. In the third step, unnecessary or
temporary fields are removed. Finally, we insert the output into the “Worker” HBase table. For financial data the process follows the same
sequence as shown in Figure 6.
Cassandra Implementation
As stated above the data model was implemented in a cluster of 3 nodes using the Cassandra DataStax Community Edition 2.0.7. In addition, the
CQLSH is used, making it possible to execute the Cassandra Query Language (CQL) which enables to use of a query syntax similar to the
relational SQL language. By means of CQLSH, the keyspace “siape” is created, as well as the family columns. Subsequently, the following steps
are performed: the loading process and the querying process for rubric incompatibilities.
Cassandra Loading Process
The process used for loading data into the Cassandra database is very similar to the process used for the HBase database. First, the SIAPE file is
processed by a shell script resulting two formatted files: one which contains personal and functional data and the other which contains only
financial data. Secondly, the Pentaho Data Integration (PDI) is used to filter constraints over personal, functional and financial information. In
this way, it performs pre-processing of the rubric incompatibilities. The result is a CSV file with 6 columns and approximately 6,500,000 lines.
The reduction in the number of fields is allowed because some of them used to filter personal, functional and financial information are not
necessary in the next steps. Consequently, the next queries use only the employee identifier, and the received rubrics.
The Cassandra loading process was tested with two different methods. The first one uses the JDBC API to populate the family columns. The
JDBC is an API to connect a Java application to a database, in this case to connect to the Cassandra database. The JDBC for Cassandra was
chosen because it performs better while inserting data into the Cassandra database compared with other tested methods. The second method
uses the Cassandra Bulk Loader which requires SSTABLEs, a feature implemented by a Java application which uses a especial API called
STableSimpleUnsortedWriter. After building SSTABLEs, it is necessary to use another process to load the SSTABLEs into the Cassandra
database.
Cassandra Querying Process
In order to query the Cassandra implementation, a Java application was developed using the Hector API (Hector, 2014) which is used to make
queries over the Cassandra database by providing functions to access, read, write, update and delete data. The application searches
every xi rubrics that has the “ponto_controle” id with relation to rubric incompatibilities into the family column “rubrica”. For each rubric xi is
created a thread responsible for other queries. After that, still in the family column “rubrica”, a search is done for all incompatible rubrics with
the xirubric. For each incompatible rubric yij, the employees that receive the rubric yij in the column family “servidor_dado_financeiro” are
searched for a previously defined specific year and month. For every resulting employee, a new search is done in the column family
“servidor_dado_financeiro” to verify if he/she is receiving the xi rubric. Finally if the employee receives the yij and xi rubric, this employee is
qualified as irregular because he is receiving two incompatible payments. The final result is a set of employees that received at least a couple of
incompatible rubrics.
RESULTS AND DISCUSSION
This section presents the performance results of loading data with the PostgreSQL, HBase and Cassandra databases as well as the response times
for querying rubric incompatibilities by means of these three databases respectively. In order to compare Cassandra with PostgreSQL and HBase,
we refer to a previous work that compared the PostgreSQL and HBase databases using SIAPE data (Huacarpuma, 2013).
Loading Data Results
Previous to testing the loading process, the source mirror file is processed once; resulting in the input CSV file. Then, the loading process was
repeatedly tested using this CSV file.
The loading times for the relational and the NoSQL databases are shown in the Table 1 and Figure 7. Comparing these results, the HBase solution
is clearly faster than PostgreSQL and Cassandra for many reasons. One reason is that HBase does not use the ACID principle as does PostgreSQL.
While PostgreSQL spends more time and resources trying to fit the ACID properties, HBase and Cassandra are interested in fitting consistency
and partition tolerance according to the CAP theorem. This is less expensive in terms of time and resources. Another aspect that may contribute
to reduce the loading time for HBase is the use of the HDFS given that we store the CSV file into the HDFS with the objective of reducing the O/I
during the loading process. It is important to highlight that Cassandra loading time using the BulkLoader resulted in the worst loading times.
Indeed, the loading time using BulkLoader is faster than any other tested, but the previous process to get the SSTABLE is very slow. This could be
explained because the Java application to build the SSTABLE is inefficient to transform the CSV file to SSTABLE.
Table 1. Loading data results
PostgreSQL 30.00
Pentaho-HBase 5.18
Query Data Results
A brief comparison of some querying processing times is shown in Table 2, for three different queries, as well as in Figure 8 for one complex
query (C). The query response time tests also showed better results for NoSQL databases, with performance gains (Efficiency) when comparing to
the relational solution, seemingly because they are column-oriented databases. But it is not clear which NoSQL solution performs better as each
one’s performance depends on the type of the executed query, given that while showing almost the same result for Item A, their results are
inverted for Item B and C.
Table 2. Querying data results
These results are the consequence of many factors. First, the HBase and Cassandra models use the key to get an entire row (value). Thus, the
method to build the key will impact the speed of the query. Secondly, the HBase key table is ordered, so if the key is built with the more
frequently used fields, the query responses will be faster. This happens because every row is stored in order according to the key used and when
the probability of the row being stored in the same HDFS DataNode is very high. This facilitates querying data very quickly because the more
fields (more frequently used fields to query data) contained in the key the faster the query will be. This is the case unless the number of fields
becomes too large because this may compromise the query performance. Thirdly, one reason for Cassandra’s performance is the replication
strategy. Besides this, data is partitioned using a consistency hashing so each data item is assigned to a node in the cluster. Finally, the HBase
columns are lexicographically oriented. This allows “scan” operations to be very fast in a specific range without the necessity of using secondary
indexes. By contrast, the Cassandra database uses a secondary index in each column in the column family.
HBase works differently from a relational database, since it is a column oriented database where the columns are lexicographically oriented. This
allows scan operations within a specific range in a more direct and rapid way, without the necessity of using secondary indexes. On the other
hand, Cassandra is not a relational database, but it has some characteristics typical of relational database such as the CQL language for querying
and creating tables and secondary indexes.
Additionally, it is worth pointing out that the NoSQL database results can be improved by tuning the data store, for instance, via table scan for
HBase. Currently the scan is done sequentially in the HBase database. This operation can be done in parallel for every DataNode. This allows
distributing the data processing to every DataNode thus improving the table scan.
Furthermore, it is worth to consider that the HBase scalability may help managing the data growth that is expected for the future years.
FUTURE RESEARCH DIRECTIONS
There are still many research possibilities to continue this study. This chapter focused mainly on two types of databases: relational database and
column-oriented databases. Nonetheless, a large number of NoSQL database exists, each one with different architectures and characteristics.
Further study of other databases would also allow the identification of new design patterns, specific to each database. In this context, as a future
work we intend to test loading data in a larger cluster including desktop computers with less processing and storage capacities. Also a comparison
with other NoSQL technologies can be done which will show the performance of column-oriented databases, comparing them to other data
storage such as Riak and mongoDB. Another interesting future work will be to test NoSQL databases not only on querying the audit trail
concerning the incompatibility of rubrics but also other audit trails, using other NoSQL models.
Moreover, it would also be useful to study the capabilities when combining the Cassandra and the Hadoop cluster, since Hadoop MapReduce jobs
can be used to do complex computations on the data stored into Cassandra. Cassandra can hold advantages, since the HBase uses HDFS as a file
system. This study case considered a dataset of the 2012 year as example, but it could be used with a bigger data volume (2011, 2012 and the
current year data).
CONCLUSION
In this work, a case study was carried out to analyze Brazilian federal payroll data using NoSQL databases (HBase and Cassandra) and a
relational database (PostgreSQL). In order to compare the performance of these technologies, the load and query time was analyzed. The
developed comparison might be relevant when deciding which database to use. In this context, we show how the performance is improved by
using Big Data technologies. We have used the combination of HBase-Pentaho and Cassandra-JDBC-Hector to test the load and query processes.
The main query analyzed was the incompatibility of rubrics in payrolls. The NoSQL technologies showed better performance in loading and
querying data compared to the relational approach.
It seems undeniable that Big Data will gain importance in a lot of fields. Hadoop technology offers a powerful means of distributing and
computing among commodity servers. Combined with NoSQL databases they are able to manage big data processing. In this context this work
examines a relatively big database (SIAPE database), which is actually facing storage, process and query problems. These problems may put the
current system at risk, in the short and the long-term. Hence, this work aims to provide valuable experience in selecting a NoSQL database and
using it efficiently.
As a final observation, it is important to note that NoSQL modeling is based on queries, which has limits when compared to a relational modeling
based on data. Relational models are projected to solve different queries and not only specifics ones, such as NoSQL databases. Moreover, we
observed that good modeling of data can help to get better query latency.
This work was previously published in Artificial Intelligence Technologies and the Evolution of Web 3.0 edited by Tomayess Issa and Pedro
Isaías, pages 230247, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
The authors thank the Brazilian Ministry of Planning, Budget and Management for its support to this work, as well as the Brazilian Innovation
Agency FINEP (Grant RENASIC/PROTO 01.12.0555.00).
REFERENCES
Andrew, C., Huy, L., & Aditya, P. (2012). Big data. XRDS The ACM Magazine for Students , 19(1), 7–8. doi:10.1145/2331042.2331045
Campos, S. R., Fernandes, A. A., de Sousa, R. T., Jr., de Freitas, E. P., da Costa, J. P. C. L., Serrano, A. M. R., … Rodrigues, C. T. (2012). Ontologic
audit trails mapping for detection of irregularities in payrolls. In Proceedings of theFourth International Conference on Computational Aspects
of Social Networks (CASoN). São Carlos, Brazil: IEEE Press.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., & Gruber, R. E. (2008). Bigtable: A distributed storage system for
structured data. ACM Transactions on Computer Systems , 26(2), 1–26. doi:10.1145/1365815.1365816
Collett, S. (2011). Why Big data is a big deal. Computerworld ,45(20), 1–6. Retrieved from
http://www.computerworld.com/article/2550147/business-intelligence/why-big-data-is-a-big-deal.html
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM , 51(1), 107–113.
doi:10.1145/1327452.1327492
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., & Vogels, W. (2007). Dynamo: Amazon's highly available
key-value store. ACM SIGOPS Operating Systems Review ACM , 41(6), 205–220. doi:10.1145/1323293.1294281
Dumbill, E. (2012). Planning for big data . Sebastopol, CA: O'Reilly Media, Inc.
Franke, C., Morin, S., Chebotko, A., Abraham, J., & Brazier, P. (2013). Efficient processing of semantic web queries in HBase and mySQL
cluster. IT Professional , 15(3), 36–43. doi:10.1109/MITP.2012.42
Huacarpuma R. C. Rodrigues D. D. C. Serrano A. M. R. da Costa J. P. C. L. de Sousa R. T. Jr Holanda M. T. Araujo A. P. F. (2013). Big data: A
case study on data from the Brazilian ministry of planning, budgeting and management. In Proceedings of the IADIS International Conference
Applied Computing 2013. Porto, Portugal: IADIS Press.
Jun, T., Kai, C., Yu, F., & Gang, T. (2009). The research & application of ETL tool in business intelligence project. InProceedings of
theInformation Technology and Applications(IFITA'09). Chengdu, China: IEEE Press. 10.1109/IFITA.2009.48
Karger D. Lehman E. Leighton T. Panigrahy R. Levine M. Lewin D. (1997). Consistent hashing and random trees: Distributed caching protocols
for relieving hot spots on the world wide web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. ACM
Press. 10.1145/258533.258660
Lakshman, A., & Malik, P. (2009). Cassandra: Structured storage system on a P2P network. In Proceedings of the 28th ACM Symposium on
Principles of Distributed Computing (PODC '09). New York, NY: ACM Press.
Marz, N., & Warren, J. (2013). Big data: Principles and best practices of scalable realtime data systems . Manning Publications.
Office of Science and Technology Policy of the United States. (2012). Fact sheet: Big data across the federal government. Retrieved September,
2014, from www.WhiteHouse.gov/OSTP
Padhy, R. P., Patra, M. R., & Satapathy, S. C. (2011). RDBMS to NoSQL: Reviewing some next-generation non-relational databases. International
Journal of Advanced Engineering Science and Technologies , 11(1), 15–30.
Pulvirenti, A. S., & Roldan, M. C. (2011). Pentaho data integration 4 cookbook . Olton Birmingham, UK: Packt Publishing Ltd.
Serrano, A. M. R., Rodrigues, P. H. B., Huacarpuma, R. C., da Costa, J. P. C. L., de Freitas, E. P., Assis, V. L., . . . Pilon, B. H. A. (2014). Improved
business intelligence solution with reimbursement tracking system for the Brazilian ministry of planning, budget and management.
In Proceedings of the 6th International Conference on Knowledge Management and Information Sharing (KMIS'14). Rome, Italy:
SCITEPRESS. 10.5220/0005169104340440
Vora M. N. (2011). Hadoop-HBase for large-scale data. In Proceedings of the International Conference on Computer Science and Network
Technology (ICCSNT). Harbin, China: IEEE Press.
KEY TERMS AND DEFINITIONS
Cassandra noSQL: Cassandra noSQL database presents high scalability regarding data volume as well as concurrent users, high availability
under failure scenarios, and high performance, mainly for writing operations.
Column Family noSQL: noSQL database which supports the concept of tables with rows and columns.
Hadoop: Hadoop project aims to develop open source software for distributed, scalable and reliable computing.
HBase noSQL: It is an Apache open source project and aims to provide a storage system. HBase database implements the column-oriented
model.
HDFS: Hadoop Distributed file system, which makes data available to multiple computing nodes.
MapReduce: MapReduce is a programming model where the application is divided into many small tasks, each of which may be executed or re-
executed at any node in the cluster.
noSQL Database: Distributed database systems built to meet the demands of high scalability and fault tolerance in the management and
analysis of massive amounts of data, an alternative to traditional Relational Database Management Systems (RDBMS).
CHAPTER 51
Analysis of Genomic Data in a Cloud Computing Environment
Philip Groth
Bayer Pharma AG, Germany
Gerhard Reuter
Bayer Business Services GmbH, Germany
Sebastian Thieme
HumboldtUniversity of Berlin, Germany
ABSTRACT
A new trend for data analysis in the life sciences is Cloud computing, enabling the analysis of large datasets in short time. This chapter introduces
Big Data challenges in the genomic era and how Cloud computing can be one feasible approach for solving them. Technical and security issues
are discussed and a case study where Clouds are successfully applied to resolve computational bottlenecks in the analysis of genomic data is
presented. It is an intentional outcome of this chapter that Cloud computing is not essential for analyzing Big Data. Rather, it is argued that for
the optimized utilization of IT, it is required to choose the best architecture for each use case, either by security requirements, financial goals,
optimized runtime through parallelization, or the ability for easier collaboration and data sharing with business partners on shared resources.
INTRODUCTION
Big Data Challenges
In 2009, the total global amount of stored data was estimated to have reached 800 Exabyte (EB) (Association, 2010) and was increased by
approximately 13 EB throughout the following year (Agrawal et al., 2012). It was recently estimated that the amount new data generated in 2013
alone has reached 900 EB, implying that the vast majority of data stored today have been generated in just the past two years (IBM). Of this, the
global amount of healthcare data was estimated to have exceeded 150 EB in 2011 (IBM). There is a simple explanation for this strong increase:
Data nowadays are generated anywhere and anytime in a mainly automated manner and storing them is relatively cheap (e.g. commercial data
storage is offered for less than USD 0.01 per Gigabyte and month) (AWS). The notion of ‘Big Data’ to describe this phenomenon that large
amounts of data are generated within a specific domain or of a specific class has already been described in the mid-nineties of the last century
when the term itself was first coined by John Mashey of Silicon Graphics (sgi) and since then been widely adopted (Lohr, 2013; Mashey, 1998).
Most types of Big Data have many characteristics in common, e.g. a typical life cycle. They are generated, copied and moved, processed and
analyzed, versioned, archived and sometimes deleted. Each step brings up systematic issues, all of them involving IT. Handling of Big Data starts
with the process of generation. To generate data and to keep them usable, it is important to document their existence. By whom, how, when and
under what circumstances were they created? The lack of such attributes will reduce or even disable the usability of data. Already, this is not
trivial; as such annotation should be stored in a searchable manner. A popular tool to handle Big Data, especially for data annotation, tracing and
versioning is ‘iRods’ (www.irods.org), which is employed, for example by the CERN (for data from high energy physics experiments) and the
Wellcome Trust Sanger Institute (WTSI) (for DNA sequencing data).
Assuming the data have been adequately annotated, they are oftentimes placed within the Internet for immediate global availability, creating the
challenge to interested users of acquiring a local copy. Classical transfer methods based on FTP or HTTP were not designed to transfer large files
(i.e. in the range of gigabytes or more). Tools like AsperaTM (using a proprietary protocol named FASP) or torrent-seeding based methods
(cghub.ucsc.edu) remedy this issue to some extent but are not yet widely spread.
Finally, due to their size, Big Data are quite often stored on distributed architectures, bringing up another issue. Software meant to process Big
Data must take into account partial failures of the underlying hardware (e.g. a single disk error) and communication latencies. Many popular
software products are in the process of redesign to adapt to this change. To speed up the analysis of Big Data, they are processed in a parallel
manner in such distributed environments. This can be done with tools like HadoopTM (Apache, 2012), utilizing a proprietary file system to
distribute data across the network and a so-called ‘master-slave approach’ to assign sub-tasks to interlinked compute nodes. HadoopTM is a
framework developed by Apache and used amongst others by FacebookTM and YahooTM.
Genomic Data is Big Data
With the decoding of the human genome and the associated substantial progress in the development of laboratory and bioinformatics methods
(Chen, Wang, & Shi, 2011; Kearney & Horsley, 2005; Wang, Gerstein, & Snyder, 2009) the data of known biological interrelationships has also
increased dramatically. Such data comprise, for example, the complete information on an individual’s genome, such as nucleotide variations,
chromosomal aberrations or other structural changes within the genome, more generically known as mutations. The smallest mutation within a
genome is the exchange of a single nucleotide within the DNA, the so-called ‘building block of life’ (see (Alberts et al., 2007; Strachan & Read,
2005) for more information). If such a mutation is shared by at least 1% of a defined population and not disease-causing per se it is called ‘single
nucleotide polymorphism’ (SNP) (Barreiro, Laval, Quach, Patin, & Quintana-Murci, 2008; Risch, 2000). In 2005, it was estimated that there are
approximately 10 million SNPs to account for variation in the human population (Botstein & Risch, 2003). But data from the 1,000 Genomes
Project (Abecasis et al., 2010) have revealed that there are many more SNPs within the human genome. By 2011, more than 40 million SNPs had
been identified (Eberle et al., 2011). Each SNP specifies a genotype, describing differences within the genomic sequence between individuals (de
Paula Careta & Paneto, 2012). This enormous amount of variability on an individual level has created a major challenge towards data analysis
and organization in order to generate new knowledge of scientific or commercial value.
Due to this increasing amount of available data the term ‘Big Data’ is now also commonly applied in the genomics field. In this context, Big Data
refers to large sets of genomic data from patients, originating from different sources and having been generated by a plethora of high-throughput
technologies, the most prominent of which are microarrays and sequencing. One of the most widely adapted uses for microarray technology
today is the simultaneous measurement of abundance of mRNA representing the expression of genes in a sample (see (Brown et al., 2000; Kerr,
Martin, & Churchill, 2000; Lashkari et al., 1997; Müller & Röder, 2004) for more details on the use of microarrays in gene expression analysis).
Another application is the DNA microarray, an example being the array-based comparative genome hybridization (aCGH) (Pinkel et al., 1998).
The aCGH is a type of microarray used to detect genomic aberrations like copy number variations (CNVs) (Pinkel et al., 1998), denoting a change
in the copy number (CN) of a genomic segment caused by evolutionary events like deletions, amplifications, or translocations of chromosomal
segments (Graux et al., 2004; Greenman et al., 2010; Theisen, 2008).
SNPs can be exploited for genotyping, as well as calculating CNVs by way of applying the Affymetrix Genome-Wide Human SNP Array 6.0
technology (SNP6 array) (McCarroll et al., 2008). This is a high-density microarray, containing as probes 900,000 known SNP loci in the Human
Genome. In addition, 950,000 non-polymorphic CN probes of different length (up to 1,000 bp) are fixed on the chip. Due to the high resolution
of this array, it is suitable to create a more detailed picture of the complex karyotypes occurring for example in cancer (Edgren et al., 2011).
The rise of DNA sequencing (DNA-seq) was given by the idea to interrupt DNA synthesis with the introduction of a so-called dideoxy nucleotide
(ddNTP) at a specific nucleotide. This led to the term ‘chain-termination method’ and is the basic idea of Sanger sequencing (Sanger & Coulson,
1975). Sanger sequencing played a key role in the description of the Human Genome and therefore for the development of sequencing methods
we know today. But with an output of 400,000 bases per machine per day the Sanger sequencing is limited in throughput (Liu et al., 2012; Wang
et al., 2009).
To overcome this limitation, a new generation of DNA sequencing methods was developed, enabling a massive parallelization of sequencing and
giving rise to the era of ‘next-generation sequencing’ (NGS). The first methods were based on Sanger sequencing technology and therefore
expensive and not precise enough for unambiguous mapping of sequences or distinguishing between isoforms (Wang et al., 2009) (further
reading (Brenner et al., 2000; Reinartz et al., 2002; Velculescu, Zhang, Vogelstein, & Kinzler, 1995)). Still, their development initiated the
development of further high-throughput sequencing methods, e.g. by Illumina (illumina), Roche NimbleGen (NimbleGen), Complete Genomics
(Genomics), Applied Biosystems (LTC-AB) and many others. Besides DNA, RNA can also be sequenced. Sequencing of RNA has been made
possible by adapting DNA-seq methods. The RNA-sequencing (RNA-seq) method enables quantification of the total transcript of a cell
(transcriptome) under specific conditions or at a specific stage of development (Chen et al., 2011; Kearney & Horsley, 2005; Wang et al., 2009).
The advancement of methods continues, e.g. with nanopore sequencing (Clarke et al., 2009; Kasianowicz, Brandin, Branton, & Deamer, 1996),
where only a pore with a diameter of around 1 nm is used to sequence DNA. The idea is that a voltage is applied across the nanopore and the flux
of the ions through the nanopore is measured. Moving a DNA molecule through this nanopore changes the ion flux for each nucleotide in a
characteristic manner. Tracking the changes in the flux for an entire DNA strand enables determination of the order of the respective nucleotides
in the sequence (Clarke et al., 2009). This latest advancement shows that novel technologies and methods for sequencing and microarray
techniques improving speed, precision and resolution emerge constantly, with the additional result that the amount of output data is increased
further. This is also the starting point for large sequencing projects like the 1,000 Genomes Project, which uses Illumina sequencers. With this
project, yet another consortium adds Big Data to the public domain, having increased its sequencing data repositories from 50 Gigabyte (GB) in
2005 to 2.5 TB in 2008. In March 2013, a repository size of 464 TB was reached (Community). The advent of new technologies with the
capability to produce even higher data density (see Figure 1) and further, even bigger sequencing projects is near. For example, the ‘Genome
England Project’ has been funded by the NIH in 2013 with approximately 150 million USD in order to sequence 100,000 patients with rare
diseases using whole-genome sequencing (WGS). The size of a typical output file of raw sequencing reads (so-called BAM file) is around 300 to
500 GB per sample, initiating yet another discussion of efficient compression and storage, which is beyond the scope of this chapter.
Figure 1. Depicts improvements in the rate of DNA sequencing
over the past 30 years and into the future. From slab gels to
capillary sequencing and second-generation sequencing
technologies, there has been a more than a million-fold
improvement in the rate of sequence generation over this time
scale. (Reproduced from M. Stratton (Stratton, Campbell, &
Futreal, 2009))
The raw output of sequencers is of course data to be processed by computational pipelines in order to extract biologically meaningful outcome,
e.g. mapping the resulting sequences to the correct position in the genome or analyzing the mapped genomic sequences for aberrations against a
reference genome. Such sequencing analysis pipelines (see Figure 2 for an example) typically comprise a number of computationally intensive
steps to transform raw data from a sequencer into human-readable data. Similarly, data from microarray experiments are processed in
comparable pipelines, e.g. to compare gene expression differences between a diseased tissue sample and a healthy reference in a patient
population or to predict disease-causing chromosomal aberrations, such as fusion genes. All of these pipelines have in common their highly
modular (i.e. several algorithms work in sequence and data get passed from one step to the next in different formats) and scientific (i.e. each
algorithm is published and will either be improved by a community effort or surpassed by another one) nature. Finally, these pipelines almost
always are meant to process data of many individuals.
Figure 2. This typical WES pipeline consists of 11 steps. For
each input file, a quality control is performed (1). FASTQ files
are aligned to the reference genome (2) and converted into
standard text files containing sequence and alignment data
(3). Text files are converted to BAM files (4), sorted (5) and
indexed (6). In the variant processing step, each nucleotide
position is considered in terms of frequency and quality (7). In
the variant calling step, actual somatic/germ line mutation
calling is done. This step requires pileup files from both cancer
and healthy samples, generating VCF files (8). In the variant
filtering, false positive indels are filtered with the help of
UCSC repeat masker information. Also, substitutions are
filtered with regards to nearly ‘non-properly-alignable’ genes
like olfactory receptor genes (9). Processing of VCF files
through Ensembl Variant Effect Predictor generates rich
variant effect annotation (10). Resulting files can be uploaded
for further analysis (11).
Genomic data files are too large in terms of size and complexity to process them with ‘traditional’ applications, such as word processing or text
editors. Therefore, new methods and technologies to share, store, search through, analyze and visualize these kinds of data are needed. This
obvious need for analyzing Big Data in large and complex algorithmic pipelines thus poses a challenge towards an IT infrastructure. Even the
advent of multi-core high-performance compute clusters and in-memory technologies are challenged by the fact that many pipelines still take
several thousand CPU hours per sample to complete. At the same time, such high performance infrastructures oftentimes are not in full use at all
times. This can for example be attributed to the sequencing process, where one ‘run’ takes several days or weeks to complete by which time a
large amount of data is ready for further processing. It is therefore feasible to take a look at alternative architectures, in which an infrastructure
that is challenged by peak loads can be supplemented by offsite computing power, e.g. as offered by Cloud computing service providers.
BACKGROUND
Calculating Genotypes and CNVs
Structural or quantitative changes within the genome are so-called genomic aberrations; specifically, if these changes occur within a
chromosome, they are called chromosomal aberrations. These changes can e.g. be detected by examining CNVs across the genome (Johansson,
Mertens, & Mitelman, 1996). Such kinds of genomic aberrations also occur frequently during cancer development and are often unbalanced,
meaning that there is a quantitative change in genetic material caused by a mutation event, e.g. a deletion or amplification (Johansson et al.,
1996). If the total amount of genetic material does not change during a mutation event, the mutation is considered balanced (Mitelman,
Johansson, & Mertens, 2007). The two genotypes for a given SNP position, which are represented by the two alleles A and B can be measured e.g.
by the representative probes on a SNP6 array consisting of six technical replicates, three replicates for allele A and three replicates for allele B.
The fluorophore-signal emitted from the probe-target hybridization describing one SNP allele is directly proportional to the allelic CN and
sensitive enough to show differences within the allelic CN. The resulting intensity values for each allele over a set of normal samples can be
clustered according to the three wild-type genotype classes AA, AB, and BB (Greenman et al., 2010).
The amount of DNA within different cells is typically very similar, especially across healthy tissues where we expect a diploid genome (Greenman
et al., 2010). In contrast to normal cells, cancer cells exhibit aneuploidy (i.e. a triploid or quadraploid genome) much more often than their
healthy counterparts (Rajagopalan & Lengauer, 2004).
In contrast to other tools detecting CNVs, the tool 'Predicting Integral Copy Number In Cancer' (PICNIC) (Greenman et al., 2010) calculates the
copy number (CN) from SNP6 arrays to overcome the challenges deriving from aneuploidy in cancer samples. PICNIC consists of two main steps.
In the first step the entire data set is corrected across all arrays of the experiment and then the probe intensity values are normalized by using
Bayesian statistics. In the second step, the data will be segmented and genotyped to identify the most likely copy number segmentation by using a
Hidden-Markov Model (Greenman et al., 2010). The transition region of segments with CNVs is defined as a change in the CN of two neighboring
genomic segments within a chromosome. It is often referred to as a copy number breakpoint or in this context simply as a breakpoint (Ritz, Paris,
Ittmann, Collins, & Raphael, 2011).
Copy Number Variation and Fusion Genes
Genomic aberrations can be more complex than CNVs. Consider, as an example for this complexity, two genes at different breakpoints, originally
coding for two different proteins, which can merge to a fusion gene or chimeric gene. The resulting merged sequence contains the entire or partial
information of both genes, which may lead to a protein with novel functions (Long, 2000). A fusion causes the change of the physical genomic
position of one gene and therefore loses the regulatory elements of this gene. This might change the expression of the involved gene (Huang,
Shah, & Li, 2012). The fusion gene TMPRSS2-ERG is a well-described prostate cancer marker gene, deriving from a 3 million bp deletion. Due to
the fusion event, the ERG gene is regulated by the TMPRSS2 promoter, leading to overexpression of ERG (Tomlins et al., 2005; Yu et al., 2010).
However, most of the occurring fusion genes are non-functional (loss of function of one or both genes), but may still have an influence on the cell
behavior.
The detection of such fusion genes has been identified as a worthwhile effort, especially in cancer genomics approaches (Thieme & Groth, 2013).
Fusion gene prediction is a computationally intensive effort requiring the knowledge of breakpoint positions throughout the genome. Ritz et al.
(2011) (Ritz et al., 2011) developed the algorithm ‘Neighborhood Breakpoint Conservation’ (NBC), which uses the CNV information from aCGH
(Suzuki, Tenjin, Shibuya, & Tanaka, 1997) data to calculate common breakpoints or pairs of common breakpoints in a given sample on the basis
of Bayesian statistics. The NBC combines the calculated probabilities of pairs of common breakpoints to detect fusion genes with positional
variability (Figure 3). The advantage of predicting fusion genes on the basis of breakpoints calculated by this method is the ability to find both
functional and silent fusion genes.
Figure 3. The workflow of ‘Neighborhood Breakpoint
Conservation’ (NBC) consists of three steps. First, all possible
copy number profiles are computed, from which the
breakpoints can be derived from. Next the probability of a
breakpoint between two probes is calculated, wherefore all
possible copy number profiles of a single individual are
considered. In the last step, the calculated probabilities of the
breakpoints of each individual are combined to find recurrent
breakpoints over all individuals. The outputs are common
breakpoints or pairs of common breakpoints over different
samples. (Reproduced from Ritz et al. (Ritz et al., 2011))
We have recently published a related method, the Genomic Fusion Detection (GFD) algorithm (Thieme & Groth, 2013). Our algorithm consists of
three main steps and one preprocessing step shown in Figure 4. In the preprocessing step the SNP6 data will be processed with PICNIC to get the
copy number segmentation of the data. The algorithm takes as input the segment information calculated by PICNIC (Figure 4A). This segment
information is used to predict possible fusion genes. The input data are processed within three steps. In the first step, breakpoints are determined
within the predicted segmentations, artifacts are deleted and genes, which are close to a breakpoint are identified (Figure 4B). Secondly, gene
pairs fulfilling certain required constraints (Figure 4C) are detected. In the last step, each result for a sample is compared to the results of all
processed samples to find possible common fusion events and reduce false-positive predictions (Figure 4D). The GFD algorithm is based on the
ideas of Ritz et al. (Ritz et al., 2011), extending them, however, by several key features.
Figure 4. (A) Preprocessing step: apply PICNIC for
normalization and segmentation of the data. (B) determine
breakpoints within the predicted segmentation, delete
artefacts and find genes, which are close to a breakpoint. (C)
find gene combinations, which fulfil certain constraints. (D)
compare the results of all processed samples to find common
fusion events.
In the early 1980s, Thomas and Christoph Cremer have observed and validated the existence of so-called ‘chromosome territories’ in healthy
mammalian cells, i.e. that chromosomes seem to have a tissue-specific location within the nucleus. Roix et al. (Roix, McQueen, Munson, Parada,
& Misteli, 2003) have shown that translocation-prone genes are located in close spatial proximity: The fusion of the MYC gene (located on
chromosome 8) with the IGH gene (located on chromosome 14) causing Burkitt Lymphoma becomes clearer, considering that these two
chromosomes are direct neighbors in the nucleus in B-cells of the human immune system. Thus, there seems to be a correlation between the
probability for inter-chromosomal fusion genes and the distance of the involved partners. Similar observations have been reported by others
(Meaburn, Misteli, & Soutoglou, 2007). With the usage of the HI-C method (van Berkum et al., 2010), an extension of the 3C method
(Chromosome Confirmation Capture), the proximities between chromosomes can be measured for each cell type. We propose to enhance the
GFD algorithm by taking into account that chromosomes are not randomly positioned in the nucleus. Adding this insight to our algorithm will
improve the prediction probability of fusion genes.
MAIN FOCUS OF THE CHAPTER
Cloud Computing
Attempting to find an existing definition of Cloud computing will return many results of varying usefulness, some of which are summarized in
Figure 5. The overlap within all these answers is too small for being useful as a universal definition.
Figure 5. Variety of definitions for cloud computing
Therefore, we have compiled some success stories, (i.e. where the use of Cloud computing has added value to the task at hand), in order to extract
some common principles of Cloud computing. Although these will not lead to any better or more universal definition of Cloud computing, they
may help create an understanding for some of the emerging central ideas (by way of simplification, we will refer to ‘the Cloud’ as synonym for
‘Cloud Computing’ of any type in this chapter):
• The New York TimesTM used 100 Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instances to translate 11 million articles into
pdf-files within 24 hours to make them available in their online-archive (Gottfrid, 2007).
• DreamWorksTM Studios bought about 20% of the CPU hours that were needed to calculate animations in the movie Kung Fu Panda 2 from
AWS (Wittmann, 2012).
Such Cloud-based models are known as ‘Infrastructure as a Service’ (IaaS), whereas the next two examples shall illustrate another service model
within the Cloud, which is called ‘Software as a service’ (SaaS):
TM TM
• Hundreds of millions of internet users are sharing their files by using portals like YouTubeTM or DropboxTM. The users’ files become
available globally by transfer to the Cloud.
• Instead of installing and updating software and data locally, Cloud products like Google EarthTM, Google MapsTM or Microsoft
OfficeTM offer single point of entries to present the latest software and data versions to all users, regardless of the local position or hardware.
These examples clearly illustrate two of the main benefits of the Cloud for genomic data:
1. If there is a need for a large amount of computational power on short notice or for a short period of time, utilizing the Cloud is a good idea.
The alternative would be to invest in a compute infrastructure, which might take a lot of time and effort to set up and may not be needed as
soon as the actual one-time computation has finished. In many business cases processing of genomic data is a spontaneous and a short term
task.
2. Cloud computing offers a simple way to make software and data available anywhere, anytime on nearly any device. Collaboration on
genomic data among different companies becomes easier than storing data in an internal corporate network.
In addition, Cloud providers offer further service models besides SaaS and IaaS, i.e. customers can buy platform (PaaS) or networks (NaaS) as a
service. Taking into account that any kind of mixtures among these services are also possible, it is obvious that the Cloud can represent nearly any
business case and why finding a universal definition is not trivial. In summary, the Cloud has versatile characteristics, but a very simple basic
idea: It is a one-stop shop for globally available computational resources of any kind (hardware, network, software, storage, etc.). From the
customer’s perspective this creates some interesting opportunities:
• Instead of adding new hardware for a short term project, it may be rented from the Cloud. The Cloud provider takes care of basic services
like secure data centers, buildings, physical access-control, power supply, cooling, backup locations, etc. This is especially helpful for small
and medium enterprises (SMEs) or startup companies, which can avoid the hurdle of building up an internal IT at enormous upfront costs
by utilizing Cloud services.
• Instead of buying software and licenses, anyone can rent exactly what is currently needed and will only pay for the actual use.
• Data stored in the Cloud can be made available globally as well as to a limited, exactly defined group of users.
In a detailed reflection on these principles, Table 1 summarizes some of the pros and cons of Clouds, which are deemed worthwhile considering
before utilizing the Cloud as computational resource.
Table 1. Compilation of pros and cons while utilizing the cloud for genomic data
Pros Cons
After the initial decision has been made to use the Cloud for some purpose, finding a suitable Cloud provider is an immediate next step. The
checklist below is intended to support this decision, by highlighting some of the aspects that are most likely of importance:
• Easy to get started: Creating account and providing a means for payment should suffice
• Wizard-based allocation and management of resources, such that no advanced knowledge of programming or technologies are required
• Integrated security features (e. g. access lists), which are documented in detail, easy to understand and to maintain
• Variety of pre-configured operating systems, application software and public data sets of general interest
• Global coverage (Laws may force the customer to keep data in a dedicated region)
• Tiered storage architecture with fast access storage and inexpensive long term archiving
Another very helpful tool that can be utilized to support the decision for a specific Cloud computing provider is for example, the ‘Magic Quadrant
Analysis’ conducted annually by an IT consulting company for infrastructure as a service providers (Leong, Toombs, Gill, Petri, & Haynes, 2013).
In their 2012 survey of the top 15 providers of IaaS, they focus on the two evaluation criteria ‘ability to execute’ and ‘completeness of vision’,
distinguishing between four types of technology providers: Leaders (strong vision, and strong position to execute in their market), Visionaries
(strong vision, weaker execution), Niche Players (focus on a segment or are unfocussed and outperformed), and Challengers (execute well or
dominate a market segment, but do not have a clear vision) (Cohen, 2012). According to their analysis, the far most advanced, almost lonely, IaaS
’Leader’ is Amazon Web Services (AWS). As for other large companies commonly associated with advanced IT technologies, Microsoft for
example is seen as a strong ‘Visionary’, whereas IBM or HP are rather perceived as ‘Niche Players’ in that market. AWS services are
recommended, among others, for cloud-native applications, batch computing, e-business hosting, general business applications, and test and
development. But probably the strongest reason why AWS is such a far advanced market share leader is this: AWS alone has more than five times
the compute capacity in use than the aggregate total of the other fourteen providers (as of March 2014)!
The final step to move into the Cloud involves choosing a suitable connection model, that is to employ a private Cloud as described above, a
public Cloud without the detour through a VPN, or to utilize ‘Hybrid Clouds’, i.e. a combination of private and public Clouds. This decision
depends on the answer to the question where the clients are located. If the clients consist only of internal users and no access from third parties
(outside of the company) is needed, a private cloud would offer the best look-and-feel to the users, because the Cloud can be set up identical to
internal servers. On the other hand, this requires integrating these servers transparently into the company’s common management tools (e.g.
host name resolution, security-scanning, patching, monitoring). In contrast, a public Cloud would probably offer more benefits and flexibility for
a shared approach in which both internal and external users are involved.
Whatever the final choice may be, a staged solution is proposed here as a useful setup (Figure 6). For utilizing a public Cloud it is recommended
to install a single entry server as a stepping stone into a larger cluster of Cloud instances. For more convenience this might be a server with a
graphical user interface (GUI) and pre-installed software that can be used to connect to the other Cloud servers (e.g. via Putty). The main
advantage for a single-point-of-entry server is that it reduces the number of public IP addresses for all Cloud Servers, as these are rare and
expensive. Furthermore, security is increased because such a single connection component can be easily monitored. Figure 6 shows a design
proposal for handling Big Data within such a set up. The WindowsTM based entry server has a public IP address allowing access from the Internet,
while all other Cloud instance use private IP addressing scheme. Communication from the Internet is restricted to the RDP protocol, which has a
built-in encryption, but is additionally wrapped in a VPN tunnel to enable secure data transfer. Secure access lists only allow access from
dedicated sources, so that the Cloud itself is protected against any access from untrusted locations. The Network File System (NFS) server is the
central repository for storing data and any number of compute instances (here: ‘R-servers’) can be started or stopped on demand using APIs
provided by the Cloud provider or custom-made programs. All internal communication is done by a secure shell (SSH) connection fitted with a
private access key to ensure encryption. Additionally, some Cloud providers offer low-cost storage which can be an attractive extension of this
model to store large amounts of data for an extended period of time.
Figure 6. Private cloud design for genomic data
Regardless of this proposed productive set up, some companies rather start exploring the usefulness of Clouds by first moving servers for
prototype development or test environments. Another possibility to gain some first-hand experience without many risks is to try out some new
software products (depending on the licensing model which is not always suitable for Cloud use) or to test a migration scenario to a newer
software release. The opportunities in the realm of Cloud computing are growing rapidly. As an example, so-called ‘Community Clouds’ are now
emerging and can be used e.g. to share requirements towards an infrastructure. The variety of available services is steadily increasing in order to
cover as many business cases as possible and to become even more attractive to future customers. At the same time, as possibilities generate
demand, new opportunities for Cloud use are arriving. To leverage one such opportunity would be to merge the high computational and storage
demands of genomic data with capabilities of the Cloud. In an environment where in-house IT capabilities to deal with genomic data are already
available, such a move to the Cloud needs to outperform those capabilities in at least one of the existing main key performance indicators (KPI) of
IT services (Availability, Security, Cost) and should not perform worse on any of those. An in-house ramp up towards such capabilities should be
considered by the same criteria in comparison to Cloud services. Of course, this involves legal considerations as moving genomic data into the
Cloud may not be allowed in some cases. This will be elaborated further in the next section.
Challenges with Genomic Data in the Cloud
Capability of Infrastructure
As elaborated above, genomic data poses great challenges towards the IT infrastructure. Processing requires many fast CPUs with large allocated
memory. These demands are closely related to the need of complex statistical methods and to the large amount of data to be analyzed. Therefore,
it is important that the Cloud provider offers high end services with a reasonable amount of memory. Certainly, such service is generating costs
while it is in use. It therefore makes sense to accommodate oneself with self-made scripts or, in case of AWS, the provided API for starting and
stopping servers according to current need and degree of utilization.
Additionally, it is important to have a cost-effective storage, because of the large size of a typical genomic data set. AWS has a multi-layered
concept with fast and expensive disk drives, but also slower and cheaper S3 storage and finally GlacierTM, an economic solution for long term
storage. GlacierTM is a low cost ‘offline’ storage solution, which is feasible for storing files for which a retrieval time of several hours is acceptable.
The price for storing data with GlacierTM is approximately 90% less than for the same amount of S3 storage. S3 and GlacierTM can also be
accessed through third party tools like Cloudberry ExplorerTM(www.cloudberrylab.com) but also by using the native API provided by AWS.
Availability of Bandwidth
Currently, one of the most limiting factors for efficient Cloud usage is the bandwidth between the local network and the Cloud. In order to
transfer genomic data, a dedicated link (with a respective SLA) or invest into a higher bandwidth network will help avoiding delays but another
important aspect is the connection tool. To this regard, one of the major benefits of using the leading AWS Cloud service is the availability of
adapted solutions from other vendors fitting exactly these needs. AsperaTM (www.asperasoft.com) for example extents the AWS network with
solutions to exchange Big Data across networks in an encrypted manner and with high speed. Traditional protocols like FTP, sFTP or FTPs were
not designed to meet the challenges of transporting Big Data over WAN links. Technically, the usage of the User Datagram Protocol (UDP; with
an add-on session handling) performs much better, than the classical Transmission Control Protocol (TCP), which includes a lot of overhead in
handshaking and session reliability measures. Another exciting solution is the Fast Data Transfer (FDT) used for Big Data transfer at CERN
(monalisa.cern.ch/FDT). The beauty in this approach lies in its capability of reading and writing at disk speed over wide area networks utilizing
TCP and its platform-agnostic design. Some repositories of genomic data enforce their own protocols for data transfer. The University of
California, Santa Cruz (UCSC) is using a torrent-based method for distribution of files among the user community. This method has proven to be
a very efficient way for fast file transfer in the last years, also due to Cloud usage for storing and processing genomic data.
Data Protection
In context of Cloud computing, IT security and data protection are topics oftentimes accompanied by the thought-terminating cliché: ‘If my data
are no longer on my local campus, it is not secure and therefore a Cloud solution is out of the question’. Such train of thought is unfortunately still
used as a killer argument to stop any further discussion. There are most certainly more intelligent approaches to analyze the security challenges
of Clouds. One widely spread approach is to classify data and run a risk analysis. When dealing with genomic data which typically derive from
patients, legal aspects regarding national and international data privacy legislation need to be taken into account as well.
Every piece of data has an ownership. The owner is the responsible person to deal with these topics of data classification and risk assessment. The
latter typically requires support by IT experts, but data owners nevertheless must be aware of the potential risks and should also be aware of the
amount of risk they are willing to take, as there is no absolutely risk-free storage of data. Table 2 shall give insight into typical risk classes and
associated questions. Clearly, the individual assessment is closely related to the type of data and the data owners’ risk perception.
Table 2. Compilation of typical data risk classes and questions
All Cloud providers are aware that data protection is one of the key decision drivers for or against Cloud usage. Therefore, they are keen on
improving their services and many offer a variety of security features. It is obvious that the Scale of Economy does not just apply to the
computational resources but also to the security. Large Cloud providers can attract and employ security specialists for their environments which
may be challenging for smaller companies. Also, such Cloud providers are capable to offer disaster recovery, because they have many distributed
sites. Further Cloud-specific risk advantages and disadvantages are summarized in Table 3.
Table 3. Special riskrelated cloudspecific advantages and disadvantages
Advantages Disadvantages
In light of all of these security considerations, there is an understandable desire of Cloud customers to quantify each applicable risk, including a
probability of incidence measure where possible. These are put forward for example as audit or ‘social hacking’ requests which are typically
frowned-upon by Cloud providers. Here, the customers’ needs to validate the Cloud environment conflicts with the need of other customers to
have their data secured from anyone’s access, including ‘social’ hackers. Leading Cloud providers solve this challenge by undergoing regular
inspections and (ISO-) certifications by independent third parties.
Apart from assessing the level of certification, Cloud customers can also proactively protect their data. Encryption is such a crucial and commonly
discussed measure especially in the context of Cloud computing. TrueCryptTM (www.truecrypt.org) is an encryption tool universally accepted
among IT security experts. The code is open source, allowing users to proof the code before usage. But TrueCryptTM (as any other tool) is neither
secure nor insecure by default. It must be used correctly to maximize security. Losing the encryption key, for example or using ‘weak’ passwords
may enable successful data theft. But even a brute force attack against an excellent password may be successful if there is enough time or a lucky
strike. Therefore, additional measures like firewalls or access control lists are useful for further reducing the risk of data theft, but certain
elements of risk always remain. The risks can just be reduced to a value matching the risk class of the data.
Storage of data almost always raises legal challenges towards the data owner. National or international data protection laws or comparable rules
(e.g. ‘Safe-Harbor’ principle) deal with the storage and protection of personal data. Among others, such regulations may impose restrictions on
the physical location and the transfer of these types of data. In all but a few cases, genomic data must be treated as personal data in need of
special protection under these regulations. Oftentimes, these data derive from patients (e.g. in case of cancer studies) who need to be
transparently informed e.g. on the physical location, intent for and purpose of their data. In clinical practice, this is achieved by so-called
‘Informed Consent’ forms, detailing this information to patients in a clear and understandable manner. If this has been achieved, such personal
data can be stored and analyzed in the Cloud, provided the Cloud contractor is capable of adhering to the highest data protection standards. Such
standards commonly include but must not be limited to: Data security certification by ISO 27001, ISO 27003 or ISO 27005 (and/or adherence to
‘Safe Harbor’ principles where applicable), signature of standard contractual clauses on data privacy that (in the European Union) comply with
EU regulation 95/46/EC, and the ability to name the compute centers where the data are stored and processed. Especially in the case of genomic
data, it is strongly advised here that any contract with a Cloud provider should cover these issues. Outsourcing of IT services delegates many
responsibilities, but the accountability stays with the data owner.
Amazon Web Service offers all possibilities to run any business case in the Cloud. We would highly recommend reading the online
documentations at AWS websites. Understanding the built-in security functionalities (e.g. 2-factor authentication for the Management GUI)
helps the customer to secure their Cloud environment accordingly to their business needs.
Costs
One of the main drivers for moving services into the Cloud (or most other outsourcing activities for that matter) is to drive down the costs for
internal IT. It is therefore important to elucidate beforehand the cost structures and pricing models of the different Cloud providers in order to
find the suitable model. It has been elaborated earlier that one scenario for utilizing the Cloud may be to absorb peak loads of computational jobs
with very high demand on CPU, memory, and storage. For such a repetitive, but non-constant demand, cost reduction can be realized by stopping
Cloud servers whenever there is no workload. If Cloud servers are running permanently (7 days x 24 hours), it can be argued that the total cost of
an in-house virtualized server is not essentially higher than a Cloud based alternative with same sizing parameters. Stern discipline or the usage
of automation (e.g. via API) is needed to get the best out of the Cloud.
FUTURE RESEARCH DIRECTION: UTILIZING CLOUD COMPUTING TO ANALYZE SNP6 ARRAYS
Study Aim
The high amount of genomic data produced by the use of new high throughput technologies like RNA-sequencing (Chen et al., 2011; Wang et al.,
2009) or high resolution microarrays (Pinkel & Albertson, 2005) yields the development of new bioinformatics tools for analyzing these data
(McPherson et al., 2011). A main challenge is that many tools for e.g. genomic analysis in most cases are neither able to parallelize the different
computational steps on a cluster nor to parallelize these steps on a single computer with more than one CPU. Hence, the analysis of large datasets
often requires a lot of time and money for computational resources. We address this issue by using AWS EC2 services (Fusaro, Patil, Gafni, Wall,
& Tonellato, 2011), the common programming language R and by enabling a flexible and non-task bound parallelization of different jobs on an
AWS EC2 environment.
Materials and Methods
In our study we focus on a software tool called PICNIC (Greenman et al., 2010). This tool is able to predict the CNV from SNP6 arrays (McCarroll
et al., 2008) produced from an Affymetrix genome wide SNP6.0 array by using a Bayesian Hidden Markov Model (HMM). For segments with a
constant CN we identify genes at the edges of the segment and compare the SNP measures among each other to calculate probabilities for fusion
genes.
Applying PICNIC on SNP6 array data is computationally intensive. The average resources needed for the processing of one sample are about 10
GB Random-Access Memory (RAM) and 3 hours Central Processing Unit (CPU) time of an AWS EC2 High-Memory Quadruple Extra Large
(m2.4xlarge). For a large dataset consisting of 1,000 samples, such an analysis would take around 125 days. One way to speed up the analysis of
large datasets is parallelization, applied e.g. on AWS EC2 cloud computing instances. The calculation of a large dataset on one AWS EC2 High-
Memory Quadruple Extra Large compute node (8 CPU’s and 68.4GB RAM) would then take 21 days instead of 125 days at a cost of USD 1,095.
However, 21 days for 1,000 samples is not acceptable, because often there are more than 1,000 samples to process, taking more time under this
condition. Due to the flexibility of AWS, the computing resources can be adjusted on-demand. Hence, if 21 days are too long, it is possible to run
PICNIC on e.g. 20 of such compute nodes and therefore 120 samples in parallel, which reduce the runtime further to 24 hours at the same costs.
The CPU time usage of our pipeline’s second step (Fusion Calculation) has no predictable processing time. Depending on the output of the
PICNIC step the CPU usage may vary between 3 minutes and 2 hours (Table 4).
Table 4. Study pipeline
Technical Setup (In This Use Case)
The administrative entry point from a local workstation to the AWS EC2 infrastructure is a server with a secure remote desktop connection from
within the local security area. The parallelization takes place on a master server. The master server can be accessed from the entry point via ssh
and serves as an NFS share with a large disc space. The master server interacts with several cloud compute nodes, on which the PICNIC processes
are executed in parallel. For the central processing of the incoming results from the compute nodes, the master server is set up with
approximately 17GB RAM (High-memory extra-large). The communication with the AWS EC2 compute nodes as well as the parallelization itself
is implemented in R (Version 2.12.1), using the R package snow (Simple Network Of Workstation, Version 0.3-10) (Rossini, Tierney, & Li, 2003).
The R program establishes a queue for the distribution of a PICNIC process with data on available compute nodes, accessible by IP-address.
Parallelization
Efficient usage of the compute nodes is achieved by using the functions of the snow package. The function clusterApplyLBbalances the work load
if the data to be processed are larger than the number of compute nodes. At first the master sends one task to each node and controls the process
with a built-in communication protocol. As soon as one computing node has finished its task, the master sends the next task to this node, until
the whole task list is completed.
Usage:
• clusterApplyLB(cl, x, f)
Arguments:
This dynamic method fits perfectly to the requirements of the second step in our study pipeline. Although the CPU usage per input file is variable,
the load-balancing algorithm leads to an excellent utilization of the allocated computing resources (Figure 7). The main disadvantage is that the
execution of a job is nondeterministic and the increasing communication with the compute nodes will reduce the overall performance of the
parallelization (Table 5).
Figure 7. The advantage of function clusterApplyLB Shown for
a task list with 8 tasks distributed to 3 compute nodes
Table 5. Parallelization methods
Looking at the first step of our pipeline with the constant CPU usage of the PICNIC step, we wanted to avoid the overhead of the communication
(Figure 8).
Figure 8. In this example the 8 tasks have an almost constant
runtime. Using do.call distributes the workload at first and
creates task-lists for all compute nodes
This can be solved by utilizing do.call. This function is able to construct and execute a function call from a simple vector function ‘c’. The function
expects a list of arguments, which are passed to it by clusterApply. The data to be processed can be split according to the number of available
compute nodes (function call length(cl), where cl is the cluster object), using splitList. Each job contains a list of elements of approximately
similar length.
Usage:
• do.call(“c”,clusterApply(cl, x,f))
Arguments:
CONCLUSION
In this chapter, it was established that by matter of size and complexity, genomic data is Big Data. New technologies and processes are therefore
necessary for analysis, storage and risk assessment. It is intended to support researchers and IT consultants to come up with better-informed
decisions on the most suitable IT infrastructure for genomic data analysis. It is an intentional outcome of this chapter that there is no one-fits-all
solution, regardless whether the IT infrastructure is internal, outsourced, or Cloud based. Rather, it is argued here that the specific application
should be taken into account and that there are certain scenarios where Clouds emerge as the most suitable infrastructure:
• Clouds are suitable for applications with high demand towards CPU, memory and storage
• Clouds are suitable for long term storage of raw data in large scale (several TB)
This last point concerns a novel emerging trend in the genomics field, especially in the Big Data era: As the size and amount of data sets increase,
it will soon no longer be feasible to download and centralize data in order to process them. Rather, data will be processed where they reside. It is
predicted here that there will be ‘Data Clouds’ in the near future, established by the data-generating institutions or consortia where
computational power can be leased and where additional internal data and algorithms can be ‘attached’, e.g. by the transfer of virtual machines
(VM) into that Cloud and only results will be reported back. From this trend, a number of consequences follow. It becomes clear that the need to
benefit from the wealth of public data will make ‘Cloud knowledge’ a highly valuable skill for IT experts. Therefore, it can be beneficial for an IT
organization to identify use cases suitable for the Cloud in order to make some members of staff ‘Cloud ready’. Such use cases can best be found if
they match one of the three demand types named above. Furthermore, such a use case must always take into account the prerequisites for Cloud
use detailed above in order to identify the proper infrastructure:
1. Capability of the Infrastructure wrt. CPU, memory and storage
3. Data protection regulations (internal regulations as well as laws, e.g. Data Privacy)
5. Costs
There are further considerations, e.g. suitability and reliability of the Cloud service provider. Here, an analytical approach e.g. as pursued by
Leong et al. is a recommended method for gaining a ranked list of Cloud providers.
In the use case study presented in this chapter, R programs were parallelized on AWS EC2 Clouds, utilizing the Cloud for massively parallel
processing of large scale genomic data. In addition, different options for parallelization were demonstrated to enable consideration of the
advantages and disadvantages of each parallelization approach.
The main advantage of the Cloud environment utilized in this use case study lies in the independence from pre-configured AMIs or fixed IT
environments. With AWS, such fixed environments are typically bound to certain regions, for example the bioconductorcloudami which is
located on USEast1 region. With the introduced approach, users have full control over packages and location of their data. Identification and
pursuit of such a use case is a promising first step for the analysis of genomic data in a Cloud environment.
This work was previously published in Big Data Analytics in Bioinformatics and Healthcare edited by Baoying Wang, Ruowang Li, and
William Perrizo, pages 186214, copyright year 2015 by Medical Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
The authors would like to acknowledge the support of the following people: Dan Akers, Andreas Friese, Ulf Hengstmann, Jens Hohmann, Heinz
Rakel, Felix Reichel, Florian Reuter, Thomas Schilling, and Karsten Tittmann
REFERENCES
Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., & McVean, G. A. (2010). A map of human genome variation
from population-scale sequencing. Nature ,467(7319), 1061–1073. doi:10.1038/nature09534
Association, T. G. (2010). Risk Management: Research on Risk Managemant, Assessment and Prevention. Geneva Association Information
Newsletter, 47. Retrieved January, 2014, from http://aws.amazon.com/glacier/pricing/
Barreiro, L. B., Laval, G., Quach, H., Patin, E., & Quintana-Murci, L. (2008). Natural selection has driven population differentiation in modern
humans. Nature Genetics , 40(3), 340–345. doi:10.1038/ng.78
Botstein, D., & Risch, N. (2003). Discovering genotypes underlying human phenotypes: Past successes for mendelian disease, future approaches
for complex disease. Nature Genetics ,33(3sSuppl), 228–237. doi:10.1038/ng1090
Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., & Corcoran, K. (2000). Gene expression analysis by massively
parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology , 18(6), 630–634. doi:10.1038/76469
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., & Haussler, D. (2000). Knowledge-based analysis of microarray
gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America , 97(1),
262–267. doi:10.1073/pnas.97.1.262
Chen, G., Wang, C., & Shi, T. (2011). Overview of available methods for diverse RNA-Seq data analyses. Sci China Life Sci ,54(12), 1121–1128.
doi:10.1007/s11427-011-4255-x
Clarke, J., Wu, H. C., Jayasinghe, L., Patel, A., Reid, S., & Bayley, H. (2009). Continuous base identification for single-molecule nanopore DNA
sequencing. Nature Nanotechnology , 4(4), 265–270. doi:10.1038/nnano.2009.12
de Paula Careta, F., & Paneto, G. G. (2012). Recent patents on High-Throughput Single Nucleotide Polymorphism (SNP) genotyping
methods. Recent Patents on DNA & Gene Sequences ,6(2), 122–126. doi:10.2174/187221512801327370
Eberle, M. A., Stone, J., Galver, L., Hansen, M., Tsan, C., & Seagale, D. (2011). A new whole-genome genotyping array of almost 4.5 million SNPs
based on data from the 1000 Genomes Project. In Proceedings of12th International Congress of Human Genetics (ICHG). ICHG.
Edgren, H., Murumagi, A., Kangaspeska, S., Nicorici, D., Hongisto, V., Kleivi, K., & Kallioniemi, O. (2011). Identification of fusion genes in breast
cancer by paired-end RNA-sequencing. Genome Biology , 12(1), R6. doi:10.1186/gb-2011-12-1-r6
Fusaro, V. A., Patil, P., Gafni, E., Wall, D. P., & Tonellato, P. J. (2011). Biomedical cloud computing with Amazon Web Services.PLoS
Computational Biology , 7(8), e1002147. doi:10.1371/journal.pcbi.1002147
Graux, C., Cools, J., Melotte, C., Quentmeier, H., Ferrando, A., Levine, R., & Hagemeijer, A. (2004). Fusion of NUP214 to ABL1 on amplified
episomes in T-cell acute lymphoblastic leukemia.Nature Genetics , 36(10), 1084–1089. doi:10.1038/ng1425
Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., & Stratton, M. R. (2010). PICNIC: An algorithm to predict absolute
allelic copy number variation with microarray cancer data. Biostatistics (Oxford, England) , 11(1), 164–175. doi:10.1093/biostatistics/kxp045
Huang, N., Shah, P. K., & Li, C. (2012). Lessons from a decade of integrating cancer copy number alterations with gene expression
profiles. Briefings in Bioinformatics , 13(3), 305–316. doi:10.1093/bib/bbr056
Johansson, B., Mertens, F., & Mitelman, F. (1996). Primary vs. secondary neoplasia-associated chromosomal abnormalities-balanced
rearrangements vs. genomic imbalances? Genes, Chromosomes & Cancer , 16(3), 155–163.
Kasianowicz, J. J., Brandin, E., Branton, D., & Deamer, D. W. (1996). Characterization of individual polynucleotide molecules using a membrane
channel. Proceedings of the National Academy of Sciences of the United States of America , 93(24), 13770–13773. doi:10.1073/pnas.93.24.13770
Kearney, L., & Horsley, S. W. (2005). Molecular cytogenetics in haematological malignancy: Current technology and future
prospects. Chromosoma , 114(4), 286–294. doi:10.1007/s00412-005-0002-z
Kerr, M. K., Martin, M., & Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. [Comparative Study]. Journal of
Computational Biology , 7(6), 819–837. doi:10.1089/10665270050514954
Lashkari, D. A., DeRisi, J. L., McCusker, J. H., Namath, A. F., Gentile, C., Hwang, S. Y., & Davis, R. W. (1997). Yeast microarrays for genome wide
parallel genetic and gene expression analysis.Proceedings of the National Academy of Sciences of the United States of America , 94(24), 13057–
13062. doi:10.1073/pnas.94.24.13057
Leong, L., Toombs, D., Gill, B., Petri, G., & Haynes, T. (2013).Magic Quadrant for Cloud Infrastructure as a Service. Retrieved from
http://www.gartner.com/technology/reprints.do?id=1-1IMDMZ5&ct=130819&st=sb
Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., & Law, M. (2012). Comparison of next-generation sequencing systems. Journal of Biomedicine &
Biotechnology , 251364. doi:doi:10.1155/2012/251364
Long, M. (2000). A new function evolved from gene fusion.Genome Research , 10(11), 1655–1657. doi:10.1101/gr.165700
McCarroll, S. A., Kuruvilla, F. G., Korn, J. M., Cawley, S., Nemesh, J., Wysoker, A., & Altshuler, D. (2008). Integrated detection and population-
genetic analysis of SNPs and copy number variation.Nature Genetics , 40(10), 1166–1174. doi:10.1038/ng.238
McPherson, A., Hormozdiari, F., Zayed, A., Giuliany, R., Ha, G., Sun, M. G., & Shah, S. P. (2011). deFuse: An algorithm for gene fusion discovery
in tumor RNA-Seq data. PLoS Computational Biology , 7(5), e1001138. doi:10.1371/journal.pcbi.1001138
Meaburn, K. J., Misteli, T., & Soutoglou, E. (2007). Spatial genome organization in the formation of chromosomal translocations.
[Review]. Seminars in Cancer Biology , 17(1), 80–90. doi:10.1016/j.semcancer.2006.10.008
Mitelman, F., Johansson, B., & Mertens, F. (2007). The impact of translocations and gene fusions on cancer causation. Nature Reviews.
Cancer , 7(4), 233–245. doi:10.1038/nrc2091
Müller, H., & Röder, T. (2004). Der Experimentator Microarrays(Vol. 1). München: Spektrum Akademischer Verlag, Elsevier GmbH.
Pinkel, D., & Albertson, D. G. (2005). Comparative genomic hybridization. Annual Review of Genomics and Human Genetics ,6(1), 331–354.
doi:10.1146/annurev.genom.6.080604.162140
Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., & Albertson, D. G. (1998). High resolution analysis of DNA copy number
variation using comparative genomic hybridization to microarrays. Nature Genetics , 20(2), 207–211. doi:10.1038/2524
Rajagopalan, H., & Lengauer, C. (2004). Aneuploidy and cancer.Nature , 432(7015), 338–341. doi:10.1038/nature03099
Reinartz, J., Bruyns, E., Lin, J. Z., Burcham, T., Brenner, S., Bowen, B., & Woychik, R. (2002). Massively parallel signature sequencing (MPSS) as
a tool for in-depth quantitative gene expression profiling in all organisms. [Review]. Briefings in Functional Genomics & Proteomics , 1(1), 95–
104. doi:10.1093/bfgp/1.1.95
Risch, N. J. (2000). Searching for genetic determinants in the new millennium. Nature , 405(6788), 847–856. doi:10.1038/35015718
Ritz, A., Paris, P. L., Ittmann, M. M., Collins, C., & Raphael, B. J. (2011). Detection of recurrent rearrangement breakpoints from copy number
data. BMC Bioinformatics , 12(1), 114. doi:10.1186/1471-2105-12-114
Roix, J. J., McQueen, P. G., Munson, P. J., Parada, L. A., & Misteli, T. (2003). Spatial proximity of translocation-prone gene loci in human
lymphomas. [Comparative Study]. Nature Genetics , 34(3), 287–291. doi:10.1038/ng1177
Rossini, A., Tierney, L., & Li, N. (2003). Simple Parallel Statistical Computing in R . The Berkeley Electronic Press.
Sanger, F., & Coulson, A. R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of
Molecular Biology , 94(3), 441–448. doi:10.1016/0022-2836(75)90213-2
Stratton, M. R., Campbell, P. J., & Futreal, P. A. (2009). The cancer genome. Nature , 458(7239), 719–724. doi:10.1038/nature07943
Suzuki, S., Tenjin, T., Shibuya, T., & Tanaka, S. (1997). Chromosome 17 copy numbers and incidence of p 53 gene deletion in gastric cancer cells.
Dual color fluorescence in situ hybridization analysis. Nippon Ika Daigaku Zasshi , 64(1), 22–29.
Tomlins, S. A., Rhodes, D. R., Perner, S., Dhanasekaran, S. M., Mehra, R., Sun, X. W., & Chinnaiyan, A. M. (2005). Recurrent fusion of TMPRSS2
and ETS transcription factor genes in prostate cancer. Science , 310(5748), 644–648. doi:10.1126/science.1117679
van Berkum, N. L., Lieberman-Aiden, E., Williams, L., Imakaev, M., Gnirke, A., Mirny, L. A., & Lander, E. S. (2010). Hi-C: A method to study the
three-dimensional architecture of genomes.Journal of Visualized Experiments , (39). doi:doi:10.3791/1869
Velculescu, V. E., Zhang, L., Vogelstein, B., & Kinzler, K. W. (1995). Serial analysis of gene expression. Science , 270(5235), 484–487.
doi:10.1126/science.270.5235.484
Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. [Review]. Nature Reviews. Genetics , 10(1), 57–
63. doi:10.1038/nrg2484
Yu, J., Mani, R. S., Cao, Q., Brenner, C. J., Cao, X., Wang, X., & Chinnaiyan, A. M. (2010). An integrated network of androgen receptor, polycomb,
and TMPRSS2-ERG gene fusions in prostate cancer progression. Cancer Cell , 17(5), 443–454. doi:10.1016/j.ccr.2010.03.018
ADDITIONAL READING
Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., & Walter, P. (2007). Molecular Biology of the Cell (M. Anderson & S. Granum Eds. Vol.
5): Garland Science, Taylor & Francis Group, LLC.
Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., & Corcoran, K. (2000). Gene expression analysis by massively
parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology , 18(6), 630–634. doi:10.1038/76469
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., & Haussler, D. (2000). Knowledge-based analysis of microarray
gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America , 97(1),
262–267. doi:10.1073/pnas.97.1.262
Kerr, M. K., Martin, M., & Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational
Biology , 7(6), 819–837. doi:10.1089/10665270050514954
Lashkari, D. A., DeRisi, J. L., McCusker, J. H., Namath, A. F., Gentile, C., Hwang, S. Y., & Davis, R. W. (1997). Yeast microarrays for genome wide
parallel genetic and gene expression analysis.Proceedings of the National Academy of Sciences of the United States of America , 94(24), 13057–
13062. doi:10.1073/pnas.94.24.13057
Müller, H., & Röder, T. (2004). Der Experimentator Microarrays(Vol. 1). München: Spektrum Akademischer Verlag, Elsevier GmbH.
Reinartz, J., Bruyns, E., Lin, J. Z., Burcham, T., Brenner, S., Bowen, B., & Woychik, R. (2002). Massively parallel signature sequencing (MPSS) as
a tool for in-depth quantitative gene expression profiling in all organisms. Briefings in Functional Genomics & Proteomics , 1(1), 95–104.
doi:10.1093/bfgp/1.1.95
Velculescu, V. E., Zhang, L., Vogelstein, B., & Kinzler, K. W. (1995). Serial analysis of gene expression. Science , 270(5235), 484–487.
doi:10.1126/science.270.5235.484
KEY TERMS AND DEFINITIONS
Cloud Computing: Is defined as a computational resource or infrastructure of any kind (Software, Hardware, Network) which is available over
the Internet, sold on demand and typically managed by a third-party provider. An important feature is the pay as you go concept.
Cloud Provider: A company, which offers Cloud Computing, taking over the responsibility for a well described part of the customer’s value
chain, e.g. providing compute nodes (hardware).
Fusion Gene: Merged genomic sequence containing the entire or partial information of at least two spatially separate genes. The merger can
lead to a protein with novel functions or non-function and may increase susceptibility to diseases like cancer.
Genomic Data: Is the data generated through research on genes or entire genomes and their function.
Microarray: Describes a wide set of technologies for molecular characterization of RNA and DNA. Microarrays typically consist of thousands or
millions of microscopic spots (probes) of known short RNA or DNA sequences of interest, chemically fixed on a solid surface like glass used to
measure the abundance of most of the messenger-RNA (mRNA) or DNA within a sample.
Parallelization: Efficient usage of computational resources by running algorithmic pipelines or part of these at the same time on different
CPUs or compute nodes in order to reduce the overall runtime.
Personal Data: Any information concerning the personal circumstances of an identified or identifiable individual. Such data, especially when
containing information about the health of an individual, are protected by specific data privacy laws in many countries.
PICNIC: Algorithm, published by a group of scientists and freely available from the Wellcome Trust Sanger Institute to detect allelic copy
number variation with SNP6 array for cancer samples.
Risk Analysis: A process to identify threats: their possible impact und the probability of occurrence. This type of analysis is needed, when
dealing with personal data in a Cloud.
APPENDIX
Table 6. Abbreviations and their meanings
Abbreviation Meaning
Sara D’Onofrio
University of Bern, Switzerland
Edy Portmann
University of Bern, Switzerland
ABSTRACT
The fuzzy analytical network process (FANP) is introduced as a potential multi-criteria-decision-making (MCDM) method to improve digital
marketing management endeavors. Today’s information overload makes digital marketing optimization, which is needed to continuously
improve one’s business, increasingly difficult. The proposed FANP framework is a method for enhancing the interaction between customers and
marketers (i.e., involved stakeholders) and thus for reducing the challenges of big data. The presented implementation takes realities’ fuzziness
into account to manage the constant interaction and continuous development of communication between marketers and customers on the Web.
Using this FANP framework, the marketers are able to increasingly meet the varying requirements of their customers. To improve the
understanding of the implementation, advanced visualization methods (e.g., wireframes) are used.
INTRODUCTION
The soaring technology level enhances communication between marketers and customers in digital marketing and alters its methods and goals to
improve the interaction between them. The ability of marketers to understand the requirements of their customers increases, which is necessary
to be economically competitive. Today, the challenge is that information gains ever more value and multiplies itself continuously. All marketers
are confronted with big data and possible information overload. It has become more difficult to analyze the huge amounts of data and to filter out
the relevant ones to accurately understand the requirements of their customers.
Since it is necessary to continuously adjust one’s own business to meet varying customers’ requirements and to make the best offer for a specific
group of customers, it is important to select the right kind of information from the growing data pool. Additionally, personalization plays a crucial
role in marketing as customers currently get spammed with unwanted or uninteresting advertisements. It is crucial to understand and meet
stakeholders’ requirements, because only information matched to their requirements is valuable. Since the marketer has to understand his
customers, stakeholder management is needed to specify the relevant requirements.
Zadeh (1979)’s information granulation theory proposes a way to deal with big data naturally. It clusters data and represents it in a structured
way (Yao, 2005; Zadeh, 1998) to easily see which information is relevant which is important for increasing one’s competitiveness. The analytical
network process (ANP) enables information granulation (Saaty, 2006) by representing the organization of given information in a networking
structure (i.e., networking granulation). This networking structure corresponds to clustering (Punj & Stewart, 1983), a widely used method in
marketing.
Since the interaction between multiple stakeholders cannot be captured exactly, fuzzy logic (Zadeh, 1988) is applied as a way to address
vagueness. Instead of searching for the best solution, it is often better to search for good enough (i.e., approximate) solutions that fit the
requirements of the stakeholders (i.e., customers) (Yao, 2000) and, thus, to make enhanced personalized offers. Fuzzy logic is added to
conventional ANP to create the fuzzy analytical network process (FANP). FANP makes it possible to work with uncertain information (e.g., see
Ahmadi, Yeh, Martin, & Papageorgiou, 2014) and to structure the information in a networking form. Thus, the presented implementation enables
improved digital marketing through FANP.
The intention is to create a cooperative decision support system (DSS) (Haettenschwiler, 2001) to improve data acquisition and decision making
that is focused on digital marketing measures. The authors combine ANP as a multi-criteria-decision-making (MCDM) method with fuzzy
research on the soft handling of big data. Existing knowledge (i.e., a conceptual framework) can be applied and enhanced, as the implementation
is based on previous research (Portmann & Kaltenrieder, 2015).
First, the theoretical background of all used concepts for this chapter will be provided. The concepts of digital marketing, stakeholder
management, requirements engineering, big data, granular computing (GrC), fuzzy logic and fuzzy sets, fuzzy cognitive maps (FCMs), ANP and
FANP will be explained in this part. Afterwards, the implementation and its process steps will be presented, accompanied by elicited
requirements of involved stakeholders for a digital marketing use case. Thereafter, several visualization methods will be considered to find the
most suitable one that fits the requirements.
BACKGROUND
The evolution of marketing is closely associated with developments in technology. Towards the end of the 20th century, the Web became a widely
used business and communication tool, which indicated the beginning of the era of digital marketing. According to Ryan and Jones (2009), the
potential audience grows, because the market penetration of digital channels is growing rapidly and so does the attraction of digital marketing.
An increasing growth in digital marketing is expected. The interaction between companies (i.e., marketers) and their customers has to be
reinvented, as it moves from demographically targeted print and broadcast messages to data-driven, contextually relevant brand communication
(Mulhern, 2009). Today’s technology enables people to connect more effectively with each other. This section will present the used concepts (i.e.,
stakeholder management, digital marketing, requirements engineering, big data, GrC, fuzzy logic, FCMs, ANP, FANP, as well as visualization
methods) of the proposed FANP framework in more detail.
Companies and Their Stakeholders
Marketers find themselves in a growing network of interest groups, so-called stakeholders (e.g., customers, suppliers, regulators, shareholders,
and socio-ethical organizations). To be competitive over time against the market, it is crucial to understand the stakeholders and to integrate
their behavior, values, and backgrounds in the marketing strategy (Freeman, 1984, 2004). It is important to add their claims into the process of
decision making from the beginning (Dimitrov & Russell, 1994).
As Argenti (2011) argues, technological development has led to stakeholder empowerment. Empowerment can be defined as a progression that
helps stakeholders gain control over their own lives and increases their capacity to act on certain situations that they themselves define as
important (Luttrell, Quiroz, Scrutton, & Bird, 2009). They are empowered to achieve their desired outcome; that is, to make effective choices and
to translate them into desired actions (Alsop & Heinsohn, 2005).
Because of social media, stakeholders can communicate with each other all over the world, exchange experiences, act globally, and can even
endanger the good reputation of a company. Thus, the power is shifted from companies to stakeholders, which makes stakeholder management
necessary. Argenti (2011) and Portmann (2013) argue that it is crucial to communicate with stakeholders online (e.g., through digital marketing).
Stakeholders not only address their criticism directly to companies, but also to the whole supply chain (Portmann & Thiessen, 2013). Marketers
have to use stakeholder management to go beyond information activity in order to create dialogues and visualization for integrating claims into
the decision-making processes (Dimitrov & Russell, 1994).
Digital Marketing
To reach stakeholders online, digital marketing has to be used. Digital marketing is the promotion of products or brands via one or more forms of
electronic media. Examples of electronic media are the Web, social media, mobile phones, electronic billboards, as well as digital, television, and
radio channels (“Digital Marketing,” n.d.). According to Ryan and Jones (2009), within the emergent development of the Web, customers
developed into “Customer 2.0”. This term captures the notion that customers have control over content. They can choose the content and
determine when and how it should be provided to them.
For digital marketing, marketers have to pay attention to customer contexts and focus on delivering value, not only exposure impressions.
Usually, delivering value has not been a part of a marketing strategy, but today digital media enables well-adapted forms, instead of traditional
“interrupt and repeat” advertising (Kim, 2008), for delivering this real value. A study (International Business Machines Corporation [IBM], 2011)
found that marketers’ first imperative is to deliver value to empowered customers (i.e., finding out who they are, what they want, and how they
want to interact with the company). This is also what Portmann (2013) argues for in his thesis. The term “empowered customers” captures the
notion that Web 2.0 changes customers’ behavior by enabling them to exchange information and create content. Since companies cannot control
this channel of influence on customers, this poses problems and has important consequences for digital marketing. Besides understanding
customers’ requirements and preferences, the marketer also has to understand their values and behavior (IBM, 2011). Pophal (2014) stated that
marketers increasingly try to target their advertising to those who are most likely to be interested, which is useful for both the marketer and
customers. Thereby, customers become satisfied and the marketer gets more competitive.
As the scope of knowledge (e.g., news, opinions, information) grows broader and deeper than ever before and emerges bottom up (i.e., grassroots
growth) through interactions among stakeholders and becomes available to everyone, emergence plays a key role in today’s digital marketing.
Emergence is the spontaneous arising of new and coherent structures out of a system, which exceed the properties of their components
(Goldstein, 1999). Because of the increasing power of customers, they search online more and more often for opinions of other customers to get
information about their experiences before purchasing a good or a service. They do not hesitate to share their opinions about (online) purchases.
Also, because of the transparency on the Web, the power shifts from marketer to customer. Customers are no longer a homogenous mass, but
diverse niche groups with increasingly, varying individual, requirements, with which the marketers have to deal. Thus, the power of customers
increases immensely and they want to have personalized information.
I-Customer relationship management (i-CRM) as a form of fuzzy stakeholder management (see Portmann & Kaltenrieder (2015) and the
following subchapters) is used, because personalization is essential and required in digital marketing. i-CRM is customer relationship
management (CRM) on the Web (i.e., Web-based), meaning that it is concerned with personalization on the Web that aims to provide
information to the customer that fits his needs and preferences (Wong, Fung, Xiao, & Wong, 2005). For personalization, information about a
product or a service is selected for a specific customer by using data about him (e.g., from the customer profile) to match the information to his
profile (Schubert & Koch, 2002). To guarantee that the personalized digital marketing fits the requirements of a particular customer, stakeholder
management is needed to define these requirements, which concern the claims of stakeholders. Therefore, the information should first be
aggregated and categorized before being delivered.
Elicitation of Customers’ Requirements
Along with stakeholder management, requirements engineering is needed to define the requirements of the stakeholders. In recent years, there
has been an increased general interest in requirements specification, because ill-defined requirements, which lead to a system application that is
not preferable, do not satisfy customers; therefore, huge costs arise to remake the system application (Roman, 1985).
There exist functional and non-functional requirements, which differ in the following way: A functional specification describes what the system is
to be. It defines the way the component interacts with its environment (i.e., in terms of the functions it must accomplish) (Roman, 1985; Ross &
Schoman, 1977). According to Glinz (2007), functional requirements address functions that a system must be able to perform and, accordingly,
what the product must do. Functional requirements are those requirements that specify the inputs to the system, the outputs from the system,
and how they are linked together. However, non-functional requirements are the required overall attributes of the system, including portability,
reliability, efficiency, human engineering, testability, understandability, and modifiability. Examples for non-functional requirements are
response time, response to failure, maintainability, or high reliability (Roman, 1985). To define the functional and non-functional requirements,
data about customer needs, companies, and the system itself are required; however, the selection of the relevant information poses a challenge
for companies. They find themselves in the context of big data, which makes it demanding to process information and make decisions. They have
to select all relevant information from a big pool of (random) data.
Before requirements of a system can be analyzed and specified, they must be collected through requirements elicitation methods from users,
customers and all other relevant stakeholders (Burge, 2001; Goguen & Linde, 1993; Rowel & Alfeche, 1997). Techniques for requirements
elicitation include interviews, questionnaires, user observation, scenarios, soft system methods, observations, social analysis, use cases, role
playing, prototyping, focus groups, team meetings and creative techniques like mind-mapping, brainstorming, and workshops (Burge, 2001;
Goguen & Linde, 1993).
Big Data
Since the beginning of the 21st century, the Web has become increasingly social (Snijders, Matzat, & Reips, 2012). Not only scholars and
employees, but also users generate data, increasing the amount of new data (Raine & Wellman, 2012) and computers start to generate
information themselves (e.g., software logs, cameras, microphones, RFID readers, and wireless sensor networks) (Hellerstein, 2008). This
development leads to big data. The term “big data” refers to data sets that are too large and complex to be processed by traditional database
management tools, which means that they are difficult to capture, store, manage, and analyze (Leeflang, Verhoef, Dahlström, & Freundt, 2014;
McKinsey Global Institute, 2011). Big data is characterized by the three basic Vs: volume, velocity, and variety (Laney, 2012) and by the four
additional Vs: variability (van Rijmenam, 2013), value (Kaisler, Armour, & Espinosa, 2014), veracity (Hoppe, 2013), and visualization (van
Rijmenam, 2013).
The phenomenon of big data provides large opportunities for companies because it gives access to real-time information, enabling them to
understand their environment at a more granular level (i.e., more specific), to create new products, and to respond to changes in behavior
(Davenport, Barth, & Bean, 2012). Given that marketers today have the ability to gather and monitor increasingly huge amounts of data, they are
able to know more than ever before about (real-time) market needs. The big challenge is to manage these huge amounts of data, to draw useful
conclusions and to make meaningful decisions (Fulgoni, 2013; Pophal, 2014; Leeflang et al., 2014). This development of data acquisition also
influences digital marketing. Big data makes it possible to follow customers during their customer journey (Lemke, Clark, & Wilson, 2011; Onishi
& Manchanda, 2012) and, thus, to understand their preferences. It is crucial to efficiently track customers’ journey to optimize marketing
campaigns and budgets (Leeflang et al., 2014) and, therefore, to be more competitive.
Often, it is difficult to classify information, because it is either too detailed or incomplete and imprecise. Specifically, information about customer
needs can be quite complex and multifaceted. Therefore, humans developed the ability to simplify presented information by generalizing it
(Pedrycz, 2013), which can be particularly helpful for information systems (Yao, 2005; Zadeh, 1998). Hence, to handle the huge amounts of data,
GrC can be useful. GrC is similar to clustering, a common tool for marketing. For this method, groups are created based on similarity to keep
track of the whole case. Cluster analysis, meanwhile, is primarily used for market segmentation, which aims to define groups of people (or
markets or organizations) to find certain characteristics in common. It helps gain a better understanding of customer behavior, identify
homogenous groups, develop new products that meet the requirements of these groups, and select test markets. Referring to GrC, this approach
divides a case into a set of granules (of different complexity) (Zadeh, 1998) and offers a structured way to recognize, analyze, illustrate and solve
problems (Yao, 2005). The data of this case are clustered into groups, classes or families, based on indistinguishability, similarities or
functionality in a biomimetic sense (Lin, 2003; Yao, 2000; Zadeh, 1998, 2014). Thereby, these granules enable one to filter out the relevant
information (Pedrycz, 2013) and easily map the complexities of a situation (Yao, 2005).
Fuzzy Logic and FCMs
An overview of all relevant information can be useful in the decision-making process, especially in an environment of imprecision and
uncertainty (Zadeh, 1988). As customers’ perceptions are often described in natural language (e.g., on the Web) (Portmann, 2013) and the
interactions among stakeholders are often vague, information is imprecise and uncertain. Fuzzy logic allows the modeling of imprecise
information and plays an important role in human reasoning in an uncertain environment. This logic is based on classical or multivalued (e.g.,
Lukasiewicz 3-valued logic (Lukasiewicz, 1920)) and on fuzzy sets (Zadeh, 1988). A fuzzy set is, contrary to a classical set, not defined through its
elements. Instead, every element is contained in every fuzzy set having a (maybe) different degree of membership to each set. This degree, and
thereby the fuzzy sets, are defined by a membership function , which maps an element to the real interval [0,1] for every fuzzy set (Zadeh,
1965). An example of a fuzzy set is a fuzzy number. Since they enhance decision making, triangular fuzzy numbers are used. They are written as
where , and belong to the reals. These three numbers describe the membership function , since they mark the limits of the
support of and the most likeliest point of (i.e., where ). Thus,
is the membership function of . Similarly; reciprocal fuzzy numbers (Sarfaraz, Jenab, & D'Souza, 2012) and crisp
numbers can be expressed in a triangular form (Opricovic & Tzeng, 2003). Therefore, classical logic is enhanced through partial truth,
where a proposition is not true or false, but has a degree of truth instead (Zadeh, 1988). This is especially helpful when handling real world
problems, since human perception is often full of uncertainty and of imprecise or partial knowledge. Fuzzy logic allows, for instance, handling
linguistic variables as degrees (e.g., see Groumpos, 2010) or computing with words (Zadeh, 1996).
Kosko (1986) enhanced the traditional definition of cognitive maps (Tolman, 1948) with fuzzy logic to create FCM. Thus, FCMs are uncertainty-
extended enlargements of cognitive maps, able to handle vagueness or partial truth. Following Groumpos (2010), a FCM consists of concepts,
called nodes, and weighted edges connecting these nodes. Concepts (i.e., key factors or characteristics) can, for instance, be goals, states,
or outcomes with values usually in [0,1] or [-1,1]. The causal relationship between the concepts can be depicted as a matrix , where the entry
of stands for the weight of the respective edge, thus denoting the influence of the concept on the concept . In an FCM these can be positive
( increases ), negative ( decreases ) or zero (no relationship between and ) and they typically lie in [-1,1]. Since the edge is not
necessarily equal to the inverse relationship , the matrix is not symmetric in general. The diagonal entries of , describing the influence of a
concept on itself, may or may not be equal to zero (e.g., Boutalis, Kottas, & Christodoulou, 2009; Kosko, 1997).
Since a FCM describes a system with influencing concepts and changing values, it should evolve too. Following Salmeron (2012), consider the
initial state vector describing the state of the concepts in time step zero. The value of each concept in time step t can be computed
iteratively by . The function is an activation function to assure that the concept-values stay in the respective interval,
for instance a sigmoid function for [0,1]. and stand for proportions to weight the influence of the concepts on each other as used by
Groumpos (2010). The authors set , to keep the notations simple. The iteration has three possible outcomes which depend on the initial
state vector (Kosko, 1997). The state vector either reaches a fixed point, falls into a cyclic repetition of the same values or proceeds chaotically in
every further iteration step. For certain types of , including the above one, there exist conditions on the matrix , such that the iteration
converges to a fixed point for every initial state vector (Boutalis et al., 2009). Figure 1 is an example of an FCM.
Different FCMs can be aggregated if they consist of the same concepts. That is, the matrices, describing the same concepts, can be aggregated by
summing up the related matrix-entries and sending them to the interval [-1,1] through a sigmoid function or the mean. In the case that the
matrices consist of disjointed concepts, they are aggregated by building the matrix with the matrices as blocks (e.g., see Kosko (1997),
Groumpos (2010)):
Analytical Framework
To enhance the decision making process to optimize digital marketing strategies, the authors aim to create a cooperative DSS (Haettenschwiler,
2001). Since problems in marketing become more and more complex and solutions cannot be made in a straightforward manner, MCDM
methods are needed to make the best possible decisions (Lin, Lee, & Wu, 2009). The underlying idea is that decision making should be based on
multiple criteria (Cheng, Li, & Yu, 2005). Clustering of information (i.e., a GrC approach) is useful for MCDM and can either be hierarchical or
feature interdependencies and feedback, like a network. Producing a comprehensive framework, the analytic hierarchic process (AHP) and the
ANP can be used as MCMD tools (Cheng et al., 2005).
The advances of a hierarchical structure for MCDM are exploited in the AHP (Saaty, 2004). The first level is the goal of the decision, the lower
levels are built from criteria and sub-criteria, and the bottom level presents the alternatives of the solutions to reach the predefined goal. To find
the best alternative, the criteria are prioritized. According to Saaty (2008), this is done by comparing the elements pairwise and assigning an
absolute number from 1 to 9 (or reciprocals of these) to each pair. The higher the number, the more important is the first element for the chosen
criteria of the previous level compared to the other element. The inverse relationship, which is the reciprocal value, implies inferior importance of
the element compared to the other. 1 means they are both equally important to the criteria of the previous level, 9 denotes the highest possible
dominance of the element over the other, and 1/9 is the highest possible inferiority. These relationships are depicted in a matrix, in which the
entries are normalized by dividing the values of the columns through their sum. To obtain the approximate weight of the prioritization of the
element, the mean over the row is built (for the advanced solution, see Saaty (2001)). This computation is repeated for every criterion in every
level except the bottom one containing the alternatives. To evaluate the alternatives in the bottom level, the product of each priority of the criteria
in this path is taken (i.e., the weights are multiplied with the weights of the criteria in the direct upper level). The highest score indicates the
(approximate) best alternative.
In the case of a network-like clustering, ANP can be used, which is a generalization of the AHP, also developed by Saaty (2001). In contrast to the
AHP, the criteria and elements of different levels (or on the same level) of the ANP no longer have to be independent and may include feedback.
That is, the AHP is a special case of the ANP if the ANP has a top-down structure. Saaty (2004) outlined 12 steps of the ANP: (1) Know the
problem you are facing. That includes everything, for instance the criteria or the possible outcomes. (2) Find for each of the four merits Benefits,
Opportunities, Costs, and Risks (BOCR) of the decision, control criteria and sub-criteria. Some (or all) of them might appear in all of the merits.
Then obtain their priorities from paired comparisons as in the AHP. In this step low priority criteria may be eliminated (e.g., those with a global
priority lower than 3%). (3) Organize the components of the network (the clusters and their elements including the alternatives of the decision)
and determine a set of these, which are relevant to every control criterion. It is suggested to number the clusters and to organize them in a
column. (4) Connect the clusters with arrows, indicating the elements influence on each other regarding one control criterion (i.e., establish one
network for each control criterion). (5) To analyze the dependence of the clusters or elements you may either regard the influence (suggested by
Saaty (2004)) with respect to a control criterion or how they are influenced by the other components. Once decided the same approach must be
taken for the entire decision. (6) Establish the supermatrix by arranging the numbered clusters (see point 3) vertically and horizontally. The
elements of the clusters are arranged as subcolumns and the entries are the priorities derived from the paired comparison in step 7. (7) Execute
paired comparisons of elements with respect to their inner dependence (i.e., compare to elements in the same cluster) and their outer
dependence (i.e., compare to elements of another cluster), according to the chosen control criterion. Then obtain of these comparisons priority
vectors as in the AHP. (8) Repeat this process on the level of the clusters and use the result to weight the corresponding subcolumn in the
supermatrix. Clusters that do not influence another get assessed with a zero. This results in the weighted column stochastic supermatrix. (9)
Raise the power of the stochastic supermatrix, until the difference of two following iterations is smaller than a predefined value, to compute the
limit priorities. Whether the outcome is a limit or a cycling limit of the matrix, depends on the type of matrix (see Saaty (2001)). (10) The limit
priorities of the supermatrix can now be synthesized by multiplying each idealized limit vector with the weight of its control criterion and adding
them up for each of the four merits . For each alternative one builds the ratio and takes the alternative with the highest ratio as the
result. (11) Another way of synthesizing the priority vectors to the general solution is to use strategic criteria (their priorities) to rate the best
ranked alternative for each of the four merits. After normalizing these four ratings the sum of the weighted costs and risks is subtracted from the
sum of the weighted benefits and opportunities for each alternative. Again, the highest value describes the best alternative. (12) Ask some “what if
“-questions to test the stability of the final answer and to compute the compatibility index (see e.g., Saaty (2001)) for the original and each new
outcome.
To take into account imprecision and uncertainty, ANP is combined with fuzzy logic to create FANP. Fuzzy logic is a useful method to deal with
vagueness and uncertain information as it occurs in the interaction among stakeholders. Often, it is not possible to provide exact comparison
ratios because information is incomplete, the environment is complex and uncertain, and there are no appropriate measurement units and scale.
Thus, according to Mikhailov and Singh (2003), to express comparison ratios as fuzzy sets can help deal with uncertain judgments. Contrary to
the fuzzy AHP, FANP cannot handle fuzzy priorities since the matrix operations executed to derive priorities from the supermatrix are too
complex. Therefore, an approach that derives crisp priorities from the comparison matrices is needed. There exist several such approaches (e.g.,
Mikhailov’s fuzzy preference programming method), which later was further developed by Wang and Chin (2011); Chang (1996)’s extent analysis;
or Opricovic and Tzeng (2003)’s defuzzification approach. Following Opricovic and Tzeng (2003), the used FANP approach is to defuzzify the
fuzzy comparison matrices (i.e., with fuzzy triangular numbers as entries) before computing the crisp priorities that are entered in the
supermatrix. The chosen defuzzification algorithm, which maps triangular fuzzy numbers to the real numbers as they proposed it for fuzzy
MCDM methods. Thus, following the lex parsimoniae, fuzzy information can be processed in a simple way.
To visualize the best possible solution, three main visualization methods are common: wireframes, mockups, and prototypes, which are often
used as a three-step process. In a first step, a wireframe, which means a schematic representation of a computer interface that shows a product’s
functionality, features and content, is created (e.g., Angeles, 2014). In a further step the wireframe is expanded by colors, images, typography and
graphics which leads the company to the development of a mockup (e.g., Hellberg, 2010). The last step is the building of a prototype which is
used to develop and test the design, by letting the involved parties test the content, interaction, and aesthetics (e.g., Hellberg, 2010; Garrett,
2011). These visualization techniques should indicate if a proposed solution is really efficient to achieve the defined goal or if the FANP
framework should be used again with modified data.
The Idea behind the FANP Framework
The presented approach is based on the concept of the intelligence amplification loop, which augments knowledge by hybrid learning that arises
through the interaction between computers and humans (Kaufmann, Portmann, & Fathi, 2012; Portmann, Kaufmann, & Graf, 2012). This
concept uses the in statu nascendi learning theory of connectivism as its basis (Siemens, 2006). The bidirectional continuous interaction between
computers and humans augments not only human but also machine intelligence as they learn and adapt through ongoing (inter-)actions and
dependencies (Kaufmann et al., 2012; Portmann et al., 2012). This idea of learning through interaction is captured and specified by the concept of
emergence. The interactions among stakeholders enhance knowledge through the intelligence amplification loop (i.e., knowledge emerges
bottom up), which is called grassroots growth. Since the presented approach concerns stakeholder management and is thus intended for
grassroots growth, emergence plays a key role. Furthermore, the Web should understand what customers want, which can be attained through
the Semantic Web. The main objective of the Semantic Web is the conversion of unstructured sources to structured information and thus to
create meaning and to enhance the cooperation between information systems and humans (Brooks, 2002; Grimes, 2011).
The authors aim to create a cooperative DSS (Haettenschwiler, 2001) to optimize digital marketing strategies and the cooperation between an
information system (i.e., DSS) and humans. In the following subchapters, the proposed FANP framework is explained in more detail,
accompanied by a marketing use case for didactic reasons.
The FANP Framework for Digital Marketing
The initial position is a company that has, together with its clothing shops, distributed an online shop with numerous products all over the world.
The marketing officer wants to develop a new DSS to optimize decision making for digital marketing to acquire more new customers and to
satisfy existing ones.
The proposed FANP framework combines ideas from various research (e.g., Kaltenrieder, D’Onofrio, & Portmann, 2015; Kaltenrieder, Portmann,
Binggeli, & Myrach, 2015; Kaltenrieder, Portmann, D’Onofrio, & Finger, 2014; Portmann & Kaltenrieder, 2015).
In the marketing use case, the DSS process for digital marketing contains four steps: acquisition, processing, testing, and execution, as shown in
Figure 2:
The ground of this DSS, specifically containing in step 1 and 2 (i.e., Acquisition and Processing), is built up of an architecture. Figure 3 shows the
architecture of the FANP framework with the three building blocks Acquisition, Representation, and Processing.
Acquisition
As illustrated in Figure 4, the first step of the DSS (i.e., the first building block) is the acquisition of data. Thus, the first main task is to collect all
case-relevant information. Information comes from various sources, namely, from three major sources: From data that is put in manually into
the system by the customer, from customer databases (e.g., customer profiles, past purchases (i.e., the history)), and from the Web. There are
different sources of Web data (e.g., networked sensors embedded in devices such as mobile phones, automobiles, and industrial machines).
Additionally, companies generate a large amount of “exhaust data” (i.e., data created as a by-product of other activities) by interacting with
individuals. Social media sites, mobile phones and laptops allow individuals to contribute to the creation of big data, by communicating, sharing,
searching, buying etc. (McKinsey Global Institute, 2011). Big data also arises from software logs, cameras, wireless networks etc. (Hellerstein,
2008). Manual data and data from databases can be processed easily. The challenges of big data arise because data stemming from the Web
causes an information overload (Fulgoni, 2013; Pophal, 2014; Leeflang et al., 2014). Thus, approaches are needed that allow to structure such
data, identify the relevant one, and make it possible to optimize the decision making process. A possible solution would be to combine
computational intelligence techniques with suitable big data databases (e.g., graph databases (cf., Robinson, Weber, & Eifrém, 2013)).
Figure 4. Acquisition
In this case, the acquisition is based on Portmann & Kaltenrieder (2015). The data sources can be various ranging from internal data to externally
available data. The external data is crawled from the Web, while the internal data comes from internal databases as the consumer database and
can be acquired from multiple sources. One method is that internal data can be acquired through creativity techniques (Kaltenrieder et al., 2015),
where experts discuss about the challenges of digital marketing. Another method to gather internal data is through the company internal big data
(e.g., customer reviews, customer profiles, and loyalty programs). The main data acquisition method will be data from previous purchases (i.e.,
history), where the direct feedback based on the applied marketing approaches is visible.
But to know which data is needed, companies have first to use stakeholder management to identify their stakeholders. A well-accepted definition
of a stakeholder is “any group or individual who can affect or is affected by the achievement of a company’s purpose” (Fiedler & Kirchgeorg, 2007,
p. 178), such as in the marketing use case to be competitive in the market segment of clothing. The idea of stakeholder management is to
formulate and implement processes which satisfy all and only those groups who have influence over the company. In this case, this means to
implement marketing processes to acquire new customers and to continuously satisfy the existing customers. The clothing company has to
manage and integrate the relationships and interests of their stakeholders (i.e., customers, employees, suppliers and other groups) in a way that
ensures the long-term success of the company (Freeman & McVea, 2001).
According to Prensky (2001) and Lehr (2011), there are in general three groups of digital users: digital natives, digital immigrants and silver
surfers. Digital natives grow up with digital technology, digital immigrants adopt the digital technology later in their lives and silver surfers who
are individuals around 60 years or older get in touch with digital technologies in the later age. These different dimensions of digital users will be
helpful to cluster the stakeholders and thus to analyze their requirements, as the experience and practice with previously used systems are
important and to know, what they are expecting.
A main stakeholder for a clothing company is obviously the customer. For digital marketing, it is important to follow customers during their
customer journey (Lemke et al., 2011), evolving from awareness or orientation on a product to purchasing and becoming loyal to this product. In
order to find users to acquire them as customers of the own company, it is crucial to react efficiently and target-oriented on touch-points.
According to Clatworthy (2011) “touch-points are the points of contact between a service provider and customers”. Therefore, data about the
customer has to be collected when he looks for information, makes comparisons, and buys a product (Leeflang et al., 2014). This is made possible
because blogs, product ratings, discussion groups etc. show how customers collect and use information for making decisions, for shopping, and in
their behavior after the purchase (Onishi & Manchanda, 2012). To have a closer look at the relevant data, the clothing online shop is considered
once more. Data that is easily accessible and manageable includes information about the customer, delivery information (indicating the regions
in which the marketer operates), product information, price information, and delivery and payment conditions. Information about the customer
consist of his profile information (e.g., name, age, address, e-mail-address, telephone number), and information about past transactions (e.g.,
past purchases, past communications with the company (i.e., the history)). The product information contains data about the type of clothing (e.g.,
dress, trousers, elegant, casual, size, color, cut etc.) Information from the Web comes from fashion blogs and social media sites in general (“likes”
etc.), product reviews, fashion shows, magazines, competitors and competing products (what are other companies doing better?). Fashion trends
(e.g., do customers like tight or wide jeans? What is presented at fashion shows?), seasonal trends, color trends, preferences of the customers,
clothes that are worn by stars, the general market situation and many more are all relevant information to the marketer.
After analyzing all information from the various sources and picking out the most relevant ones, a customer record card can be created for a
specific customer. Table 1 shows an example for a customer. Monica, a 43-years-old mother working in human resources, is presented here as a
customer of the clothing company.
Table 1. An example of a customer
Monica Customer (digital immigrant)
Demographics
Gender: Female
Age: 43
Living Place: Bern
Family: Husband and 2 Children (both female:
1992, 1995)
Education: University Degree
Job: Human Resources
Working Place: Zurich
Income: More than average
Hobbies: Yoga, Reading, Cooking
Attitudes: Liberal
Anticipated Behavior
After the company has identified which stakeholders are relevant to be competitive in the clothing market and created customer records cards,
requirements engineering is used to elicit their needs and preferences. A main question is how to find out what stakeholders, especially
customers, really need (Goguen & Linde, 1993). The important information can be elicited from customer databases (i.e., from customers’
profiles and reviews, previous purchases and from creativity techniques (Kaltenrieder et al., 2015) or customer needs and preferences from the
Web. There exist various techniques of requirements elicitation: interviews, questionnaires, user observation, scenarios and many more as well as
creative techniques like mind-mapping, brainstorming, and workshops. Creative techniques like mind-mapping or brainstorming are useful at
the beginning of the requirements elicitation process, to determine in which direction the elicitation of requirements should lead. In the case of
Monica, it is obvious to first clarify if she buys only clothes for herself or also for her family. This can be done with the help of observation, if it is
to look at the history of her profile or if it is to follow her customer journey. To keep the example simple, Monica only buys clothes for herself.
Based on this information, creative techniques can be used to find out, for example, what kind of products can be offered to her. After listing all
possible products, in a further step, is to find out which products could be sold to Monica. Obviously, there exist a lot of requirements elicitation
techniques and the question is, which one is the most appropriate. Questionnaires are widely used methods for requirements elicitation.
However, the relevant questions have to be determined in advance and it has to be made sure that they are correctly understood, which can be
difficult. To avoid the difficulties of questionnaires, open ended interviews can be used. In an open ended interview, the interviewer influences
the interview to a lesser extent, because he simply asks a question and lets the respondent answer as he wishes. Nevertheless, to find adequate
questions to get valid responses remains a crucial but challenging task. A widely used practice in marketing research that allows more natural
interactions are focus groups, for which a group of people discusses a specified topic. This method can have some drawbacks if the participants
are not experts about the topic (Goguen & Linde, 1992). In this chapters’ example, focus groups do not seem to be an appropriate method,
because the main stakeholder group are the customers. They are no experts, the marketer wants to gather as many information (i.e.,
requirements) about as many customers as possible in a short time, and it is not possible for the customer to meet in a group. Thus, some form of
interview would seem to be more appropriate. There is some empirical evidence that interviews are highly effective techniques for requirements
elicitation. This is especially true for structured interviews (Davis, Dieste, Hickey, Juristo, & Moreno, 2006). This would speak in favor of
questionnaires, which are more structured than other forms of interviews. Additionally, since many forms of interviews take a lot of time (e.g.,
open-ended interviews), presenting a questionnaire to Monica would be more appropriate to assess the requirements. This questionnaire can
also be presented to other customers and thus information from many customers can be collected in an easy and efficient way. Thus, an overall
picture of the requirements can be obtained and the online shop can be created and optimized by taking the various requirements into account.
According to Goguen & Linde (1993), once the need is expressed, the company tries to identify what kind of products and services its online shop
should have to meet these. But because customers may change their minds once they see something other more clearly, it is important to
continuously elicit requirements of the stakeholders. The requirements may be changed every day and it needs time get the knowledge about
what the involved stakeholders really want. There are good reasons why customers often do not, or cannot, know exactly what they need; they are
always confronted with new things that seem to be better than the old ones. The company has to be ready to react immediately to a change of
requirements. As already mentioned in the background, there exist functional and non-functional requirements.
Referring to functional requirements and to the defined stakeholders groups, digital natives require instant provision of information, which
should be aggregated and categorized (i.e., personalized) (Ryan & Jones, 2009). Digital immigrants and silver surfers have the same
requirements, but to a lesser extent. From the view of customers the online shop should surely have different pages to click on to look at the
different products. It is desirable that these products are categorized in a structured manner and described with all relevant product information
(i.e., price, size, color etc.) to facilitate the information search. Furthermore, a shopping basket, a purchase order posting and among other things
a method to contact an employee of the clothing shop are important functions which should be provided.
In addition, the customers have non-functional requirements, (i.e., they wish for functionality, operability, ease of use and security). This is
especially true for the silver surfers (cf., Chung & do Prado Leite, 2009; Hunke & Reidl, 2011; Lehr, 2011). Moreover, the silver surfers seek
convenience and competence (Hunke & Reidl, 2011) in a higher degree than digital natives and immigrants. In general, all customers are
concerned about product safety, proper information disclosure (Maignan & Ferrell, 2004), consumer rights and digital privacy (Ashworth & Free,
2006).
Summing up, the system should be reliable (i.e., information should be accurate and complete), efficient, functional, usable (e.g., aesthetic,
consistent), secure, and its performance should be good (i.e., concerning speed, efficiency and fast response time). Additionally, from the point of
view of a company, the system should be testable, extensible, maintainable, compatible, and it should be adapted, serviced, installed and localized
easily for the own business (Chung & do Prado Leite, 2009). In contrast, for the employee, ease of use, usability of tools, digital privacy, product
safety, proper information disclosure (based on Sadri & Clarke Bowen, 2011) are important. To conclude, the system has to provide proper
information disclosure, clear information and high usability. The platform has to work properly and to help to disseminate information. It has to
be failure-resistant, secure and to enable updates. There have to be possible interfaces to other systems (integration) (cf., Chung & do Prado
Leite, 2009).
To get more specific the stakeholder analysis showed that Monica is one of the relevant stakeholders for this clothing company and she has the
following characteristics: She loves to go shopping, but does not have the time to go into the retail shops. Thus, she prefers to buy her clothes and
accessories on the Web. Her functional requirements are that the online shop has the functions to search products, to present (i.e., visualize)
them, enable her to put them into the shopping basket, and to reserve them for a specific time (depending on the cookies (Bewersdorff, 2014)
how long the chosen products will stay in the shopping basket). It should allow her to accomplish all relevant steps of the order, such as payment
and delivery submission and confirmation, in a clear and easy way and in a final step to confirm the order. Especially, there should be several
payment options. Additionally, the online shop should have the functions to get easily information if there is a problem or a lack of clarity or to
give the possibility to give reviews. Because of Monica’s working experience, she knows how important it is to give reviews. The online shop
should provide the possibility to evaluate the bought products and the corresponding service. Additionally, there should be the possibility to
contact the company through a contact form, a telephone number or an e-mail address or even a voice-mail. Since Monica’s time is limited and
she does not want to waste her time, the online shop should provide her with help to find the right products and especially the right size. Another
feature that Monica would appreciate, is to be directed automatically to products that match her previous purchases (Spiller & Lohse,
1997/1998). Furthermore, the online shop should fulfill the non-functional requirements: be user-friendly, intuitive, self-explanatory, reliable,
efficient and secure. Finally, Monica wants to have a good experience with the clothing online shop. She does not want to waste energy for
searching for a specific kind of clothing which fits her (i.e., right size, color and cut) or to be disturbed by advertisements about products she is
not interested in buying. Instead, she would appreciate personalized advertisements, because it would enable her to save time for searching what
she is looking for. Thus, it is essential to know her wishes and preferences. This is the reason why it is crucial to use requirements engineering in
an efficient way for the company.
Of course, there are more customers than only Monica and many other stakeholders which are important for the clothing company. Thus, there
are a lot of customer record cards to create and to keep track of all of them, GrC is helpful and with FCMs it can be shown to which degree a
specific customer belongs to a certain defined cluster. Thus, the second building block in Figure 2 is acting as information and knowledge
representation for the architecture. The representation is also primarily based on Portmann and Kaltenrieder (2015). FCMs are used as a main
representation method. Graph databases (e.g., Neo4j (www.neo4j.com)) will be used to store the information. The information from the FCMs
will then serve as input for the next building block. Unfortunately, data are not always precise and clear defined. Perhaps some customers cannot
perfectly be assigned to a cluster and they belong to several clusters simultaneously to a certain degree (i.e., determined by a membership
function as in section Fuzzy Logic and FCMs). Figure 5 shows an easy example for this: Based on the criteria “visiting time”, the clothing
company wants to find out at which time Monica mostly buys clothes to know when to utilize the personalized advertisement. According to her
manual inserted disclosure, her preferred buy-time is midday. But adapted from her history (i.e., previous purchases) she also bought clothes at
another time period.
Figure 5. Clustering FCM
Fuzzy set theory (Zadeh, 1965), fuzzy logic (Zadeh, 1988) and information granulation (Zadeh, 1979) are ways to deal with uncertain information
and big data (Yao, 2005; cf., Portmann & Kaltenrieder, 2015). The FANP framework, explained in the next section, introduces fuzzy logic to meet
all kind of requirements, even those who are ill-defined.
Processing
The second step (i.e., third building block) is to process the data (i.e., to make marketing based decisions adapted from existing information). To
find the best possible decision in this uncertain and imprecise environment, the incoming data from the FCMs is processed by the MCDM
method. FANP is used, where all available data is taken into account and complemented with the information gained through the personas, and
the most suitable solution for the customer is calculated. After processing the data, as shown in Figure 6, the FANP is feeding the data into a
database where all historic decisions and information are stored. This information is also fed back into the FANP to enhance the available
information. The processed information from the FANP (i.e., the processed decisions) can then be used for personalized marketing.
Figure 6. Processing
Functionality of the FANP
This subsection is showing the functionality of the FANP, which is build according to the process flow of the FANP. Figure 7 illustrates the first
three processing steps of the FANP.
Figure 7. First steps of the FANP
In a first step the main goal has to be identified. Additionally, the four merits have to be defined. In a second step, for all these merits
individual control criteria have to be determined. Figure 7 shows this second step for the merit Benefits. The other three merits can have similar
or different control criteria, but the processing of the other three merits is similar to Benefits.
Therefore, Benefits is shown as example. In the third step, cluster networks are built for every control criterion. It is important that the used
alternatives are similar for all control criteria. As specified before, all control criteria are processed similar to control criterion 1. Therefore only
control criterion 1 is showed. A comparison matrix is built for every element in every cluster. Figure 8 shows fuzzy triangular numbers (e.g.,
), which are used to represent the values in the Comparison Matrix.
In a next phase the Comparison Matrix is defuzzified (see Background), resulting in a Crisp Matrix as shown in Figure 9. After the defuzzification,
the values will be processed with the AHP (see Background).
Figure 9. Crisp Matrix of the FANP
The results of the AHP, shown in Figure 10, are priority vectors (e.g., ). The priority vectors are calculated for all clusters under a control
criterion. Inserted into the supermatrix these priority vectors are processed as in the classical ANP (i.e., the weighted supermatrix is raised to a
sufficiently high power to derive the limit priority of this control criterion).
Figure 10. Supermatrix of the FANP
The limit priority vectors of all control criteria are then synthesized for each merit (e.g., Benefits) and for each alternative
the ration
To adapt the functionality of FANP to the specific case of Monica, the clothing company first has to identify their goals. One of them is to increase
the click rate of its clothing online shop. Afterwards the four merits are defined. Based on the goal and the merits, the corresponding control
criteria are determined. In this case, there would be many criteria. To keep it simple the authors assume to have only the three criteria: costs,
visiting time, and visualization. Based on the goal possible alternative are: personalized e-mail, coupon, and pop-up advertising. There would be
more alternatives, but to keep it simple, only three are mentioned. In the sub-network for costs there are for instance the following clusters: legal
aspects, alternatives, customers, measurability and implementation. These clusters (with the contained elements) are then processed by the
above described FANP (i.e., (fuzzy) comparisons regarding the control criteria costs are made between the elements of these clusters). These
matrices are defuzzified and processed with the result that the personalized e-mail is the best alternative regarding the merit Costs. After the
computation of the priority vectors of the other merits ( ), the merit Opportunities yields that pop-ups are the best solution to increase the click
rate, while the merits Benefits and Risks support the statement that personalized e-mail is the best alternative.
Testing
The third phase is to test the proposed solution. This is achieved by using visualization methods with different levels of detail to check if the
proposed solution matches the identified requirements of the customers. The idea is to present the framework using visualizations methods in a
user-friendly way so that everyone can apply it and to analyze in a further step if all participants get the same outcome as the one the FANP
framework suggests for the company. In other words, market analysis can be conducted with the different visualization methods. In case the
results differ immensely, it is necessary that the company continuously conducts requirements analysis and engineering to optimize the decision-
making process. To allow the FANP framework to find the solution that fits the requirements of all stakeholders, an iterated process is needed.
This means that it is necessary to go back to the second step many times (i.e., to the processing part to modify the data or to put new data into the
FANP framework), to go through the whole FANP framework and to test the outcome again until all stakeholders and relevant information are
known. Following the three-step process of the visualization methods, the FANP framework is tested first with a wireframe, then with a mockup
and finally with a prototype.
Testing with Wireframes
A wireframe is a schematic representation of a computer interface that shows a product’s functionality, features and content. It should not
represent the visual design of the product, but should be kept as simple as possible (Angeles, 2014). Wireframes do not include colors, styling or
graphics and they are created iteratively in software, on paper or materials such as whiteboards. They show paths between pages of a website and
they prioritize content by determining the amount of space and the location of an item. Their advantage is that they can be created and changed
quickly and easily (Lynch & Horton, 2009; “Wireframing,” 2004). Additionally, they build on the conceptual structure and in the meantime, they
lead to the surface design. They integrate the interface design, the navigation design, and the information design (Garret, 2011). Two types of
wireframes can be distinguished, low- and high-fidelity wireframes. Low-fidelity wireframes can be developed quickly and are more abstract.
They support the communication within the team. Since high-fidelity wireframes are more detailed, including information about each item, they
can be used for documentation (“Wireframing,” 2014). The lowest level of fidelity is attained by using simple sketches. Other forms of wireframes
are page zones, detailed wireframes and flow diagrams. Page zones roughly represent contents within one view through boxes and labels.
Detailed wireframes contain more detail, like labels on interface elements, sample content, media and illustration of flow. Flow diagrams indicate
flow within the view or between views through lines and arrows (Angeles, 2014).
In the example, first, all steps of the FANP framework are visualized with the help of wireframes, to present the participants how the framework
would look like and how it would work. Additionally, the wireframe will present all used data and illustrate how the example is calculated. In the
case of Monica, a wireframe for the proposed FANP framework has to be created that would provide a personalized e-mail as the result. This
framework has to be discussed with experts and pretest participants to verify that the chosen data and the applied steps make sense or if they
would change, delete or add something. Thus, this wireframe has to present the intended functionality of the framework to the test audience. The
results of the audience’s feedbacks either lead to a re-evaluation with another set of wireframes or to the next step of testing (i.e., mockup).
Testing with Mockups
Based on the structure of the wireframe, a mockup is created as the next step. The aim is to develop a preliminary design with single visualized
elements. Thus, colors, images, typography and graphics can be used. Mockups include more details than wireframes and although not
representing the final design, they should provide look and feel of the finished website (Hellberg, 2010). They are designed to get feedback early
in the design process (“Mockups,” n.d.) and are changed until all relevant parties are satisfied (Reimer, 2011). To create a mockup, conventional
graphics programs like Photoshop can be used (Hellberg, 2010). As the testing participants and experts agreed on the presented wireframes
before, for the presented example, a mockup for the FANP framework has to be prepared and then discussed with the testing audience. Once
again, the result is the personalized e-mail. The FANP framework is presented with the same data and steps as before, but with more details and
closer to reality. As in the first step, the feedbacks either lead to another round of mockup testing or to the third step in testing.
Testing with Prototypes
After having analyzed the steps of the FANP framework via wireframes and in a further step via mockups, the next step is building a prototype in
the design process (Hellberg, 2010). It is used to develop and test the design (i.e., the FANP framework), in that the involved parties can test the
content, interaction, and aesthetics. There are low- and high-fidelity prototypes, from rough sketches on paper to click-through prototypes that
resemble the finished product (Garrett, 2011). High-fidelity prototypes are more similar to the final product and have the same techniques and
appearance, but their production is more expensive and time-consuming compared to low-fidelity prototypes (Walker, Takayama, & Landay,
2002). Low-fidelity prototypes are developed using low-fidelity-materials like paper or user friendly programming tools (Sefelin, Tscheligi, &
Giller, 2003). Walker et al. (2002) compare user testing with low- and high-fidelity web prototypes in computer and paper media. They find that
both prototypes uncover usability issues equally well, which suggest that low-fidelity prototypes should be used, because they are less costly and
allow to keep focused on information architecture and interaction design. Concerning the used medium, they find that both paper and computer
media are well-suited for user-testing. The advantages of computer prototypes are the easy recording and distribution of tests and documentation
of the design process. Paper prototyping makes participatory design easy and enables exploratory and dynamic testing. Rettig (1994) sees
additional benefits of low-fidelity paper prototyping in the fact that it can be applied very early in the process, brings results fast and allows to
investigate more ideas than high-fidelity prototypes. However, respondents seem to prefer computer prototypes, which could be an argument in
favor of using computer-based prototypes. Nevertheless, paper prototypes have many advantages as well (cf., also Retting, 1994; Walker et al.,
2002; Sefelin et al., 2003).
This last and most important testing step would result in a functioning prototype of the FANP framework, according to the level of fidelity. But in
a first step, the authors prefer to test the FANP framework using paper prototype to receive fast results to allow to investigate more ideas in a
next optimized prototype (i.e., computer-based prototype) (Rettig, 1994). Furthermore, the outcome of the prototype FANP framework cannot be
predicted since first of all, the two previous steps of testing would have to be performed.
Execution
The last step of the process, shown in Figure 12, provides that the proposed framework has passed the testing phase. Thus, this framework is to
implement and to integrate into the digital marketing process to support the company’s strategy.
Figure 12. Execution
FUTURE RESEARCH DIRECTIONS
The proposed FANP framework for an improved digital marketing (i.e., personalized marketing) is the main topic for proposed future research
directions. This section describes planned and additional future research directions.
i. Storage of FCMs: Comparison of graph databases to find the most suitable database. A possible research topic is the
enhancement of FCMs and their storage in graph databases to enhance knowledge representation. Thereby, FCMs can be used to
visualize stakeholder relationships in digital marketing.
i. MCDM through FANP: Detailed comparison of existing FANP approaches to define the most suitable FANP approach for
digital marketing. A possible research topic is the enhancement of FANP and its processing to act as a MCDM. There are existing
FANP approaches using various methods to calculate the most suitable alternative. A detailed comparison is needed to continue the
development of the FANP framework.
ii. Requirements Engineering: Analysis of requirements to develop and specify the functional and non-functional requirements
and the technical specification to process FANP for digital marketing.
i. Testing of the FANP Framework: Once functional prototypes are developed, the entire FANP framework can be tested and
enhanced for digital marketing. A possible research topic is the enhancement of the entire framework and its potential for digital
marketing. Then, the framework has to be implemented and tested in a digital marketing environment.
a. Analysis techniques
CONCLUSION
This chapter introduced a FANP framework to enhance digital marketing in the field of personalized marketing. Following an introduction into
the topic, the theoretical background for the chapter offers an insight in the applied concepts. The proposed FANP framework, including different
concepts (e.g., stakeholder management, requirements engineering, FCMs etc.) acts as a MCDM method. The digital marketing process (i.e., all
process steps), the architecture and especially the functionality of the FANP give an insight into the applied framework. A customer (i.e., Monica)
was used as an example for didactical reasons. Besides various strength (e.g., application of fuzzy logic in digital marketing), limitations of the
FANP framework are among other things privacy concerns of customers which could jeopardize the whole framework, lack of compatibility
between the framework architecture and the system of a company, missing availability of relevant data, possible negative influence of the
framework on the appearance and reputation of a company (e.g., because customers might not appreciate the fact that their personal data is
collected and used for marketing purposes) or the fact that the development of the framework is in an early stage. At present, the FANP
framework is theoretically elaborated. Further theoretical research should enhance the framework and thereby minimize the current limitations.
Once the FANP framework is tested using the proposed visualizations methods, specific customer concerns and inputs will be taken into account
for the further development of the framework. As a next step, the ongoing requirements engineering will be concluded and a first prototype will
be built to advance the implementation of the FANP framework.
This work was previously published in Fuzzy Optimization and MultiCriteria Decision Making in Digital Marketing edited by Anil Kumar and
Manoj Kumar Dash, pages 202232, copyright year 2016 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Alsop, R., & Heinsohn, N. (2005). Measuring empowerment in practice: Structuring analysis and framing indicators. World Bank Policy
Research Working Paper, 3510, 1-123.
Ashworth, L., & Free, C. (2006). Marketing dataveillance and digital privacy: Using theories of justice to understand consumers' online privacy
concern. Journal of Business Ethics , 67(2), 107–123. doi:10.1007/s10551-006-9007-7
Bewersdorff, J. (2014). Cookies: Manchmal schwer verdaulich . InObjektorientierte Programmierung mit JavaScript (pp. 263–264). Wiesbaden,
Germany: Springer. doi:10.1007/978-3-658-05444-1_30
Boutalis, Y., Kottas, T. L., & Christodoulou, M. (2009). Adaptive estimation of fuzzy cognitive maps with proven stability and parameter
convergence. IEEE Transactions on Fuzzy Systems ,17(4), 874–889. doi:10.1109/TFUZZ.2009.2017519
Brooks, T. A. (2002). The semantic web, universalist ambition and some lessons from librarianship. Information Research , 7(4), 7–4.
Burge, J. E. (2001). Knowledge elicitation tool classification.Artificial Intelligence Research Group . Worcester Polytechnic Institute.
Chang, D. Y. (1996). Applications of the extent analysis method on fuzzy AHP. European Journal of Operational Research , 95(3), 649–655.
doi:10.1016/0377-2217(95)00300-2
Cheng, E. W. L., Li, H., & Yu, L. (2005). The analytic network process (ANP) approach to location selection: A shopping mall
illustration. Construction Innovation , 5(2), 83–97. doi:10.1108/14714170510815195
Chung, L., & do Prado Leite, J. C. S. (2009). On non-functional requirements in software engineering . In Conceptual modeling: Foundations and
applications (pp. 363–379). Berlin Heidelberg, Germany: Springer. doi:10.1007/978-3-642-02463-4_19
Clatworthy, S. (2011). Service innovation through touch-points: Development of an innovation toolkit for the first stages of new service
development. International Journal of Design , 5(2), 15–28.
Davenport, T. H., Barth, P., & Bean, R. (2012). How “big data” is different. MIT Sloan Management Review , 54(1), 22–24.
Davis A. Dieste O. Hickey A. Juristo N. Moreno A. M. (2006). Effectiveness of requirements elicitation techniques: Empirical results derived from
a systematic review. In Proceedings of the 14thIEEE International Requirements Engineering Conference (RE’06). Minneapolis, MN: IEEE.
10.1109/RE.2006.17
Dimitrov, V., & Russell, D. (1994). The fuzziness of communication: A catalyst for seeking consensus. In L. Fell, D. Russell, & A. Stewart
(Eds.), Seized by agreement, swamped by understanding. Retrieved October 8, 2014, from
http://www.univie.ac.at/constructivism/pub/seized/fuzcom.html
Fiedler, L., & Kirchgeorg, M. (2007). The role concept in corporate branding and stakeholder management reconsidered: Are stakeholder groups
really different? Corporate Reputation Review, 10(3), 177–188. doi:10.1057/palgrave.crr.1550050
Fulgoni, G. (2013). Big data: Friend or foe of digital advertising? Five ways marketers should use digital big data to their advantage.Journal of
Advertising Research , 53(4), 372–376. doi:10.2501/JAR-53-4-372-376
Garrett, J. J. (2011). The elements of user experience: User-centered design for the web and beyond . Berkeley, CA: Pearson Education.
Goguen, J. A., & Linde, C. (1993). Techniques for requirements elicitation. RE, 93, 152-164.
Grimes, S. (2011). 12 things the semantic web should know about content analytics. Open Text: The Content Experts. Retrieved September 29,
2014, from http://semanticnavigation.opentext.com/wp-content/uploads/2011/06/12-Things.pdf
Groumpos, P. P. (2010). Fuzzy cognitive maps: Basic theories and their application to complex systems . In Glyka, M. (Ed.), Fuzzy cognitive
maps: Advances in theory, methodologies, tools and applications (pp. 1–22). Berlin: Springer. doi:10.1007/978-3-642-03220-2_1
Hellerstein, J. (2008). Parallel programming in the age of big data.Gigaom Blog. Retrieved September 22, 2014, from
https://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/
Hoppe, A., Nicolle, C., & Roxin, A. (2013). Automatic ontology-based user profile learning from heterogeneous web resources in a big data
context. Proceedings of the VLDB Endowment , 6(12), 1428–1433. doi:10.14778/2536274.2536330
Hunke, G., & Reidl, A. (2011). Quo vadis 55plus marketing? In Hunke, G. (Ed.), Best Practice Modelle im 55plus Marketing (pp. 32–38).
Wiesbaden, Germany: Gabler Verlag. doi:10.1007/978-3-8349-6402-1_2
International Business Machines Corporation. (2011). From stretched to strengthened: Insights from the global chief marketing officer study.
Executive Summary. IBM CMO Csuite studies . Somers, NY: IBM Global Business Services.
Kaisler S. Armour F. Espinosa J. A. (2014). Introduction to big data: Challenges, opportunities and realities minitrack. In Proceedings of the 2014
47th Hawaii International Conference on System Sciences (HICSS) (pp. 728-728). Big Island, HI: IEEE. 10.1109/HICSS.2014.97
Kaltenrieder, P., Portmann, E., Binggeli, N., & Myrach, T. (2015). A conceptual model to combine creativity techniques with fuzzy cognitive maps
for enhanced knowledge management . In Fathi, M. (Ed.), Integrated systems: Innovations and applications (pp. 131–146). Berlin: Springer.
doi:10.1007/978-3-319-15898-3_8
Kaufmann M. Portmann E. Fathi M. (2012). A concept of semantics extraction from web data by induction of fuzzy ontologies. In Proceedings of
the 2013 IEEE International Conference on Electro/Information Technology (EIT) (pp. 1-6). New York, NY: IEEE.
Kim, S. J. (2008). Viewpoint-the long tail of media-A framework for Advertising in the Digital Age. Journal of Advertising Research, 48(3), 310–
312. doi:10.2501/S0021849908080367
Kosko, B. (1986). Fuzzy cognitive maps. International Journal of Man-Machine Studies , 24(1), 65–75. doi:10.1016/S0020-7373(86)80040-2
Leeflang, P. S. H., Verhoef, P. C., Dahlström, P., & Freundt, T. (2014). Challenges and solutions for marketing in a digital era.European
Management Journal , 32(1), 1–12. doi:10.1016/j.emj.2013.12.001
Lehr, U. (2011). Demografischer Wandel – Herausforderungen auch für Kommune, Wirtschaft und Handel . In Hunke, G. (Ed.),Best Practice
Modelle im 55plus Marketing (pp. 13–31). Wiesbaden, Germany: Gabler Verlag. doi:10.1007/978-3-8349-6402-1_1
Lemke, F., Clark, M., & Wilson, H. (2011). Customer experience quality: An exploration in business and consumer contexts using repertory grid
technique. Journal of the Academy of Marketing Science , 39(6), 846–869. doi:10.1007/s11747-010-0219-0
Lin, C. T., Lee, C., & Wu, C. S. (2009). Optimizing a marketing expert decision process for the private hotel. Expert Systems with
Applications , 36(3), 5613–5619. doi:10.1016/j.eswa.2008.06.113
Lin, T. Y. (2003). Granular computing . In Wang, G., Liu, Q., Yao, Y., & Skowron, A. (Eds.), Rough sets, fuzzy sets, data mining, and granular
computing (pp. 16–24). Berlin: Springer. doi:10.1007/3-540-39205-X_3
Lukasiewicz, J. (1920). On 3-valued logic . In McCall, S. (Ed.),Polish Logic (pp. 16–18). Oxford, UK: Clarendon P.
Luttrell, C., Quiroz, S., Scrutton, C., & Bird, K. (2009).Understanding and operationalising empowerment . London, UK: Overseas Development
Institute.
Lynch, P. J., & Horton, S. (2009). Web style guide: Basic design principles for creating web sites. London, UK: Yale University Press. Retrieved
October 8, 2014, from http://webstyleguide.com/wsg3/index.html
Maignan, I., & Ferrell, O. C. (2004). Corporate social responsibility and marketing: An integrative framework. Journal of the Academy of
Marketing Science , 32(1), 3–19. doi:10.1177/0092070303258971
McKinsey Global Institute. (2011). Big Data: The next frontier for innovation, competition, and productivity. Technical report . McKinsey Global
Institute.
Mikhailov, L., & Singh, M. G. (2003). Fuzzy analytic network process and its application to the development of decision support systems. IEEE
Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews , 33(1), 33–41. doi:10.1109/TSMCC.2003.809354
Mulhern, F. (2009). Integrated marketing communications: From media channels to digital connectivity. Journal of Marketing
Communications , 5(2-3), 85–101. doi:10.1080/13527260902757506
Onishi, H., & Manchanda, P. (2012). Marketing activity, blogging and sales. International Journal of Research in Marketing , 29(3), 221–234.
doi:10.1016/j.ijresmar.2011.11.003
Opricovic, S., & Tzeng, G. H. (2003). Defuzzification within a multicriteria decision model. International Journal of Uncertainty, Fuzziness and
Knowledge-based Systems , 11(05), 635–652. doi:10.1142/S0218488503002387
Pedrycz, W. (2013). Granular computing: Analysis and design of intelligent systems . Boca Raton, FL: CRC Press. doi:10.1201/b14862
Pophal, L. (2014). Digital advertising trends you need to know.EContent June 2014. Retrieved September 22, 2014, from
http://www.econtentmag.com/Articles/Editorial/Feature/Digital-Advertising-Trends-You-Need-to-Know-97161.htm
Portmann, E. (2013). The FORA framework: A fuzzy grassroots ontology for online reputation management . Berlin: Springer. doi:10.1007/978-
3-642-33233-3
Portmann, E., & Kaltenrieder, P. (2015). The Web KnowARR framework: Orchestrating computational intelligence with graph databases. In W.
Pedrycz & S.-M. Chen (Eds.), Information granularity, big data, and computational intelligence (pp. 325-346). Berlin: Springer.
Portmann E. Kaufmann M. A. Graf C. (2012). A distributed, semiotic-inductive, and human-oriented approach to web-scale knowledge retrieval.
In Proceedings of the 2012 international workshop on Web-scale knowledge representation, retrieval and reasoning (pp. 1-8). New York, NY:
ACM. 10.1145/2389656.2389658
Portmann, E., & Thiessen, A. (2013). Web 3.0 Monitoring im Stakeholder-Management. HMD Praxis der Wirtschaftsinformatik, 50(5), 22–33.
doi:10.1007/BF03340850
Prensky, M. (2001). Digital natives, digital immigrants. On the horizon , 9(5), 1–6. doi:10.1108/10748120110424816
Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions for application. JMR, Journal of Marketing
Research , 20(2), 134–148. doi:10.2307/3151680
Raine, L., & Wellman, B. (2012). Networked. The new social operating system . Cambridge, MA: MIT Press.
Reimer, L. (2011). Following a web design process. Smashing Magazine. Retrieved October 10, 2014, from
http://www.smashingmagazine.com/2011/06/22/following-a-web-design-process/
Rettig, M. (1994). Prototyping for tiny fingers. Communications of the ACM , 37(4), 21–27. doi:10.1145/175276.175288
Robinson, I., Weber, J., & Eifrém, E. (2013). Graph databases . Sebastapol, CA: O’Reilly Media.
Ross, D. T., & Schoman, K. E. Jr. (1977). Structured analysis for requirements definition. Software Engineering . IEEE Transactions on , 3(1), 6–
15.
Rowel, R., & Alfeche, K. (1997). Requirements engineering – A good practice guide . John Wiley and Sons.
Ryan, D., & Jones, C. (2009). Understanding digital marketing: Marketing strategies for engaging the digital generation . London: Kogan Page.
Saaty, T. L. (2001). Decision making with dependence and feedback: The analytic network process . Pittsburgh, PA: RWS Publications.
Saaty, T. L. (2004). Fundamental of the analytic network process – Multiple networks with benefits, costs, opportunities and risks.Journal of
Systems Science and Systems Engineering , 13(3), 348–379. doi:10.1007/s11518-006-0171-1
Saaty, T. L. (2006). Rank from comparisons and from ratings in the analytic hierarchy/network processes. European Journal of Operational
Research , 168(2), 557–570. doi:10.1016/j.ejor.2004.04.032
Saaty, T. L. (2008). Decision making with the analytic hierarchy process. International Journal of Services Sciences , 1(1), 83–95.
doi:10.1504/IJSSCI.2008.017590
Sadri, G., & Clarke Bowen, R. (2011). Meeting employee requirements. Industrial Engineer , 43(10), 44–48.
Salmeron, J. L. (2012). Fuzzy cognitive maps for artificial emotion forecasting. Applied Soft Computing , 12(12), 3704–3710.
doi:10.1016/j.asoc.2012.01.015
Sarfaraz, A., Jenab, K., & D'Souza, A. C. (2012). Evaluating ERP implementation choices on the basis of customisation using fuzzy
AHP. International Journal of Production Research , 50(23), 7057–7067. doi:10.1080/00207543.2012.654409
Schubert P. Koch M. (2002). The power of personalization: Customer collaboration and virtual communities. In Proceedings of Americas
Conference on Information Systems (AMCIS 2002). Dallas, TX: Association for Information Systems.
Sefelin, R., Tscheligi, M., & Giller, V. (2003). Paper prototyping - What is it good for? A comparison of paper-and computer-based low-fidelity
prototyping. In CHI'03 extended abstracts on Human factors in computing systems (pp. 778-779). ACM.
Snijders, C., Matzat, U., & Reips, U.-D. (2012). “Big data”: Big gaps of knowledge in the field of the internet science. International Journal of
Internet Science , 7(1), 1–5.
Spiller, P., & Lohse, G. L. (1997/1998). A classification of internet retail stores. International Journal of Electronic Commerce , 2(2), 29–56.
Tolman, E. C. (1948). Cognitive maps in rats and men.Psychological Review , 55(4), 15–64. doi:10.1037/h0061626
Van Rijmenam, M. (2013). Why the 3V’s are not sufficient to describe big data. BigDataStartups. Retrieved September 22, 2014, from
http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/
Walker M. Takayama L. Landay J. A. (2002). High-fidelity or low-fidelity, paper or computer? Choosing attributes when testing web prototypes.
In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (pp. 661-665). Thousand Oaks, CA: SAGE Publications.
10.1177/154193120204600513
Wang, Y. M., & Chin, K. S. (2011). Fuzzy analytic hierarchy process: A logarithmic fuzzy preference programming methodology. International
Journal of Approximate Reasoning ,52(4), 541–553. doi:10.1016/j.ijar.2010.12.004
Wong K. W. Fung C. C. Xiao X. Wong K. P. (2005). Intelligent customer relationship management on the web. In Proceedings of TENCON 2005 -
2005 IEEE Region 10 Conference (pp. 1-5). Melbourne, Australia: IEEE. 10.1109/TENCON.2005.301163
Yao Y. Y. (2000). Granular computing: basic issues and possible solutions. In Proceedings of the 5th Joint Conference on Information Sciences
(JCIS 2000). Atlantic City, NJ.
Yao Y. Y. (2005). Perspective of granular computing. In Proceedings of the 2005 IEEE International Conference on Granular Computing 1 (pp.
85-90). Beijing, China: IEEE. 10.1109/GRC.2005.1547239
Zadeh, L. A. (1996). Fuzzy logic = computing with words. IEEE Transactions on Fuzzy Systems , 4(2), 103–111. doi:10.1109/91.493904
Zadeh, L. A. (1998). Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of
information/intelligent systems. Soft Computing - A Fusion of Foundations, Methodologies and Applications , 2(1), 23–25.
Zadeh, L. A. (2014). A note on similarity-based definitions of possibility and probability. Information Sciences , 267, 334–336.
doi:10.1016/j.ins.2014.01.046
KEY TERMS AND DEFINITIONS
Analytic Network Process: A multi-criteria-decision-making method based on connected networks. It is a more general form of the analytic
hierarchy process.
Big Data: A collective term for data sets only computable with specific software due to their complexity and their size.
Fuzzy Analytic Network Process: An extension of the analytic network process including fuzzy logic to implement vagueness and partial
truth.
Fuzzy Cognitive Maps: An extension of a cognitive map implementing fuzzy logic, which are able to handle vagueness and partial truth. Nodes
and weighted edges are used to compute their relative impact.
Fuzzy Logic: A form of multi-valued logic adding uncertainty and vagueness to the traditional binary logic.
Granular Computing: A computing paradigm to process information. Therefore information is first split up in different granules with various
levels of abstraction and then computed.
MultiCriteriaDecisionMaking: A method of decision-making based on multiple criteria (e.g., purchasing a house, value, age, location).
Requirements Engineering: A process to formulate, document and maintain requirements in a project. Requirements can be separated into
different aspects (e.g., functional and non-functional requirements and technical specification).
Stakeholder Management: A management strategy to utilize gathered information about involved parties (individuals, groups, or
organizations) and to manage their stakes in a project.
CHAPTER 53
Application of Big Data in Healthcare:
Opportunities, Challenges and Techniques
Md Rakibul Hoque
University of Dhaka, Bangladesh
Yukun Bao
Huazhong University of Science and Technology, China
ABSTRACT
This chapter investigates the application, opportunities, challenges and techniques of Big Data in healthcare. The healthcare industry is one of the
most important, largest, and fastest growing industries in the world. It has historically generated large amounts of data, “Big Data”, related to
patient healthcare and well-being. Big Data can transform the healthcare industry by improving operational efficiencies, improve the quality of
clinical trials, and optimize healthcare spending from patients to hospital systems. However, the health care sector lags far behind compared to
other industries in leveraging their data assets to improve efficiencies and make more informed decisions. Big Data entails many new challenges
regarding security, privacy, legal concerns, authenticity, complexity, accuracy, and consistency. While these challenges are complex, they are also
addressable. The predominant ‘Big Data’ Management technologies such as MapReduce, Hadoop, STORM, and others with similar combinations
or extensions should be used for effective data management in healthcare industry.
INTRODUCTION
The healthcare industry is one of the most important, largest, and fastest growing industries in the world. It has historically generated large
amounts of data, “Big Data”, related to patient healthcare and well-being (Nambiar, 2013). These data include clinical data from clinical decision
support systems, patient data in electronic patient records, physician’s prescriptions, pharmacies, insurance, administrative data, sensor data,
social media posts, blogs, web pages, emergency care data, news feeds, and articles in medical journals (Bian et al., 2012; Raghupathi &
Raghupathi, 2013). International Data Corporation, a global market research firm, estimates that the amount of digital data will grow from 2.8
trillion gigabytes in 2012 to 40 trillion gigabytes by 2020 (IDC, 2012). A recent study estimates that over 30% of all data stored in the world are
medical data and this percentage is expected to increase rapidly. In 2012, the volumes of worldwide healthcare data were 500 petabytes and are
projected to reach 25,000 petabytes in 2020 (Feldman et al., 2012).
It is widely accepted that Big Data can transform the healthcare industry by improving operational efficiencies, the quality of clinical trials, and
optimizing healthcare spending from patients to hospital systems (Koh & Tan, 2011). The potential for Big Data analytics in healthcare leads to
better outcomes by analyzing patient characteristics and outcomes of care. It identifies the most clinically and cost effective treatments and offers
analysis and tools. Big Data can assist patients to determine regimens or care protocols by collecting and publishing data on medical procedures.
For example, broad scale disease profiling helps to identify predictive events and support prevention initiatives; and, implementing much nearer
to real-time by aggregating and synthesizing patient clinical records. Moreover, licensing data can assist pharmaceutical companies to identify
patients for inclusion in clinical trials.
The doctors will be able to understand which tests are not necessary and patients will be able to access information on the doctors for specific
procedures with Big Data analytics (Chawla & Davis, 2013). The Big Data analytics can avoid errors, diagnostic accuracy and improve
coordination of care by using high-quality data. In USA, the Obama Administration has invested USD 200 million for Big Data Research and
Development initiative to transform the use of Big Data for biomedical research (STP, 2012). The government proposed “Health 2.0” to manage
hospitals, patients, insurance and government efficiently. The U.S. healthcare alliance network, Premier, has more than 2,700 members,
hospitals and health systems, 90,000 non-acute facilities and 400,000 physicians. It has assembled a large database of clinical, patient, financial,
and supply chain data to generate comprehensive and comparable clinical outcome measures. The Korean government plans to operate the
National DNA Management System which will offer customized diagnosis and medical treatment to patients (NICT, 2011).
However, the health care sector lags far behind other industries in leveraging their data assets to improve efficiencies and make more informed
decisions. While other industries such as the insurance, banking and retail sectors are far advanced in leveraging Big Data techniques, health care
remains poor at handling the flood of data. Researchers have raised concerns about how to ensure that Big Data has central role in a health
system’s ability to secure improved health for its users (Ohlhorst, 2012). Big Data entails many new challenges regarding security, privacy, legal
concerns, authenticity, complexity, accuracy, and consistency.
There is increasing concern that millions pieces of important data are lost every day due to traditional storage technologies. This is problematic,
as it does not allow health services to adapt to the needs of patients or diseases, as there are currently no tools being utilized that are capable of
storing and managing so much information, although the technology exists (Diaz et al., 2012). While these challenges are complex, they are also
addressable. Mass data storage in real-time technologies is needed. The predominant ‘Big Data’ Management technologies such as MapReduce,
Hadoop, STORM, and others with similar combinations or extensions should be used for effective data management in healthcare industry.
Big Data in Healthcare
About 2.5 quintillion bytes of data are generated every day and almost 90% of the existing global data has been created during the past two years
(IBM, 2013). In 2011 alone, 1.8 zettabytes of data were created globally. This volume of data equates to 2 hours long HD movies, which one
person would need 47 million years to watch in their entirety (Hoover, 2013). In addition, this volume of data is expected to double each year. On
the other hand, Social Networking users generate large amounts of data. For instance, Facebook users generate 90 pieces of contents (notes,
photos, link, stories, posts), while 600 million active users of social platforms spend over 9.3 billion hours a month on the site (McKinsey, 2011).
Every minute 24 hours of video is uploaded and more than 4 billion view per day onto YouTube, while Twitter users send 98000 tweets per
minute (HP, 2012).
International Data Corporation (IDC) has forecasted that the Big Data market is expected to grow from $3.2 billion in 2010 to $16.9 billion in
2015. This represents a 40 per cent compound annual growth rate (CAGR) and about seven times that of the overall information and
communications technology (ICT) market (McAfee et al., 2012). A recent study predicts that the number of specialist Big Data employees in big
organization will increase by more than 240% over the next five years in the UK alone (Power, 2014).
Nowadays, many organizations are collecting, storing, and analyzing massive amounts of data. These data are commonly referred to as “Big Data”
because of the volume and velocity with which it arrives, the variety of forms it takes, and the value it realizes (Kaisler et al., 2013). Using this
definition, the characteristics of Big Data can be summarized as the four Vs, i.e., Volume (great volume), Variety (various modalities), Velocity
(rapid generation), and Value (huge value but very low density). The US Congress defines Big Data as “a term that describes large volumes of high
velocity, complex, and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management,
and analysis of the information” (Hartzband, 2011, p.3).
Big Data is a collection of large and complex data sets which are difficult to process using common database management tools or traditional data
processing applications. According to Gartner “Big Data is high-volume, high-velocity and high-variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision making” (Gartner, 2013). Elbashir et al. (2013) use the
term Big Data to refer to “the technology used to collect, process, store, share, and analyze huge volumes of data such as text, documents, videos
and pictures”.
Big Data, however, differs from data-warehousing and data mining. Data mining can handle only limited amounts of data and usually focuses on
abnormal data or errors and discover interesting patterns from previous data stored (Yoo et al., 2012). On the other hand, Data warehousing
refers to simple data storage in a central repository of information from multiple sources. It stores current and historical data for further analysis
through Big Data technology or data mining (Kantardzic, 2011).
In the global economy, the Big Data revolution has begun and is now available to many industries. The healthcare industry is part of this
revolution and it is anticipated that Big Data will transform this industry by increasing accessibility and availability of information (Groves et al.,
2013). Big Data in healthcare means large and complex electronic health data sets which are very difficult to manage with traditional hardware,
software and data management tools and methods (Sun & Reddy, 2013). The key sources of Big Data in the healthcare industry are the following:
Electronic Health Records (EHRs) from healthcare service providers; clinical trials data; population health management data; new diseases trend
analysis; research and development data; and data from insurance agencies (Mancini, 2014). In the last few years, Big Data related to health has
been referred to by multiple terms, including e-Health, digital health, health information technology, mHealth, health 2.0, e-medicine, and
among many other terms (Barrett et al., 2013).
Big Data in healthcare is overwhelming because of its volume, diversity of data types, and the speed at which it must be managed. In the
healthcare sector, a single patient generates thousands of data elements, including diagnoses, medications, medical supplies, digital image, lab
results, procedures, and billing (Liu & Park, 2014). The sources of external health data such as social media, use of smartphones or wearable
sensor information on patients' heart rates, sleep patterns, brain activity, temperature, muscle motion, and numerous other clinically useful data
are also generated by patients outside the care provider facilities (Liu et al., 2012). Moreover, the e-Health communication, Health Information
Network, Health Information Organizations, associates parties of insurance, government reporting and so on also generate large volumes of data
in the healthcare industry.
Recently, the healthcare industry has been trying to understand all the innovative uses things that can be made done with Big Data. Data from
multiple sources, data collection techniques and technologies, will promote Big Data to generate new and innovative solutions to healthcare.
Healthcare stakeholders in developed and developing countries have realized that the ability to manage and create value from today’s large
volume of data, from various sources and in many forms (i.e. structured, semi-structured, unstructured), represent the new competitive
differentiation (Jee & Kim, 2013). Therefore, most of the governments, especially in developed countries, and the healthcare industry are
currently operating Big Data projects or are in the planning stage. All Big Data projects in healthcare industries have common goals, such as
better citizens’ healthcare services, easy and equal access to public health services, and the improvement of medically related concerns
(Raghupathi & Raghupathi, 2014). Side by side, each country or healthcare industry has its own priorities, opportunities, and threats, based on
that country’s unique environment (e.g., healthcare expenditure, inefficient healthcare systems, regional disparities), which Big Data projects
should address.
Many white papers and business reports confirmed that Big Data could be used to guarantee public health by appropriate treatment for patients,
monitor safety of healthcare, managerial control and health systems accountability (Jee & Kim, 2013). Healthcare industry already captured the
value from Big Data. For example, Kaiser Permanente can now exchange data across all medical facilities and achieved USD 1 billion in savings
by implementing a new computer system, HealthConnect. Blue Shield of California, in partnership with NantHealth, improved healthcare
delivery, by developing integrated technology system. They can now provide evidence-based care to patients. AstraZeneca have developed more
than 200 innovative health-care applications, in partnership with WellPoint’s data and analytics subsidiary, HealthCore to provide economical
treatments for some chronic illnesses and common diseases (Kayyali et al., 2013).
In recent years, the healthcare industry has applied Big Data Analytics tool to detect diseases and medical treatment. A number of healthcare
organizations are applying Big Data tools to address multiple healthcare challenges, assist in diagnosis diseases and support research (i.e.
genomics). DNAnexus provides cloud based platform for storing, managing, analyzing and visualizing next generation DNA sequencing
(Fieldman et al., 2012). Genome Health Solution, network of physicians and technology providers, integrate personal genomics and streamline
care for patients with cancer and other diseases. Aggregating individual medical data into Big Data algorithm helps physician to provide
evidence-based medicine to patients (Groves, 2013).
Opportunities of Big Data in Healthcare
Big Data is creating a new generation of decision support data management. Healthcare sectors are recognizing the potential value of this data
and are putting the technologies, people, and processes in place to capitalize on the opportunities. Big Data is the key factor in competition and
particularly relevant in areas such as growth, innovation, productivity, efficiency and effectiveness of organization (Chen et al., 2012).
Organizations are currently addressing enhanced customer experience, process efficiency using Big Data (Bughin et al., 2010). It is also not
uncommon that some organizations are engaging in more game-changing activities and developing new products and business models, and
producing valuable information directly through Big Data.
The Gartner, IT research firm, conducted a survey on 720 of the company’s Research Circle members worldwide. They found nearly 64 percent of
organizations investing or planning to invest in Big Data technology, 30 percent have already invested in Big Data technology, 19 percent plan to
invest within the next year, and rest of 15 percent plan to invest within two years. The respondents of organizations already invested in Big Data,
report that investments in Big Data will exceed $10 million in 2013, rising to 50% by 2016. The 32% executives of the organizations that have
already tested the Big Data waters, reported that their Big Data initiatives are fully operational, in production across the corporation (Eddy,
2013).
Worldwide Forecasts & Analysis (2013 – 2018) states that demand of Big Data applications, growth of consumer and machine data, and growth
of the unified appliances are playing a key role in shaping the future of Big Data market (Nirmala, 2014). From a regional point of view, North
America continues to lead investments and have invested in technology specifically designed to address the Big Data challenge. Big Data market
in Europe is projected to grow at a compound annual growth rate (CAGR) of 31.96 percent over the period 2012-2016. Big Data’s ability to
analyze data to forecast future accurately is the key factors contributing to this market growth.
Many organizations including healthcare in Asia-Pacific region is notably ambitious to invest during the next two years. Big Data market is
expected to grow from $258.5 million in 2011 to $1.76 billion in 2016, 46.8 percent five year annual growth rate, in Asia-Pacific region (Cerra et
al., 2012). This market will grow 20 percent or more over the next 24 to 36 months. Big Data adoption among Chinese companies are fast gaining
momentum, adding that the adoption rate is 21 percent, compared with 20 percent for Singapore and Indonesia and 19 percent for the Malaysia.
The same study forecast that Big Data business in China could touch $806 million by 2016 (Schönberger & Cukier, 2013). Cisco Systems Inc,
global networking major, conducted a survey among Chinese and overseas companies. This study showed that more than 60 percent of the
respondents believed that Big Data is essential for businesses to improve decision-making. More than 90 percent of the respondents from China
expressed their confidence in using Big Data technology (Jing & Yingqun, 2014).
For Big Data, 2013 is treated as the year of early deployment and experimentation. Big Data investment and planned investment led by media
and communications, banking, healthcare and services industries. Planned investments during the two years are highest for transportation (50
percent), health care (41 percent) and insurance (40 percent) (Davenport, & Dyché, 2013). Governments and healthcare sectors in developed and
developing countries have recognized that Big Data can be a fundamental resource for discovering useful information to enhance social and
economic growth. Moreover, Big Data can produce a wide range of innovations for personalized care, making health care more preventive, new
wave of efficiency and productivity, improve the quality of health care delivery, clinical and health policy decision making, and improved patient
engagement (Schouten, 2013). Big Data has also emerged as essential tool for the improvement of e-Health.
Big Data has unlimited potential for effectively storing, processing, analyzing, and querying healthcare data which profoundly influence the
human health. It can transform health care by providing information directly to patients, enabling them to play a more active role (Groves et al.,
2013). Currently, patients’ records with health care professionals, putting the patient in a passive position. In the future, medical records may
reside with patients. Big Data offers a chance to integrate the traditional medical model with the social determinants of health in a patient-
directed fashion by linking traditional health related data (eg, medication list, treatment history and family history) to other personal data found
on other sites (eg, income, education, diet habits, exercise schedules, and forms of entertainment), all of which can be retrieved without having to
interview the patient with an extensive list of questions (Murdoch & Detsky, 2013)
Big Data will be used to analyze the line of treatment by using relevant patient medical information and past experience from the database. It will
be beneficial to decide on line of treatment for deadly diseases like Cancer, HIV, etc. Moreover, Patient’s past history and bio-informatics analysis
will be used for personalized medicine if required. Analysis will also help to advice appropriate diagnostics thereby assisting fast tracking of
diseases and its cure. Big Data will also reduce the length of hospital stay and treatment because of effective Big Data management (Michele,
2012).
Big Data provides new opportunities to index and store previously unstructured, siloed and unusable data for additional uses by health care
stakeholders. In the health care sector, access to Big Data is enabling stakeholders to exploit the potential of previously unusable data (Khorey,
2012; Srinivasan & Arunasalam, 2013). Big Data creates new business value by transforming previously unusable data into actionable knowledge
and new predictive insights. It is estimated that the health care sector in US could create savings of more than USD 200 billion every year,
reducing healthcare expenditure by about 8 percent, if they use Big Data creatively and effectively (Kayyali et al., 2013).
Healthcare data is continuously and rapidly growing containing abundant information values which help in decision making process (Joseph &
Johnson, 2013). Data-driven firms who effectively use Big Data analytics in their decision-making processes derived five to six per cent better
“output and productivity” than if they had not used Big Data analytics. If we use this same five-six per cent average for Big Data Analytics for
health care in Canada, it would translate into $10 billion in cost savings annually which comprising five per cent of $207 billion healthcare
expenditures in 2012 (Wallis, 2012).
It is widely believed that the use of Big Data can reduce the cost of health care while basing it on more extensive continuous monitoring. With
minimum length of stay, appropriate investigations and best line of treatment reduces the healthcare services cost. An effective use of Big Data
can not only reduce the cost of healthcare but also rightly considered crucial to improve health services. McKinsey & Company estimates that
$300 billion to $450 billion can be saved in the healthcare industry in the world from Big Data Analytics (Chris, 2012).
Recently, health care organizations ranging from single-physician offices to multi-provider groups, accountable care organizations, and large
hospital networks by digitizing, combining and effectively using Big Data. It helps to improve the quality and efficiency of health care delivery;
detecting diseases at earlier stages, managing specific health populations; and detecting health care fraud more quickly and efficiently. In the last
few years there has been a move toward evidence-based medicine instead of traditionally used Physicians judgment when making treatment
decisions. Big Data help evidence-based medicine which involves systematically reviewing clinical data and making treatment decisions based on
the best available information (LaValle et al., 2014). Moreover, aggregating individual medical data sets into Big Data algorithms offers the
healthiest evidence that helps physicians, patients, and other healthcare stakeholders identify value and opportunities. Health care providers are
using Big Data to identify patients at high-risk for certain medical conditions before major problems occur.
It is claimed that Big Data discover population health patterns, identify non-traditional intervention points and predict long-term conditions by
linking petabytes of raw information to health records, demographic data and genetic information. Many white papers and business reports
focusing on healthcare have claimed that Big Data could be used to guarantee public health, support clinical improvement, determine
appropriate treatment paths for patients, monitor the safety of healthcare systems, and promote health system accountability to the public
(Burghard, 2012). Disease prevention is now achievable and Big Data are used as better diagnostic tools which increased access to healthcare.
Important application areas of Big Data are nutrition, accidents and injury, chronic and infectious diseases, mental health, environmental health,
and social health (Cambria et al., 2013).
Recently, Big Data is also used to find causes of, and treatments for diseases; actively monitor patients so clinicians are alerted to the potential for
an adverse event before it occurs (Pentland et al., 2009). In recent times, the healthcare sectors used Big Data to detect diseases in better way and
aid medical research. For example, HIV researchers in the European Union worked with IBM, applying Big Data tooling to perform clinical
genomic analysis. IBM Big Data tooling played a crucial role in helping researchers understand clinical data from different countries in order to
discover treatments based on accumulated empirical data.
Challenges of Big Data in Healthcare
Still, not too many people including patients, doctors in healthcare realize they are part of the Big Data generation process. Although Big Data
holds out enormous opportunities for improving health systems, there are also dangers that must be avoided. In particular, Big Data entails many
new challenges regarding its privacy and security risks, ownership issues, complexity, as well as the need for new technologies and human skills.
Privacy issues will continue to be a major concern of Big Data in healthcare. The inappropriate use of personal data, due to linking of data from
multiple sources, is great public fear regarding Big Data. Managing privacy is a problem in healthcare because Big Data has to be addressed from
both technical and sociological perspectives (Agrawal et al., 2011). There is also risk of misuse of putting so much personal data in the hands of
either government or companies. The privacy of healthcare data is only increases in the context of Big Data.
Several researchers have argued that it is very difficult to ensure the security of Big Data in health systems to secure improved health for its users
(Lodha et al, 2014; Feldman et al., 2012). Although new computer programs can freely remove names and other personal information from
records being transported into large databases, stakeholders across the industry still worried about potential problems as more information
becomes public. In most of the cases, Big Data software has no safeguards which have serious negative consequences (Villars et al., 2011). It is
very difficult for healthcare industry to ensure that the data in Big Data operations itself is secured.
Data usability and ownership have been identified as major concern, especially with respect to healthcare decision making. Eighty per cent of
medical data is unstructured as images, documents, and transcribed notes which make it difficult to access for effective analytics (Grimes, 2012).
When physicians can read narrative text within an Electronic Health Record (EHR), most current analytics applications cannot effectively utilize
this unstructured data. Another important challenge of Big Data is data ownership issues. Important questions are being raised over who will
own certain forms of health information for what purpose, how and by whom that information can be used. It is also challenging for healthcare
organization how to store their data, and how and when to delete and/or archive their data. Moreover, the unfamiliar and correlational nature of
Big Data increases the probability of misinterpretation that can cause serious harm.
Big Data also pose internal IT challenges. Big Data deployments require new IT staff and application developer skill which are likely to be in short
supply for quite some time. There is already a worldwide skills shortage in the data scientist and data analyst. McKinsey Global Institute predicts
that the U.S. alone will face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to
analyze Big Data and make decisions by 2018 (Manyika et al., 2011). Existing staff in many healthcare organizations may not have the required
skills and competencies to execute Big Data. Moreover, sometimes existing team members will be highly sought after by competitors and Big
Data solutions providers.
Although there are serious issues involved in the ethics of online data collection and analysis, little is understood about the ethical implications of
the healthcare research being done in Big Data (Ess, 2002). Still, many ethics boards do not understand the processes of mining and
anonymizing Big Data. Finally, it is not surprising that Big Data creates a new kind of digital divide: the Big Data poor and the Big Data rich.
Big Data Technologies
Facing the challenges and advantages derived from collecting and processing vast amounts of medical data, a new technological schema is highly
needed to address the every need arising from the Big Data applications. First, the sheer size of the data generated may require more powerful
data storage technologies, which advocates the distributed file systems. And consequently, Second, the manipulation of the massive data stored
implies the need for new database management systems capable of manipulating structured and unstructured data, which has to be relying on
the corresponding distributed file systems. Third, large data sets may allow for more insightful explorations to reveal some valued patterns
hidden, which drives the development of new analytics tools.
In this section, some emerging technologies in Big Data will be described in three categories including file systems, data management systems
and analytics tools. In recent years, file systems, data management and analytics tools are widely used in healthcare industry (Russom, 2011).
Healthcare service provider can provide better care to patients, get more valuable insights, faster fraud detection, and reduce administrative cost
by using data analytics tools such as spark, MapReduce. They can store, analyze, and correlate various data sources to extrapolate knowledge by
using Big Data technologies (Basu, 2014). It should be noted that most of the technologies described here can’t work alone and always appear as
one component of a Big Data technical schema. Sometimes, the functionalities of them are overlapped or without crisp boundaries. Another note
is that the development of related Big Data technologies is very fast, and it might be impossible to capture every new track of the corresponding
technology in this chapter.
File Systems
Hadoop Distributed File System
Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.
Its Hadoop Distributed File System (HDFS) splits files into large blocks and distributes the blocks amongst the nodes in the cluster. For
processing the data, the Hadoop Map/Reduce ships code (specifically Jar files) to the nodes that have the required data, and the nodes then
process the data in parallel. This approach takes advantage of data locality, in contrast to conventional HPC architecture which usually relies on a
parallel file system (compute and data separated, but connected with high-speed networking) (IBM, 2014).
Since 2012, the term “Hadoop” often refers not to just the base Hadoop package but rather to the Hadoop Ecosystem, which includes all of the
additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache
Spark, and others (Yahoo, 2012).
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop1framework. HDFS
stores large files (typically in the range of gigabytes to terabytes across multiple machines. It achieves reliability by replicating the data across
multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are
still useful). A Hadoop cluster is the mainly working unit which usually has a single name node along with a cluster of data nodes. Each data node
serves up blocks of data over the network using a block protocol specific to HDFS. The internal communication among blocks/clusters are
through TCP/IP and the external one is conducted by remote procedure call (RPC) sent by users.
HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations (White, 2009). HDFS
can be mounted directly with a File system in User space (FUSE) virtual file system on Linux and some other Unix systems. File access can be
achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby,
Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, browsed through the HDFS-UI webapp over HTTP, or via
3rd-party network client libraries.
HDFS evolved to be more mature as times goes. In May 2012, it was equipped with added the high-availability capabilities, as letting the main
meta data server (the Name Node) fail over manually to a backup. HDFS Federation, a new addition, aims to allow multiple name spaces served
by separate name nodes. In general, HDFS contributes as the footstone of any Hadoop systems/packages.
The potential for Hadoop in healthcare and healthcare data is exciting. Recently, Raghupathi and Raghupathi (2014) reported that the Hadoop is
the most significant data processing platform for Big Data analytics in healthcare. The Hadoop can opens up the platform for new research
domains. Researcher can now use air quality data with asthma admissions, genomic data to speed drug development using Hadoop.
Google File Systems
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google for its own use (Carr, 2006; Ghemawat,
2003). It is designed to provide efficient, reliable access to data using large clusters of commodity hardware.
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine), which can generate enormous amounts of data
that needs to be retained. Google File System grew out of an earlier Google effort, “Big Files”, developed by Larry Page and Sergey Brin. Within
GFS, files are divided into fixed-size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely
overwritten, or shrunk; files are usually appended to or read. It is also designed and optimized to run on Google's computing clusters, dense
nodes which consist of cheap, “commodity” computers, and thus precautions must be taken against the high failure rate of individual nodes and
the subsequent data loss. Figure 1 depicts the general structure of GFS.
A GFS cluster consists of a single master and multiple chunk servers, and is accessed by multiple clients, as shown in Figure 1. Each of these is
typically a commodity Linux machine running a user-level server process. It is easy to run both a chunk server and a client on the same machine,
as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
A GFS cluster generally consists of one Master node and a large number of Chunk servers. Files are divided into fixed-size chunks. Chunk servers
store these chunks. Each chunk is assigned a unique 64-bit label by the master node at the time of creation, and logical mappings of files to
constituent chunks are maintained. Each chunk is replicated several times throughout the network, with the minimum being three, but even
more for files that have high end-in demand or need more redundancy. Chunk servers store chunks on local disks as Linux files and read or write
chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunk servers. By default, users can
designate different replication levels for different regions of the file name space. The master maintains all file system meta data. This includes the
name space, access control information, the mapping from files to chunks, and the current locations of chunks.
The Master server does not usually store the actual chunks, but rather all the meta data associated with the chunks, such as the tables mapping
the 64-bit labels to chunk locations and the files they make up, the locations of the copies of the chunks, what processes are reading or writing to
a particular chunk, or taking a “snapshot” of the chunk pursuant to replicate it (usually at the instigation of the Master server, when, due to node
failures, the number of copies of a chunk has fallen beneath the set number). All this meta data is kept current by the Master server periodically
receiving updates from each chunk server (“Heart-beat messages”).
Data Management Systems
BigTable
Bigable is a compressed, high performance, and proprietary data management system built on Google File System, which is capable of storing
large amounts of data in a semi-structured manner. BigTable development began in 2004 and has been used by a number of Google applications,
such as web indexing, Google Maps, Google Book Search, “My Search History”, Google Earth, Blogger.com, Google Code hosting, Orkut,
YouTube, and Gmail. Google's reasons for developing its own database include scalability and better control of performance characteristics.
Following Google's philosophy, BigTable was an in-house development designed to run on commodity hardware. BigTable allows Google to have
a very small incremental cost for new services and expanded computing power (they don't have to buy a license for every machine, for example).
BigTable maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated
arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map. BigTable is
designed to scale into the petabyte range across “hundreds or thousands of machines, and to make it easy to add more machines to the system
and automatically start taking advantage of those resources without any reconfiguration” (Chang, et al., 2006).
Each table is a multi-dimensional sparse map. The table consists of rows and columns, and each cell has a time version. There can be multiple
copies of each cell with different times, so they can keep track of changes over time. In his examples, the rows were URLs and the columns had
names such as “contents:” (which would store the file data) or “language:” (which would contain a string such as “EN”). Tables are optimized for
Google File System (GFS) by being split into multiple tablets – segments of the table are split along a row chosen such that the tablet will be ~200
megabytes in size. When sizes threaten to grow beyond a specified limit, the tablets are compressed using the algorithm BMDiff and the Zippy
compression algorithm publicly known and open-sourced as Snappy. The locations in the GFS of tablets are recorded as database entries in
multiple special tablets, which are called “META1” tablets. META1 tablets are found by querying the single “META0” tablet, which typically
resides on a server of its own since it is often queried by clients as to the location of the “META1” tablet which itself has the answer to the
question of where the actual data is located.
Hive
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed
by Facebook, Hive is now used and developed by other companies such as Netflix (Venner, 2009).
Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-
like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and in the future Spark jobs. All
three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes. By default, Hive stores
metadata in an embedded Derby database, and other client/server databases like MySQL can optionally be used.
HBase
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of
Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like
capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught
within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero
items representing less than 0.1% of a huge collection).
HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper
(George,2011).Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API
but also through REST, Avro or Thrift gateway APIs.
HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-
driven websites, including Facebook's Messaging Platform.
Analytics Tools
MapReduce
There are several tools that have been designed to address Big Data analytics problems. Among them, the most popular one is the MapReduce.
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed
algorithm on a cluster (Zikopoulos et al, 2011).
A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues,
one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue,
yielding name frequencies). In short, “Map” is used for per-record computation, whereas “Reduce” aggregates the output from the Map functions
and applies a given function for obtaining the final results. The model is inspired by the map and reduce functions commonly used in functional
programming, although their purpose in the MapReduce framework is not the same as in their original forms. The key contributions of the
MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications
by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce (such as MongoDB) will usually not be faster
than a traditional (non-MapReduce) implementation, any gains are usually only seen with multi-threaded implementations. Only when the
optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework
come into play, is the use of this model beneficial (Lämmel, 2008; Grzegorz et al, 2014).
MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source
implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been
genericized.
The growth of medical data including image need parallel computing and algorithm optimization for analysis and indexing to scalable solution.
The MapReduce framework is used to speed up medical data processing such as parameter optimization for lung texture segmentation using
support vector machines, content based medical image indexing, and three dimensional directional wavelet analysis for solid texture
classification (Markonis et al, 2012).
Spark
In addition to MapReduce, Spark (Karau etc, 2014) is a new system developed to overcome data reuse across multiple computations. It supports
iterative applications, while retaining the scalability and fault tolerance of MapReduce, supporting in-memory processes. This latter emergent
technology aspires to be the new reference for the processing of massive data.
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, and opened sourced in 2010 under a BSD license. In 2013, the
project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became an Apache Top-
Level Project. In November 2014, the engineering team at Databricks used Spark and set a new world record in large scale sorting
In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster
for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine
learning algorithms. Spark can interface with a wide variety of file or storage systems, including Hadoop Distributed File System (HDFS),
Cassandra, OpenStack Swift, or Amazon S3.
Spark is one of the most actively developed open source projects. It has over 465 contributors in 2014, making it the most active project in the
Apache Software Foundation and among Big Data open source projects.
The Spark project consists of multiple components. Spark Core is the foundation of the overall project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets, a logical collection
of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained
transformations (e.g. map, filter, reduce, join) on existing RDDs.
The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala similar to local, in-process collections. This simplifies
programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured
and semi-structured data. Spark SQL provides a domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides
SQL language support, with command-line interfaces and ODBC/JDBC server.
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and perform
RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in
streaming analytics, on a single engine.
Machine Learning Library (MLlib) is a distributed machine learning framework on top of Spark that because of the distributed memory-based
Spark architecture is ten times as fast as Hadoop disk-based Apache Mahout and even scales better than Vowpal Wabbit. It implements many
common machine learning and statistical algorithms to simplify large scale machine learning pipelines, including summary statistics,
correlations, stratified sampling, hypothesis testing, random data generation, SVMs, logistic regression, linear regression, decision trees, naive
Bayes, collaborative filtering, k-means, singular value decomposition (SVD), principal component analysis (PCA), optimization primitives and so
on. GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computation that can model the
Pregel abstraction. It also provides an optimized runtime for this abstraction.
The Spark-based approach to healthcare data processing produces very accurate data marts which are used by statisticians and epidemiologists
in the occurrence of some diseases (i.e. AIDS, leprosy, and tuberculosis). The spark, a tool of in-memory facility, scalability, and ease of
programming, can link disparate databases with socioeconomic and healthcare data, serving as a basis for decision making processes and
assessment of data quality (Pita et al., 2015).
STORM
Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz
and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define
information sources and manipulations to allow batch, distributed processing of streaming data.
A Storm application is designed as a topology in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices.
Edges on the graph are named streams, and direct data from one node to another. Together, the topology acts as a data transformation pipeline.
At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real-
time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must
eventually end. Storm can provide real-time analytics, continuous computation and online machine learning to healthcare industry data analysis.
It is fault-tolerant, scalable and guarantees healthcare data will be processed, and is easy to set up and operate (Hu et al., 2014).
CONCLUSION
The health care industry undergoes tremendous pressure to deliver quality service to patients across the globe. Big Data application in healthcare
is the drive to capitalize on growing patient and health system data availability to generate healthcare innovation. By making smart use of the
ever-increasing amount of data available, we can find new insights by re-examining the data or combining it with other information. In
healthcare this means not just mining patient records, medical images, bio-banks, test results, etc., for insights, diagnoses and decision support
advice, but also continuous analysis of the data streams produced for and by every patient in a hospital, a doctor’s office, at home and even while
on the move via mobile devices. Many governments and healthcare provider has already increased the transparency by making decades of stored
data, searchable and actionable healthcare data. However, healthcare Big Data has different values and attributes which poses different
challenges. Addressing the challenges associated with biomedical Big Data must of necessity engage all parts of the Big Data ecosystem. Although
all Big Data project in healthcare industries have similar common goals such as, better citizens’ healthcare services, equal access to public
services and improvement of quality medical services, each government or healthcare stakeholder has its own priorities based on nation’s unique
environment. The government and healthcare stakeholder should collaborate with entities that possess a great deal of technologies and expertise.
They should develop strategies and policies on how Big Data could be best managed. The government and healthcare service provider should
explore advanced analytics, legislation, privacy, security technologies for real-time analysis of Big Data in healthcare. The Big Data research in
healthcare requires knowledge about standards, filters, meta-data, techniques for storing, finding, analyzing, visualizing and securing data, and
sector-specific editing of data. The researcher and practitioner should carefully look to determine the best ways of using Big Data in healthcare.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 149168, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., ... Widom, J. (2011). Challenges and Opportunities with Big Data
20111. Academic Press.
Barrett, M. A., Humblet, O., Hiatt, R. A., & Adler, N. E. (2013). Big Data and disease prevention: From quantified self to quantified
communities. Big Data , 1(3), 168–175. doi:10.1089/big.2013.0027
Bian J. Topaloglu U. Yu F. (2012, October). Towards large-scale twitter mining for drug-related adverse events. In Proceedings of the 2012
international workshop on Smart health and wellbeing (pp. 25-32). ACM.10.1145/2389707.2389713
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, Big Data, and smart assets: Ten tech-enabled business trends to watch. The McKinsey
Quarterly , 56(1), 75–86.
Burghard, C. (2012). Big Data and Analytics Key to Accountable Care Success . IDC Health Insights.
Cambria, E., Rajagopal, D., Olsher, D., & Das, D. (2013). Big social data analysis. Big Data Computing, 401-414.
Cerra, A., Easterwood, K., & Power, J. (2012). Transforming Business: Big Data, Mobility, and Globalization . John Wiley & Sons.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., … Gruber, R. E. (2006). Bigtable: A Distributed Storage System
for Structured Data. Google.
Charles, D., Furukawa, M., & Hufstader, M. (2012). Electronic Health Record Systems and Intent to Attest to Meaningful Use Among Non-federal
Acute Care Hospitals in the United States: 2008-2011. ONC Data Brief , 1, 1–7.
Chauhan, R., & Kumar, A. (2013, November). Cloud computing for improved healthcare: Techniques, potential and challenges. In EHealth and
Bioengineering Conference (EHB), 2013 (pp. 1-4). IEEE.
Chawla, N. V., & Davis, D. A. (2013). Bringing Big Data to personalized healthcare: A patient-centered framework. Journal of General Internal
Medicine , 28(3), 660–665. doi:10.1007/s11606-013-2455-8
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data to Big Impact.Management Information
Systems Quarterly , 36(4), 1165–1188.
Diaz M. Juan G. Oikawa Lucas A. R. (2012). Big Data on the Internet of Things.Sixth International Conference on Innovative Mobile and
Internet Services in Ubiquitous Computing. 10.1109/IMIS.2012.198
Feldman, B., Martin, E. M., & Skotnes, T. (2012). Big Data in Healthcare Hype and Hope. October 2012. Dr. Bonnie, 360.
Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating
Systems Principles SOSP '03. ACM.
Groves, P., Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The ‘Big Data’ revolution in healthcare. The McKinsey Quarterly .
Hoover, W. (2013). Transforming Health Care through Big Data Strategies for leveraging Big Data in the health care industry . Institute for
Health Technology Transformation.
Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. Access, IEEE, 2, 652–687.
doi:10.1109/ACCESS.2014.2332453
Huai Y. Lee R. Zhang S. Xia C. H. Zhang X. (2011, October). DOT: a matrix model for analyzing, optimizing and deploying software for Big Data
analytics in distributed systems. In Proceedings of the 2nd ACM Symposium on Cloud Computing (p. 4). ACM.10.1145/2038916.2038920
IBM. (2014). What is the Hadoop Distributed File System (HDFS)?. IBM.
Jee, K., & Kim, G. H. (2013). Potentiality of Big Data in the medical sector: focus on how to reshape the healthcare system.Healthcare
Informatics Research, 19(2), 79-85.
Jing, L., & Yingqun, C. (2014, April 21). When Big Data can lead to big profit. The China Daily. Retrieved from
http://www.chinadailyasia.com/business/2014-04/21/content_15131425.html
Joseph, R. C., & Johnson, N. A. (2013). Big Data and Transformational Government . IT Professional , 15(6), 43–48. doi:10.1109/MITP.2013.61
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013, January). Big Data: Issues and challenges moving forward. InSystem Sciences
(HICSS), 2013 46th Hawaii International Conference on (pp. 995-1004). IEEE.
Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms . John Wiley & Sons. doi:10.1002/9781118029145
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2014).Learning Spark: Lightning Fast Big Data Analytics (1st ed.). O’Reilly Media.
Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The big-data revolution in US health care: Accelerating value and innovation . Mc Kinsey &
Company.
Khorey, L. (2012). Big Data, Bigger Outcomes. Journal of American Health Information Management Association , 83(10), 38–43.
Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of Healthcare Information Management ,19(2), 65.
Lämmel, R. (2008). Google's Map Reduce programming model — Revisited. Science of Computer Programming , 70(1), 1–30.
doi:10.1016/j.scico.2007.07.001
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2013). Big Data, analytics and the path from insights to value.MIT Sloan
Management Review, 21.
Liu, W., & Park, E. K. (2014, February). Big Data as an e-Health Service. In Computing, Networking and Communications (ICNC), 2014
International Conference on (pp. 982-988). IEEE. 10.1109/ICCNC.2014.6785471
Liu, W., Park, E. K., & Krieger, U. (2012, October). eHealth interconnection infrastructure challenges and solutions overview. In eHealth
Networking, Applications and Services (Healthcom), 2012 IEEE 14th International Conference on (pp. 255-260). IEEE.
Lodha, R., Jain, H., & Kurup, L. (2014). Big Data Challenges: Data Analysis Perspective. International Journal of Current Engineering and
Technology.
Markonis D. Schaer R. Eggel I. Muller H. Depeursinge A. (2012). Using MapReduce for large-scale medical image analysis. In 2012 IEEE Second
International Conference on Healthcare Informatics, Imaging and Systems Biology (p. 1). IEEE.10.1109/HISB.2012.8
Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A revolution that will transform how we live, work, and think . Houghton Mifflin Harcourt.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big Data. The management revolution. Harvard Business
Review , 90(10), 61–67.
Michele, O. C. (2012, October). Big Data, Bigger Outcomes Enterprise Systems and Data Management . Journal of American Health Information
Management Association , 83(10), 38–43.
Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of Big Data to health care. Journal of the American Medical
Association , 309(13), 1351–1352. doi:10.1001/jama.2013.393
Nambiar, R., Bhardwaj, R., Sethi, A., & Vargheese, R. (2013, October). A look at challenges and opportunities of Big Data analytics in healthcare.
In Big Data, 2013 IEEE International Conference on (pp. 17-22). IEEE. 10.1109/BigData.2013.6691753
NICT. (2011). President's Council on National ICT Strategies. Establishing a smart government by using Big Data . Seoul, Korea: President's
Council on National ICT Strategies.
Nirmala, M. B. (2014). A Survey of Big Data Analytics Systems: Appliances, Platforms, and Frameworks. Handbook of Research on Cloud
Infrastructures for Big Data Analytics, 392.
Ohlhorst, F. J. (2012). Big Data analytics: turning Big Data into big money . Hoboken, NY: John Wiley & Sons. doi:10.1002/9781119205005
Pentland, A., Lazer, D., Brewer, D., & Heibeck, T. (2009). Using reality mining to improve public health and medicine. Studies in Health
Technology and Informatics , 149, 93–102.
Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based workflow for probabilistic record linkage of healthcare
data. In the Workshop Proceedings of the EDBT/ICDT 2015 Joint Conference (March 27, 2015, Brussels, Belgium) on CEUR WS.org.
Power, D. J. (2014). Using ‘Big Data’for analytics and decision support. Journal of Decision Systems , 23(2), 222–228.
doi:10.1080/12460125.2014.888848
Raghupathi, W. (2010). Data Mining in Health Care . In Kudyba, S. (Ed.), Healthcare Informatics: Improving Efficiency and Productivity (pp.
211–223). Taylor & Francis. doi:10.1201/9781439809792-c11
Raghupathi, W., & Raghupathi, V. (2013). An Overview of Health Analytics. J Health Med Informat , 4(132), 2.
Raghupathi, W., & Raghupathi, V. (2014). Big Data analytics in healthcare: Promise and potential. Health Information Science and
Systems , 2(1), 3. doi:10.1186/2047-2501-2-3
Russom, P. (2011). Big data analytics . TDWI Best Practices Report, Fourth Quarter.
Srinivasan, U., & Arunasalam, B. (2013). Leveraging Big Data Analytics to Reduce Healthcare Costs . IT Professional , 15(6), 21–28.
doi:10.1109/MITP.2013.55
STP. (2012). Office of Science and Technology Policy, Executive Office of the President of the United States. The Obama administration unveils
the “Big Data” initiative: announces $200 million in new R&D investments . Washington, DC: Executive Office of the President.
Sun J. Reddy C. K. (2013, August). Big Data analytics for healthcare. In Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 1525-1525). ACM.10.1145/2487575.2506178
Wallis, N. (2012). Big Data in Canada: Challenging Complacency for Competitive Advantage. IDC.
Yahoo. (2014). Continuuity Raises $10 Million Series A Round to Ignite Big Data Application Development Within the Hadoop Ecosystem.
Retrieved from finance.yahoo.com
Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J. F., & Hua, L. (2012). Data mining in healthcare and biomedicine: A
survey of the literature. Journal of Medical Systems , 36(4), 2431–2448. doi:10.1007/s10916-011-9710-5
Zikopoulos, P. C., Eaton, C., deRoos, D., Deutsch, T., & Lapis, G. (2011). Understanding Big Data - Analytics for Enterprise Class Hadoop and
Streaming Data (1st ed.). McGraw-Hill Osborne Media.
ENDNOTE
1 The term “Hadoop” often refers not to just the base Hadoop package but rather to the Hadoop Ecosystem, which includes all of the additional
software packages that can be installed on top of or alongside Hadoop, such as Pig, Hive, HBase, Spark, and others.
CHAPTER 54
Shaping Content Strategies with User Analytics and Identities:
How User Analytics is Shaping Editorial Strategy, Driving Marketing, and Generating New Revenue
Tricia Syed
Penton, USA
ABSTRACT
User data has become the foundation of many businesses. The ability to increase the breadth and depth of user data to analyze trends is now the
roadmap for information companies – providing direction for new business, content strategy, print to digital shifts and overall retention and
engagement. In this chapter, the author will explore user identity and the three key core data buckets – Profile, Activity, and Behavioral – that
define how to decipher audience members and their ‘user records.’ The chapter will specifically showcase how user identity shapes editorial
strategy, marketing messaging and drives revenue. It will look at the impact specific technologies are having on what data can be captured as well
as the complexities around data capture in general – standardization, preservation, storage, relational data opportunities and data optimization.
INTRODUCTION
Recently, the author was shopping online for family in preparation for summer. She needed children’s bathing suits, new beach towels and some
rack equipment for an automobile. She went to at least ten consumer sites, including Lands End, L.L. Bean, Amazon, Thule.com, REI.com and
Zappos, comparing styles, prices and item availability. While shopping, she added items to her online cart, either purchasing them or abandoning
them, and in some cases abandoning the site entirely. Customer service as well as exit intent automation interrupted the shopping process to ask
if any help was needed or to attempt to stop her from exiting the site(s). She did not interact with any online support and in the end did purchase
merchandise from three sites.
As an expert in user marketing, the author is intrigued by the approaches companies use to process customer activity for upsell or follow up
engagement opportunities. It is impressive when the process is well executed and, more specifically, when there is contextual relevancy and
strong timing with the follow up. In this recent shopping expedition, Zappos was the clear winner of the day on both fronts. When checking
Facebook while shopping, the author noticed almost within seconds that items left in the Zappos shopping cart or items viewed, or items like
them were suddenly appearing in the margins and body of the Facebook feed. The same dynamics are shaping content delivery as well.
STRATEGIES WITH USER ANALYTICS AND IDENTITIES
Contextual Targeting and Retargeting
There will be those who are “spooked” by the Big Brother aspect of contextual marketing and retargeting, viewing these practices as creepy,
disruptive or as an invasion of privacy. Yet for digital media and marketing professionals, it is interesting to observe the tactics and how well or
poorly methods and technology are executed. Companies as well as news and media companies are in need to continually drive for increased
sophistication in editorial and marketing practices to increase readership, user engagement and revenue.
In the contextual marketing space, contextually targeting shoppers based on online behavior makes a connection and/or a presumption about
what the shopper is thinking. As one shops, companies that leverage contextual targeting are anonymously cookie-ing visitors so that they can
follow their actions. By tracking what links are clicked, what pages are perused, and what items are searched, the business is investing in re-
marketing to place their products in front of shoppers again and again as they surf the web. This level of data capture, or what is also referred to
as activity data, has a short shelf life. The re-marketed ads are relevant for a limited period of time before the shopper makes a purchase and for a
short period of time afterward. The companies that have built and sell this technology and level of targeting are attempting to create an
experience that is tied to anecdotal or potentially statistically proven ‘human nature.’ They are trying to re-engage, remind or re-attract buyers to
what they were initially considering for purchase.
Retargeting has become a big business. Companies such as Bizo, ReTargeter and Adroll specialize in retargeting. Adroll, for example, has been
around since 2007 and today it partners with some of the largest online social companies such as Facebook, Google, Yahoo and Microsoft.
Retargeting enables their clients to reach 98 percent of sites on the Internet. Adroll claims extremely high return on investment (ROI) values for
clients and cites advanced targeted techniques as one of their main differentiators. Like Adroll, Bizo is a multi-channel business-to-business
(B2B) marketing solutions company. They also partner with major companies allowing them to reach more than 90 percent of the business
population in the United States. ReTargeter is a display advertising solutions company that specializes in audience targeting and retargeting,
claiming to reach 98 percent sites on the Internet.
There are some lessons digital media companies can learn from early experiences in this space. Despite the improvements in contextual
targeting, issues such as discrepancies, irrelevancy and intrusion still permeate the user experience. Intelligently targeting ads or content and
making sure they are relevant still present challenges for businesses. Repeatedly displaying an ad for something the buyer has already purchased
is not effective. As an example, the author recently registered to attend what looked to be an informative and interesting marketing conference,
yet after the registration was confirmed, she continued to receive emails asking her to register for the conference. Now, instead of focusing on
looking forward to the event, she is thinking, get your act together! Most consumer companies and B2B companies relying on retargeted ads or
even their own marketing automation platforms do not go beyond ‘recent activity’ data and geo-targeting. If they collect deeper demographic
and/or firmographic data (organizational data or features tied to the company’s industry), as well as historical activity and behavioral data, they
would be able to provide a more holistic view of users and their needs, interests and experiences. Imagine how this might impact predictability
and higher conversion rates. Perhaps the problem lies with not enough data sets linking or not enough human influence attempting to better
discern the patterns within big data.
The Impact of Big Data and Little Data
Big data has become a term that is influencing businesses, data analysts, technologists and journalists alike. Understanding consumers’ viewing
and purchasing behavior must extend beyond cookies, it requires careful management of the volume, multiplicities and speed in which the data
are being cogenerated and stored. Thomas (2014) defines the types of data you can trust as companies maneuver through and attempt to cull out
what is usable and valuable, versus the “noise” so much data can create. The data types range from experimental data to marketing-mix modeling
data, which is the creation of an analytical database. Then there is the cleansing and normalizing of those data and the use of multivariate
statistics and modeling to isolate and neutralize some of that noise. Sales data and social media data are also important, but Thomas finds the
most value in the “little data,” which are statistically significant subsets of the data.
“Corporate decision-makers often would be better served if they rely on tried-and-true tools and systems from the world of little data, rather than
illusions from big data,” Thomas says. Specifically, he talks to the validity of marketing research based on data derived from random sampling of
target audiences. “Marketing research can be designed to be forward-looking and predictive, rather than backward-looking,” he claims.
“Experienced researchers can create alternative futures and measure the relative appeal of the differing visions of the future.” Such “little data
often provides a more accurate basis for sound corporate decision-making.” With the ability to capture enormous amounts of data sets and fields,
it’s interesting to consider what might be categorized as the little data within the bigger data movement.
Shaw (2014, para. 1) references Professor Gary King of Weatherhead University who makes the case for quality over quantity in big data
gathering:
There is a big data revolution. But it is not the quantity of data that is revolutionary. The big data revolution is that now we can do something
with the data. The revolution lies in improved statistical and computational methods, not in the exponential growth of storage or even
computational capacity.
“Doing something” with the data means accurate interpretation and appropriate actions to build new layers of the business and generate new
revenue based on the user data.
Capitalizing on Big Data
Capitalizing on big data includes: a) defining the audience, b) defining the valued trends of the audience, and c) ensuring one has the tools and
capabilities to conduct analysis and act on opportunities within those trends. User data has become the foundation for many businesses’
marketing strategy. The ability to identify user segments and trends within key segments enables one to:
In a time when so many B2B media companies are making a shift from print ad dollars to digital offers, the ability to parse user data, predict
behaviors and engage users is mandatory. With a combined methodology, alongside platforms that support data capture and data access,
marketers, editors, publishers and sales representatives have an opportunity to make smarter choices for the business.
Developing the “Who” of User Personas
Whether it’s big data or little data, content creation teams within successful media and information service companies need to be dedicated to
truly understanding the day in the life of the customer. This entails the creation of user personas to better understand users and customers. The
user persona identity that represents the ideal segment or class within the target audience includes demographics, firmographics, subscriptions,
mediums of interest, job function and requirements, daily pain points, challenges within their lines of work, what technologies they rely on to
access solutions, and the concerns of their like-minded peers. Persona data provides for improved user experiences and deep segmentation. They
enable content creators, community managers and marketers to execute a sophisticated targeting and engagement by providing relevant content,
solutions and connections.
The concept of developing user personas to identify like communities within a broader audience has been a common practice within the
marketing field, as developed by Jenkinson (1994) in his paper, “Beyond Segmentation”. He created a tool called CustomerPrints to better
categorize customer segments beyond the traditional segmentation based on demographic and behavioral similarities. The purpose was to
achieve a higher level of knowledge about a customer’s daily life, needs and desires based on distinct groups of customers. Today, the challenge is
choosing the platforms and technologies to support this user persona work and how to parse through at times, inordinate amounts of data – in
essence, capitalizing on big data.
Developing user personas is an important technique in meeting the challenge of deep relevancy by contextual targeting and user transparency
(see Figure 1). User personas enable media and information services companies to dissect and analyze user data to better understand their
business roadmaps.
The key to user access and interpretation is in understanding how to employ database methodologies, platforms and cloud technologies to
generate user-centric analytics. Decision-makers must determine how to parse data, generate user health checks and extrapolate data to discover
or uncover significant trends about content, topics, and industry impact so that they can better project and predict the needs and interests of
content consumers. For news and information organizations, the first step in building a foundation to analyze user data is acquiring the ability to
capture a person’s complete, and ideally most up-to-date, demographic and firmographic information. The second step is to ensure there is a
platform in place that allows for data analysis and marketing automation in order to segment users based on demographic differentiators. Before
a media organization begins to evaluate service options, it must answer the question: who is the user? This is the starting point. Once established,
the business can expand the definition with niche user segments that align with their content and product offerings.
When capturing demographic data, it’s important to decide in advance what demographic fields or questions are important and impactful. Time
invested in this sometimes-tedious process will be rewarded. Next, it is important to standardize how questions are asked – specifically, how
registration forms collect the data as well as how the data are stored and accessed. Open fields in registration forms can create multiple variations
of the same data point, unless a sophisticated system or dedicated data manager oversees normalization of the data. It’s better to carefully define
fields to prevent variations in user input.
Penton’s Experience
There are web analytics platforms that can aid in data normalization. Web analytics platforms such as Omniture offer an option to upload
backend user database fields and identifiers by email so that businesses can then track within Omniture what those records (or users) are reading
and map it back to their unique record data. The process is called Sainting. At Penton, a professional information services company (the author’s
company), record data has been merged with Omniture to enable editors and marketers to see what topics people and companies are reading,
downloading and revisiting. Here, the demographic data are imperative as they become the identifier of the activity and provide trends at both
the individual and account level.
The user persona represents the reader, website visitor, or user who needs or willingly consumes content. At Penton, marketers and editors were
recently tasked with implementing a phased approach to developing user personas. Penton is focused on five core industry segments, including
agriculture, transportation, infrastructure, design & manufacturing, and natural products & food, and each also includes niche segments. As a
result, the persona exercise is not one size fits all. Rather, each marketer was tasked to work closely with their editor, publisher and sales teams to
define their key audience members. Fields being selected include target group, job title/function, geography, employee size, and company
revenue.
Data on user daily influences, which include motivations, job pressure and industry pressures or needs, were also collected. The goal again is to
ensure that the user universe and segments or user niches within the broader universe are understood. Once user definitions, user needs and
clarity on their roles and challenges specific to those roles were collected, it was time to look at the topic level and content medium level that
would be viewed as valuable. What content will resonate with the audience? For some user segments, they are in need of free or fee-based
education to help propel them to the next level in their careers. For others, it’s a better understanding of the scope of technologies based on their
company size. Another segment is concerned with more information about the manufacturers on which they currently rely, or how new
government laws will impact their business. Alongside the demographic and activity make-up of the user audience, Penton also looked at user
habits around consumption based on the day in the life of a consumer and what it reveals. Dayparting, the process of dividing the day into
different time segments, enables the adjustment of offers accordingly, which is critical to editorial and marketing efforts and the impact this has
on response rates and user engagement.
The user persona identification exercise is broken into two parts. The “Who” which captures the demographic make-up and a day in the life of the
user and the “What” which provides a list or roadmap of the content and assets for the user to engage. Because technology is influencing how and
when users consume data, this exercise should occur at least annually, if not more frequently, to take into account new niche topics, changes in
the industry and changes in content delivery and consumption. For example, with mobile content consumption, the recent focus is on responsive
design to optimize the user experience, but this will also result in new data sets about how and when the users access that content on their mobile
device.
In Penton’s exercise with editors and marketers, it was emphasized to the team that analysis of user trends within defined user personas will
become the roadmap for publishers. Before questions on the registration page can be finalized, the question of who is the audience must be
answered. Are they small to mid-sized businesses, global enterprises or somewhere in between? Are the visitors c-level executives, directors and
managers or developers? Once the segments – or buckets on where the user best fits – have been defined, one can begin to build on their
identifiable record data by adding more complex data points that describe job expertise and areas of interest and more importantly, what are the
specific concerns of the audience or audience segment.
The Role of Marketing Automation
Standardization, agreed-upon user data fields, consistent firmographic data particularly with enterprise companies with defined company
naming conventions, company location and company size ranges, data optimization, data preservation, and relational data opportunities are
critical steps to building a concise and minable user database. Companies today can find a faster path to get to quality data by investing in
marketing automation software platforms designed primarily for marketing departments, but increasingly used by other divisions such as
editorial to more effectively communicate across online channels. Marketing automation capabilities provide transparency on what resources
users are consuming in terms of content (what newsletters they are opening, click through rates), where users are consuming content (social
media sites) and when. These activities are track-able and users can be identified by the content-type, topics and social channels they access.
Garrison (2013, para. 2) explains the marketing automation approach as a means to quickly analyze key customer segments:
…most businesses have a broad customer base. And within that broad base are niche groups with distinct wants and needs, and unique
motivations. A powerful tool to expand your understanding of those specific needs and motivations is to create a persona; this is an archetypal
character that represents different user types within a targeted demographic that might use your product or service in a similar way.
There are over 200 marketing automation solutions from which companies can choose. Three of the more well-known include Eloqua, Marketo
and Pardot. Many organizations have only begun to tap the power of these solutions. Of the B2B marketers using an automation platform in
2014, 85 percent of them feel that they are not using them to their full potential (Senatore, 2014).
When leveraged correctly, marketing automation provides companies with an opportunity to capture, store and define data segments that result
in deeper data capture and new trends. For a media or information services company relying on behavioral-based trigger marketing campaigns
(messaging the audience at preplanned points in time based on their behavior, for example, if they click on a certain link in a newsletter, they
receive a contextually relevant follow up email associated to what they first clicked), custom data cards that enable deep user segmentation and
relevancy are less about driving leads, scoring leads or uncovering prospects, but more about engagement and understanding the user lifecycle
based on topics of interest, mediums of interest and synergizing website traffic trends with the user base. Businesses rely on the marketing
automation tool to enhance targeting and personalization. Trigger campaigns can be parsed according to user reaction. Marketers can
subsequently create a series of unique messages based on the direction or lack of direction the user takes.
How Marketing Automation Platforms Work
Most marketing automation platforms provide insight and guidance on user persona or buyer persona best practices. These platforms foster deep
segmentation, tight targeting, and the ability to retarget, nurture, and upsell. Companies including Marketo, Eloqua and Pardot offer marketing
automation platforms that store, collect, optimize, normalize, and target data for email marketing efforts. Eloqua and Marketo have developed
strong marketing communities. Pardot, part of Salesforce.com, also works with preferred partners, namely cloud applications such as SnapApps.
They offer pre-configured interactive content, including interactive infographics, with the ability to capture user data and provide analytics. More
specifically, Eloqua has created something it labels as Custom Data Objects (CDO), which enables companies to store additional information for
user contacts, such as product purchase history or event attendance data, without cluttering up contact records. Marketers and editors can use
these tools to maintain clean and organized databases, but more importantly, store data that can then be segmented at the contact level – to build
out more sophisticated messaging based on the “Who” and the “What” of the audience.
Another provider in this space, Televerde, goes a step further by appending and enriching data at the program level based on a campaign to a
specific audience. They provide data quality analysis to ensure data is current and accessible across multiple platforms. Another company and
preferred partner of Eloqua, Sureshot Media, provides a variety of creative services for demand generation or targeted marketing programs
driving visibility for a company’s products, services, or brand awareness and one in particular is a set of apps called SparkPlugs. Sparkplugs help
businesses extend the capabilities of their marketing automation application by addressing a variety of areas, including data quality and
marketing channels. Specifically, these data quality apps integrate with third-party data providers to validate data, such as email and postal
addresses and enhance data, such as appending company or contact information to records.
Data Optimization and Normalization
Once key demographics have been defined and a platform is in place to capture and communicate user records, the next critical step is data
optimization and normalization. Despite the inconsistent data capture that might occur on a registration form or missing data as a result of a
tiered registration approach, there are many service companies that can append demo and firmographic data based on unique identifiers such as
email or company domain. Often these companies require access to a house file to do a match. They charge based on the data field or overall
volume of appends. Such companies provide more complete and updated data to develop accurate segmentation and trend analysis. Typically,
the cost of services is worth the investment. Traditional list services companies as well as cloud data companies can assist in this process.
Most of these service providers have proprietary databases and technologies. The costs of their services vary. It is important, as a company
decides on which one to work with to consider data status, data health and business need. For example, the company Fresh Address focuses on
email appending (adding email addresses to audience records by extrapolating or paying for the email through a list broker) and optimization. As
an email database service provider around since 1999, Fresh Address services range from building, updating, segmenting and cleaning email lists
from a customer’s database. Televerde also provides appends like emails or address information, and data optimization, promoting data that are
globally and socially sourced with several layers of validation to ensure job titles and job functions are up to date. Most likely they rely on
companies such as LinkedIn and telemarketing outreach to ensure accurate data capture.
Some service companies are more niche focused. For example, iProfile is a sales and marketing intelligence provider that provides access to a
database of company profiles and contact information for information technology decision makers. One of their unique offerings is a service of
real-time deal alerts that notifies customers when IT purchasing decisions are about to be made. They also update clients when IT decision
makers change jobs within their organization or move to a new company.
When enhancing or optimizing user data (updating demographic data or company data to ensure accuracy), a simple best practice is to test a
small segment with one or more potential data vendors to ensure quality and engagement. As a rule, it is best to test small amounts of data with
missing elements and conduct a permission pass campaign through email to trigger a response or conversion. This helps to ensure they are
accurate and active before making a larger investment or commitment to enhance the user database.
Developing the “What” of User Personas
The next step in creating value and opportunity from the user persona is the “What.” From the “What” of user activity data (see Figure 2),
companies can begin to identify user interests based on content consumption including both content topics and content types. These content and
product lifecycle patterns provide more sophisticated opportunities for user segmentation resulting in the three R’s – relevancy, response and
retention.
What are activity data? What can or should be captured? Most importantly how is this information used to direct content strategy, product
strategy, and marketing to drive engagement? Simply put, activity data are defined as the record of any user’s action.
Defining user activities is a key part of the user persona exercise and should be revisited often as industries and their audiences as well as the
content channels and social platforms continually evolve. Activity data, as defined by Franklin et al. (2011), can fall into three key categories:
• Access: Logs of user access to systems indicating where users have travelled (e.g., log in/log out, passing through routers and other
network devices, premises access turnstiles).
• Attention: Navigation of applications indicating where users have been are paying attention (e.g., page impressions, menu choices,
searches).
• Activity: ‘Real activity’, records of transactions which indicate strong interest and intent (e.g., purchases, event bookings, lecture
attendance, book loans, downloads, ratings).
Once engaged in this process, there are important questions to ask, such as: Are all actions critical to track? Once tracked, how well can the
trends within those activities be analyzed? If users are logged in to the system web analytics can be merged with user records. Even if users are
not logged in and are not easily identified, web traffic can be tracked by IP addresses and geo-targeted to identify companies.
Some companies attempt to track traffic and identify users based on IP and geographic data in order to identify account-level data or the
companies from which the traffic is coming. Sites can then take that information and map it against their internal databases to potentially locate
look-a-like data or users from the same or similar companies, to better understand what topics are resonating across these organizations.
LSC Digital is one of those service companies focused on data processing that uses website tracking mechanisms employing profiling and
mobility to build new segmentation plans. The end goal is to improve response rates and decrease promotion costs. This is the ability to locate
profitable customers that companies can identify in their existing databases to support ongoing promotions and direct acquisition. This is an
interesting approach because the desired outcome is a combination of the web traffic on sites alongside existing user base trends. If users are not
logged in, it is typically not possible to track their activity back to their record data. With services such as that offered by LSC, one can at least see
traffic trends and apply them to the internal audience.
Other service companies, such as Chartbeat, provide real-time analytics to editorial and marketing teams to help direct content- based on web
traffic. With Chartbeat, one can see visitors, load times, and referring sites in real-time. The difference, unlike theretargeters of ads, is
differentiating what people have clicked versus what they have read. Many think that the value lies not in what a user clicks, but what happens
between the clicks. The goal for editors then becomes less about the click-through data and more about the attention of users, how long they
stayed on a page or article, and their repeat-visit data.
Web traffic and website topics are one level of activity data. Companies such as LSC that track the type of traffic, or Chartbeat, which uncovers
content value based on user consumption and attention rather than clicks, have the ability to better capture relevant organic traffic. Once these
data sources are combined or the findings leveraged within the “member” database, one can begin to decipher deeper patterns and create a much
richer, more relevant story for the target audience.
Haile (2014) of Chartbeat makes interesting and critical points on how best to maneuver through data and identify which actions are meaningful.
For example, he writes, “As pageviews have begun to fall, brands and publishers have embraced social shares such as Facebook likes or Twitter
retweets as a new currency” (para.14). He goes on to point out that data from these engagements tends to be misinterpreted:
…the people who share content are a small fraction of the people who visit that content. Among articles we tracked with social activity, there
were only eight Facebook likes and one tweet for every 100 visitors. The temptation to infer behavior from those few people sharing can often
lead media sites to jump to conclusions that the data does not support. (para.2)
Instead of an overreliance on clicks or social sharing activities, Chartbeat contends that a more accurate relationship between content and
engagement can be determined by the actual attention a reader gives to an article or what Haile refers to as “largest volume of total engaged
time – or the Attention Web. (para.16)
Having access to topics of engagement from a media company’s web traffic with tools like Google Analytics or Adobe, or more specifically enable
editors to see what stories are being read; what topics are resonating with the audience and how long the audience spends with a particular article
or story. Reader engagement data can help direct editorial strategy. Taxonomies, sub-topics, new content strategies can all be generated as a
result of traffic patterns. Similarly, the same level of “attention” to the user base, whether it is data cards within an automation platform or some
other form of data storage, should be given before emailing and marketing to audiences. What are the users reading, downloading, and signing
up for? Are they logged in, are they repeatedly visiting the site, when do they access content (dayparting, or the practice of dividing a day into
several parts and directing different forms of advertising to each part), and how long are their visits? These activities become the beginning of the
process of applying value to users, as well as engaging them in a meaningful way.
How to Leverage Activity Data
At Penton, such activity data became the “What” of understanding users, customers, readers and members. “What” actions result in learning
more about needs, pain points and interests as well as actions resulting in revenue for the business divisions. Each media company will create its
own exercise, but in the case of Penton, the definition included a combination of website content access and consumption, including white
papers, e-newsletters, webinars, virtual events, in-person events, fee-based habits and print subscriptions. Because Penton is an information
services company that covers so many varied industries, there are multiple segments of user lifecycles or lists of activities. The values placed on
those activities are also highly industry-specific.
How is activity data collected, and how is it best used? This may depend on the type of media used to access the information. In some cases,
opening and clicking on an e-newsletter article results in the basis for critical activity data. Capitalizing on hot topics and an engaged subscriber
base helps to drive needed traffic. It’s important to look closely at the articles that do drive messaging for new subscriptions. Higher subscription
counts lead to more internal traffic, as well as additional opportunity to capture more activity data per each user when they react, click and read.
Activity data aligned with what articles users like to read (topics of interest) and the time of day they read (dayparting analysis) can then be used
for other efforts.
The same exercise works with other media that require registration or user log in. By noting the actions taken on sites around media
consumption, users become identifiable by the media to which they gravitate. In some industries, such as technology for example, where IT
professionals have access to educational webcasts, white papers, learning guides, eBooks, interactive infographics, virtual trade shows, research
and surveys, one can uncover interesting trends by record or record type. A researcher or analyst can begin to establish affinity lists based on the
media users seem to repeatedly access.
In addition, it is possible to align the activity with a topic and create a segment within the medium to further target the audience. For example,
some users rely on webcasts to keep up on the latest information or solutions around their roles. Others might gravitate toward white papers and
others might register for a webcast with good intentions, but never attend. In all cases, user behaviors emerge and marketing or content outreach
should be adjusted accordingly. An effective tactic for webcast attendees is to regularly send updates, rewards and or sneak peeks to maintain
engagement. Those who register and never attend (have the intention but not the time for follow through) might benefit from a documented,
high-level overview. Here they can still access the content, but in a different medium, such as an executive summary. If the goal is lead generation
for the media organization, this is a win. If the objective is user paid activity, one may still win the conversion. If the goal is education and
deepening thought leadership, the goal is accomplished with a condensed version of the presentation in a shorter format.
User Segmentation and User Lifecycles
Another best practice that results from capturing activity or content-type consumption data is dividing users up based on their preferences. Some
users tend to read lengthier documents, some prefer brief summaries or roll ups, some gravitate to slide shows, and others participate regularly
in virtual events, such as webinars and virtual trade shows. Understanding each user’s content consumption preferences leads to better
engagement and more upsell opportunities.
At Penton, users have also been identified as community participants by providing valuable user-generated content within niche communities.
The user data compiled based on these activities provides email segmentation opportunities and trend analysis to better understand the personas
– but most importantly, it provides the opportunity to glean user value or customer lifetime value. To determine this value, a weighted value is
assigned to each activity. The more users engage, the more their value increases. Analyzing user patterns provides insight into user behavior
around content consumption that leads to new ways to segment and ultimately reach the audience. There is no limit to these variables, as
explained in Urbanski (2014, para.3):
Segments are everyone’s business—increasingly, even B2B marketers. They recently asked a number of data suppliers to cite which common
customer segmentation variables, such as size and industry type, they provide to clients. What Stevens and Grossman found surprised them:
Some suppliers provide as many as 100 variables, including details on companies’ installed technology and spending on items like insurance
and legal services.” Stevens goes on to make the point that “B2B data is richer in segmentation variables than most of us had thought.
Segmentation and the ability to act on activity data is not only about relevancy but agility. Much of the data collected has a shelf life or it evolves
as the user takes more actions. Understanding user activities and building a data dictionary or encyclopedia to ensure that the entire team from
marketing, to editorial, to sales understands how the audience is defined is important.
Defining, identifying and capturing activity data also leads to a clearer understanding of the user lifecycle. Companies can employ these
techniques to understand user flow across content, content products, topics and frequency. User activity data, coupled with user demographics,
becomes the foundation toward building user value and developing user lifetime value. Customer and user lifecycles are focused on reach,
acquisition, conversion, retention, affinity, and brand loyalty. With so many options for content consumption as well as ways to measure activity,
the “funnel” (the user’s path from initial discovery to engagement or purchase) has become more complex. The question then becomes how to
take advantage of user activity data to help define user or customer lifecycles. As editorial departments are more focused on content creation,
topic relevancy, and user-generated content to increase readership and deepen engagement, the user lifecycle or user value is tied more to the
types of activities and topics and how they impact the business.
Activity intelligence dashboards are just as relevant for the users as the buying customers. With data-capture tools and a platform that provides
transparency beyond static data points, editors and marketers have the ability, with customization and user knowledge, to motivate behaviors
based on what the user lifecycle data reveal.
Developing the “Why” of User Personas: Behavioral and Attitudinal User Data
The final and most complex aspect of analytics and big data with regard to readership or user persona is behavioral data – the “Why.” Before
delving into tools and methodologies on how to capture and assign behaviors, the author will first define behavioral data and differentiate it from
attitudinal data (see Figure 3). Dupre (2013) interviewed Eric Tobias, VP of Web products at Exact Target, who observed there are significant
differences between attitudinal and behavioral data:
Attitudinal data is explicit—asking customer questions such as through a survey and evaluating their direct feedback. Behavioral data is
implicit— observing and inferring customer motivations based on actions such as online clicks. Independently, both types of data have virtues,
but relying exclusively on one can leave gaps in understanding the customer’s overall journey. For instance, neglecting attitudinal data can
lead to a lack of understanding as to why a customer converted.(para. 2)
How do media companies today capture behavior (see Figure 4) and/or attitudinal data and more importantly, how can they use these data to
guide editorial initiatives and business initiatives?
Figure 4. Behavioral, insights gained from a collection of
activities
To gather these insights, implicit data can be collected via web analytics, opens, and clicks. Service companies such as Chartbeat are providing
editors and community developers the ability to see in real time what their audience is accessing, reading, and where the hot spots are throughout
the day. The goal is to provide quality data – not volume data – and to provide key metrics around user behavior that editors can use to make
smart decisions. The differentiator with this platform from other web analytics platforms is that the focus is on readers’ experiences – namely
how much time they spent on an article, what they read next, and which articles resulted in valued actions or return visits. The ability to measure
audience attention by article in real time is referred to by Chartbeat as “engaged time.” This level of analytics, along with geographic data and
platform usage data, begins to tell a story or stories about readership and what articles, pathways, and external sites are triggering the most
valuable responses.
Similar to Chartbeat, Evergage is software aimed at helping sites create a highly personalized user experience in real-time to enhance customer
engagement. In terms of attitudinal or explicit data, the best path is typically customer insight surveys, polls, field sales, and marketing outreach
to obtain accurate and customized answers, which then become the measure for user satisfaction or consumption patterns. Then, the ideal is to
combine these disparate data sources to enable a complete or holistic view of your audience (Dupre, 2013).
There are advantages to combining attitudinal survey data and behavioral data captured on the web, including content access as well as content
premium access such as white paper downloads and webcast participation. Attitudinal surveys, much like registration forms, only measure what
the audience self-reports, which is subject to inaccuracy, dishonesty, and bias. Daypartingand the level of engagement play an important part as
well. Learning when to deliver surveys to audience types is critical in the success of and completion of lengthy surveys.
Self-reported data, coupled with behavioral data, provide a more transparent view into user interest as well as a comprehensive view of the
segment to which each user belongs. Understanding behavior first can ultimately impact survey engagement, again knowing topics of interest,
content medium of interest as well as time of day when users are more engaged. All of these behavioral data can help assist in positioning and
deploying insight surveys with greater success.
The Role of Web Analytics
Web analytics plays a key role in understanding users. It’s important to employ web analytics data that captures traffic trends to accurately reflect
the motivations of the user base. Ensuring that web behavior data are linking back to the user base and enhancing user records provides a more
comprehensive and transparent view into the audience than by simply profiling them. To reach this capability, it’s important to attach behavior
to the user database.
More service companies have emerged to go beyond the traditional services once provided by list brokers. Seeing the need for enhanced insights
into the user base, they provide the ability to capture, gain access to, and act on behavioral data with a shelf life. For example, I-Behavior, a
database marketing and behavior profiling company, leverages cooperative members who contribute transactional data to a database. In
exchange, they receive higher-quality prospects for their own marketing, through the use of predictive models based on data about previous
transactions. Behavioral-based service providers also offer capabilities that help enrich customers’ in-house databases by appending and
overlaying data elements such as interests, lifestyle, and behavior, to help increase the relevance of marketing messages to each customer.
Beyond these tools, another layer of capturing and interpreting behavioral data harkens back to the concept outlined by Jenkinson (1994) that
like-minded behaviors coupled with demographics results in like-minded actions. In platforms such as Eloqua, these communities of like-minded
behaviors are referred to as List Groups or Custom Data Objects. They enable personalization as a starting point. With a significant and increased
uptake in digital content consumption, communities or user buckets are becoming more and more niche and more splintered. Users are falling
into multiple sub-cultures of existing communities and the potential for over-messaging due to too much data which threatens relevancy, affinity,
and a positive user experience. With personalized content being the key driver to user experience success, investing more time in defining the
broader segments or buckets of the audience may pay off. For example, some broader segments and steps along the user lifecycle could be
defined asseeds, influencers, advocates, and influenced. Conyard (2014, p. 9) explains one way to interpret the aforementioned buckets:
The idea is initially seeded through both idea creation and advertising (paid and unpaid). It is likely the idea will be picked up initially (but not
exclusively) by existing advocates. Through them, the message will spread to those that will be influenced by the idea, namely further
advocates and influencers within the market space. When conducted correctly, this can result in a snowball effect – an influencer aligns with
an idea in a strategy and in turn encourages support from their own advocates, who go on to become ambassadors for the idea themselves.”
Similarly he points out that the counter here is one can also create extremely negative reactions that are subsequently “broadcast to a wider
sphere.
Behavioral Data
Behavioral data help to more robustly uncover and define the user lifecycle. It is not just about implicit trends, interests, topics, and pain points.
Lifecycle or implicit actions of the user are also contingent on the campaigns, content, and topical asset-types being consumed. Areas of interest,
sub-categories, hot buttons, and community threads have a known shelf life. Campaign interest such as registering and attending a webcast on a
particular topic provides insight into a user’s medium of interest.
How long will the “topic” of interest be relevant? This harkens back to examining layers of the user persona and the, at times, unsophisticated
delivery of retargeted ads to browsers. Is the product in the shopping cart still relevant? Is that behavioral data worth capturing? How long do
editors and marketers developing persona-based databases in marketing automation platforms, hold onto these kinds of data? The answer lies in
the ability to access audience segments by profile and activity along with behavioral data in real-time. Colwell, B. and Taylor, M. (2014) refer to
the ability to target in real-time based on user behavior as “Always-On Marketing (AOM) or data-driven and content-led experiences that are
delivered across channels and devices in real-time.” They surveyed 685 c-level marketing, technology, and business executives and found that
fewer than “20 percent have the technology, creative execution, and integrated data to deliver a targeted experience to a recognized customer
across channels.”
Ultimately, the information collected needs to be actionable by the editor or marketer with the media organization. The success lies in companies
that are able to generate unified views of their users or what might include the complete user persona (profile, activity, and behavior) with
defined groupings or communities, “develop actionable rules for experiences in each moment across channels, with the creative assets pulling
dynamically in each millisecond” (Colwell & Taylor, 2014). The Razorfish-Adobe study revealed that more than 75 percent of these companies,
covering six vertical industries at company sizes with revenues exceeding $500 million annually, were unable to convert behavioral data into
usable directives for targeted messaging. To address needs such as this, Razorfish and Adobe work in concert to bring together the ability to
rapidly identify the roadmap needed to extract the most value from the client. The result is actionable plans to fill the gaps between marketing
goals and identified roadblocks. This is just one example of how creative agencies and web and data analytics platforms are coming together to
offer solutions, dashboards, formulas and speed to real-time data usage.
CONCLUSION
Now that components of user personas, analytical tools, access to big data capture, and methods for defining the audience have been addressed,
how does all this shape the user experience and impact decision making on user and content strategies?
Although this chapter included many companies that support the ability to utilize data and better understand user persona, there is no perfect
combination of tools, applications, platforms, and data structure. The answer will be unique based on a media company’s needs, budget, and
internal skill set. The trend is that all companies should consider the need for logic around user identities and broader segmentation. As Colwell
(2014) points out, “It’s just not good enough to invest in only the technology or data analytics skills. You need your digital marketing ecosystem
working together.”
The following questions must be asked and answered in order to define each organization’s ecosystem (see Figure 5): What should the ecosystem
look like based on the direction of the industry? What are the customer insight surveys revealing coupled with implicit data capture? What are
the user ‘aha’ moments that define the user, content, and product roadmap?
Figure 5. Market segmentation defined by profile, activity,
behavioral and attitudinal data
At the beginning of the chapter, the author described an online shopping experience. Her expectations for seeing relevant ads based on her user
behavior and web perusing were not high. In contrast, the author has high expectations from this type of activity from media and information
services companies, with which she has subscribed, become a member, or accessed premium, and gated content. Those activity-based and
implicitly-based trends or choices should be capitalized upon. For these companies, there is an expectation that they are storing user’s activities
and will use these data to create more relevancies when messaging the user. Mining and intepretting user or audience data is a critical component
of success in today’s media and information services companies. Employing some of the strategies outlined here would be beneficial to align with
this metrics-driven culture:
• Leverage the right tools to capture and append accurate demographic and firmographic data quickly
• Locate the right marketing automation platform and take advantage of segmentation, trigger programs and the ability to nurture and warm
audiences intelligently
• Ensure data sets are linking – explore if and how the content contributors and the marketers and analysts unify
• Visualize the user path, make it a recurring exercise to recognize associations around activity and behavior to build out algorithms
• Set benchmarks to monitor increased affinity, loyalty and ROI – define what the user lifecycle looks like and what the trends are within the
communities
Other questions to keep in mind include how the user personas evolve and represent the industry. Understanding users and defining personas is
just a part of the user experience roadmap. What is the user’s journey now and in the future? Do the content assets, products and channels align
with the user-centric direction?
Data sources are increasing every year, from Google Analytics, Chartbeat, marketing automation data platforms, B2B contact lists, data appends,
lifecycle processes, behavior profiling, consumer transaction databases, database service providers, and more. They all play a critical role. The job
of editors and marketers is to determine which data are important and how to leverage them effectively.
Understanding the logical flow of data requires clarity with regard to the day in the life of the customer, reader, consumer or subscriber. Their
logical interests and needs will vary based on profile data such as who they are, how senior their role, their concerns and areas of interest,
company size or geography, and other firmographic information. Shaw (2014, para. 38) points out how important it is to turn data into
visualizations. He states
To make sure that what we will be presenting to the user is understandable, which means we cannot show it all. Aggregation, filtering, and
clustering are hugely important for us to reduce the data so that it makes sense for a person to look at
With regard to leveraging big data in terms of the relationship between data analytics and user experience, Shaw (2014, para. 38) indicates that
this is a different method of scientific inquiry that ultimately aims to create systems that let humans combine what they are good at—asking
the right questions and interpreting the results—with what machines are good at: computation, analysis, and statistics using large datasets.
Relying on the little data within the larger data sets to impact user experience, editors, and marketers must form a collaborative team to clarify
user personas in order to delve into the combination of technical solutions and platforms to support analytics directives.
This work was previously published in Contemporary Research Methods and Data Analytics in the News Industry edited by William J. Gibbs
and Joseph McKendrick, pages 111132, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Colwell, B., & Taylor, M. A., & Razorfish. (2014). The state of alwayson marketing study. Retrieved 2014, from
http://www.alwayson.razorfish.com/#intro
Dupre, E. (2013, November 1). Attitudinal and behavioral data: better together - Direct marketing news. Dmnews.com. Retrieved June 26, 2014,
from http://www.dmnews.com/attitudinal-and-behavioral-data-better-together/article/317695/
Franklin, T., Harrop, H., Kay, D., & van Harmlen, M. (2011).Exploiting activity data in the academic environment.http://www.activitydata.org/
Garrison, K. (2014, June 18). 5 key questions to ask before developing customer personas. It’s All About Revenue: The Marketing Blog. Retrieved
June 26, 2014, from http://blog.eloqua.com/5-key-questions-to-ask-before-developing-customer-personas/
Haile, T. (2014, March 9). What you think you know about the web is wrong. TIME.com. Retrieved June 26 2014, from
http://time.com/12933/what-you-think-you-know-about-the-web-is-wrong/
Shaw, J. (2014, March-April). Why big data is a big deal. Harvard Magazine. Retrieved from http://harvardmagazine.com/2014/03/why-big-
data-is-a-big-deal
Thomas, J. (2014, March 17). Little data vs. big data: nine types of data and how they should be used. MarketingProfs. Retrieved 26 June 2014,
from http://www.marketingprofs.com/articles/2014/24670/little-data-vs-big-data-nine-types-of-data-and-how-they-should-be-used
Urbanski, A. (2014, January 1). 4 Modern Themes in Segmentation - Direct Marketing News. Dmnews.com. Retrieved 26 June 2014, from
http://www.dmnews.com/4-modern-themes-in-segmentation/article/326267/
KEY TERMS AND DEFINITIONS
Activity Intelligence Dashboard: Activity intelligence dashboard provides a comprehensive view of a sales lead as well as the account
associated with that lead.
Attitudinal Data: Attitudinal data attempts to determine how someone “feels” about a product or service by measuring the importance
someone places on attributes of a product or service.
B2B: Business-to-business describes commerce transactions between businesses usually in the context of marketing and sales.
B2C: Business-to-consumer describes commerce transactions between businesses and consumers usually in the context of marketing and sales.
Behavioral Data: Behavioral data is a collection of all data on an individual’s behavior. Behavioral data are typically collected online, but can
also be derived from offline sources.
Big Data: Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using traditional
database management tools.
Contextual Targeting: Contextual advertising is a form of targeted advertising on websites or content displayed in mobile browsers. The
advertisements are selected and served by automated systems based on the content that is relevant to the user.
Data Normalization: Database normalization is the process of organizing the fields and tables of a relational database to minimize
redundancy.
Dayparting: Dayparting is the practice of dividing a day into several segments and directing different forms of advertising to each segment.
Firmographics: Firmographics define the characteristics of organizations that are most likely to spend money on your product or service. What
demographics are to people, firmographics are to organizations.
Infographic: Infographics or information graphics are visual representations of data presented in a form that is quick and easy for readers to
visually interpret.
Little Data: The measurement, tracking, and analysis of the minutiae of everyday lives. Little Data is the small, unique preferences of users.
Marketing Automation: Marketing automation allows companies to streamline, automate, and measure marketing tasks and workflows, so
they can increase efficiency and grow revenue.
Marketing Mix Modeling: Marketing mix modeling is a term for the use of statistical analysis such as multivariate regressions on sales and
marketing data to estimate the impact of various marketing tactics on sales.
Organic Traffic: Organic traffic is traffic that comes to a website via unpaid links from other sites such as search engines, directories, and other
websites.
Permission Pass Email: A permission pass involves sending out a new bulk email to your list asking the recipients to confirm that they want
to continue to be subscribers.
Push Traffic: Push traffic refers to marketing promotions presented to large groups of people through channels such email campaigns and
banner ads.
Retargeting: Retargeting is a cookie-based technology that uses a JavaScript code to anonymously ‘follow’ your audience all over the Web.
Retweet: A Retweet is a re-posting of someone else’s Tweet on the Twitter social media site.
Sainting: Sainting is a process where back-end user database fields are uploaded and merged with identifiers by email to track what users are
reading and then map that data back to the user’s record data.
SMB: Small to mid-size business.
User Persona: A user persona is a representation of the goals and behavior of a hypothesized group of users. Personas are fictional characters
created to represent the different user types that might use a site, brand, or product in a similar way.
Web Analytics Platform: A Web analytics platform is a tool that measures web traffic and enables businesses to measure, analyze and
improve the effectiveness of a web site.
CHAPTER 55
Improving Healthcare with DataDriven TrackandTrace Systems
Eldar Sultanow
XQS Service GmbH, Germany
Alina M. Chircu
Bentley University, USA
ABSTRACT
This chapter illustrates the potential of data-driven track-and-trace technology for improving healthcare through efficient management of
internal operations and better delivery of services to patients. Track-and-trace can help healthcare organizations meet government regulations,
reduce cost, provide value-added services, and monitor and protect patients, equipment, and materials. Two real-world examples of
commercially available track-and-trace systems based on RFID and sensors are discussed: a system for counterfeiting prevention and quality
assurance in pharmaceutical supply chains and a monitoring system. The system-generated data (such as location, temperature, movement, etc.)
about tracked entities (such as medication, patients, or staff) is “big data” (i.e. data with high volume, variety, velocity, and veracity). The chapter
discusses the challenges related to data capture, storage, retrieval, and ultimately analysis in support of organizational objectives (such as
lowering costs, increasing security, improving patient outcomes, etc.).
INTRODUCTION
The healthcare sector is an important driver of economic growth in many countries around the world. In 2013, it is estimated that $6.15 trillion
were spent on healthcare worldwide, with the majority of the expenditures and the largest percent of a country’s wealth being observed in the US
(18% of GDP) and other developed countries such as the Netherlands (11.9% of GDP), France (11.6% of GDP), Germany (11.3% of GDP), Canada
(11.2% of GDP) and Switzerland (11.0% of GDP) (Hoovers.com, 2013; Plunkett Research, 2013). However, many healthcare organizations today
are faced with increasing challenges, including government regulation, stringent safety requirements, aging population, increasing labor costs,
and expensive medical equipment and pharmaceuticals. Manufacturers of medications, medical technology, and supplies, hospitals, nursing
homes, doctor’s offices, pharmacies, and other healthcare sector players all need to better control costs, safety, and quality in order to successfully
meet patient needs, regulatory requirements and competitive pressures.
One information technology (IT) touted for its potential to improve processes and reduce costs is big data – a term coined to reflect very large
and complex data sets, which can be analyzed using sophisticated tools to extract improvement insights. Annually, the healthcare sector in the US
alone could generate $300 billion in value and a productivity increase of 0.7% from using big data. Big data can also create new markets, such as
a market for medical clinical information providers, which his predicted to reach $10 billion by 2020 (Manyika et al., 2011).
Traditionally, big data was defined in terms of three dimensions - volume, variety, and velocity. Recently, a fourth dimension, veracity, was added
(IBM, 2014). Volume refers to the size of the data sets, which is considered too large to be managed using traditional database tools. Volume is
time and industry-dependent: today, big data range from hundreds of terabytes (1012bytes) of data (the volume stored by most US companies
today) to pentabytes (1015 bytes) of data, but these volumes are expected to increase in the future as more detailed data is captured (Manyika et
al., 2011; IBM, 2014). Variety refers to the different types of structured and unstructured data such as traditional transactional data as well as
data from sensors, social media, mobile devices, and other real-time monitoring technologies. Velocity measures the speed of streaming the data
–in near real-time or fully real-time. Finally, veracity refers to the degree of certainty in the data –in terms of sampling rates from a population or
in terms of data quality, for example (IBM, 2014).
While the academic and practitioner communities have focused heavily on explaining the value of big data, they did so largely in generic terms,
with little attention given to the technologies and information systems that make big data happen. This chapter presents two real-world case
studies of data-driven track-and-trace technology. Tracking and tracing were terms originally defined in the context of supply chain
management. Tracking implies collecting data about product location and movement from a supplier to a customer, throughout the supply chain.
Tracing denotes the retrieval of the tracking information in order to identify a product’s real-time location, past movements, or origin (Ha &
Choi, 2002; Laurier & Poels, 2012). As sensor technologies evolved, track-and-trace technology today can capture and retrieve not just location,
but other types of data, such as temperature and acceleration of objects and vital signs of people, and do so reliably and in real-time. Thus, track-
and-trace technologies generate data in high volumes and with high variety, velocity and veracity. We believe that understanding their technical
architecture and benefit affordances is essential for helping healthcare organizations analyze this data and generate value.
Track-and-trace technologies enable “real world awareness” (RWA) – a term originally coined by SAP and defined as “the ability to sense
information in real-time from people, IT sources, and physical objects – by using technologies like RFID and sensors – and then to respond
quickly and effectively” (Heinrich, 2005). RWA is intended to reduce or resolve breaks in information sensing and responding, and thus closes
the gap between the natural and the virtual world. The natural world comprises physical and operational reality, e.g. people, products, inputs and
resources, while the virtual world consists of the depiction of reality in IT, i.e. in IT systems and local, regional and global information networks.
The RWA concept is based on the growing trend towards automatic data entry and retrieval processes and an ever-increasing wealth of available
information.
In our increasingly complex world, big data methods that allow tracking and investigations and pattern recognition of irregularities can help
companies understand and monitor the movement of people and products and take action when necessary. In this chapter, we present two real-
world examples of track-and-trace technology as implemented by a German technology provider and used in several healthcare settings: a system
for tracking medication through a pharmaceutical supply chain, and an indoor real-time location tracking system. We explain how these systems
can capture, transmit, store, and analyze organizational data (such as location, temperature, movement, etc.) about tracked entities (such as
medication or patients) and the embedded knowledge about those entities (such as safety and security risks, service needs, etc.), with the goal of
fulfilling organizational objectives (such as lowering costs, increasing security, improving patient outcomes, etc.). These examples are based on
primary data about the details of the design, implementation and adoption of each system, as well as secondary data about component
technologies and similar systems.
The chapter is organized as follows. We provide an overview of each example, focusing on the business issue, the design of the track-and-trace
systems, the data collection process, and the system benefits. We then summarize the challenges for data collection, transmission, storage,
retrieval and analysis. We conclude with further research directions.
A TRACK AND TRACE SYSTEM FOR PHARMACEUTICAL SUPPLY CHAINS
The first example describes the use of track-and-trace technology to support pharmaceutical supply chains in Germany. Next, we describe the
XQS track-and-trace system of counterfeit security and drug quality assurance, developed by XQS Service GmbH and first introduced in the
German oncology market several years ago.
Issues in Pharmaceutical Supply Chains
Drug quality plays a key role in providing the general public with a universal system of safe health care. Pharmaceutical supply chains, however,
exhibit weaknesses that can affect drug quality. Counterfeit drugs and illegal exports and imports, usually of lower quality, can enter unsecured
supply chains. Quality deficits in drugs, which decrease their effectiveness, can be caused by failure to comply with transportation and storage
requirements (the so-called “cold chain” requirements) by manufacturers, logistics providers or wholesalers. Last, but not least, drugs can be
stolen by criminals and resold outside of the regular supply channels –usually without proper storage and transportation, which affects drug
quality.
Counterfeiting is a major problem in pharmaceutical supply chains. Within just two months, the European Union confiscated 34 million
counterfeit medicine tablets in targeted customs controls in all Member States (Schiltz, 2009). According to the World Health Organization, 70%
of fraudulent drugs appear in developing countries; in addition, 50% of counterfeit medicines contain no active ingredient, 19% contain incorrect
amounts of active ingredients, and 16% are composed of completely wrong ingredients, with very few containing a quality and quantity of
ingredients comparable to their legitimate counterparts (Schweim, 2007). Even when the source of medication is legit, processes throughout the
supply chain have a crucial influence on quality, as medication often needs to be kept within specific temperature limits and has to be transported
without damage to fragile bottles or excessive shaking. In Germany, there are currently around 250 drugs requiring cold chain distribution, and
the number of such drugs is steadily increasing (Sultanow, Brockmann, & Wegner, 2011). Temperature fluctuations outside the permissible range
often lead to irreversible changes to the active ingredients. Patient safety is at risk without the patients being aware of this. The use of cold chain
drugs which have lost their chemical effectiveness due to the violation of the prescribed temperature range can ultimately lead to treatment
failure in patients.
The price pressure as well as political and governmental influences in the pharmaceutical market lead not only to the current cut-throat
competition, but also to an increased risk of forgery. In the course of globalization, expansion of export markets and parallel imports
collaboration processes are becoming increasingly complex. For example, in the last few years, news organizations have reported that in Italy,
expensive drugs are stolen, stretched or manipulated - and then shipped to northern Europe, only to emerge in countries such as Germany. Drug
thefts are a lucrative trade for criminals, as single doses of a drug can be worth hundreds of Euros, and no companies are immune from it. For
example, Trans-o-flex, a large cross-European logistics group, experienced such a six-figure theft in its North Rhine-Westphalia Neuss warehouse
in 2014 – amounting to truckloads of drugs, and half of its inventory. The incident affected dozens of manufacturers and re-importers, including
Roche, Johnson & Johnson, McNeil, and others (Apotheke Adhoc, 2014). Criminals prefer targeting logistics companies or even hospitals
because security in these locations is easier to breach, as opposed to manufacturing facilities, where security is much tighter. The thieves then
follow well-known strategies: the stolen drugs are either thrown on the black market or “laundered” through fictitious intermediate or
wholesalers in Italy and Eastern Europe and then resold in high-price countries. Business with stolen drugs is extremely lucrative. They are
trivial, clean, easy to transport and bring a lot of money. In addition, the penalties for the theft of legal drugs in countries such as Italy are not
nearly as high as about for drug trafficking. As the list of stolen drugs is getting longer, pharmacists are already seeing the proper drug supply at
risk. It is becoming increasingly difficult to verify if the drugs received are genuine or are from batches of stolen or manipulated drugs. But while
stolen drugs are a lucrative business for thieves due to the high resale value, they represent a major risk for patients who unknowingly take the
stolen and possibly contaminated or tampered medicine.
When detected, counterfeiting, illegal shipments and quality problems lead to major recalls and significant costs for manufacturers. For example,
batches of anti-HIV drugs in German pharmacies, including Viramune, manufactured by Boehringer Ingelheim, and Combivir, manufactured by
GlaxoSmithKline, were recalled in 2009 due to counterfeiting. Many batch recalls relating to (hazardous) quality defects such as bacterial
contamination were also conducted in 2009. Examples include Vicks Sinex in Germany, Great Britain and the United States, and contaminated
batches of the drug Doreperol in Germany and Luxembourg. Unfortunately, when quality and counterfeiting problems are not detected, the use
of the compromised drugs in patients can have serious, sometimes fatal, consequences. In 2010, three newborns being treated at Mainz
University Hospital died after being given contaminated nutritional infusions from a damaged bottle (Sultanow, Brockmann, & Wegner, 2011).
And in Germany, the authorities are frequently discovering illegally imported drugs – such as Herceptin and other cancer drugs, generating
recalls from hospitals and pharmacies.
The European Medicines Agency has also been issuing warnings about manipulated (and with false shipment certificates) medicines – as in the
case of cancer medication containing antibiotics instead of the active ingredient, or being sold as a liquid instead of the normal powder from. This
is likely to adversely affect patient treatment, as the cancer drugs have to be sterile and the doses should be tailored to each individual; any
deviation in package integrity or dosing can lead to serious health problems.
Legislators have recognized the risks associated with these problems and laid down basic requirements for storage and transport in the German
Ordinance on the Production of Pharmaceuticals and Active Substances (AMWHV), which came into effect in November 2006 (Krüger, 2007).
The laws are designed to limit liability for consequential harm caused if other supply chain participants to not follow the guidelines. However,
pharmaceutical companies which (due to data complexity and diversity) cannot trace their product pathways are subject to significant fines.
Thus, tracking and tracing of medication through the pharmaceutical supply chains becomes essential for both manufacturers and patients.
Technologies for TrackandTrace Systems
The basic concept of real-world awareness as reflected in the literature (Heinrich, 2005; Heinrich 2006; Diessner, 2007) suggests the following
criteria for a track-and-trace system for pharmaceutical supply chains: labelling and serialization, sensor-based monitoring, traceability (track-
and-trace), visibility, and reduction of breaks in information capture and transmission. However, most pharmaceutical supply chains today are
not supported by track-and-trace systems. Barcodes are still used as the main identifier of medicine packages – and they do not allow companies
to uniquely identify a drug vial (as in serialization). Other real-world awareness functions are also hard to implement with barcodes without
significant labor costs. For example, tracking the movement of the products from the manufacturer through the supply chain would require
manual scanning of the product bar code at every transfer – a very labor-intensive and time-consuming process that few, if any, companies are
willing to implement. Even so, the products are still vulnerable to theft and adulteration as thieves will find it easy to circumvent the barcode
scanning – and no alarms can be generated if a scan does not take place. Furthermore, barcodes can become damaged in transit, and difficult to
read – thus making tracking and tracing impossible.
Newer labelling technology include 2D barcodes (also known as Data Matrix) and radio-frequency identification (RFID). Data Matrix (GS1, 2014)
is a labeling technology based on a two-dimensional square or rectangular code composed of black-and-white square cells, which encode the
binary letters 0 and 1. Depending on its size (usually between 10x10 to 144x144 cells in a square label), a Data Matrix label can encode a
maximum of 2,335 alphanumeric characters or 3,116 numbers. Thus, the label can store a unique serial number and other relevant information.
Data Matrix labels can be printed on product labels (such as on food and drug packaging), or engraved directly on products (such as equipment).
The labels can be read using optical devices – and as long as the label resolution is high and the reading devices are properly positioned in the
line of sight, reading accuracy is very high. While Data Matrix labels are more expensive to print than regular barcodes, they are still relatively
cheap, making them an attractive option for tracking applications.
RFID (RFID Journal, 2014) is a wireless technology that involves tags that can encode information (such as unique product serial numbers and
other information about the lot number, manufacturer, etc.) and communicate it wirelessly when queried through a radio signal transmitted by a
reader. RFID tags have been developed for high frequency (HF) and ultra high frequency (UHF) radio waves. Passive RFID tags consist of an
electronic chip and an antenna, which are usually affixed to a paper substrate in the form of a flat label, which can be then attached to goods.
They can communicate information to readers up to 4 meters away. Active tags also include a battery that amplifies the signal power and enables
the tag to be “read” from a longer distance - up to 100 meters away. Because the reading is done wirelessly, the tag does not need to be in the line
of sight of the reader, thus improving reading capabilities. Depending on the memory capacity on the chip, an RFID tag can encode from tens to
thousands of characters. A special encoding device is required to write the data on the tag, and the tag can also be erased as needed, allowing it to
be reused. RFID tag prices vary – from cents to tens of dollars per tag - depending on their characteristics, such as memory capacity or reading
distance.
A comparison of Data Matrix and RFID technologies reveals both have several advantages and disadvantages. While Datamatrix codes are more
sophisticated than the regular barcodes, they still suffer from some of the same issues. In particular, they can be easily photo-copied and attached
to counterfeit medication packages, thus giving them a seal of authenticity. The counterfeit packages usually move more quickly through the
supply chain to the consumer as thieves tend to price them lower than the original drugs. As the market participants are price-sensitive, the first
drugs to sell to a pharmacy are the lower-priced counterfeit ones, while the original drugs may sit in storage at a wholesaler’s location for a longer
time. The code on the counterfeit package is scanned and verified at pharmacy as genuine (since the real drug the code belongs to did not make it
to the pharmacy yet), and the counterfeit drug is administered to the patient, usually very soon after. This is especially true for temperature-
sensitive drugs which have high requirements for storage and transportation. Later, the original drug enters the market and is flagged as a
forgery during a scan at pharmacy. Even if an investigation is initiated at that time, the fake drug was already administered and the patient may
be adversely affected as a result. As industry insiders confirm, this creates a systematic, hard to detect problem for pharmaceutical supply chains,
and similar scenarios are already happening in countries such as Italy, for example.
Even if RFID is more expensive and requires more complex processes for attaching tags to products, it presents a significant advantage versus
barcodes (regular or 2D) in many applications, and especially in a pharmaceutical environment. First, RFID has bulk reading capability that can
speed up the track and trace processes throughout the supply chain. And second, RFID tags are harder to falsify and allow for continuous
monitoring, thus preventing problems as those described in the previous section. Furthermore, because of the higher cost (and profit) of
medication, as well as because of the higher cost of potential patient adverse events from labeling and supply chain failures, there is a stronger
business case for using RFID tags for pharmaceuticals.
A System for Improving Safety and Quality of Pharmaceutical Supply Chains
In this section, we present an RFID-based system, developed by German company XQS, that helps improve the safety and quality of the
pharmaceutical supply chains. This XQS track-and-trace system that allows the implementation of the five real-world awareness criteria
discussed previously. We discuss each one of these criteria below.
To achieve labelling and serialization, each pack of drugs is given a unique identification number. This number is persistently stored in a central
database – called a Trust Center. Electronic checks can be carried out so that pharmacists, doctors and patients can always be sure that drugs are
original and come from an authentic source. Because of the advantages of RFID discussed earlier, the XQS track-and-trace system was
implemented using RFID tags to store the drug unique identification number.
The system also features sensor-based monitoring, where ambient conditions (temperature, shock) affecting the drug are recorded, monitored
and stored so that the pharmacist, doctor and patient can all be sure of the product’s quality and effectiveness. Sensory components are used in
order to monitor compliance with ambient conditions and to provide precise information on whether or not the quality of the pharmaceutical
product has been impaired, e.g. during storage and transport. This applies first and foremost to temperature, as a great many drugs are
temperature-sensitive. For example, cold chain drugs must be stored and transported at a temperature between 2°C and 8°C, while dry goods
must be stored and transported at temperatures between 15°C and 25°C. For drugs requiring cold chain distribution, there must be no
interruption in the maintenance of the correct temperature at any stage in the supply chain until the product is dispensed to the patient. A further
aspect of relevance to quality assurance is monitoring of inclination (tilt), impact, and shaking. Shock effects can cause hairline cracks in
ampoules or syringes, through which impurities (toxins or infectious agents) can get into the medicine. This can have unexpected adverse side-
effects or even fatal consequences, as in the above-mentioned case of newborn deaths. The XQS infrastructure includes RFID-based ambient
conditions monitoring devices, about the size of a credit card, which are placed in the package along with the drugs, activated with the XQS
software prior to dispatch, and later read by the recipient using specialized QS terminals located at the wholesaler’s premises, in pharmacies or
hospitals.
The system also has traceability (track and trace) features: the origin, interim stages and ambient conditions arising along the entire supply chain
for products can be identified. This requires persistent storage of data relating to packs of drugs, including the information on packages to be
dispatched, their content and ambient conditions, delivery information, and time and place of receipt. This is performed at the QS terminals
which are linked to the Trust Center. The terminals are located at various sites. The terminals themselves and their location data are registered in
the Trust Center. They read the unique ID marked on the package label. By reading the number, the terminal is able to verify the product’s
authenticity and displays its origin and detailed product information. Because each reading of the pack at the terminal is also logged in the
database, the geographical flows of medicinal products within the XQS infrastructure can also be tracked and traced.
Visibility at every stage is closely linked with traceability. Awareness applications generate up-to-the-minute information about the status of the
supply chain, as well as irregularities (which may point to counterfeiting), ambient conditions affecting the drugs and medicinal products, and
drug flows. These applications are real-time systems for monitoring and visualization. The drug pack’s identification, including the drug
identification key (a uniform nationwide number used in Germany and abbreviated as PZN), batch number and other attributes, is linked to a
database hosted on a Trust Center server. At any stage in the supply chain – whether it be the wholesaler’s outbound logistics or the pharmacist’s
receipt of the goods – data checks with the Trust Center can take place. This means that the identification number can be read locally and
compared with the data held on the server; the reading event is also logged. This data collection forms the basis for visibility at every stage and
includes temperature and order histories, drug flows and statistical evaluations. The latter are used to detect irregularities, such as the purchase
of large quantities which never reappear in the market. These “black holes” may be linked to illegal exports, counterfeiting and product
manipulation (Sultanow, Brockmann, & Wegner, 2011).
Finally, reduction of breaks in information capture and transmission is another feature of the technology. As indicated above, data about
medicinal products and supply data (e.g. recipient, items, quantity, batch number etc.), are equipped with a unique identifier that can be inputted
directly into pharmacy and hospital systems. Seamless technological integration of business functions along the pharmaceutical supply chain
prevents capture and transmission errors and reduces transaction costs. Special printers that are part of the XQS infrastructure enable RFID tags
to be printed out on paper, so that delivery notes can be labelled and all the items on the delivery note can be linked to the corresponding (already
serialized) drug packs. This RFID-tagged delivery note is known as an electronic delivery note. Each QS terminal is capable of reading electronic
delivery notes and can automatically transmit the linked delivery data to the ERP system at the pharmacy or hospital. In the same way, the
terminal also transmits the data from RFID-tagged drug packs to the ERP system, so no need for manual inputting of data exists. Within the XQS
infrastructure, the data flow through all participants’ systems without any breaks.
This gap-free collection and cross-referencing of all data allow analyses to be carried out for the purpose of counterfeit security and quality
assurance as the basis for optimizing the supply chain, as well as for health care research.
The XQS pilot was initially launched by the wholesaler Max-Pharma in cooperation with its manufacturing partner Sun Pharmaceuticals
Germany (Swedberg, 2011), and generated significant business value for stakeholders in the pharmaceutical supply chain (Chircu, Sultanow, &
Saraswat, 2014). Since then, various other manufacturers as well as oncology practices have adopted the XQS model. Other pharmaceutical
manufacturers have also deployed similar RFID systems in their supply chains, in search of security and quality benefits. For example, in the US,
Purdue Pharma has implemented item-level RFID tags for the OxyContin painkiller subject to counterfeiting and abuse, and Abbott Labs started
testing a temperature tracking solution for its supply chain (RFID Journal, 2014). As an additional validation of the value such a track-and-trace
system has for pharmaceutical supply chains in general, several major global shipping companies have implemented similar technologies. For
example, DHL has developed a new air freight service dubbed DHL Thermonet, which allows customers to track the temperature during goods
transportation using a passive RFID tag and a battery-powered temperature logger. Similarly, a top global logistics company, Panalpina, has used
RFID and temperature sensors to develop a cold chain system that allows management of multiple temperature ranges during shipping (RFID
Journal, 2014).
A TRACK AND TRACE SYSTEM FOR INDOOR MONITORING
The second case study describes the use of track-and-trace technology in a real-time location system (RTLS) developed by the same German
technology provider, XQS Service GmbH, and introduced in the healthcare market in Germany as well as in Asia.
Issues in Monitoring Patients and Assets in Healthcare
As many healthcare organizations are focusing on reducing costs, they are taking a closer look at the efficiency of their processes and the
productive utilization of their specialized staff. What many discover is that their healthcare workers spend considerable time and effort trying to
track down wandering patients or misplaced equipment. In nursing care or mental health facilities, patients with mental or physical deficiencies
(such as those suffering from dementia or Alzheimer’s) can wander off without anybody noticing, and due to their illness cannot remember how
to get back, or ask for directions. If they suffer a health-related event, they are usually not able to call for help or provide relevant personal
information to healthcare providers, thus jeopardizing their treatment. The traditional solution for monitoring such patients is to limit their
freedom of movement in the facility, but this can result in a decreased quality of life for patients and increased stress for healthcare providers. A
related problem exists in neonatal care units: although newborn babies cannot wander off by themselves, they could be subject to kidnapping.
While rare, kidnapping attempts still occur in hospitals today, and unfortunately some are successful (BBC, 2010). And in any facility, expensive
medical instruments, equipment, prostheses, etc. can get misplaced, lost or stolen. Searching for these items can shift the healthcare providers’
focus away from patients, affecting efficiency, quality of care and staff morale.
Achieving real-world awareness in healthcare facilities implies monitoring the movement of patients and medical instruments in real time.
Unfortunately, most healthcare organizations today do not have the money and human resources to do so. Currently, instruments are labeled
with barcodes. Patients are identified with bracelets which contain their names and date of birth, and sometimes a barcode as well. However, as
we discussed before, barcodes require labor-intensive reading – using a barcode reader – in order to provide the tracking data. It is impractical
for healthcare facilities to read these barcodes any time an instrument is moved or a patient changes location. The barcodes can easily become
damaged and unreadable, thus preventing tracking. It is also easy for the patient or a third party (as in the case of newborn kidnapping) to avoid
the manual reading of the barcode, and there are no alarms that can alert the medical staff something is wrong. However, the same RFID
technology discussed in the previous section can be used in this case as well. In the next section, we present an RFID-based system, developed by
German company XQS, that solves these problems.
A TrackandTrace System for RealTime Monitoring
QS-Locate is an indoor track-and-trace system that can monitor organizational entities – people or assets – inside a certain perimeter (such as
inside a medical ward, or an entire building). It captures the position of persons as well as items (medical equipment, medication, other hospital
assets, etc.) relative to pre-set boundaries. It can also capture the temperature of the tracked items or vital signs of tracked persons in real time.
The basic function of the QS-Locate system is to identify an entity – a person or an item – and compare their location and other relevant data
with pre-determined limits. If the entity data is outside of the pre-approved boundaries, the system generates alerts or performs other corrective
actions. QS-Locate uses RFID tags and sensors (both passive and active) placed on the entity. Data on location, temperature, and/or vital
functions is captured from each entity using the tags and sensors. The system contains high performance middleware which collects all data from
the RFID devices and dispatches it to all connected client devices. Up to 10,000 clients can be accommodated by this system. Figure 1 presents an
overview of the system.
Figure 1. Schematic structure of the architecture of the QS-
Locate system (Source: XQS)
The middleware is streaming oriented: it sends real-time data to the client devices, using RFID loggers placed on entities of interest (beds,
equipment, medication, patients, staff, etc.). The data is logged into the system in two modes: Live Mode and Offline-Log-Mode. The Live Mode
is used for real-time data collection, while the Offline-Log-Mode is used for collecting data during transportation and logging it into the system at
a later time. The QS-Locate system can detect offline loggers and import all logged data into one central database. Each logger device has its own
identification number, which is registered in the system before it begins to collect data. Depending on the setting, the logger begins to save data
every few seconds or minutes after being activated. The collected data include: date, time, temperature, location, battery conditions, etc. The data
from the logger device can be read by fixed or mobile readers.
When the loggers operate in Live-mode, data saved in the logger is sent through the readers, and is saved by the system database, so that
particular values can be found in the database when they are needed later (see Figure 2). Data exchanges take place in the system database, where
the normal and critical (outside of established ranges) data can be automatically differentiated by the system. When a critical event is identified,
e.g., when a person wearing a logger enters the monitoring area without authorization, or the temperature is over or below the preset
temperature, the system will activate an alarm, and display an alert on the monitor. Thus, the live-mode operation allows the monitoring of
persons and items – an application suitable for neonatal care units, mortuaries, or for tracking expensive equipment or devices. For example, the
location of at-risk patients, such as newborns, can be monitored by combining the live-mode location data with a map of the area where the
patients are allowed to be at any given time (i.e. a secure neonatal care unit), and additional information about the allowed duration of their stay
in a certain area (such as an incubator). Any deviation (leaving the secure area, spending more time in an area than allowed, breach of the
security perimeter, etc.) is flagged for investigation (see Figure 3).
When the logger operates in the offline mode, data saved in the logger will be transferred into the system after it is collected, for retroactive
analysis. In both modes, the system can provide a historical view of which items and which persons have been in specific locations – including
how long and which ways they moved from one zone to another. Reports can show statistics on which items are the most time in specific zones,
which can be useful in detecting bottlenecks for equipment and personnel. The collected temperature data by both Live- and Offline-mode can be
listed, analyzed and exported to other applications, such as Excel, and can help companies meet the requirement of good manufacturing practices
and government regulations during manufacturing, transportation and storage of temperature-sensitive medications – both by live monitoring
and retroactive analysis of historical data – ultimately increasing patient trust and treatment safety.
One potential extension to QS-Locate is integration with more sophisticated logger devices in order to capture vital function data from patients.
Stand-alone loggers can incorporate more sophisticated health monitors; alternatively, these can be seamlessly integrated in patient clothing.
Patient monitoring using a wireless system was a concept patented many years ago (Heinonen & Okkonen, 1998), and the idea of integrating
sensors into textiles has been around for more than a decade as well. Scientists have been working since then to develop the technology for
commercial use. The review that follows draws from some of these works (Edmison, Jones, Nakad, & Martin, 2002; Bonato, 2003, Paradiso,
Loriga, Taccini, Gemignani, & Ghelarducci, 2005; Mashable, 2014) to present a summary of technology opportunities and challenges and
potential benefits.
There are many challenges with sensor integration in textiles and clothing, primarily related to the sensor design and reliability. In the simplest
case, extra pockets can be added to garments to house the components of vital sign monitors. However, for seamless integration and data
collection, sensors can be added directly during the textile manufacturing processes by incorporating special yarns during weaving or knitting. In
order to capture valid vital metrics, interfering factors must be eliminated - for example sweating due to physical stress; this is accomplished by
using special materials with sweat-reducing features and active moisture transport (Edmison, Jones, Nakad, & Martin, 2002; Paradiso, Loriga,
Taccini, Gemignani, & Ghelarducci, 2005).
The sensor-enhanced materials can then be used for clothes, bandages, or bed linens. For example, specific body functions (regularity of
breathing, heartbeat, sleep cycles in case of irregularities or snoring) can be monitored while sleeping by sensory bed sheets. And temperature
sensors placed in socks can diagnose cold feet, which may be a sign of magnesium deficiency. This method of monitoring is unobtrusive, and can
increase patient self-esteem by eliminating the feeling of always being “watched” by or “chained to” a traditional vital signs monitor. The
integrated sensors can capture detailed metrics, such as humidity (perspiration), body temperature and heart rate on the skin surface, and the
data can be further transferred with RFID technology. The system creates the opportunity to protect patients by monitoring their vital signs,
predicting their health risks based on collected data, and generating alarms or automated emergency calls if needed (Edmison, Jones, Nakad, &
Martin, 2002; Bonato, 2003, Paradiso, Loriga, Taccini, Gemignani, & Ghelarducci, 2005; Mashable, 2014).
Prototypes indicate that it is possible to monitor vital physiological data by means of special shirts that measure ECG data, pneumatic-graphic
values, movement values, skin and internal temperature as well as information about body position, and send it to another location for evaluation
using a wireless transmission device. This is especially useful in the early diagnosis and monitoring of cardiovascular diseases and helps the
patient have higher mobility and independence (Paradiso, Loriga, Taccini, Gemignani, & Ghelarducci, 2005). The applications are not limited to
healthcare. Emergency technicians, firefighters, soldiers, even high-performance athletes can benefit from such monitoring (Paradiso, Loriga,
Taccini, Gemignani, & Ghelarducci, 2005). Several companies are demonstrating different possible designs. Sensatex integrates sensors, a
microphone and a multi-function processor into a smart shirt (Lugmayr, 2006). Adidas proposes a heart rate analyzer that can help athletes
optimize their training levels by integrating with a mobile application that acts as personal trainer and tailors training advice in real time
depending on the levels of vital signs (Adidas, 2014). Workers exposed to extreme conditions (high or low temperatures, high altitudes, high
stress situations, etc.) can wear these types of clothing with integrated sensors, which can generate alerts if the workers’ vital signs indicate an
increase in stress or the risk for an adverse health event. This information can be used for making decision regarding staffing levels and work
assignments. This system can therefore improve the safety of employees under potentially dangerous working conditions (Paradiso, Loriga,
Taccini, Gemignani, & Ghelarducci, 2005).
Real-time location systems such as QS-Locate are increasingly popular in hospitals and other healthcare facilities around the world.
Implementations of such systems in neonatal care units were reported in Korea, Spain, and the US, among other countries. After birth, babies
and their moms receive passive RFID tags, sometimes enhanced with biometric information, that ensure the hospital staff always works with the
correct mother-baby pair and prevent mix-ups. Babies’ tags are monitored with a location system that does not allow nurses or anybody else to
take any baby outside of the secure area or tamper with the tag. In case of irregularities, the systems can sound an alarm or lock down the
maternity ward area to prevent a potential kidnapping. Similar “wanderer management” systems are implemented by various nursing care
facilities using a technology provided by Kimaldi Electronics. Both patients and staff have passive RFID bracelets, which communicate location
information to antennas installed on the ceiling (a technical solution designed to facilitate reading of multiple tags without errors). And last, but
not least, the value of tracking medical equipment with RFID has been independently verified in several successful implementations in various
hospitals (RFID Journal, 2014).
TRACKAND TRACE SYSTEMS: DATARELATED CHALLENGES
As it can be seen from the above descriptions, track-and-trace systems have to collect, transmit, store, retrieve and analyze a variety of data from
many tracked entities in real-time. This presents several challenges, which are discussed below.
Data collection challenges relate to the difficulty of ensuring data quality along dimensions such as accuracy, completeness, and timeliness. The
Healthcare Information and Management Systems Society (HIMSS) identifies data quality as an essential component for successful analytic
applications (HIMSS, 2013). To this extent, sensors and RFID tags need to be calibrated so that they capture accurate data on location, ambient
conditions and vital functions, respectively. Data such as temperature also needs to be recorded without interruption – without complete data,
the integrity of the cold chain cannot be determined. And in order to ensure traceability and real-time visibility, data capture has to be timely – in
real-time or almost real-time. From a legal standpoint, data collection also needs to be secure, so that unauthorized tampering / manipulation
can be prevented. But because the systems we described do not collect sensitive patient-related data, just temperature data and location data
(where assets and/or persons are located), there are no legal requirements for additional privacy (as data like this is normally collected in order
to ensure controlled access to buildings or work time monitoring for employees, for example).
Data transmission and data storage present several challenges as well. One technical challenge is to transmit data across organizational
boundaries without any errors or interruptions, even when data velocity and variety is high. This may increase the costs of the solutions, as the
transmission systems need to be designed with reliability in mind. Reports indicate that serialized product data is much more dynamic, granular
and diverse than the transactional data traditionally captured by ERP and decision support systems in the past (Galli, 2012). Also, data should
only be exchanged via encrypted channels to prevent unauthorized access that could compromise the tracing ability or the privacy of the patients.
When used for certification of cold chain processes, data must be securely stored and historical information should be available to the authorities
in case of an investigation into quality issues. The volume of data will thus continue to increase as more drugs, patients and assets need to be
tracked in real-time. As a result, organizations may face storage capacity and data retention problems. For example, in a hypothetical supply
chain example of a retailer, Galli (2012) estimates that tracking 1,500 items from 20 suppliers results in 750,000 serialized items; if the retailer
has 5 data read points, this results in 3.75 million RFID data points that need to be organized and stored for future analysis.
Data retrieval also presents several challenges related to the volume and variety of data. For smaller applications, relational databases and
structured query language (SQL) may work at first, but as the volume of data reaches millions of records these tools become inadequate. Big data
solutions based on more advanced, faster (but also more expensive) technologies such as HANA (SAP’s in-memory computing platform) are
needed in this case. For example, XQS implementations in German pharmacies, wholesale and outpatient centers and hospitals do not have a
need for big data technologies at the moment, but another implementation in Asia is already exceeding 6 million records, which require in-
memory computing capabilities (Chircu, Sultanow, & Saraswat, 2014). With time, however, the volume of data for all implementations will grow,
stressing the need for adopting more sophisticated big data tools, and filtering the data to identify and quickly retrieve the parts that are most
useful for analysis – the “RFID beeps that matter” (Galli, 2012).
One additional challenge relates to the creation and interpretation of the analyses based on the collected data, especially in the case of patient
vital signs monitoring. Medical professionals are yet to develop a comprehensive model of how variations in various vital metrics are correlated,
and how they relate to various health risks. Researchers have started to apply data mining methods to the large amounts of data available from
body sensors, and to identify patterns that may inform further treatment guidelines, but this works is still in its infancy (Bonato, 2003). In the
absence of such models, misinterpretation of what the data indicates can emerge to be a real problem.
Last, but not least, track-and-trace systems raise data security and privacy issues. The design of the system needs to match the privacy rules of the
clients. For example, the XQS system uses encrypted connections for data transmission. The system is fully integrated in the IT infrastructure of
hospitals, enabling hospitals to keep sensitive patient data on their own servers. In addition, healthcare organizations using the track-and-trace
system have to ensure the basics rules of personal patient data processing are met. In particular, the retention of records of treatment (at least 10
years after completion of treatment) and the prevention of unauthorized disclosure require careful access management so that data privacy is
maintained.
CONCLUSION AND FUTURE RESEARCH DIRECTIONS
Very high standards of transparency, safety and security are required in the healthcare environment. In this book chapter we illustrate how track-
and-trace technology can be used to use data to achieve these goals. We describe two real-world examples of commercially available track-and-
trace systems, based on RFID and sensor technologies, and used for preventing counterfeiting and increasing quality of medication in
pharmaceutical supply chains, and for real-time monitoring of objects and people, respectively.
In the process of illustrating these technologies that support data-based wisdom, we start opening the black box of data and its capture,
transmission, storage and analysis. We believe that healthcare industry executives struggling with questions such as “How to deal with the
mounds of track-and-trace data?” need to first understand how data is created, and what contributes, or distracts from, its high volume, variety,
velocity, and veracity. Technology, as described in this chapter, has affordances that need to be described before the journey from data to wisdom
can begin. To this end, we present the first step in the journey – an analysis of the business problems, the system architecture and the expected
benefits of the data-driven track-and-trace systems, as well as some of the data-related challenges they create.
If current predictions are true, today’s track-and-trace systems are barely scratching the surface of big data opportunities in healthcare. It is
therefore very possible that, for big data, the best is yet to come – perhaps in the form of better track-and-trace architectures, metric interaction
models, and filtering, search and predictive mechanisms that make it easier to identify relevant data and derive meaningful insights and
responses in real time.
This work was previously published in Strategic DataBased Wisdom in the Big Data Era edited by John Girard, Deanna Klein, and Kristi
Berg, pages 6582, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Bonato, P. (2013). Wearable sensors/systems and their impact on biomedical engineering. IEEE Engineering in Medicine and Biology
Magazine , 22(3), 18–20. doi:10.1109/MEMB.2003.1213622
Chircu A. M. Sultanow E. Chircu F. (2014). Cloud computing for big data entrepreneurship in the supply chain: using SAP HANA for
pharmaceutical track-and- trace analytics. In Proceedings of the 2014 IEEE 10th World Congress on Services (SERVICES2014) - Industry
Summit (pp. 450-451). Anchorage, AK: IEEE. 10.1109/SERVICES.2014.84
Chircu, A. M., Sultanow, E., & Saraswat, S. (2014). Healthcare RFID in Germany: An integrated pharmaceutical supply chain perspective. Journal
of Applied Business Research , 30(3), 737–752.
Clemens, A. (2009). RFID: identifikation mit innovation: Vision und wirklichkeit. Retrieved from http://www.lsb-
plattform.de/innologist09/Herr_Clemens_RFID_Identifikation mit Innovation_Vision und Wirklichkeit.pdf
Diessner, P. (2007). SAP's view of supply chain visibility: managing distributed supply chain processes with the help of supply chain event
management (SCEM) . In Ijioui, R., Emmerich, H., & Ceyp, M. (Eds.), Supply chain event management: Konzepte, prozesse, erfolgsfaktoren und
praxisbeispiele (pp. 71–83). Heidelberg, Germany: Physica-Verlag. doi:10.1007/978-3-7908-1740-9_5
Edmison J. Jones M. Nakad Z. Martin T. (2002). Using piezoelectric materials for wearable electronic textiles. In Proceedings of the Sixth
International Symposium on Wearable Computers (ISWC) (pp. 41-48). Seattle, WA: IEEE. 10.1109/ISWC.2002.1167217
Ha, S.-C., & Choi, J.-E. (2002). A model design of the track & trace system for e-logistics. Operations Research , 2(1), 5–15.
doi:10.1007/BF02940118
Heinonen, P., & Okkonen, H. (1998). Method for monitoring the health of a patient. United States Patent No. 5,772,586.
Heinrich, C. E. (2005). RFID and beyond: Growing your business through real world awareness . Indianapolis, IN: John Wiley & Sons.
Heinrich, C. E. (2006). Real world awareness (RWA) - nutzen von RFID und anderen RWA-technologien . In Karagiannis, D., & Rieger, B.
(Eds.), Herausforderungen in der wirtschaftsinformatik(pp. 157–161). Berlin, Germany: Springer. doi:10.1007/3-540-28907-0_11
Laurier, W., & Poels, G. (2012). Track and trace future, present, and past product and money flows with a resource-event-agent model. IS
Management , 29(2), 123–136.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: the next frontier for innovation,
competition, and productivity. McKinsey Global Institute. Retrieved from
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
Paradiso, R., Loriga, G., Taccini, N., Gemignani, A., & Ghelarducci, B. (2005). WEALTHY – A wearable healthcare system: New frontier on e-
textile. Journal of Telecommunications and Information Technology , 4, 105–113.
Sultanow, E., Brockmann, C., & Wegner, M. (2011). RFID & pharma (teil 1). Ident, 16(4), 36-40. Retrieved from
http://www.ident.de/uploads/media/ident_2011_4_WEB.pdf
Swedberg, C. (2011). German drug company tracks products with UHF tags. RFID Journal. Retrieved from
http://www.rfidjournal.com/article/view/8509
ADDITIONAL READING
Laheurte, J.-M., Ripoll, C., Paret, D., & Loussert, C. (2014). UHF RFID technologies for identification and traceability. London, England: ISTE
and Hoboken . NJ: John Wiley & Sons. doi:10.1002/9781118930939
Liao, W., Lin, T. M. Y., & Liao, S. (2011). Contributions to radio frequency identification (RFID) research: An assessment of SCI-, SSCI-indexed
papers from 2004 to 2008. Decision Support Systems , 50(2), 548–556. doi:10.1016/j.dss.2010.11.013
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review , 90(10), 60–68.
Tzeng, S., Chen, W.-F., & Pai, F.-Y. (2008). Evaluating the business value of RFID: Evidence from five case studies.International Journal of
Production Economics , 112(2), 601–613. doi:10.1016/j.ijpe.2007.05.009
Wamba F. S. Ngai E. W. T. (2012). Importance of the relative advantage of RFID as enabler of asset management in the healthcare: results from a
Delphi study. In Proceedings of the 45thHawaii International Conferences on System Sciences (HICSS) (pp. 2879-2889). Maui, HI: IEEE.
10.1109/HICSS.2012.315
Zhou, W., & Piramuthu, S. (2010). Framework, strategy and evaluation of health care processes with RFID. Decision Support Systems , 50(1),
222–233. doi:10.1016/j.dss.2010.08.003
Zhu, X., Mukhopadhyay, S. K., & Kurata, H. (2012). A review of RFID technology and its managerial applications in different industries. Journal
of Engineering and Technology Management ,29(1), 152–167. doi:10.1016/j.jengtecman.2011.09.011
KEY TERMS AND DEFINITIONS
Big Data: A term coined to reflect very large and very complex data sets.
Radio Frequency Identification (RFID): A technology that uses radio frequency electromagnetic fields to communicate data between a tag
(a transponder that stores data electronically) to an interrogating reading device. RFID tags have been developed for high frequency (HF) and
ultra high frequency (UHF) radio waves. Passive RFID tags use the energy of the interrogating radio waves to reflect back the information over a
relatively short distance, while active RFID tags have a battery that allows for a bigger transmission range.
RealTime Location System (RTLS): A hardware and software system that enables tracking of physical entities such as objects or humans in
real time, usually in a finite area such as a room, floor or a building,, and generates alerts or triggers other relevant actions if entities go outside of
the pre-approved perimeter.
RealWorld Awareness (RWA): A term coined to describe the sensing of real-time data from tracked physical entities such as people and
objects and virtual entities such as information systems, and the associated responses.
Supply Chain: A network of organizations, individuals, technologies, and associated processes supporting the movement of products from
suppliers to customers.
Tracing: The set of activities that retrieve tracking information and identify a product’s real-time location, past movements, or origin.
Tracking: The set of activities for collecting information about product movements, usually through the supply chain.
CHAPTER 56
Big Data Applications in Healthcare
Jayanthi Ranjan
Institute of Management Technology, India
ABSTRACT
Big data is in every industry. It is being utilized in almost all business functions within these industries. Basically, it creates value by converting
human decisions into transformed automated algorithms using various tools and techniques. In this chapter, the authors look towards big data
analytics from the healthcare perspective. Healthcare involves the whole supply chain of industries from the pharmaceutical companies to the
clinical research centres, from the hospitals to individual physicians, and anyone who is involved in the medical arena right from the supplier to
the consumer (i.e. the patient). The authors explore the growth of big data analytics in the healthcare industry including its limitations and
potential.
INTRODUCTION OF BIG DATA
Big data has become a hype in the analytics industry and draws a lot of attention from worldwide. Big data is not being created now, it was
present from the beginning. It is just that more and more data is being generated now in this technology-driven world. Big data is simply the old
form of data being processed in traditional data warehouses with addition to real time and operational data stores. It is not only too big in size to
be processed using traditional tools but also available in varied number of forms (Datastax Corporation, 2012). Reports suggest that every day
this world generates 2.2 million terabytes of new data out of which 10% is structured and 90% is unstructured data. Also, with the introduction of
latest technologies in this digital world, over 90% of the data is created in the past 2 years.
As discussed above, Big data is defined using three different dimensions – Volume, Variety and Velocity. In the end, the main challenge lies in
getting the relevant knowledge and information from the data that can be used to take right decisions Oracle (2013).
BACKGROUND: BIG DATA ANALYTICS IN HEALTHCARE INDUSTRY
A few decades ago, the flow of information in the healthcare industry was relatively simple with minimal usage of technology. Nowadays,
technology has become much more advanced, and the flow of information has become more complicated. Today, data mining and big data
analytics is being used to manage inventories, develop new drugs, manage patient’s records, cost of medicines, and administer clinical trials. The
methodologies involved in the data extraction and information retrieval process are the same as other sectors, but the usage and relevance of that
data has changed to suit the needs of hospitals, pharmacy companies, healthcare organisations, medical research labs and drug trial clinics.
Unfortunately, data collection methods have improved, but the techniques used to process it are still lagging behind (Ranjan, 2007).
Role of Big Data in Healthcare Sector
Big data is being actively used in the healthcare sector these days to change the way decisions are made. It is changing the entire healthcare
ecosystem by providing cost-effective measures, better resources and measurable value around the globe. As per the KPMG (2012) report, 70% of
the healthcare data is generated in Europe and North America and by 2015, world is expected to generate 20 Exabyte of healthcare data. Big data
market in the healthcare sector has become so massive that in 2010, it contributed to 7% of the global GDP and reduced around 8% of the
healthcare global expenditure by providing better solutions. Below figure shows various initiatives and developments that are taken by various
agencies around the world, to revamp the healthcare sector using big data.
• Differentiating hype about Big Data Analytics from fact and application
MAIN FOCUS OF THE CHAPTER
Impact of Big Data Analytics on the Healthcare Industry
The impact of big data from the patient’s point of view can be divided into 5 value based components, where this value is derived from the total
cost applied and the impact or outcome it has on the life of patients. These value based components are also called pathways, as they provide
guidance for further improvement along each of these pathways
2. Right Care: According to this path, all the healthcare points will have the same data about the patients, from a common database, that
will help them in coordinating efforts to help the patients in the same manner, and ensure that no glitches occur, like duplication of effort
and wrong strategies (Groves, 2013).
3. Right Provider: This pathway defines that patients should always be treated by the right professional, a person who is best suited to the
task required for that particular patient. It ensures that patients are given attention to the level required, and by the person with the best
track record (Groves, 2013).
4. Right Value: This pathway ensures that the right value is added to the system without compromising on quality and efficiency. It helps
to eliminate fraud in the system and allows for outcome related compensation measures that give an incentive to healthcare providers to
provide the best possible services (Groves, 2013).
5. Right Innovation: This pathway addresses the stakeholders to continuously nurture innovation in this field by encouraging R&D and
trials by their researchers. It maintains that only through innovation can the organisations can hold competitive advantage and they must
use the data mining results for continuous improvement (Groves, 2013).
Applications of Big Data Analytics in the Healthcare industry
There are numerous applications of business data analytics in the healthcare industry. They can be general applications such as customer
relationship management, healthcare effectiveness or medical specific applications such as disease detection, predictive medicines, personalized
treatment etc. In this report, we will describe some of the more important applications in this industry:
Customer Relationship Management: Customer Relationship Management is not just restricted to banks, retail outlets, hotels and food chains,
but it is also being extended to the healthcare industry these days. There are numerous contact points in healthcare organizations such as call
centres, billing points, welcome desks, physician’s office etc (Rygielski, 2002). There are four stages in the relationship between a customer and
the business, known as the customer lifecycle. These stages define the role of CRM at each step and help companies improve their relationship at
every stage. Data mining in CRM helps both on the input side as well as on the output side. On the input side, it provides an understanding of the
customer database and on the output side; it provides techniques for finding interesting results from that database. Data mining also helps
healthcare organisations to change their interactions with the customers over a period of time. For example, customers getting into the
retirement stage of their lives will have changed needs, and data mining models will detect that change along with the customers with similar
behaviour patterns (Rygielski, 2002). Customer Relationship Management can be modelled through data mining techniques to target individual
customers for long lasting relationships built on specific applications developed for them for their convenience. CRM can be used to predict what
medicines the patient will buy or which doctor has attended a particular patient the most or whether he will comply with the prescriptions. The
use of data mining to build CRM models can help healthcare organizations to improve their service, reduce their waiting times and provide better
information to patients (Chye Koh, 2005) . Pharmaceutical companies can utilise healthcare CRM to their advantage by deciding which customer
base to target for their latest drugs, to decide which physicians to choose for clinical trials and to predict the future demand of drugs according to
the genetic makeup of customers and past ailments (Chye Koh, 2005).
Big data provides many opportunities by engaging players involved within the healthcare system. It provides same set of solutions but with more
accuracy, privacy, consistency and facility. Alternative solutions are also provided which can benefit through better and faster planning, cost-
effectiveness, better research, quality service and effective optimization. External data is processed using various tools to improve the quality of
the internal processes. All external stakeholders, such as the National Health Information Network (NHIN), come together and share the
information on the common platform to integrate the data within the healthcare system. All forms of big data in terms of volume, velocity and
variety are used in the healthcare. Large volumes of data from different stakeholders is taken together which comes in different forms –
structured and unstructured and is then processed at a very fast pace to provide better treatment options for the patients and customers.
However, scientists and analysts who process the healthcare big data are facing certain challenges. The problem generates with the lack of
information that is available for decision-making. Volume of data available is no doubt too big but integration of data from different players
becomes very tedious task. Firstly, the data is present with each party in different forms and then there are concerns like privacy and propriety
that hinders the analysts to process the data easily. Another challenge arises with the security of the data when personal data is shared with
different resources freely, so security threats and chances of fraud become high. However, countries have placed different acts and a law to curb
this issue, one such act is the Health Insurance Portability and Accountability Act (HIPAA). Lastly, the challenge comes in defining the standards
to generate accurate data used for processing and the entire decisions have to be taken in particular timelines (Cognizant, 2012).
Research Methodology
The research approach is qualitative in nature as basic idea was gaining information from officials and staff of customer-centric organisations
regarding big data implementation in healthcare. The research methodology adopted is based on in-depth study on of big data in healthcare and
its growing significance in business world. The application of big data in healthcare helps firms to gain insights of healthcare preferences,
healthcare experiences, healthcare products and their patterns on users’ etc .The real data analysis was not done and neither any specific industry
was targeted for finding principles which needs to be considered as aim was to understand big data impact in healthcare. The research approach
is based on discussions with officials and staff and is not confined to numbers. The research was not confined to single healthcare or Technology
firms or type of organisation as stress was to formulate the challenges and role of big data in healthcare. This study is exploratory so that the
reader becomes familiar with the basic idea of big data with respect to healthcare industry. As the research plans to generate new understanding
and principles on the basis of formulated questionnaire, in future, it is primarily exploratory research.
Impact of Big Data on the Healthcare System
McKinsey & Company (2013) developed a value-based model that discusses how big data has changed the paradigm by making a big impact in
five different ways.
1. Right Living: Well-being of consumers is taken care of by helping them to take informed decisions about their lifestyles.
2. Right Care: Right set of treatment and care is provided using data backed by complete evidence and outcomes to ensure customer safety.
3. Right Provider: Matching the right set of care provider with the patient is also done to deliver prescribed clinical impact.
4. Right Value: Cost-effective measures are designed to enhance healthcare value by generating continuous sustainable approaches.
5. Right Innovation: Adequate resources are provided for advance innovation and R&D to discover new ways and safety measures for
healthcare development.
Google Flu Trends
The Google search engine encounters a huge volume of search queries in a variety of formats every day, what we term as the BIG DATA
nowadays. To get a hang of the sheer volume of data incoming at Google search engines from all over the world (refer Table 1).
Table 1. Adopted from (Google Official History, 2013)
Since long, Google has been analyzing search query data to find out popular trends and patterns over the Internet. These trends were unique in
the sense they were derived from data generated all over the globe and hence were truly representative of what is trending online globally, what is
the fancy of the global population right now instead of just one country. In 2008, Google decided to explore further into the promising world of
Big Data Analytics to wield more powerful and socially useful information from the huge mountain of search query data available with them. The
problem they decided to focus on was outbreak of infectious diseases like influenza which are potentially alarming and dangerous for mankind
because of two reasons, One, epidemics of such seasonal influenza are responsible for millions of deaths around the world and second, often
doctors encounter birth of a new strain of influenza virus against which there is no prior immunity, making the disease ten times more worse and
deadly. This gave birth to a new platform or web service called Google Flu trends whose main objective is to accurately model real world
phenomenon using patterns in search queries (GoogleBlog, 2008).
Google found out that certain aggregated search queries tend to be very commonly used during the flu season. Their algorithm clearly
distinguishes between search terms for flu which are general queries for gathering information on the subject, and specific queries, which are
used by people experiencing Influenza like illness (ILI) (Google trends, 2013). Next, what was observed during research was that there is a close
relationship between the frequencies of these searches queries and the number of people experiencing flu-like symptoms over a span of years. By
counting the number and frequency of such search queries, Google Flu Trends could estimate how much flu is circulating in different countries
and regions around the world.
When you have a large Internet user base in a particular region, general observation is that people try to first gather information about any illness
or symptoms they might be experiencing online. When they have sufficient evidence regarding the same they visit the doctor. Majority of the
people visiting the clinic have first gathered information regarding the illness or symptom online. There is a lag between this online search and
doctor’s visit which Google Trend uses to its advantage is that it can estimate the prevalence of diseases during the first interaction that people
have with the Google search engine, quite some time before they actually visit the doctor. Hence, Google utilized health-care seeking behavior by
people in the form of queries on its online search engine.
Benefits from Google Flu Trends
The model developed by Google for tracking/monitoring flu activity around the globe, gives estimates, which are closely matched to the
traditional flu activity indicators (government records etc.). This is a very cost-effective method for estimating flu activity in an area of interest,
yielding significant level of accuracy and hence can be used in real life by health care stakeholders. It utilizes the concept of collective intelligence
to harness its huge data source in real time (GoogleOrg, 2013)
Traditional flu surveillance systems take 1-2 weeks to collect and release surveillance data but Google search queries can be automatically
counted very quickly. Data collection is easier and time saving unlike traditional systems of recording. Google Flu Trends can predict regional
outbreaks of the flu up to 10 days before the Centre’s report them for Disease Control and Prevention (CDC) (Wikipedia, 2013) (Ginsberg, et.al,
2009).
Because of its faster detection, Google Flu Trends acts as a Early-warning system for outbreaks of influenza which is crucial for such infectious
diseases where time is of essence, due to the nature of challenges posed by such diseases as mentioned in the beginning of the discussion. Early
detection of a disease outbreak can reduce the number of people affected hence reducing the mortality rate substantially. The official response to
a flu pandemic, such as vaccine distribution and timing, can now be greatly enhanced with such an early warning. Health care resources like
doctors, medicines, equipment etc. which are always scarce in number in times of epidemics can now be put to more efficient and targeted use
than ever before, resulting in hundreds of lives being saved.
Challenges
Google Flu Trends experiences a few challenges as any other Big Data analytics system like combing through the mess of Big Data to produce
meaningful results. Particularly in the case of analyzing search queries it is crucial to identify the context in which the query was submitted to
distinguish between relevant (actual flu incidence) and irrelevant data (e.g. gathering information on flu for school project). Smarter algorithms
have to be built to overcome this challenge on a daily basis. Another challenge is the sampling bias resulting from ongoing culture for usage of
Internet. For e.g. data generated from Twitter accounts is skewed in the sense that Twitter is majorly used by younger age demographics of the
population. Hence, it cannot be a true representative of the entire population and leads to missing points in the data chart for all ages besides
those which use Twitter (e.g. elderly, children). Here too, underlying algorithms need to be adjusted to overcome this weakness. (HBR, 2013).
What analysts have to make sure in the end is that the interpretations from the data should actually fit with what we know and expect about the
world.
Google Flu Trends: Potential Applications in India and Abroad
1. Widespread global usage of online search engines may enable models to eventually be developed in international settings, which can be of
great use to international health bodies like WHO to monitor health issues around the globe and gather resources to fight such outbreaks in a
successful manner. Leveraging help from other countries to tackle such outbreaks would require less time than before.
2. Google Flu trends can be extended to include Dengue trends (Beta version running) / Malaria trends especially in tropical countries like
India.
3. Eradication of HIV and female feticide are two most burning social issues in India. Due to the stigma attached and the secrecy maintained
by the victims it is almost impossible to identify the suffering before it is too late. Using Google Flu Trends model we can identify
regions/areas experiencing high activities of HIV like symptoms or females contemplating female feticide and hence Government
intervention/targeted welfare programs can be built/disseminated in such areas for better awareness and saving lives.
The new technologies in medicine also raise a lot of ethical questions. There will be future cases in personalized medicine, in which we can clearly
say yes or no and also therapies which we have to exclude based on what we have interpreted from Big Data. For all the hardship of such a
diagnosis - terminally ill could be certain of his future treatment, also be spared of much suffering.Another major challenge is to determine to
what extent the data you have is relevant. In the case of a seriously debilitating disease like cancer, there are tens of variables which may be
presented to you and you will have to choose whch of them are relevant. The privacy is another major hurdle that is there is on its way to a more
networked system of cancer data. The more centralized data available, the unauthorized access would be devastating in which, for example, the
information could be tapped by the criminals.
PENETRATION OF BIG DATA ANALYTICS IN PHARMACEUTICAL INDUSTRY
The adoption of Big Data in healthcare can change the delivery healthcare and also impact pharmaceutical research and drug development. In the
current scenario, healthcare companies around the world have started using big data analytics for analyzing the claims and clinical data they have
about patients to derive conclusions on the risk exposure to patients and the choice of prescribed drug subsequently. This analysis can be taken a
level up and could be extrapolated to arrive at conclusions on personalized medicine and customized care. Big Data on healthcare and the
insights yielded can help pharmaceutical companies with marketing strategies, commercial pointers to guide R&D decision making process, real
time data to gain better access to healthcare outcomes that will benefit in the improving the development strategies, identify unmet medical
needs and limitations of prevailing therapies and providing an evidence based value proposition and in improving lifecycle value and enhancing
asset maximization. But the power of data lies in the way it is used. There is high ambiguity over transparency, compliance and apprehensions
regarding big data in pharmaceutical industry, and hence the players need to ensure strong coordination between commercial and Research &
Development entities, spring up analytical rigor and develop critical hypotheses that need to be tested systematically. Else big data might lead to
incorrect conclusions that may have detrimental effects on a product or an entire portfolio (Mahajan, 2010). The Business Intelligence market
growth in the pharmaceutical industry will be driven by factors like the need to develop an effective and transparent information infrastructure,
identification of the latest sales forecasts and customer trends, attainment of cost efficiency, and increase in outsourcing. The data-intensive
nature of the pharmaceutical industry makes BI deployment more challenging than other industries. Adoption of Business Intelligence is
challenge due to various reasons like the availability of data at different time periods, the frequent changes in legislation and the continuously
growing consolidation in the pharmaceutical industry (Mahajan, 2010). Besides achieving operational excellence, pharmaceutical companies are
investing in BI tools for implementation in technical areas, such as improving research and drug development, clinical performance, resource
management, sales force tracking, and regulatory compliance reporting. Another lucrative use of Business Intelligence tools is in the decision
making process for launching new products and services in the existing markets. Pharmaceutical companies are more likely to invest significantly
in an operational reporting BI tool in order to fortify their marketing and sales function (Mahajan, 2010)
Comparative Study of Pharmaceutical Industry from Indian and Global Perspective
Indian Perspective
Ranbaxy Laboratories holds huge data repositories procured from various sources both internal and external. Internally, the sources of data were
the various applications within the organization dispersed over multiple offshoots of the organization and also the base of the environment- the
ERP system. The company could collate the data but couldn’t provide an integrated view of the entire data within the organization to its users.
This significantly delayed accessing information, report generation and subsequently the decision making process
(computer.financialexpress.com, 2003). Ranbaxy considered making SAP as its centralized database but since SAP was a transaction system and
couldn’t be substituted as a warehouse, they did not implement it. Ranbaxy required a data warehouse coupled with a powerful extraction tool
which would dig into heterogeneous data sources to find historical data. Hence, Ranbaxy implemented both data warehousing and data mining
solutions in the early 2002. Initially, the solution was implement only for its marketing division in India and subsequently rolled out to all the
marketing offices. It further implemented in the manufacturing division as well. (computer.financialexpress.com, 2003). The data warehouse had
over 100GB of critical data, out of the 1 TB of data residing in Ranbaxy’s data repository, which was relevant for its business intelligence
activities. The data was extracted from SAP and other sources using ETL and the presentation layer was dealt by Business Intelligence tools. The
extraction of data from SAP was done by tools tailored specifically for SAP environment. ETL was used to retrieve legacy data from MIS
operations and data from SQL database. Post extraction, the data was cleansed to provide a unified version for data across multiple systems.
(computer.financialexpress.com, 2003).
The extracted data is translated into data useful for strategic decision making through query and reporting functionality. Information from
multiple sources, including SAP, spanning through the organization is consolidated into data mart which is in turn used for reporting and
analysis. Business Intelligence tools are used for reports generation and analysis from data marts. The formats of these reports can be customized
with respect to the users and distributed accordingly. Either the relevant data is given to the users on demand or the company can decide what
kind of information to give to respective users (computer.financialexpress.com, 2003).
Advantages
• The time taken to create ad-hoc report for strategic analysis is significantly lesser after
• ETL has alleviated the process of collaborating data from multiple sources and
• Business Data Analytics has enables the organization to analyze existing data and also
• Predict accordingly.
• Data
Global Perspective
Pfizer is investing its efforts to get more information from existing clinical trial data. The company is implementing sophisticated data mining
techniques to improve the design of new trials, in order t to understand possible new uses for existing drugs in a different perspective, and to help
examine how drugs are being used after they have been approved. Companies seldom examine the data from past clinical trials. Typically,
pharmaceutical companies work on a set of trials as a part of a new drug application and generate a report on the drug’s efficacy and safety. All
the information with respect to the trail is saved and not studied until additional analysis is called for by regulatory agencies (Bio-IT World,
2008). Diverging from the conventional methodology, Pfizer has taken up additional analysis of clinical trial data using data mining techniques
to look for unknown or specific patterns. The information gained from secondary analysis is utilized to design new studies. The information
gleaned from data mining can be used to determine a sample size while designing a new trial (Bio-IT World. 2008). For instance, if a company
wants to test a drug, which has been approved in one country, in a different country, the company would have to do a study proving that the drug
worked within the population of that particular country. This can be used in a scenario to test is an approved drug can be used for other purposes
in a sub group within the trial population. It can also be used for deeper analysis so as to minimize risks associated with a drug. Data mining
techniques can be used to study drug interaction or safety related issues that impact a particular population (Bio-IT World. 2008).
FUTURE OF BIG DATA IN PHARMACEUTICAL AND HEALTH CARE
Big data in pharma is one of the fastest growing industries, is now termed as “Big Pharma”. Since the data is complex, industry has an absolute
need for big data and has a lot of applications. Global trends and technology advancements have made big data more than necessary for future
growth. Particularly in India, global investments for outsourcing, external research and manufacturing have been pushing more applications of
big data. Also, in the past ten years India has accumulated immense data digitally which can now be directly used for analytics, something which
was not practiced earlier by Indian companies.
The heat map above geographically summarizes Big Data opportunities in India in the coming years. Off shoring and product manufacturing
show huge potential and growth opportunities. With innovation in big data analysis techniques, a new revolutionized era can come in the
healthcare industry in India.
Global vies-a-vies Indian context for big data can be examined based on different parameters which is summarised in the following table.
1. Integration of data
2. Mind-set of organizations in pharmaceutical industry is of that risk averse, and unless they see a future ideal value addition, they are
reluctant to invest
4. Particularly with India, denial of patent protection is keeping likes of Novartis, Bayer, Roche at bay
5. Misuse of data driven revolutions by some organizations to maximize their own value in healthcare.
6. Technology and Analytics- There is a serious dearth of bio scientists, and an estimated 200,000 shortage of big data analysts in USA in
addition to shortage of 660000 biostatisticians, health professionals. India can produce only 12,000 professionals in next three years for the
same.
Hence to overcome these issues, Integration of data between different sources of the organization and between different organizations is needed.
From the heat map for opportunities in India that has been analyzed, outsourcing and manufacturing products are the business opportunities for
big data.
Limitations of Business Data Analytics in Healthcare Industry
Although big data analytics has been a great boon to the healthcare industry, there are some serious
1. The input data used for data mining techniques is not available in the same format throughout healthcare organisations. Hospitals,
pharmacy companies, medical research centres have data in different settings and systems. There is a need for a common data warehouse to
be built for data sharing in this industry, but it is a costly affair and cannot be afforded by every organisation.
2. Another important issue is related to the quality of data. Data obtained from various sources is often erroneous, corrupted, inconsistent or
non-standardized. It takes huge efforts to convert that data into usable from or to extract meaningful information (Chye Koh, 2005).
3. There are legal and ethical considerations with the data being collected and processed. Patients are generally not comfortable about
sharing their private information with organisations as they fear it might be misused. There is global apprehension about issues regarding
privacy of an individual and the ethical side of intruding personal life of patients. These problems have to be tackled before organisations can
effectively use databases for data mining.
4. There are concerns among experts of big data about the quality of information obtained after processing large volumes of medical data, in
which there are numerous variables and constraints. They argue that it will expose fluctuations in the result due to randomly “fishing” the
data in the hope of finding patterns and relationships (Chye Koh, 2005)
5. Data mining and big data techniques can only be applied by a person knowledgeable in this field. There is a significant dearth of experts
from the big data industry and it costs organisations a lot to hire such individuals (Chye Koh, 2005).
6. It requires a substantial investment on the part of companies to acquire resources, hire personnel, train personnel and buy the technology.
Also, big data companies selling their products to medical organisations and individual physicians have to convince them of the superiority
associated with using big data analytics. The world at large is still not ready to accept big data analytics on a large scale.
CONCLUSION
The applications of business data analytics in healthcare are numerous, its benefits plenty. The penetration of big data in healthcare is still at a
very nascent stage, with many organisations apprehensive about adopting and installing it into their daily routine. There are success stories of
data mining in healthcare, but these stories are far and few. To adopt this technology on a global scale, some measures need to be taken. Data
needs to be better captured before it can be processed. The organisations need to standardize the format of storing data across all healthcare
facilities, be it pharmaceutical companies, hospitals, research centres or medical insurance organisations (Chye Koh, 2005). Also, most of the
data that is recorded is not quantitative data, but may be textual or visual. So text mining techniques and digital image diagnosis needs to be
adopted by medical institutions before all data can be processed. Data from social websites can be harnessed to generate interesting health
patterns and behaviour.
There is no nobler cause than saving lives. And what better way to accomplish it in the digital age than big data Analytics. Using Big Data
Analytics to ensure right living, right care, right provider, right value and right innovation is at the heart of this application. More and more big
data is becoming accessible publicly like Google Trends, Cancer Genome Atlas data portal etc. hence developing Big Data analytics tools and
techniques is the need of the hour in healthcare and pharma. More innovative and sophisticated the analysis more will be the benefit to society
and business. India, renowned globally for providing service solutions and being the favorite outsourcing hub, should identify this burgeoning
opportunity before it is too late.
In the future, all the healthcare constituents- pharmaceutical companies, hospitals, medical organisations, research centres and labs, physicians
and government run medical facilities will be impacted by the revolution in big data analytics in this industry. Organisations with a growing need
to cut costs and sustain comparative advantage will accept this new trend and will be able to provide immediate benefit to patients and end users
(Hamilton, 2012). The costs of obtaining information about patients will break financial barriers; personalized medicine will be accessible to the
common man and data from unconventional sources such as social media will be utilised to predict and solve medical problems of patients. The
big data issue is so vast that no individual organisation or person can solve it alone. It needs a coordinated effort on part of the whole healthcare
community to sustain this big data momentum and to transfer its benefits to the world (Costa. 2012).
This work was previously published in Impact of Emerging Digital Technologies on Leadership in Global Business edited by Peter A.C. Smith
and Tom Cockburn, pages 202214, copyright year 2014 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Chye Koh, H. (2005). Data Mining Applications in Healthcare.Journal of Healthcare Information Management , 19, 64–70.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search
engine query data. Retrieved 10 October 2013 from http://li.mit.edu/Stuff/CNSE/Paper/Ginsberg09Mohebbi.pdf
GoogleTrends. (2013, August). Explore Swine Flu. Retrieved 10 September 2013 from http://www.google.com/trends/explore - q=Swine Flu
Groves, P. (2013). The 'Big Data' Revolution in Healthcare. McKinsey & Company. January-2013, 1-18. Retrieved on 12 August 2013 from
http://www.mckinsey.com/insights/health_ systems_and_services/the_bigdata_revolution_in_us_health_care
Groves, P., Kayyali, B., & Knott, D. (2013). The big data revolution in Healthcare. Center for US Health System Reform Business Technology
Office, McKinsey & Company. Retrieved 10 August 2013 from http://www.mckinsey.com/insights/health_systems_and_services/the_big-
data_revolution_in_us_health_care
Ranjan, J. (2007). Applications of Data Mining Techniques in Pharmaceutical Industry. Journal of Theoretical and Applied Information
Technology , 3(4), 61–67.
KEY TERMS AND DEFINITIONS
Big Data: Defined as the processing of massive amounts of data that collect over time that are difficult to analyze and handle using common
database management tools. The data are very unstructed and analyzed for marketing trends in business as well as in the fields of manufacturing,
medicine and science.
Data Analytics: Defined as the collecting and analyzing data associated with customers, business processes, market economics or practical
experience in which the collected data is categorized, stored and analyzed to study purchasing trends and patterns and the final patterns
facilitates decision making.
GoogleFlu: Is from Google and it provides many estimates of influenza activity for more than 25 countries. By aggregating Google search
queries, it attempts to make accurate predictions about flu activity.
Healthcare Analytics: Defined on massive collection of data related to various healthcare domains, defensive medicine, billing, and fee-for-
service, culminating in the mass adoption of EMRs and data proliferation towards improving clinical efficiency, quality of care, affordability, and
fee-for-value - culminating in a new age of healthcare analytics.
CHAPTER 57
Evaluation of Topic Models as a Preprocessing Engine for the Knowledge Discovery in
Twitter Datasets
Stefan Sommer
Telekom Deutschland GmbH, Germany
Tom Miller
TSystems Multimedia Solutions GmbH, Germany
Andreas Hilbert
Technische Universität Dresden, Germany
ABSTRACT
In the World Wide Web, users are an important information source for companies or institutions. People use the communication platforms of
Web 2.0, for example Twitter, in order to express their sentiments of products, politics, society, or even private situations. In 2014, the Twitter
users worldwide submitted 582 million messages (tweets) per day. To process the mass of Web 2.0’s data (e.g. Twitter data) is a key functionality
in modern IT landscapes of companies or institutions, because sentiments of users can be very valuable for the development of products, the
enhancement of marketing strategies, or the prediction of political elections. This chapter’s aim is to provide a framework for extracting,
preprocessing, and analyzing customer sentiments in Twitter in all different areas.
INTRODUCTION
Twitter is one of the fastest growing social network platforms in the world. In 2014 the social network consists of over two billion users, which
submit 582 million messages per day (Twopcharts, 2014). In comparison to 2011 the number of users increased by factor four and the amount of
tweets per day increased by factor 5 (Stieglitz, Krüger, Eschmeier, 2011). By using Twitter people share news, information or sentiments in a
short message, which is limited to 140 characters, named tweet. Twitter is a so-called microblog, a special kind of a blog that combines an
ordinary blog with features of social networks. The communication platforms of Web 2.0 gain in importance, as interaction increases between
users through these media (Stephen & Toubia, 2010). Due to the positive development of microblogs and Twitter in particular, these services
become a valuable source for companies or institutions (Pak & Paroubek, 2010; Barnes & Böhringer, 2011).
Today users are considered to be key communication partners for companies: providing relevant feedback, requests, and testimonials to the
company’s performance, a political party or an institution (Richter, Koch, Krisch, 2007; Tumasjan et al., 2010; Sommer et al., 2012). They share
their sentiments with other users through the communication platforms of Web 2.0 (Jansen, Zhang, Sobel, Chowdury, 2009). By spreading their
thoughts through platforms, such as blogs, communities, or social networks, users influence other users in the process of their own sentiment
creation (O'Connor, Balasubramanyan, Routledge, Smith, 2010). But how to deal with this huge amount of unstructured or semi-structured data
of social networks within a complex IT infrastructure?
Our research aim is to provide a preprocessing framework for Twitter data, which extracts, transforms and supplies the relevant tweets of a huge
amount of data in order to make the data applicable for sentiment analysis. The framework can be used for different approaches and areas. On
the one hand to enhance the company’s products e.g. by adding mandatory features and finally involving users in the product development as so-
called ‘prosumers’. On the other hand to optimizing the company’s marketing activities by analyzing which advertising is most discussed or
referenced in the Web 2.0 platforms, especially in the case of viral marketing (Jansen et al., 2009; Lee, Jeong, Lee, 2008; Liu, Hu, Cheng, 2005).
Other examples show the use of Twitter to predict the voters’ opinion in political elections (Tumasjan et al., 2010). In this chapter, we focus on
the evaluation of our framework, which is based on topic models. Our Twitter dataset for the evaluation contains tweets covering the TV debate
between the candidates for the election of the Deutsche Bundestag in Germany.
RELATED WORK
In the last five years many articles have been published in the area of Sentiment Analysis. Liu (2007) gives a broad overview of characteristics,
tasks and methods of Sentiment Analysis and places them into the context of Web Data Mining. Next to the theoretical descriptions of Liu (2007)
you can find various articles covering different existing Sentiment Analysis systems, which are systematically presented by Lee et al. (2008), as
Table 1 shows.
Table 1. An overview of current Sentiment Analysis systems (modified by Lee et al., 2008)
Extracting Sentiment
Sentiment Syntactic Expression
System
Resource Analysis Feature Sentiment
Extraction Assignment
Probabilistic model
Naïve Bayes Classifier
Review Seer Thumbs
(2003) up/down
Frequent Average
noun and star rating
Red Opal Star
noun
(2007) rating
No phrase
CBA WordNet
miner exploring
Opinion Infrequent the
Observer feature dominant
(2004) selection polarity of
each
phrase
Due to the popularity of microblogs and Twitter, in particular, there are many papers covering this research area. Böhringer & Gluchowski
(2009) describe the microblogging service Twitter, and how users are able to communicate with each other by using this platform. Tweets can
contain different content, for example short expressions about the user’s personal situation, but also sentiments of or testimonials on products,
services or even political parties and personal situations. It is interesting to analyze this content in order to get useful insights into users’
sentiments. Hence, many researchers analyze Twitter posts and show their results: The objectives of this analysis reach from general insights to
concrete extractions of sentiments and their application in a special area. Oulasvirta, Lehtonen, Esko, Kurvinen & Raento (2010) explain
common findings, such as the characteristics of users’ self-disclosures, and they explored the behavior of microblog users publishing current
events and experiences. However, Golder & Macy (2011) show differences in the diurnal mood of Twitter users. They state that individual mood is
an indicator for well-being, working memory, creativity and the immune system. With this knowledge they want to draw inferences about the
health status of users. In contrast to these general insights, many researchers use Twitter in more specific case studies in order to reveal the
sentiments of the tweet authors. In this context the most popular aim is to extract political sentiments (O’Connor, 2010; Tumasjan, Sprenger,
Sandner, Welpe, 2010; Chung & Mustafaraj, 2011). For example, Tumasjan et al. (2010) want to answer the question, whether the expressed
sentiments in Twitter are able to mirror the real sentiment of voters. They come to the conclusion that the content of Twitter entries can
represent the political attitude of the users. The example shows that Twitter is a valuable resource for important insights into users’ sentiments,
not only for political expressions, but also for expressions about products or society topics.
Before we are able to analyze the different expressions in the tweets, we have to deal with the huge amount of tweets and the selection of relevant
content which is required for sentiment analysis. In the first step we use our Twitter crawler, which accesses the Twitter API to select the relevant
tweets by a keyword search. After different steps of data cleaning and data transformation, we secondly want to separate the posts into groups
corresponding to specific topics. For this separation we use generative topic models. Blei & Lafferty (2009) give a fundamental explanation of
topic models and their usage. In their work they adapt topic models to a digital archive of the journal ‘Science’, which contains articles from 1980
to 2002. Searching for interesting articles within this huge amount of information can be difficult. Therefore, they postulate the possibility of
exploring this collection through the underlying topics in the articles. With probabilistic topic models they are able to automatically detect a
structure of topics in this collection, even without manually constructed training data. In the case of the ‘Science’ archive Blei & Lafferty (2009)
found 50 topics corresponding to the articles. So, while searching for documents, scientists are able to browse through a list of similar documents
that deal with the same topic.
Since their publication, topic models have been successfully used by other authors in order to identify topics. There are also researchers who have
used topic models to analyze tweets. Ramage, Dumais, & Liebling (2010) detect dimensions of content in Twitter entries with the help of topic
models. They introduce five dimensions, in which Twitter content can be matched. For example, the dimension ‘substance’ contains expressions
with generally interesting meanings; they have found words like ‘president’, ‘American’, and ‘country’ for this dimension. Another dimension is
called ‘status’ and they have detected words like ‘still’, ‘doing’, and ‘sleep’, which express the current situation of the user. With this partially
supervised approach Ramage et al. (2010) show the distribution of these five dimensions for specific Twitter users in order to get an idea of the
user’s tweeting behavior. However, Zhao et al. (2011) used unsupervised topic modeling on a Twitter dataset. They aim for an empiric
comparison of Twitter content and a traditional news medium. They have adopted topic models on both the Twitter content and the New York
Times news. Afterwards, they have compared the topics and have focused on the differences between them. As a result, Zhao et al. (2011) have
found out that Twitter users actively and rapidly spread news of important or extraordinary events. Twitter is a popular data source, because of
the great number of text posts and the heterogeneous audience (Pak & Paroubek, 2010). In addition to that, in Twitter, users express a certain
sentiment about different topics, so it is possible to analyze their sentiments regarding those topics. In this context, Pak & Paroubek (2010) have
shown examples of entries with expressed sentiments in order to demonstrate the high potential of sentiment analysis in tweets. Tumasjan et al.
(2010) have used 100,000 tweets for revealing political sentiment in Germany. They have figured out that the majority of the analyzed tweets
reflect voter preferences and even come close to traditional election polls. Users not only express their sentiments, but also discuss them with
other users. Beside the content-based analysis of a microblog, the social network structure of Twitter also yields valuable information about a
person’s behavior. Heinrich (2011), for example, have analyzed relationships in Twitter, which predict the reference potential of a person as a
determinant of the grade of opinion leadership and social centrality. Filtering those important users, helps to reduce the amount of Twitter
updates that one has to scan and monitor over time while researching dynamic effects in topic and sentiment trends. These approaches illustrate
the capabilities that topic modeling offers for uncovering latent structure in text documents.
Our final step in preparing the Twitter dataset for sentiment analysis is to evaluate the performance of our topic modeling approach. The
commonly used evaluation metric for topic models is the document held-out probability (Wallach et al., 2009). This metric measures the
probability that a trained topic model could not classify an unseen document. In our work, this evaluation approach is called “extrinsic”, because
the evaluation is done on external objects – the unseen documents. On the one hand the extrinsic evaluation will be the right approach, if the
topic models application domain is focused on the classification aspect of the topic model. On the other hand, if the application domain focuses
the topics themselves, the evaluation will be done in an “intrinsic” way. In that case, “intrinsic” means that the evaluation focuses on the words of
the topics and the topics semantic nature. Newman, Lau, Grieser, Baldwin (2010) “proposed the novel task of topic coherence evaluation as a
form of intrinsic topic evaluation”.
RESEARCH APPROACH
In this chapter we emphasize on the evaluation of our preprocessing framework for Twitter datasets. The basis of our preprocessing framework is
our developed algorithm, which is able to identify microblog entries (tweets) containing sentiments in a particular context (Sommer, Schieber,
Heinrich, Hilbert, 2012). The evaluation of the algorithm is mandatory as we need to be able to analyze the performance of our preprocessing
approach. We follow in our research the design science approach by Hevner, March, Park, Ram (2004). The purpose of Hevner’s approach is
gaining new insights by developing an artifact that solves a specific problem. In our case the specific problem is the preprocessing of Twitter
datasets in order to gain insights into the user’s sentiments or testimonials in different areas. In this chapter we focus on the sentiments of users
during the election of the Deutsche Bundestag in Germany. With our approach we give answers to the following research questions:
1. How does the preprocessing framework for knowledge discovery in Twitter datasets works?
2. What kinds of criteria or metrics exist to evaluate topic models as basis of the preprocessing framework?
3. What are the results of the evaluation regarding to our example of the TV debate between the candidates for the election of the Deutsche
Bundestag in Germany?
The first question covers the composition of the different components of our framework and how these work together. It describes a powerful
technique, which is able to automatically create topic clusters out of a huge amount of textual data by analyzing terms that co-occur with other
terms. The algorithm is the basis of our framework. The second and third question deals with the problem how to evaluate topic models in this
special case. We discuss different critiera and metrics for an evaluation and calculate them for our example of the TV debate.
PREPROCESSING FRAMEWORK FOR SENTIMENT ANALYSIS IN TWITTER
Knowledge Discovery in Databases
In order to build a framework to extract, transform and analyze Twitter datasets we align our approach to the process of Knowledge-Discovery-
in-Databases (KDD), which is widely accepted by academics and practitioners. It was published by Fayyad (1996) and shows the nontrivial
process of identifying patterns in huge amounts of data. Following Azevedo & Santos (2008), other approaches for pattern recognition in datasets
such as SEMMA or CRISP-DM can be referred in the implementation of the KDD process by Fayyad (1996). The KDD process contains five
stages: Selection, Preprocessing, Transformation, Data Mining, and Interpretation.
The first stage in the KDD process contains the selection of the target dataset, which can be an entire dataset, a subset of variables, or data
samples. In the next two stages (Preprocessing and Transformation) the target data is prepared for the data-mining methods. Data cleaning and
transformation algorithms provide consistent data for example by purging double entries, filtering outliers, reducing dimensionality or creating
new variables. The fourth stage includes the analysis of the preprocessed and transformed data in order to gain patterns of interest in a particular
form, depending on the analysis objective. These patterns have to be evaluated and interpreted for the purpose of generating knowledge in the
last stage (Azevedo & Santos, 2008; Fayyad, 1996).
The KDD process is iterative. So if necessary, it is possible to improve the results by enhancing the activities within the earlier stages. The five
stages in the KDD process lead the data analyst from the raw datasets to significant patterns and formerly unknown insights.
Knowledge Discovery in Twitter Datasets
To adopt the KDD process for the analysis of Twitter datasets we have to consider some changes referring to the special capabilities and
challenges of the medium Twitter. Böhringer & Gluchowski (2009) introduce the microblogging service Twitter and its functionalities: First,
Twitter users can communicate with each other by referencing the name of the communication partner with a prefixed ‘@’. For example, user A
writes an entry containing ‘@userB’ in order to address user B. In addition, users can distribute the entry of another user by forwarding this entry
with the prefixed characters ‘RT’. For example, in order to forward the origin entry ‘tweet’ of user A user B publishes the tweet with ‘RT @userA
tweet’. In this way, the range of an expression is increased, which ultimately will benefit the reputation of the original author. Finally, there is a
very important function of microblogs: the tagging of the entries. Authors can tag their entries by adding keywords to the tweet. These keywords
are called hashtags and can be recognized by the prefixed ‘#’. In summary, the technical functionalities of Twitter provide several possibilities for
analysis. The character limitation of 140 characters leads to the main challenge in the analysis of microblogs. In order to write as much
information as possible users tend to use abbreviations (for example, ‘4ever’ is used as an abbreviation for ‘forever’). In addition, the informal
way of speaking in microblogs and syntactic errors complicate the mining procedure. In contrast, Bermingham & Smeaton (2010) consider the
short length to be a strength of microblogs, because the limited text can contain compact and explicit sentiments. In their paper they have found
classifying sentiments in microblogs easier to mine than in blogs. Therefore, the brevity might be an advantage, too. The modified process for
knowledge discovery in Twitter datasets is shown in figure 1.
Figure 1. Preprocessing framework for Twitter datasets
The first stage in the knowledge discovery process for Twitter datasets contains the selection of the target data, as shown in figure 2. We have
implemented a Twitter crawler as a prototype, which accesses the open Twitter API to select our target data of the complete raw Twitter data.
With the crawler we are able to gather a huge analysis corpus of tweets. The crawler searches recursive via the Twitter API for user selected
keywords and stores the raw Twitter data in a database.
Figure 2. Prototype: Data selection for Twitter crawling job
The crawler can process several Twitter search jobs in a parallel operating mode, as shown in figure 3. We are also able to extract the relationship
between Twitter users corresponding to the tweets. The job overview is divided into search jobs or relationship jobs.
After the selection of our target data, we have to perform some data cleaning and transformation tasks in the next two stages. We have to point
out that our data cleaning and transformation tasks focus on the relevant task to perform the topic modeling for our preprocessing framework.
For a sentiment analysis the additional transformation and tasks might be added to our process, which, for instance, has been already discussed
by Kaufmann (2010), Parikh & Movassate (2009), Lee et al. (2008) and Lo, He, Ounis (2005). The dataset contains only textual content from
tweets. Thus, we have only one variable. In order to obtain useful results, we process the data and filter some parts form each entry in the second
stage of our preprocessing framework. The data cleaning tasks start in the first step with the tokenization of each tweet. We use the whitespace-
tokenization to keep hashtags (e.g. #Merkel) and links with the URL as separate tokens. In the next step we filter these hashtags and links with
regular expressions as discrete entities. These entities are removed from the dataset for the topic modeling, but we keep the entities as own
variables stored in order to perform sentiment analysis in later stages. In the next step we filter non-relevant text components such as stopwords,
the keywords of our search string, single characters or numbers and cross-references to other users (e.g. @userA). The cross-references to other
users are also stored separate for network analysis in posterior stages. In the last step of the data cleaning we eliminate spam and advertising
tweets form our dataset. We use a combination of regular expressions together with duplicate detection to identify spam or advertising tweets.
This approach filters some of the unwanted tweets, but there is still a potential for optimization. The result of this stage is an adjusted corpus
collection of our previous Twitter dataset. In the transformation process we firstly apply our vocabulary automatically. Afterwards we transform
our corpus by lexicalizing it to the vocabulary. The results of this stage are yield co-occurrence-based data in order to perform our algorithm for
building the topic models.
The next step contains the topic modeling of the tweets. We implement the LDA algorithm by Blei, Ng and Jordan (2003) in order to identify
topic clusters in our Twitter datasets. Sommer et al. (2012) successful implemented the LDA algorithm for preprocessing Twitter datasets. Figure
4 shows the results of the LDA algorithm in the frontend of our prototype. Blei et al. (2003) introduced Topic Models as a powerful technique for
finding useful structure in an otherwise unstructured collection of data. Topic models are probabilistic models for unsupervised uncovering the
underlying semantic structure of a document collection (e.g. Twitter). With the special characteristics of tweets in mind several approaches have
to be validated for our purpose of analyzing sentiments in Twitter entries that are related to a certain topic. Therefore we have to assign a certain
probability to those documents that are likely to represent such a topic. We therefore only discuss methods that are probabilistic in their
approach. Well-known methods like ‘tf-idf’ are only suitable in a pre-processing manner because next to the fact that it is not probabilistic, it also
holds a small amount of reduction which would lead to unsatisfying scaling behavior with the fast-growing database of Twitter streams. The
‘pLSI’ model by Hofmann (1999), while probabilistic in its nature, lacks a representation on a document level. This fact is quite important when
looking at our research goal of not only exploring topics and filtering them according to our interests, but also using the method for representing
those documents in terms of topics so that we can reduce the amount of data that is generated by the Twitter entries.
Figure 4. Prototype: Resulting topic models after the LDA
EVALUATION
Selection of Evaluation Criteria and Metrics
The first in the evaluation of our preprocessing is to select criteria and metrics that that divide “good” from “bad” topic models in sense of our
intrinsic perspective. The selection of criteria for the intrinsic evaluation is done in an inductive way. Consequently, only a topic model with the
most essential data transformation and data cleaning is required. We used the “MALLET topic modeling toolkit” by McCallum et al. (2002)
generating the LDA model. The only preparation that we did beforehand was the transformation to lowercase-strings and the filtering of all non-
latin-characters, except ß, ä, ü and ö.
The underlying data has been collected during the Bundestag elections in Germany in the week before the TV debate between chancellor Angela
Merkel (CDU) and the challenger Peer Steinbrück (SPD). In the following table, three example topics of the generated model are represented:
The first criterion for the intrinsic evaluation is the interpretability. This is obvious by comparing topic one and topic three. Nearly all terms
within topic one are forming a meaningful topic. E.g. “frau”, “angela”, “merkel” and “cdu” refer to the chancellor and “peer”, “steinbrück” and
“spd” apply to the challenger. The TV debate is characterized by “tvduell”, “tv”, “btw” and “zdf”. The terms “keine”, “maut”, “große” and
“koalition” referring to two of the main issues of the TV debate. In contrast, no combination of the words within topic three makes sense.
The second criterion is the importance of words. Topic three contains only meaningless, but frequently used words – so called stopwords. Those
words have no importance for the topics, because they are not delivering any valuable input. Even worse, due to the limitation of representative
words for a topic, those stopwords are wasted space. A good topic model would not contain any meaningless words. Thus, the importance of
words is added to the criteria that should be evaluated.
The last criterion in this chapter is the separation of topics. A topic model that separates different contents to different topics is better than a
topic model that mixes different contents. The second topic of our example (Table 2) contains content about a German party called “AFD –
Alternative für Deutschland”. This topic is clearly separated from the TV debate topic and also separated from the “stopwords” topic. If the
content of topic two mixed into topic one, misinterpretation of this mixed topic would happen, because the AFD was neither represented in the
TV debate by a challenger nor an aspect of the debate. A model that avoids this type of misinterpretation is obviously a better one. Depending on
the model being evaluated more or different criteria can be used.
Table 2. Topic model example
3 die nicht als mehr ist auch zu nur aber noch ja sind
das viel besser keine immer leider gar sondern
Newman et al. (2010) evaluates various metrics for the measurement of “topic coherence”, which is in substance the same as the interpretability
in this paper. In their work the “best-performing method was term co-occurrence within Wikipedia based on pointwise mutual information”
(Newman et al., 2010). Therefore, for each pairwise combination of the terms from one topic, the pointwise mutual information (PMI)-value is
calculated. This calculation is done on the basis of all Wikipedia articles. If a combination of two terms occurs often within a sliding window of
ten words, this combination gets a higher PMI-value than a combination that co-occurs rarely. Co-occurring terms tend to be a good indicator for
interpretable combinations. To obtain a topic-level-value for the criterion interpretability the average over all combinations of that topic is
formed. For the model-level-value all topic-values are calculated as the mean and as the average over all topic-values. For the criterion
importance of words two possible metrics were proposed. The first one is the well-known “tf-idf” measure by Salton & McGill (1983). The second
metric is an adapted form of the Kullback-Leibler-Divergence (KLD), which was described by Lo et al. (2005). Both metrics had been applied to a
Wikipedia basis, and a basis that has been formed of all crawled tweets. In the same way that the interpretability-value is calculated from word-
level to model-level, the tf-idf-values and the KLD-values are calculated as the average from word- to topic-level and from topic-level to the
model-level, the average and the mean is calculated. The separation of topics is measured by the inverse cosine similarity. The popular cosine
similarity measures the similarity between two vectors, and since the topics of a topic model are word-vectors, this is a practicable option. The
model-level-value for separation of topics is also calculated as the average, and the mean of the inverse cosine similarity values on topic-level.
The selection of the metrics is done in a procedure presented by Newman et al. (2010). Therefore, N = 5 users had to score 13 different LDA
models, while each LDA model consists of 10 topics with 20 words. The users have scored the three identified criteria interpretability, importance
of words and separation of topics on a three-point-scale from 1 … topic is not interpretable / word is unimportant / topics are not clearly
separated to 3 … topic is interpretable / word is important for the topic / topics are clearly separated. On top of the user-scores the gold standard
for the evaluation is built as the inter annotator agreement (IAA). The 13 topic models are different in the parameters stopword count and use
white list. Stopword count sets the number of words that should be filtered by the stopword filter and the use white list parameter turns the white
list on and off.
Table 3. Parameterization of the userevaluated LDA models
Id 1 2 3 4 5 6 7 8
whitelist ☒ ☒ ☒ ☒ ☒ ☒ ☒ ☑ ☑
The 13 LDA models that have been scored by the users were also evaluated by the preselected metrics of section 6.2. For all criterion scores of
each LDA model a spearman's rank correlation coefficient (RCC) was calculated. For each criterion the metric with the highest RCC between the
IAA and the metric itself was selected as the preferred evaluation-metric.
Table 4. Spearman's rank correlation coefficients between inter annotator agreement and the preselected metrics
Topic-
Level to
Criterion Databasis Metric RCC
Model-
Level
Ø 0.4560
Interpretability Wikipedia PMI
Median 0.6154
Ø -0.0055
Tf-idf
Median 0.1923
Twitter
Ø 0.0330
KLD
Median 0.2637
Importance of
words
Ø 0.9615
Tf-idf
Median 0.9451
Wikipedia
Ø 0.4451
KLD
Median 0.5385
Inverse Ø 0.7777
Separation of
/ cosine
topics
similarity Median 0.5960
For interpretability the PMI approach on Wikipedia basis with calculating the median, to get the interpretability value on model level, was
selected. The “tf-idf” metric on Wikipedia basis with calculating the average over all topics was chosen for the importance of words criterion. This
combination of metric, database and building the model-value reaches nearly a perfect representation of the IAA. For separation of topics the
average inverse cosine similarity values of all topic combinations within one model was selected.
Evaluation
The last part of the framework, besides the criteria and the metrics, is the topic model scorecard. The idea of the topic model scorecard is to
aggregate the individual criteria to one meaningful value – the topic model score. Before the aggregation, all measurements within one criterion
and over all evaluated topic models have to be normalized. This normalization is fundamental to the ability to compare the different criterion
values. The topic model score is aggregated as the product of the weighted topic model criterion values. E.g. within a use case where the
interpretability is the most important criterion and the importance of words is much more important than the separation of topics, a topic score
could be aggregated by a 60/30/10 weighting. With this topic model score it is possible to compare different types of topic models or different
parameter settings for topic models with an application-domain-fitted value.
For the evaluation 65 LDA-models have been created on the TV debate tweets. Those models differ in the parameters stopword count and use
whitelist. The first model was not filtered for any stopwords. Models two to ten have a stopword count rising by ten stopwords. From the eleventh
model the count rises by 50 stopwords. Model 29 filters 1000 stopwords. From this model the rise is 1000 stopwords until the 33rd model, which
filters 5000 stopwords. The models 2 to 33 are generated without a whitelist. The models 34 to 65 have been generated in the same way, but this
time with the usage of a whitelist. For all models the topic count was set to 20, the dirichlet parameter α was set to 1 and the count of the gibbs
sampling iterations was set to 2000. Figure 4 provide the normalized scores for each criterion and the overall topic model score.
Figure 5. Normalized scores for each criterion for the topic
model evaluation
The interpretability increases the more stopwords will be filtered – until 250 filtered stopwords. This results of the fact that in those models a lot
of stopwords, without any interpretative value, are included. From 250 to 750 stopwords a plateau is formed, because from this point on only
meaningful words are replaced by other meaningful words. The filtering of more than 750 stopwords leads to poorer models, because now
meaningful words will be replaced by very rare words and those words are more difficult to interpret and the chance of accidently filtering non-
stopwords increases.
The criterion importance of words increases nearly constantly with a rising count of filtered stopwords. The more stopwords are filtered the more
rare words are included in the models and this leads to a higher value of the underlying metric. Models that have been generated with the
whitelist are always better than those without the whitelist. This is because the whitelist consists of words out of a political domain. Those words
are very frequent in the TV-Debate-tweets, and therefore they are often included in the models. In contrast, those words are very rare in the
Wikipedia and this leads to a better importance of words score. The usage of the whitelist has also a disadvantage – the separation of topics score
will be constantly lower than without the whitelist. Filtering more than 300 stopwords without a whitelist leads to a nearly perfect topic
separation.
The topic model score is aggregated by the example weighting of 60/30/10. With this set of weights the topic model score looks similar to the
interpretability diagram. Only the region above 750 stopwords differs, because the decreasing interpretability score is replaced by the constantly
increasing importance of words score. In addition to this, the separation of topics score lowers the models with an active whitelist. Although the
model without a whitelist and with 5000 filtered stopwords reaches the highest topic model score, it might be not the best model. The topic
model score of this model is only 17% better than the score of the model with 250 stopwords. This improvement has been achieved through the
filtering of 4750 additional words. The filtering of 10,000 words will lead to an even higher topic model score. With this in mind, it seems to be
better to favor a model at the beginning of the plateau.
The following diagram shows that the topic model score is better suited to choose the right topic model variant instead of an extrinsic metric like
the log document held-out probability. In the diagram a high document held-out probability value means that this model is good in classifying an
unseen document. By choosing this metric as the basis for the model decision, the model with only 60 stopwords and without a whitelist would
be taken. However, models with a higher stopword count are better application-domain-fitted than this model.
CONCLUSION
The analysis of user communication via Web 2.0 technologies and Twitter in particular is an important evolutionary step to gain insights in users
sentiments of products, politics, society or even private situations. Nevertheless it is also a big challenge for companies or institutions. Due to the
mass of semi-structured or unstructured data of the Web 2.0 environments it is difficult to process the data into existing IT landscapes and turn
the data into relevant information.
We presented a preprocessing framework which deals with the question how companies or institutions can integrate Web 2.0 data and Twitter
data in particular into their IT landscapes. We pointed out different challenges in order to extract, transform and supply the relevant content out
of a huge amount of Twitter data and make the data finally applicable for sentiment analysis. Our prototype showed in the case of the Bundestag
elections in Germany how to perform all relevant steps in the KDD process for Twitter datasets and how to gain the relevant topic models as the
final result. Furthermore we discussed the evaluation of our framework in detail by detecting different criterions and metrics for the evaluation of
our LDA algorithm.
Further research needs to verify that our evaluation methods will work also with English vocabulary and with more complex topic model
structures. Our evaluation example focused only on German language which is a limitation in our work. But we already showed in other
publications that the LDA algorithm works in for German or English Twitter datasets. We will extend our framework to German and English
language support including the evaluation and also implement different LDA algorithms.
This work was previously published in Recent Advances in Intelligent Technologies and Information Systems edited by Vijayan Sugumaran,
pages 4662, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Azevedo A. Santos M. F. (2008). KDD, SEMMA and CRISP-DM: A parallel overview. In AbrahamA. (Ed.), IADIS European conference on data
mining (pp. 182–185). IADIS.
Barnes, S. J., & Böhringer, M. (2011). Modeling Use continuance behavior in microblogging services: The case of Twitter. Journal of Computer
Information Systems , 51(4), 1–13.
Bermingham A. Smeaton A. F. (2010). Classifying sentiment in microblogs: Is brevity an advantage?InACM (Ed.), Proceedings of the 19th ACM
International Conference on Information and Knowledge Management (pp. 1833–1836). New York: ACM. 10.1145/1871437.1871741
Blei, D., & Lafferty, J. (2009). Topic models. Retrieved September 25, 2011 from
http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation.Journal of Machine Learning Research , 3, 933–1022.
Chung J. Mustafaraj E. (2011). Can collective sentiment expressed on Twitter predict political elections? In BurgardW.RothD. (Eds.), Proceedings
of the Twenty-Fifth AAAI Conference on Artificial Intelligence (pp. 1770-1771). AAAI Press.
Dave K. Lawrence S. Pennock D. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews.InACM
(Ed.), Proceedings of the 12th International Conference on World Wide Web (pp. 519–528). New York: ACM. 10.1145/775224.775226
Fayyad, U. M. (1996). Advances in knowledge discovery and data mining . Menlo Park, CA: AAAI Press.
Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures.Science , 333(6051),
1878–1881. doi:10.1126/science.1202775
Heinrich, K. (2011). Influence potential framework: Eine Methode zur Bestimmung des Referenzpotenzials in microblogs . In Gluchowski, P.,
Lorenz, A., Schieder, C., & Stietzel, J. (Eds.),Tagungsband zum 14: Interuniversitären doktorandenseminar wirtschaftsinformatik (pp. 26–36).
Universitätsverlag Chemnitz.
Hevner, A., March, S., Park, J., & Ram, S. (2004). Design science in information systems research. Management Information Systems
Quarterly , 28(1), 75–105.
Hofmann T. (1999). Probabilistic latent semantic indexing.InACM (Ed.), Proceedings of the Twenty-Second Annual International SIGIR
Conference (pp. 50–57). ACM.
Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter power: Tweets as electronic word of mouth. Journal of the American Society
for Information Science and Technology , 60(11), 2169–2188. doi:10.1002/asi.21149
Lee D. Jeong O.-R. Lee S.-G. (2008). Opinion mining of customer feedback data on the web.InACM (Ed.), Proceedings of the 2nd International
Conference on Ubiquitous Information Management and Communication (pp. 230-235). New York: ACM. 10.1145/1352793.1352842
Liu, B. (2007). Web data mining: Exploring hyperlinks, contents, and usage data . Berlin: Springer.
Liu B. Hu M. Cheng J. (2005). Opinion observer: Analyzing and comparing opinions on the web.InACM (Ed.), Proceedings of the 14th
International Conference on World Wide Web (pp. 342–351). New York: ACM. 10.1145/1060745.1060797
Lo, R. T.-W., He, B., & Ounis, I. (2005). Automatically Building a stopword list for an information retrieval system. In Proceedings of the Fifth
DutchBelgian Information Retrieval Workshop (pp. 17-24). Academic Press.
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Proceedings of Human Language
Technologies. The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108).
ACL.
O'Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From Tweets to polls: Linking text sentiment to public opinion time
series. In W. Cohen & S. Gosling (Eds.), Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (pp. 122–129).
The AAAI Press.
Oulasvirta, A., Lehtonen, E., Kurvinen, E., & Raento, M. (2010). Making the ordinary visible in microblogs. Personal and Ubiquitous
Computing , 14(3), 237–249. doi:10.1007/s00779-009-0259-y
Pak A. Paroubek P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In
CalzolariN.ChoukriK.MaegaardB.MarianiJ.OdijkJ.PiperidisS.RosnerM.TapiasD. (Eds.), Proceedings of the International Conference on
Language Resources and Evaluation (pp. 1320–1326). European Language Resources Association.
Popescu, A.-M., & Etzioni, O. (2007). Extracting product features and opinions from reviews . In Kao, A., & Poteet, S. R. (Eds.),Natural language
processing and text mining (pp. 9–28). London: Springer. doi:10.1007/978-1-84628-754-1_2
Ramage, D., Dumais, S., & Liebling, D. (2010). Characterizing microblogs with topic models. In W. Cohen & S. Gosling (Eds.),Proceedings of the
Fourth International AAAI Conference on Weblogs and Social Media (pp. 130–137). The AAAI Press.
Richter, A., Koch, M., & Krisch, J. (2007). Social commerce - Eine Analyse des Wandels im E-Commerce (technical report no. 2007-03) . Munich:
Faculty Informatics, Bundeswehr University Munich.
Salton, G., & McGill, M. (1983). Introduction to modern information retrieval . New York: McGraw-Hill.
Scaffidi C. Bierhoff K. Chang E. Felker M. Ng H. Chun K. (2007). Red Opal: Product-feature scoring from reviews.InACM (Ed.), Proceedings of
the 8th ACM Conference on Electronic Commerce (pp. 182–191). New York: ACM.
Sommer, S., Schieber, A., Heinrich, K., & Hilbert, A. (2012). What is the conversation about? A topic-model-based approach for analyzing
customer sentiments in Twitter. International Journal of Intelligent Information Technologies , 8(1), 10–25. doi:10.4018/jiit.2012010102
Stephen, A., & Toubia, O. (2010). Deriving value from social commerce networks. Networks Journal of Marketing Research ,67(2), 215–228.
doi:10.1509/jmkr.47.2.215
Stieglitz, S., Krüger, N., & Eschmeier, A. (2011). Themenmonitoring in Twitter aus der Perspektive des Issue Managements. In K. Meißner & M.
Engelien (Eds.), Virtual enterprises, communities & social networks (pp. 69–78). Dresden: TUDpress.
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with Twitter: What 140 characters reveal about political
sentiment. In W. Cohen & S. Gosling (Eds.),Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (pp. 178–
185). The AAAI Press.
Wallach H. M. Murray I. Salakhutdinov R. Mimno D. (2009). Evaluation methods for topic models. In Proceedings of the 26th International
Conference on Machine Learning (pp. 1-8). ACM Press.
Yi J. Niblack W. (2005). Sentiment mining in webfountain. In Proceedings of the 21st International Conference on Data Engineering (pp. 1073-
1083). IEEE Computer Society.
Zhao, W., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., & Li, X. (2011). Comparing Twitter and traditional media using topic models . In Clough,
P., Foley, C., Gurrin, C., Jones, G., Kraaij, W., Lee, H., & Mudoch, V. (Eds.), Advances in information retrieval(pp. 338–349). Berlin: Springer.
doi:10.1007/978-3-642-20161-5_34
KEY TERMS AND DEFINITIONS
Design Science: In the computer science the design science approach is a popular research method and follows the goal to gain new insights by
developing an IT artifact that solves a specific problem.
KnowledgeDiscoveryinDatabases (KDD): The KDD describes the iterative process of identifying patterns in huge amounts of data. The
process is divided into five stages: Selection, Preprocessing, Transformation, Data Mining, and Interpretation.
Microblog: A microblog is a special kind of a blog that combines an ordinary blog with features of social networks, but with a limitation of
characters for each entry. Twitter is the most popular microblog with a limitation of 140 characters for each entry.
Preprocessing: A Preprocessor is an engine that processes input data to a defined output data for other programs. In our case it performs
different steps of data selection, cleaning and transformation for analysis purposes (sentiment analysis).
Sentiment Analysis: Sentiment Analysis (as a synonym for Opinion Mining) uses methods of natural language processing, linguistic and text
analysis to extract sentiment patterns in structured or unstructured documents.
Topic Modeling: A topic model is a statistical model that discovers the occurrence of defined elements (topics) in a collection of documents.
Topic modeling is mainly used in the areas of machine learning or natural language processing.
Web Data Mining: Web Data Mining (as a synonym for Web Mining) covers the application of data mining methods to discover different
patterns in the web. It is divided into web usage mining (user behavior), web content mining (text analysis) and web structure mining (structure
of linking).
CHAPTER 58
Network Text Analysis and Sentiment Analysis:
An Integration to Analyse Word-of-Mouth in the Digital Marketplace
Veronica Ravaglia
Catholic University of the Sacred Heart, Italy
Luca Zanazzi
University of Bologna, Italy
Elvis Mazzoni
University of Bologna, Italy
ABSTRACT
Through Social Media, like social networking sites, wikis, web forums or blogs, people can debate and influence each other. Due to this reason,
the analysis of online conversations has been recognized to be relevant to organizations. In the chapter we introduce two strategic tools to
monitor and analyze online conversations, Sentiment Text Analysis (STA) and Network Text Analysis (NTA). Finally, we propose one empirical
example in which these tools are integrated to analyze Word-of-Mouth regarding products and services in the Digital Marketplace.
INTRODUCTION
The Social Web environment is an important and free public space in which, virtually, everyone may express his or her impressions, opinions and
beliefs against a product, an event or a cause. More and more people, of every age, use the social web before they make any kind of decision, could
it be to buy something, change something in their lives or make a difficult choice. Through Social Web tools, i.e. social networking sites, wikis,
web forums or blogs, people can debate and influence each other’s opinion. This kind of User Generated Content is very important for every
organization because people interact with each other and generate, on their own, a topic of discussion. This is the major advantage, compared to
the other traditional forms of assessment that an organization can set. Most of the User Generated Contents on the web are public, which means
that they are free to view. These contents and commentaries, therefore, represent a valuable and accessible resource to assess people’s ideological
positions. But, at the same time, people play an important role in influencing each other about a specific brand, product, service or something
else (a film, an actor and so on) that has attracted the online debate. From this point of view, we can refer to the public and free online debate
that arose around the choice of the actor for the second episode of Spider-Man, which seems to have influenced a lot the production of the film.
There is no doubt about the direct effect of the word-of-mouth on the digital marketplace and for this reason it is clear how these phenomena
have to be taken into account for an efficient market strategy. Monitoring and intervening on brands and products’ social perception is nowadays
very important for all organizations. How organizational marketing campaign affects the consumption of the products, how the brand is
perceived by the users and non-users, are there any differences in the sentiment against the product based on the characteristics of the users are
some of the questions that need to be answered.
Many tools can be used to analyze these data, but one of the most valuable is the Sentiment Text Analysis (STA). Once the most relevant or
important words are identified or specified, this technique allows, not only, to understand the polarity of the feelings undergoing a comment
(positive, neutral, negative), but also to identify the single emotion involved. Moreover, the STA is based on a conventional Text Analysis (TA)
and, therefore, there are also other several text analyses that can be performed at once. Some of these analyses are, for instance, counting the
contingency of two single words (how much they appear close and linked in the text), some others perform the grammatical analysis of a word
(nouns, adjectives or verbs), or simply their number in the text and their relevance amongst all comments.
To get a better understanding of these types of data, a complementary methodology that can be integrated with the STA is the Social Network
Analysis (SNA) applied to text, i.e. the Network Text Analysis (NTA). Traditionally the SNA is a set of methods for analyzing the structure of
whole social entities as well as a variety of theories explaining the patterns observed in these structures (Wasserman & Faust, 1994). Within this
chapter, the SNA is mixed with STA to analyze the network and connections of the most used words about a specific topic, brand, service or
organization. Each word will be a node in the network and each node may have several properties, such as type, (noun, adjective, verbs), speaker,
location (where the word has been spoken) or sentiment (positive, neutral or negative) of the various words.
The aim of this chapter is to give a theoretical framework about Sentiment Analysis and Network Text Analysis and some concrete examples of
their use to better understand the word-of-mouth effect in the digital marketplace.
BACKGROUND
Content Analysis and Sentiment Analysis
The term Big Data is often invoked to describe the overwhelming volume of information produced by and about human activity, made possible by
the growing ubiquity of mobile devices, tracking tools, always-on sensors, and cheap computing storage. “In a digitized world, consumers going
about their day – communicating, browsing, buying, sharing, searching – create their own enormous trails of data” (Manyika et al., 2011).
Social media today have become very popular communication tools among Internet users. Millions of messages are appearing daily in popular
websites for social interaction such as Twitter, Tumblr, Facebook, blogs or forums. In these User Generated Contents, people write about their
life, share opinions on variety of topics and discuss current issues. Moreover, because of a free format of messages and an easy accessibility of
platforms, Internet users tend to post more and more comments about products and services they use, or express their opinion about brands.
Thus, social media websites become valuable sources of people’s topics of conversation, opinions and sentiments, particularly for business
intelligence units.
• How positive (or negative) are people about our products (or brand)?
Technological advances have made easier than ever to harness, organize, and scrutinize massive repositories of these digital traces; while once
computational techniques required supercomputers to perform large-scale data analysis, now it can be deployed on a desktop or laptop computer
(Manovich, 2011).
Within the context of mass communication, Content Analysis is one of the most familiar methods (Berelson, 1952; Krippendorff, 2012; Riff, Lacy,
& Fico, 2014). The classical definition of content analysis is “a research technique for the objective, systematic, and quantitative description of the
manifest content of communication” (Berelson, 1952, p. 18). Content analysis is generally conformed to the following procedure: 1) the research
questions and/or hypotheses are formulated; 2) the sample is selected; 3) categories are defined for coding; 4) coders are trained, the content is
coded, and reliability is assessed; 5) the coded data are analyzed and interpreted (McMillan, 2000; Riff et al., 2014). For content analysts in
particular, the development of digital media’s architecture has led to a fast-changing array of new structural features associated with
communication, such as the hashtag on Twitter, as well as the development of socio-cultural contexts around those new features – both
representing virgin terrain for content-analysis exploration. This opportunity has attracted all manner of content analysis work. In one small but
flourishing branch of this research, there are content analyses of Twitter use by journalists (Bruns, 2012; Herrera & Requejo, 2012; Lasorsa,
Lewis, & Holton, 2012), news organizations (Blasingame, 2011; Greer & Ferguson, 2011; Messner, Linke, & Eford, 2011; Newman, 2011), foreign
correspondents (Bruno, 2011; Cozma & Chen, 2013; Heinrich, 2012), nonprofit organizations (Waters & Jamal, 2011), and even the homeless
(Koepfler & Fleischmann, 2012).
Nevertheless, although Content Analysis seems appropriate for Big Data Analytics, much of the scholarly literature does not agree whether
classical approaches are adequate for studying online content. McMillan (2000) and Weare & Lin (2000), for example, identify a number of
challenges to applying content analysis to the Web, including difficulties in obtaining a representative sample because of the vastness of the Web;
in defining the unit of analysis; and in ensuring that coders are presented with the same content for purposes of reliability. However, ultimately
both conclude that minor adaptations to traditional approaches of content analysis are sufficient, such as using lists to help generate sampling
frames and using software to capture snapshots of websites.
Frequently, researchers and stakeholders are not fully satisfied in the detection of contents in online conversations and, frequently, insights
about opinion and sentiment associated to these topics are required. Sentiment analysis, also called opinion mining, is the field of study that
analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services,
organizations, individuals, issues, events, topics, and their attributes.
With the growing availability and popularity of opinion-rich resources such as online review sites (like Trip Advisor) and personal blogs, new
opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of
others (Pang & Lee, 2008). A very broad overview of the existing work was presented by Pang & Lee (2008). In their survey, the authors describe
existing techniques and approaches for an opinion-oriented information retrieval. Yang, Lin, & Chen (2007) use web-blogs to construct corpora
for sentiment analysis and use emotion icons assigned to blog posts as indicators of users’ mood. The authors applied support vector machine
(SVM) and conditional random field (CRF) learners to classify sentiments at the sentence level and then investigated several strategies to
determine the overall sentiment of the document. As the result, the winning strategy is defined by considering the sentiment of the last sentence
of the document as the sentiment at the document level.
Read (2005) used emoticons such as “:-)” and “:-(” to form a training set for the sentiment classification. For this purpose, the author collected
texts containing emoticons from Usenet newsgroups. The dataset was divided into “positive” (texts with happy emoticons) and “negative” (texts
with sad or angry emoticons) samples. Emoticons-trained classifiers: SVM and Naive Bayes, were able to obtain up to 70% of accuracy on the test
set.
Go and colleague (Go, Huang, & Bhayani, 2009) used Twitter to collect training data and then to perform a sentiment search. The approach is
similar to that of Read (2005). The authors construct corpora by using emoticons to obtain “positive” and “negative” samples, and then use
various classifiers. The Naive Bayes classifier obtained the best result with a mutual information measure for feature selection. The authors were
able to obtain up to 81% of accuracy on their test set. However, the method showed a bad performance with three classes (“negative”, “positive”
and “neutral”).
Interest in using both Content and Sentiment Analysis as a marketing research tool is due to many reasons. Firstly, not only academics but even
marketers have recognized the profound influence of social communities on consumer behavior (Gruen, Osmonbekov, & Czaplewski, 2006; Park,
Lee, & Han, 2007). Secondly, recent developments in technology have enhanced classification accuracy, and user friendliness. Marketers can
collect large body of data in real time, unobstructed or contaminated by the presence of external researchers. Therefore, through online data
analysis, market research cost and sampling error are reduced and validity and reliability of research findings are enhanced (Rambocas & Gama,
2013).
Because of all the possible applications, there are a good number of companies, large and small, that have opinion mining and sentiment analysis
as part of their mission. Besides reputation management and public relations, one might perhaps hope that by tracking public viewpoints, one
could perform trend prediction in sales or other relevant data. Yet, this Internet development is also vexing because “the glittering promise of
online data abundance too often proves to be fool’s gold” (Karpf, 2012, p. 652). Firstly, researchers struggle to anticipate when and how to “trap”
such data streams, given that most objects of study (e.g., tweets, web traffic data, online news homepages) go un-archived (Karlsson &
Strömbäck, 2010). Additionally, public data often pales in quality to proprietary data, which is rarely available to scholars. As Boyd and Crawford
(2012, p. 669) note, even while scholars are able to access vast numbers of tweets via Twitter’s public Application Programming Interface (API),
many researchers are not getting the “firehose” of the complete content stream, but merely a “gardenhose” of very limited numbers of public
tweets – the randomness of which is entirely unknown, raising questions about the representativeness of such data to all tweets, let alone to all
users on the service. Finally, controversial questions about consumer privacy, research ethics, and the overall quantification of social life should
never be left out (Boyd & Crawford, 2012; Oboler, Welsh, & Cruz, 2012).
THE SOCIAL NETWORK ANALYSIS
We are not alone. Everyone, in the everyday life, is connected with someone else. These ties may be represented e.g. by the family, a group of
friends or a team of colleagues that create our personal network. In a social network, the actors and their actions are viewed as interdependent
rather than independent, autonomous units. Relational ties (linkages) between actors are channels for transfer or “flow” of resources (either
material or nonmaterial). For these reasons, the data describe the connections, the contacts and the ties between all the entity (individuals or
organizations) in a defined group (people, family, society, enterprises, country). Network models focusing on individuals view the network as a
structural environment that provides opportunities for or constraints on individual actions. On the contrary, network models focusing on
structures (social, economic, political, and so forth) are lasting patterns of relations among actors (Gaggioli, Riva, Milani, & Mazzoni, 2012; Riva,
Banos, Botella, Wiederhold, & Gaggioli, 2012; Scott & Amaturo, 1997; Wasserman & Faust, 1994).
From this brief introduction, it is clear that the focus of attention of the SNA is the relation between a set of entities rather than their attributes,
even though, as it will be more clear in the following of this chapter, attributes are important to better describe and analyze the structure of
relations. In this definition, the actor is a social entity, a discrete individual, corporate or collective social units, for instance people, departments,
agencies. The relational ties are social ties, such as the evaluation or the selection of one person by another, the transfer of resources, association,
behavioral interaction, formal relations, or biological relationships.
In the SNA each individual is a node and has its own attributes (gender, age, work position, etc.), while the relationships between individuals are
represented by ties.
These social relations vary along three dimensions – direction, strength, content:
• Direction: Who originates the communication and/or receives a contact. Ties may be “MAN” – mutual, asymmetric, null. Asymmetric ties
are called directed ties, in contrast to undirected or non-ties.
• Strength: The frequency of interaction. Simplest strength measure is binary (present/absent); other strength indicators may take discrete
or continuous values (for instance, class attendance is binary; talking duration is continuous).
• Relational Content: A specific substantive connection among actors. Relations are multiplex when actors are connected by more than
one type of tie (e.g., friendship and business). Some of the relations are forcibly biunivocal (brother-brother), others exclusive (wife-
husband) and others not exclusive (teacher-alumni) with all the possible combinations.
The power of social network analysis stems from its difference from traditional methods in social scientific studies, which assume that attributes
of individual actors define whether they are friends or not, how many friends or acquaintances they have, but also how old they are, and other
psycho-social dimensions (such as self-esteem, life satisfaction, job satisfaction, etc.). This approach has turned out to be useful for explaining
many real-world phenomena, but leaves less room for individual agency, the ability for individuals to influence their success, because so much of
it rests within the structure of their network. Social networks have been used to examine e.g. how organizations interact with each other,
characterizing the many informal connections that link executives together, as well as associations and connections between individual
employees at different organizations. For example, power within organizations often comes more from the degree to which an individual within a
network is at the center of many relationships, rather than actual job title. Social networks play also a key role in hiring, in business success, and
in job performance. Networks provide ways for companies to gather information, deter competition and collude in setting prices or policies
(Kilduff & Brass, 2010).
Metrics (Measures) in Social Network Analysis
SNA is a corpus of analysis that allows to describe and examine the relational structure of a specific context by means of some metrics (indexes)
and the related sociograms (graphs). Each analysis allows to focus the attention at two levels: the whole network of relations (whole network or
full network analysis) or the individuals (ego-network analysis) (Scott, 2000; Wasserman & Faust, 1994). Not all analysis are good for any
setting: the type of analysis and the index to calculate is determined by the context and the aims of the study (Gaggioli, Milani, Mazzoni, & Riva,
2011). Examples of the most relevant types of analysis, and the related network and individual indexes, are the following:
• Neighborhood Analysis: Density/inclusiveness | nodes degree. In this type of analysis, two nodes are 'adjacent' (connected) if there is a
line (connection) between them. A node is 'incident' to a line if the node is one of the pair of nodes defining the line. Nodal 'degree' (of
connection) is the number of lines that are incident with it. It measures the size of its direct neighborhood. Two types of measurements are
present: In-Degree and Out-Degree. The nodal ‘In-Degree’ is the number of lines to which the node is incident as target (receiver); the‘Out-
Degree’ is the number of lines to which the node is incident as source. Density is the ratio of the number of present lines (connections) with
respect to the maximum possible, i.e. how much the network of relations approaches the situations in which all possible relations are
activated (complete graph). Inclusiveness is the number of connected nodes expressed as a proportion of the total number of nodes.
• Cohesion Analysis: Cliques | cliques participation index. Clique is a maximal complete sub graph composed of three or more nodes. We
can briefly summarize this concept as a subset of nodes connected. A node can be a member of more than one cliques, therefore the cliques
in a network may overlap. Cliques participation index (CPI) describes the average number of cliques each participant is involved in (Gaggioli,
Riva, Milani, Mazzoni, 2012).
Nevertheless, one of the most famous and important SNA measures, particularly relevant for the scope of this chapter, is the centrality analysis
and the related indexes of centralization (network) and of centrality (individual). This type of analysis tells us how central or peripheral is a
specific entity (centrality index) and how the network is centralized on its more central entities (centralization index). In other terms, the
centrality is a measure of the relevance, prestige, power or status characterizing a set of entities of a specific context (Scott, 2000; Wasserman &
Faust, 1994). There are many types of centrality (and centralization) indexes based on the type of relational dynamic on which the attention is
focused. For instance, the degree centrality describes the relevance of the nodes of a network based on the quantity of relations: e.g., the more an
actor receives connections, the more the actor has a high status. The betweenness centrality focus the attention on nodes that lie between other
nodes in the network, since they have a power (more relevance) in controlling the transmission of information between the networks: e.g., the
more an actor acts as bridge between many nodes, the more the actor has the power to “control” the flow of information. A further interesting
centrality index is the eigenvector centrality. It is similar to the degree centrality, but the centrality of a node is determined also by the centrality
of its adjacent nodes (directly connected): e.g., the more an actor is connected to nodes with a high relevance, the more it becomes relevant.
NETWORK TEXT ANALYSIS AND THE ANALYSIS OF WORDOFMOUTH IN THE DIGITAL MARKETPLACE
The SNA is a very solid and specific method of quantitative analysis. It can be used in several contexts and has many positive aspects such as
handling big data and complex networks. However, focusing on the mere communication exchange or similarities between nodes, it considers
neither the context in which the conversations took place nor the quality (content) of the exchange. Take for example a conversation about a
product on a blog. This particular conversation could be very dense (many communication exchanges are present). It is the extent to which all
possible relationships between the elements are in fact present, and is evaluated on a scale ranging from 0 (absence of aggregation) to 1
(maximum aggregation)) from an SNA point of view. Moreover, the centrality index could be meaningful. However, the conversation could have
many topics and could also treat other areas of interest.
Starting from this issue, as anticipated before, the aim of this chapter is to create an integration between the content analysis (CA) and the social
network analysis (SNA). It is, indeed, possible to use SNA on qualitative data and not only in the relations and exchanges within a network of
individuals (Mangione, Mazzoni, Orciuoli, & Pierri, 2011). There are several examples of an integration between CA and SNA, especially as
regards the study of ontology (Hoser, Hotho, Jäschke, Schmitz, & Stumme, 2006; Yu, 2008) and the semantic aspect of a network (Erétéo,
Gandon, Corby, & Buffa, 2009; Mika, 2005).
From this original idea, a new type of analysis joins the traditional SNA: the Network Text Analysis (NTA). The NTA is based on the content and
the quality of the data, considering, at the same time, the relations between the nodes (lemmas) in the dimensions of time, space, conditions of
the communicational exchange. The basic assumption of the NTA is that a given text is a genuine network, formed by different words. The words,
as in a traditional SNA, are the nodes of the network, but unlike in the latter, each words has a specific property (it can be a noun, a verb or an
adjective). However this property is not intrinsic but, rather, based on more or less explicit rules (grammar and logic). Functionality and relations
between words can also be based on the content of the sentence.
The purpose of this methodology is to explore the nature of the relationships between words to achieve a deeper understanding that goes beyond
the simple numerical data of the SNA. In particular, the NTA allow us to have a deeper understanding about the central and peripheral
arguments of a speech, the strengths and weaknesses of a message or conversation, which arguments are more or less central in real or virtual
conversations about the same topic, etc..
Similar to SNA, NTA has several fields of application; however, the choice of the correct indexes is subject to what you need to get from the text
(Gaggioli, Riva, Milani, Mazzoni, 2012). The same index has different meanings: in the SNA, the adjacency matrix shows the relational
information among the nodes, while, in the NTA, it shows the co-occurrence between the topic or words in a given User Generated Content. An
interesting field of application is that of the knowledge graphs, i.e. schemas representing the graphical synthesis of a particular construct (e.g. a
theory) by means of keywords (Popping and Strijker, 1997; Popping, 2003). To construct these types of graphs, NTA is needed to extract
information about the relations between the lemmas of a given text. Starting from the relations detected by the NTA, in a knowledge graph such
relations are enriched by indication of the type of relation: CAU describes a relation of cause and effect; PAR indicates the relation “is part of”;
AKO indicates the relation “a kind of”, etc.
One of the most interesting and promising field of application of the NTA, which could have important implication for the analysis of the word-
of-mouth in the digital marketplace, is its use to understand the social representations characterizing a given topic (Gaggioli, Riva, Milani,
Mazzoni, 2012). A theoretical framework can be useful to understand how the NTA may be exploited as driving force of the social representations
change. In order to answer to this need Abric’s point of view about social representations turns out to be particularly interesting and effective
(Abric, 1989). He describes social representations based on a double system:
• The central system, which corresponds to the core of the representation and is fixed socially, as it is linked to historic, sociological, and
ideological conditions. The central core is fundamental for the stability and coherence of each representation, guaranteeing that it will
endure in time. This system specifies the fundamental elements around which representations are generated, and constitutes the social and
collective basis of the representations, which defines the degree of consensus and homogeneity of a group, independently of each individual.
• The peripheral system, which depends strictly on characteristics of the individuals and the context in which they are positioned. Such a
system therefore makes a differentiation within the group of individuals and is able to adapt itself to specific situations and to explore certain
everyday experiences. It is much more versatile than the central system, and it enables the integration of information and practices of
varying types, thus showing the heterogeneity of behavior and contents. The peripheral system is essential in the identification of the
changes and transformations underway in the representations, which indicate their evolution and likely future modifications.
Therefore, stability and rigidity of representations depend on the core, strongly linked to the value systems shared by members of the group,
while the richness and variety of individual experiences and the evolution of everyday practices determine a representation’s mutability and
flexibility.
Since the central system creates and organizes the representation, Abric avers that it is also its most stable element, that which more than any
other resists to change. This means that a representation’s evolution will begin from the modification of its peripheral and less central elements.
A transformation of the central nucleus would indeed bring about a modification in the structure and the totality of the representation. Therefore,
we can presume that the change and evolution of a representation will only be superficial if it comes about through the modification of the
meaning, or the nature, of the peripheral elements, while the involvement of the core would radically alter the representation itself. There are
therefore two interesting aspects, which relate to the use of NTA in the analysis of the representations, which characterize a network of
individuals:
• The possibility to differentiate between elements of the central core and peripheral elements in a social representation;
• The possibility to carry out a longitudinal analysis of the representation’s evolution, paying particular attention to the transition of certain
aspects from peripheral to central, and vice versa.
NTA in Practice
In practice, the NTA can be very useful in order to analyze User Generated Contents like conversations or discussion about products, policies or
events, particularly in online contexts in which data collection is easy and could be automatized. An example of data collection and elaboration to
perform a network text analysis is the following.
The first step is identifying a meaningful conversation on the web. It can be taken directly from a blog, a forum or even from a social network site.
The data collected have to be “clean” from unnecessary words, such as time line of the conversation, name of the speaker or number of the post
etc. Part of the analysis preparation is also the selection of the keywords. The keywords are the words that have a high incidence in the given text.
Most of the text analysis software have a built-in dictionary and algorithms to safely identify keywords. You have to pay attention to not include
in the keywords the self-referred words (for instance, when a speaker starts a post using the name of another speaker), the pronouns, the
auxiliary verbs, the prepositions, the articles or the conjunctions. Moreover, some words have to be treated as identical, if they share the same
meaning, even if they have different grammatical properties: pear = pears or write = wrote = writing.
Secondarily, the collected data have to be processed with specific text analysis application (like software for text mining or content analysis) in
order to obtain the co-occurrence matrix (it indicates the amount of times that two words are matched within a text). The co-occurrence matrix is
formed by the words that are more central in the discussions analyzed, namely those which not only have a greater amount of co-occurrences
than other entries, but which would also be given greater weight by the interlocutors for their relevance in terms of centrality. From an operative
point of view, the co-occurrence matrices are square matrices similar to typical SNA adjacency matrices and show, at the crossing of two terms,
the strength of the link (the amount of times in which they match within the text). It is therefore possible to transpose the data of this matrix onto
an adjacency matrix and then perform the SNA on nodes represented by lemmas, i.e. the Network Text Analysis. The index of eigenvector
centrality, for instance, allows to discriminate the concepts which are most central and peripheral in a speech. The cohesion analysis describes
how much a principal conversation is composed by many sub-conversations having specific topics. The connectivity analysis shows which parts
of a speech are more solid and which arguments are weaker (peripheral in terms of centrality index).
At the conclusion of this chapter, we propose a purely demonstrative example of an analysis of the conversations held on an online blog by a
group of users about the controversial (and elusive) case of the Pink Beer made by one of the most famous brand.
THE PINK BEER CASE STUDY
This case concerns a Danish brand producing beer since 1856, traditionally a Strong Ale with low fermentation and a robust and bitter taste. In
March 2014 a lot of information appears online (more or less officials) announcing a new product, the Soft Ale: a light, kitschy, pink beer made
with raspberry and ginger. The new product was advertised on a website, accompanied with a video-spot and the official trademark of the brand.
Eventually the advertisement has proven to be an April fool, but during the days before the discovery, many conversations took place, triggering a
flaming debate on the taste of the future beer.
More than a mere April fool, this advertising campaign may be an attempt by the brand to understand, in advance, the reaction of consumers
about a new product. We believe that, although consumers, did not appreciate overall rose beer after careful analysis, a meaningful interpretation
of the data and a conscious changing of the unpopular peripheral topics, the brand’s new product could have a positive review. We took one of
those conversations with the purpose of finding the general sentiments and topics about the Pink Beer.
The first step was to clean the text-file from all the unusual words. This has been done through a Text Mining Software (like T-Lab) and a
subsequent human check intervention. The plain text-file was then analyzed with the same software and several keywords emerged. They were
obtained starting from the occurrences of the same words in the text-file. As anticipated before, words with the same semantic meaning were
considered the same word (for instance: beer = beers). Secondarily the co-occurrence matrix was obtained and transformed in the adjacency
matrix. By comparing the keywords with the sentences in which these words were present, a topic and a sentiment were assigned to each
sentence. An expert of the field made the sentiment analysis manually. Operatively, a sentiment (negative, neutral or positive) and a topic were
given to each sentence. Each keyword was related to all the sentences in which it was included and to the respective sentiment and topic. For
instance, if a keyword is included in five sentences, the keyword is also associated to five sentiments and topics. Eventually, the most recurrent
topic and sentiment related to the keyword was chosen and attributed as the main ones.
These tags became the node attribute in the following NTA. Several analyses were conducted, but for explanatory purposes, we only report the
neighborhood analysis (Figure 1) and the degree centrality (Figure 2).
Figure 1.
Figure 2.
As it is shown in the picture (Figure 1), each node is a keyword and the ties indicate the co-occurrences. The shape stands for the topic of the
sentence in which the keyword existed (if it concerns the trademark or the product, i.e. beer), while the color indicates the general sentiment of
the sentence (positive, neutral or negative). It is quite evident that the discussion concerns principally the product (beer) and in negative way (red
color). Particularly interesting is that while the discussion is much focused on the product (beer), see also the next picture (Figure 2), it also
influences the brand that has a high frequency (dimension) becoming one of the focus of the negative sentiment.
Further, by looking on the left of the picture, we have a representation of the “beer” such as something concerning industrial or artisanal fields in
which it is characterized by procedures such as filtering and tasting. But the upper side of the picture shows the relations between the most
negative keywords that characterized the discussion: raspberry (the taste of the beer), Lambic (the type of beer obtained by spontaneous
fermentation), passion (described as something that will decrease) and finally the trademark (that has principally an undirected negative
sentiment determined by the negativity of the other keywords with which it co-occurs).
The second pictures (Figure 2) shows with more detail the relative ranking (centrality) among the various most important lemmas. While in the
previous picture (Figure 1) the relations between lemmas were evident, in this second the centrality of both trademark and product in the
discussion is clearer. But since the product and the trademark were the focus of the discussion, for a better understanding of the central and
peripheral lemmas characterizing the discussions, it could be useful to eliminate the focus of the discussion (product and trademark) and
concentrate on the remaining lemmas (the arguments connected to the principal product and the trademark)(Figure 3).
Figure 3.
Now the focus of the negative sentiment characterizing the discussion about the “pink beer” is clear. The production and the existence of a
Lambic pink beer produced by raspberries does not meet the positive sentiment of people involved in the discussion. Strangely, the “kriek bier”, a
sort of beer originally obtained by adding fresh black cherries to a Lambic beer, has a positive sentiment. Probably the fact that this is an
historical type of beer, already well known and conceptualized in people’s “beer representation”, play an important role in its acceptance. While
the new one, is probably seen as a mere copy of the other and it does not encounter the same sentiment.
CONCLUSION
The existence of a vast, publicly accessible reservoir of observable peer-to-peer communications is unprecedented. There is meaningful
information in these conversations which can be accessed at minimal cost. Compared with the more costly survey-based methods typically
employed, this data source is significantly more efficient and may offer an easy and cost-effective opportunity to gain the flow, drivers and
sentiment associated to a specific product or brand
In this chapter, we have described three types of analysis that could be very useful to analyze Word-of-Mouth in the Digital Marketplace: Content
Analysis, Sentiment Analysis and Network Text Analysis. While all the three types of analysis have interesting potentiality to analyze dialogues,
conversations, commentaries in online contexts, we would suggest an integration of the three to have a deeper understanding of the arguments,
sentiment and structures that characterize people’s representations about a given product or brand.
From this point of view, the Network Text Analysis (integrated with SA information) seems to have important margins of progress since it could
be associated with the classic Social Network Analysis to have a double level of analysis: the first on the relations between people and the second
on the structural representation of their discussions. This could be very interesting to analyze the influence processes such as emotional
contagion and virtual influence in online contexts.
From a brand perspective, the integrated exploration of these elements could help in improving brand engagement and brand image in many
ways. Firstly, the detection of people’s arguments about products or brands – and sentiment associated to these – helps in the construction of a
brand image more syntonic with the consumer's desiderata.
Secondly, the understanding of the topics, sentiment and structures of brand’s representations in online conversation is extremely useful when it
comes to customer relationship management. The most efficient way to create consumer engagement is to create a brand-consumer exchange
relationship. This particular relationship is highly consistent with the Web 2.0 and SNSs environments, where both players within an
environment can be active content creators and the communication is immediately characterized by bidirectional flows. The detection of the
drivers and the sentiment of conversations favour the brand in creating engaging conversations with its customers. Moreover, the study of the
network helps in individualizing brand’s ambassadors, what they care most about, the strength and the weakness in their brand’s representation.
Finally, the analysis of online user generated content generate insights relevant in product improvement.
However, this study is not without limitations; indeed, if online conversations provide a huge range of useful elements in managing brand online
presence, brand image and product improvement, the relationship between the online and the offline worlds should be investigated, too. Future
research is needed to explore the match between insights hidden in online conversation and offline behavior of consumers.
This work was previously published in Capturing, Analyzing, and Managing WordofMouth in the Digital Marketplace edited by Sumangla
Rathore and Avinash Panwar, pages 137153, copyright year 2016 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Abric, J.-C. (1989). L’étude expérimentale des représentations sociales. Les Représentations Sociales , 4, 187–203.
Blasingame, D. (2011). Gatejumping: Twitter, TV news and the delivery of breaking news. #ISOJ Journal: The Official Research Journal of the
International Symposium on Online Journalism,1(2). Retrieved from http://online.journalism.utexas.edu/ebook.php
Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly
phenomenon. Information Communication and Society , 15(5), 662–679. doi:10.1080/1369118X.2012.678878
Brass, D. J. (1984). Being in the right place: A structural analysis of individual influence in an organization. Administrative Science
Quarterly , 29(4), 518–539. doi:10.2307/2392937
Brass, D. J. (2011). A social network perspective on industrial/organizational psychology . In Kozlowski, S. W. J. (Ed.),The Oxford handbook of
organizational psychology (pp. 667–695). New York: Oxford University Press.
Bruno, N. (2011). Tweet first, verify later? How real-time information is changing the coverage of worldwide crisis events. Oxford, UK: Reuters
Institute for the Study of Journalism, University of Oxford. Retrieved from
http://reutersinstitute.politics.ox.ac.uk/fileadmin/documents/Publications/
Bruns, A. (2012). Journalists and Twitter: How Australian news organisations adapt to a new medium. Media International Australia
Incorporating Culture and Policy, (144), 97–107.
Cozma, R., & Chen, K.-J. (2013). What’s in a tweet? Foreign correspondents' use of social media. Journalism Practice , 7(1), 33–46.
doi:10.1080/17512786.2012.683340
Erétéo, G., Gandon, F., Corby, O., & Buffa, M. (2009). Semantic social network analysis . In Daniel, B. K. (Ed.), Handbook of Research on
Methods and Techniques for Studying Virtual Communities: Paradigms and Phenomena (pp. 122–156). Academic Press.
Gaggioli, A., Milani, L., Mazzoni, E., & Riva, G. (2011). Networked flow: A framework for understanding the dynamics of creative collaboration in
educational and training settings. The Open Education Journal , 4(1), M3. doi:10.2174/1874920801104010041
Gaggioli, A., Riva, G., Milani, L., & Mazzoni, E. (2012). Networked flow: Towards an understanding of creative networks . Berlin, Germany:
Springer.
Go, A., Huang, L., & Bhayani. R. (2009). Twitter sentiment analysis. Final Projects from CS224N for Spring 2008/2009 at The Stanford Natural
Language Processing Group.
Greer, C. F., & Ferguson, D. A. (2011). Using Twitter for promotion and branding: A content analysis of local television Twitter sites.Journal of
Broadcasting & Electronic Media , 55(2), 198–214. doi:10.1080/08838151.2011.570824
Gruen, T. W., Osmonbekov, T., & Czaplewski, A. J. (2006). eWOM: The impact of customer-to-customer online know-how exchange on customer
value and loyalty. Journal of Business Research , 59(4), 449–456. doi:10.1016/j.jbusres.2005.10.004
Heinrich, A. (2012). Foreign reporting in the sphere of network journalism. Journalism Practice , 6(5-6), 766–775.
doi:10.1080/17512786.2012.667280
Herrera, S., & Requejo, J. L. (2012). 10 Good Practices for News Organizations Using Twitter. Journal of Applied Journalism & Media
Studies , 1(1), 79–95. doi:10.1386/ajms.1.1.79_1
Hoser, B., Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006). Semantic network analysis of ontologies . Berlin, Germany: Springer.
doi:10.1007/11762256_38
IEEE. (2007). WI ’07: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC: IEEE Computer
Society.
Karlsson, M., & Strömbäck, J. (2010). Freezing the flow of online news: Exploring approaches to the study of the liquidity of online
news. Journalism Studies , 11(1), 2–19. doi:10.1080/14616700903119784
Karpf, D. (2012). Social science research methods in Internet time.Information Communication and Society , 15(5), 639–661.
doi:10.1080/1369118X.2012.665468
Kilduff, M., & Brass, D. J. (2010). Organizational social network research: Core ideas and key debates. The Academy of Management
Annals , 4(1), 317–357. doi:10.1080/19416520.2010.494827
Koepfler J. A. Fleischmann K. R. (2012). Studying the values of hard-to-reach populations: Content analysis of Tweets by the 21st Century
homeless.Proceedings of the 2012 iConference, 7(1), 48-55.
Krippendorff, K. (2012). Content analysis: An introduction to its methodology . Thousand Oaks, CA: Sage Publications;
doi:10.1145/2132176.2132183
Lasorsa, D. L., Lewis, S. C., & Holton, A. E. (2012). Normalizing Twitter: Journalism practice in an emerging communication space. Journalism
Studies , 13(1), 19–36. doi:10.1080/1461670X.2011.571825
Mangione, G. R., Mazzoni, E., Orciuoli, F., & Pierri, A. (2011). A pedagogical approach for collaborative ontologies building . Berlin, Germany:
Springer. doi:10.1007/978-3-642-19814-4_7
Manovich, L. (2012). Trending: The Promises and the Challenges of Big Social Data. Debates in the Digital Humanities . In Gold, M. K.
(Ed.), Debates in the Digital Humanities (pp. 460–475). Minneapolis, MN: The University of Minnesota Press.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation,
competition, and productivity. Technical report. McKinsey Global Institute.
McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism &
Mass Communication Quarterly , 77(1), 80–98. doi:10.1177/107769900007700107
Messner, M., Linke, M., & Eford, A. (2011, April). Shoveling tweets: An analysis of the microblogging engagement of traditional news
organizations. Paper presented at the 12th International Symposium for Online Journalism, Austin, TX.
Mika, P. (2005). Ontologies are us: A unified model of social networks and semantics . Berlin, Germany: Springer.
Newman, N. (2011). Mainstream media and the distribution of news in the age of social discovery . Oxford, UK: Reuters Institute of Journalism.
Oboler, A., Welsh, K., & Cruz, L. (2012). The danger of big data: Social media as computational social science. First Monday , 17(7).
Pang, B., & Lee, L. (2008, February). Using Very Simple Statistics for Review Search: An Exploration. Poster presented at the International
Conference on Computational Linguistics, Haifa, Israel.
Park, D.-H., Lee, J., & Han, I. (2007). The effect of on-line consumer reviews on consumer purchasing intention: The moderating role of
involvement. International Journal of Electronic Commerce , 11(4), 125–148. doi:10.2753/JEC1086-4415110405
Popping, R. (2003). Knowledge graphs and network text analysis.Social Sciences Information. Information Sur les Sciences Sociales, 42(1), 91–
106. doi:10.1177/0539018403042001798
Popping, R., & Strijker, I. (1997). Representation and integration of sociological knowledge using knowledge graphs. Social Sciences Information.
Information Sur les Sciences Sociales , 36(4), 731–747. doi:10.1177/053901897036004006
Rambocas, M., & Gama, J. (2013). Marketing Research: The Role of Sentiment Analysis. FEP Economics and Management, 1-24.
Read J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification (pp. 43–48). In Proceedings
of the ACL-2005 Student Research Workshop. Ann Arbor, MI: ACL. 10.3115/1628960.1628969
Riff, D., Lacy, S., & Fico, F. (2014). Analyzing media messages: Using quantitative content analysis in research . London, UK: Routledge.
Riva, G., Banos, R. M., Botella, C., Wiederhold, B. K., & Gaggioli, A. (2012). Positive technology: Using interactive technologies to promote
positive functioning. Cyberpsychology, Behavior, and Social Networking , 15(2), 69–77. doi:10.1089/cyber.2011.0139
Scott, J., & Amaturo, E. (1997). L’analisi delle reti sociali . Roma, Italy: La Nuova Italia Scientifica.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge, UK: Cambridge University Press.
doi:10.1017/CBO9780511815478
Waters, R. D., & Jamal, J. Y. (2011). Tweet, tweet, tweet: A content analysis of nonprofit organizations’ Twitter updates. Public Relations
Review , 37(3), 321–324. doi:10.1016/j.pubrev.2011.03.002
Weare, C., & Lin, W.-Y. (2000). Content Analysis of the World Wide Web Opportunities and Challenges. Social Science Computer Review , 18(3),
272–292. doi:10.1177/089443930001800304
Yang, C., Lin, K. H.-Y., & Chen, H.-H. (2007). Emotion classification using web blog corpora. Academic Press.
KEY TERMS AND DEFINITIONS
ConsumerBrand Engagement (CBE): The cognitive, emotional, behavioral factors, which connect a brand with its followers.
Content Analysis (CA): The study and analysis of the text contained within the various types of human (cultural) artifacts such as writing,
images and recordings.
Digital Marketplace: A kind of e-commerce that engages the stakeholder with the multimedia device and the web technologies, protocols and
tools.
Network Text Analysis (NTA): A set of methodologies, supported by software, that allow to extract networks of concepts” from texts and to
discern the “meaning” represented or encoded therein.
Online Conversation Monitoring: A group of tools and activity aimed to check the customers believes through online channels.
Sentiment Analysis: A type of analysis that aims to determine an attitude, a judgment, a belief or an evaluation toward a given topic.
Social Network Analysis (SNA): A methodology that allows studying and visualizing the ties between the nodes of a given network.
Word of Mouth: A spontaneous and free oral or written recommendation by a satisfied customer to the prospective customers of a good or
service.
CHAPTER 59
Big Data and Web Intelligence for Condition Monitoring:
A Case Study on Wind Turbines
Carlos Q. Gómez
University of CastillaLa Mancha, Spain
Marco A. Villegas
University of CastillaLa Mancha, Spain
Fausto P. García
University of CastillaLa Mancha, Spain
Diego J. Pedregal
University of CastillaLa Mancha, Spain
ABSTRACT
Condition Monitoring (CM) is the process of determining the state of a system according to a certain number of parameters. This ‘condition’ is
tracked over time to detect any developing fault or non desired behaviour. As the Information and Communication Technologies (ICT) continue
expanding the range of possible applications and gaining industrial maturity, the appearing of new sensor technologies such as Macro Fiber
Composites (MFC) has opened a new range of possibilities for addressing a CM in industrial scenarios. The huge amount of data collected by
MFC could overflow most conventional monitoring systems, requiring new approaches to take true advantage of the data. Big Data approach
makes it possible to take profit of tons of data, integrating in the appropriate algorithms and technologies in a unified platform. This chapter
proposes a real time condition monitoring approach, in which the system is continuously monitored allowing an online analysis.
INTRODUCTION
Condition monitoring (CM) is defined as the process of determining the state of system according to a parameter of the system. The main
propose of CM in this chapter is to identify a significant change of this condition of the system which is indicative of a developing fault. It is
usually considered as part of a predictive maintenance strategy, in which maintenance actions, and therefore preventive maintenance tasks, are
scheduled to prevent failure and avoid its consequences. The objective is to extend the life cycle of the system analysed, and to avoid major
failures, resulting in considerable cost and associated downtime reduction.
The so called Information and Communication Technologies (ICT) have grown up with no precedents, and all aspects of human life have been
transformed under this new scenario. All industrial sectors have rapidly incorporated the new technologies, and some of them have become de
facto standards like supervisory control and data acquisition (SCADA) systems. Large amounts of data started to be created, processed and saved,
allowing an automatic control of complex industrial systems. In spite of this progress, there are some challenges not well addressed yet. Some of
them are: the analysis of tons of data, as well as continuous data streams; the integration of data in different formats coming from different
sources; making sense of data to support decision making; and getting results in short periods of time. These all are characteristics of a problem
that should be addressed through a big data approach.
This chapter proposes a real time condition monitoring approach, in which the system is continuously monitored allowing online analysis and
actions. The system is fed by data streams received from different sensors adequately located on the machine.
The proposed methodology is applied to the industry of wind energy, in particular to the detection of failures in the blades like surface cracking,
scuffing, pitting, etc.
Other interesting application is the detection of ice on wind turbine blades. It is known that icing causes a variety of problems for wind turbines,
increased fatigue of components due to imbalance in the load or power reduction due to disrupted aerodynamics (Homola, Nicklasson, &
Sundsbø, 2006).
All the information analysed by the system is obtained through non-destructive techniques using transducers, which are being used in wind
power industry with great success. However, it is worth to mention that wind power is just as an illustrative example of application, while the
methodology is applicable in many different scenarios across several industries.
BACKGROUND
Wind energy is inexhaustible, ecologically and environmentally friendly. It is becoming one of the most widespread and productive methods for
generating electrical energy (see Figure 1). Today, it is a mature technology and this energy source is applied to both large scale and small
installations. It certainly has become a mainstay within the energy systems of many countries, and is recognized as a reliable and affordable
source of electricity (Beattie & Pitteloud, 2012).
In 2013, wind energy represented 3.5% of total energy demand. And by 2016 is expected to be the global installed capacity of 500,000 MW. In
addition to onshore wind farms, wind farms are built in the sea (offshore), several kilometres from the coast, to take advantage of the best wind
conditions to overcome the negative relief effects. In these installations it is common to find much more powerful machines than which are
installed onshore. The diameter of the turbine is a crucial parameter: longer blades, more swept area and more energy produced.
This trend to building ever larger blades carries out certain problems. The blades have to bear more and more weight and strength due to its
greater sweep area. This means an increased fatigue in the blade structure, and therefore any blade failure entails very high costs. It has been
estimated that the time between failures in wind turbine blades is 5 years. The time spent in repairing one of these blades is 2 days on average in
onshore wind turbines. However, in the case of offshore wind turbines the downtime could increase up to a month.
The repair costs of a wind turbine blade may vary between 20000€ to 50000€ depending on the required operations, e.g. if it is necessary to take
it down and if it can be repaired in the field or it has to be carried back to the factory. These costs can be multiplied by 10 if the turbine is located
offshore due to the associated costs of transporting the new blade and difficult working conditions for replacement (Söker, Berg-Pollack, &
Kensche, 2007). In addition to these costs, is necessary to add lost profits caused by the downtime.
To deal with these problems companies have invested great resources to develop reliable preventive maintenance. This was responsible of
ensuring the adequate working of wind turbines by means of periodic reviews, which consisted on oil changes on the gearbox, visual inspections
of blades, retighten screws, etc. Nowadays has emerged a new form of maintenance, called predictive maintenance, in order to avoid the defects
as far as possible, whose function is to try to detect a potential failure before it occurs, to avoid triggering a fatal error. For this reason, this
approach requires a system capable of providing real-time status of the machine, independently, safely and accurately (Figure 2).
Figure 2. Wind turbine condition monitoring for blade and
tower
BIG DATA IN STRUCTURAL HEALTH MONITORING
The non-destructive inspection tests are used with the purpose of detecting superficial or even internal discontinuities of a certain material, as
well as assessing its properties. The Structural Health Monitoring (SHM) is the process of implementing a damage detection strategy in a given
structure. By SHM is possible to detect structural changes. It is commonly used for checking welding points and components, or for assessing the
density of a material. Most of the times, the obtained data on these tests are not directly understandable, and may require the analysis of
qualified professionals.
The implementation of such a system for predictive maintenance purposes represents a big challenge, and many factors contribute to make it
harder. Some of them are related with the nature of the data, as they primarily consist on time-domain signals while data mining techniques have
traditionally focused on cross-section data (with no time dimension). Another concerning fact is the problem of integrating the result of multiple
signal analysis in a unified and consistent framework. But probably the most challenging fact is the problem of dealing with huge amounts of
data, as the traditional algorithms are not specially designed for being scalable over terabytes or even petabytes of data.
The problem this large amount of data to analyse is mainly due to the development of new types of sensors at a low cost, and the possibility of
transmitting tons of data everywhere. These factors let to build a ‘digital projection’ of the machines’ life, consisting on all the data that have been
collected about them, including their surrounding working conditions. for example, the health monitoring of a machine should focus on the more
critical components, not only on those that will cause a larger failure rates, but also those that would produce longer down-time failures. In wind
turbines it is known that the highest failure rates of structural components are caused by the blades, especially the pitch system, and the drive
system (Pinar, García, Tobias, & Papaelias, 2013). At this light, it would be reasonable to have a number of at least 16 sensors on each blade and
48 sensors on the tower. These sensors are commonly Macro Fiber Composites (MFC) devices that work at ultrasonic frequencies, in the range of
MHz. A typical signal sampled at 4MHz during one second will represent 9.72MB of data. Having 96 sensors in a single turbine, it would
represent the amount of 933 MB of data in just a matter of a second. These dimensions get even bigger when we consider the tracking of multiple
wind turbines, e.g. a wind farm with 80 turbines would generate 72.9 GB of data each second. In addition, it must to be add all the information
about environmental working conditions, which usually come from SCADA systems.
Even though Big Data has become one of the most popular buzzword, the industry has evolved towards a definition around this term on the base
of three dimensions: volume, variety and velocity (Zikopoulos & Eaton, 2011).
Data volume is normally measured by the quantity of raw transactions, events or amount of history that creates the data volume. Typically, data
analysis algorithms have used smaller data sets called training sets to create predictive models. Most of the times, the business use predictive
insight that are severely gross since the data volume has purposely been reduced according to storage and computational processing constraints.
By removing the data volume constraint and using larger data sets, it is possible to discover subtle patterns that can lead to targeted actionable
decisions, or they can enable further analysis that increase the accuracy of the predictive models.
Data variety came into existence over the past couple of decades, when data has increasingly become unstructured as the sources of data have
proliferated beyond operational applications. In industrial applications, such variety emerged from the proliferation of multiple types of sensors,
which enable the tracking of multiple variables in almost every domain in the world. Most technical factors include sampling rate of data and
their relative range of values.
Data velocity is about the speed at which data is created, accumulated, ingested, and processed. An increasing number of applications are
required to process information in real-time or with near real-time responses. This may imply that data is processed on the fly, as it is ingested, to
make real-time decisions, or schedule the appropriate tasks.
PROPOSED METHODOLOGY
The approach proposed in this chapter is based on the use of three sources of information. These sources are not independent, because they
provide information about the same physical event. The signals picked up by the transducers will be processed in the very first step by three
parallel filters. They will be responsible of extracting the useful information to be used in the condition monitoring. The results are three set of
signals: vibrations, acoustic emission and ultrasonic signals, each of them analysed by an independent ‘line’ of the system (Figure 3).
Figure 3. Functional Schema of Signal Analysis
Vibrations
The first approach analyses only low frequencies that are characteristics of vibrations. It is possible to get valuable information related with the
integrity of a blade structure analysing the vibrations in dynamic conditions (Abouhnik & Albarbar, 2012). The extracted information shows the
natural frequencies of a blade, and their respective harmonics. These vibrations are registered on a model that analyses the amplitude and
frequency. Because the manufacturing of the blade is manual, and blades are not identical, the system will create a unique model for each blade.
The model will learn from these two parameters over a period of time, and these parameters will be associated to a free fault model. Therefore
these signals are processed online in the time domain and in the frequency domain. In the time domain is applied an upper and lower limit
amplitude for signals, based on what it has learned in the previous period, which correspond to very strong vibrations that can trigger broken
fibres due to the fatigue of these large vibrations, then trigger an alarm. Same data are analysed in the frequency domain, and the natural
frequencies and their harmonics are compared with the learned in the model. If there is a new energy peak in the frequency domain, which may
be due to a failure in its structure has altered the natural frequency of the blade, and then an alarm should be triggered.
Employing the Big Data approach is possible to analyse large amounts of information, even coming from different sensors and places. In this
sense, the information obtained from meteorological sensors on the turbine becomes highly profitable and valuable. These data are usually
extracted from SCADA systems, and should include wind speed and direction which definitely determine the blade vibrations in its natural
frequencies. That information is also taken into account in the design of the trained model, thus making possible to predict the blade vibration for
different wind directions and speeds.
Acoustical Emissions
The second approach is focused on the detection of acoustic emission. When repetitive loads are applied on a certain material, it is known that
they produce micro-breaks which liberate energy out of the material (Beattie, 1997). This energy takes the form of elastic wave which produce
sound. With the help of sensors properly disposed it is possible to capture and record this sound. These sensors translate that mechanical energy
in small electrical signals, which usually are pre-amplified in order to obtain a clearer signal. During the installation of the system is important to
adequately locate the sensors, and to fix them to the material by good coupling. The signals are captured, amplified and recorded in a computer
for posterior analysis. The frequency of the signal produced by the micro-breaks depends on several factors, e.g. the nature of the material, the
type of discontinuity and the source of the emission. On this base it is possible to characterize the source of the emission by isolating certain
frequencies with the help of appropriate filters. In many works it is common to use three sensors, which are properly located to determine the
source of the emission with high accuracy, usually with the help of triangulation algorithms.
In wind turbine blades domain, the acoustic emission is a major way to detect micro-breaks between the glass fibers of a blade in real time.
The method proposed in this chapter consists on the use of 16 MFCs on each blade (Figure 4). Most of the sensors are located in the first third of
the blade, where breaks commonly occur.
These sensors are constantly recording data. When a fiber break occurs, the elastic waves reach the MFC and the signal is recorded (Figure 5).
Knowing the distance between the MFCs, and measuring the time delay between the activation of every sensor, it is possible to accurately
compute the location of the break and its characteristics, depending on the type of wave issued, (amplitude, frequency ...).
The large amount of data generated by all the sensors, as well as the meteorological data, are processed by the Big Data system, and a specific
model for each blade is generated. It is important to note that when meteorological data predict rain or hail, acoustic emission detection should
be disabled, because each impact of each raindrop produces similar sound waves, which can be confused with those issued by a fault. The system
will create a proper model for each blade, will record the detected acoustic-emission source locations and the most probable types of defect. In
parallel, when an acoustic emission is detected, the ultrasonic inspection is activated in order to corroborate the possible damage.
The last approach is an active search for defects using the technique of pulse-echo ultrasound. In this case the transducer is excited with a short
pulse with frequencies above 20 kHz. These signals are applied to the material, and the received echoes are studied. In order to strengthen the
ultrasound analysis, the signals will be emitted in the form of white noise. This line could be subdivided into two parts. The first one would
perform a general periodic monitoring: when a failure event is detected; the second part is initiated to perform a more exhaustive analysis which
verifies the actual existence of a break and its location.
Ultrasound Inspection
This approach is an active search for defects using ultrasonic inspections and ice detection. For fault detection, in contrast to acoustical
emissions, ultrasonic short pulses are applied on the material to be examined. A combination of the pulse-echo and pitch-catch techniques is
used. The ultrasonic short pulses produce waves that travel through the material, and these waves are reflected when reach the interface or a
discontinuity (Su, Ye, & Lu, 2006). These echoes give information such as the break location by measuring the arrival time of the echo, as well as
the type of defect (Figure 6).
For ice detection on blades, meteorological data from SCADA systems are also used. It is known that the ice on the blade changes the properties
of wave propagation, especially the velocity. If the meteorological information predicts the presence of frost, and ice build-up on the blades, the
ultrasonic transducer will emit the short pulses and the system will compare the collected signals with the ones coming from normal (previous)
conditions.
BIG DATA AND CLOUD COMPUTING
The large amount of data generated by sensors will require a powerful computing architecture that can manager this volumes of data with no
problems of saturation neither response delays. Nowadays, the concept of Cloud Computing has got popularity in the industry as well as in the
academic community. It introduces the idea of ‘elastic resources’ that expand and contract according to the system load dynamics (Taniar, 2009).
In this way, computational resources can be shared across applications, resulting in great reduction of costs and latency (Minelli, Chambers, &
Dhiraj, 2012). Even though this is a novel trend in industry, it is well founded in rather known concepts like parallel computing, concurrent
computing and operating system architecture. They all have more than four decades in academic and research community, and have been raised
now under the name of Cloud Computing. There are various reasons for this resurgence, e.g. the penetration of Internet in any sector and
industry of economy, the increasing power of computers at a lower prices, the advent of embedded systems and definitely the evolution of
software industry. This new paradigm has offered a solution to an immense quantity business models. Some of them are included in one of the
following service models (see Figure 7).
Infrastructure as a Service (IaaS)
They are the most architectural layer of a computing system. The services that users can access include storage, load balancers, firewalls, virtual
private networks (VPN) and virtual local area networks (VLAN), but the most generally consumed service at this level is computing power: the
user is provided with network access to an operating system, which can be executed either by a real machine or by an emulator. The latter has
crystalized in the industry with the name of virtual machine (VM), and many products have extended these techniques not only for
infrastructural services, but also for final users. Some of them are VMware, VirtualBox, Xen, Hyper-V. Most of the tools provided by a IaaS cloud
are focused on managing such virtual machines in software containers called pools. VM pools are able to run large numbers of virtual machines,
and scale up and down the resources according to the whole charge in the system (Zikopoulos & Eaton, 2011). From the customer point of view,
the services provided in this layer will require considerable technical knowledge and expertise because they involve installing, patching and
maintaining operating system images as well as the actual final application software and its dependencies.
Platform as a Service (PaaS)
The technical complexity involved in IaaS products became a barrier for their adoption in many domains. For that reason, the cloud providers
started to build up a new layer of services, which will be closer to the customer requirements. The users will not have to worry about
infrastructure issues, they should be able to focus on the concrete business requirements. PaaS is conceived to make it possible: the user is
provided with a computing platform that ideally fits all his requirements: an operating system, a programming language execution environment,
database and http servers, and even an integrated developing environment (IDE). The underlying resources are automatically scaled by the
infrastructure so that the users do not have to allocate resources manually (Zikopoulos & Eaton, 2011). These features allow to programmers the
development of large and complex software solutions, deploy them in the cloud and reach the production stages much easier than using
traditional server development techniques.
Software as a Service (SaaS)
The successful and rapid adoption of platform services in the cloud made the industry to evolve towards a new layer of services, those oriented to
final software consumers. As programmers were endowed with cloud developments platforms and tools, the most foreseeable evolution was
indeed a remarkable increment in the volume of cloud solutions targeting final consumers. And these services were quickly and massively
adopted in software industry: by one hand, software developers install and operate applications and do not have to deal with the infrastructure’s
complexity, and, on the other hand, the final users receive additional benefits because they do not have to install and run the application in local
machines, but simply use cloud clients to access the application. These changes dramatically reduce the tasks of support and maintenance, and
unify the user’s experience in a single interface (Zikopoulos & Eaton, 2011). The application is completely executed in the cloud, allowing the
implementation of new valuable features and functionalities that were unfeasible in local software solutions. This is possible mainly by the
hardware scalability provided by the infrastructure at lower levels, but some techniques have been designed to optimize time and resources also
at the software layer. A software optimization technique that has become very popular among SaaS providers is multitenancy, which consists on
grouping several logic instances of an application in a single bunch which is served by a single shared resource.
Other big sub product in this trend is the so called Internet of Things (IoT) which is becoming more and more popular these days. It is a natural
consequence of the previous developments, since it consists on the interconnection of uniquely identifiable embedded computing like devices
with the existing Internet infrastructure. Companies like Xively, AT&T, Axeda, Cisco and others are offering solutions specifically targeted to this
new growing market.
PROPOSED INFRASTRUCTURE
The sensors are controlled by a node (in some domains it is called wasp-mote), which is a device capable of receive streams of data from sensors,
split them in packages, and send them through the network with the appropriate metadata. Several considerations should be done in this point.
With respect to the volume of data generated, it is important to make the accurate estimations in order to acquire the correct hardware which
supports it. For example, if it is need to install forty MFC sensor on a wind turbine, and capture data at 25 kHz, it would mean approximately
360MB of data every second. Would the node support this? Is its data bus enough for ingesting such a stream? These all are factors that should
be considered when hardware options are being evaluated.
Once the data has been properly collected by the node, it is time to transmit them to the computing centre. This process is usually done by mean
of REST web services, which work over the HTTP requests and responses layer; nevertheless, some alternative technologies have recently
emerged, like WebSockets and other streaming solutions. Some of these efforts have crystallized in the HTML5 standard that web browser must
accomplish. Even these all alternatives, it is important to select the most appropriate data transmission technology. Some criteria will help for
this:
• Granularity: The granularity of the requests (bunches of data send) from the node to the server has to allow the latter to scale their
resources as needed. In other words, it is necessary to balance the trade-off between small and big queries. Small queries will let the server
scale smoother, but at the price of a higher number of queries. As certain amount of protocols related bytes is attached to each query, the
total efficiency of the system could decrease when the size of the queries is arbitrarily reduced.
• Responsiveness: One of the strongest requirements in industrial applications is the responsiveness of the whole system. It is true also for
monitoring systems, where no real time decision is made, but streams of data has to be continuously processed to guarantee safe working
conditions. This is the case of wind turbines condition monitoring. In the next sections we are going to present how it could be achieved in a
cloud computing environment.
Task Queues
Regardless the cloud provider selected, the system should be designed to scale and balance the resources as needed, otherwise the system much
probably will collapse at demand picks. The most common approach for doing so consists on splitting the load in a proper way: by doing so, some
tasks are processed ‘in the background’ by some worker threads, and some other tasks are directly dispatched in the “main thread”. The correct
separation depends on several factors, like the quantity of expected requests, the volume of the received data and the computing cost of
processing each request. In all cases, the incoming requests are organized in small and discrete units of work, called ‘tasks’, and are pushed into a
queue, at the time that some ‘worker’ processes dispatch them as soon as possible. The scalability of the system comes when the ‘workers’ are
replicated to timely consume the processes in the queue, and this is done dynamically and according to the current system load.
In today’s cloud industry this approach is widely used, though using different names according to the specific provider. For example, in Google
Cloud Platform it is called ‘Task Queues’, while in Amazon Web Services (AWS) it is named ‘Simple Queue Service’. In Windows Azure Cloud
Platform it is a bit trickier to match this functionality. It will be a combination of the ‘Service Bus’ and ‘Scheduler’ functionalities both provided in
Azure platform. The Scheduler provides a mechanism for multiple process orchestration, integrating them all in a single logical base, but it needs
the help of the Service Bus, which provide the messaging and buffering tools for making possible such integration.
Processing Algorithms
The algorithms executed by these tasks depend on the concrete application to build. In wind turbine condition monitoring with MFC sensors,
task processing will probably consist on analysing the received raw signal, first making a sort of feature extraction. Feature extraction is a form of
dimensionality reduction where the data is transformed into a reduced representation set of features. The objective of this step is twofold: reduce
the redundancy in the data, and spotlight the relevant information contained in the data.
Once this information is computed, the following common step is to compare them with the previously stored data. The most elegant and
efficient way of doing it is by means of statistical modelling techniques. A statistical model is an abstract formalization of the relationships
between the principal variables in a certain phenomenon. By using a model, it is possible to condense huge amounts of data in just few values.
These values usually correspond to the parameters of the model. The process of optimizing the value of the parameters is called model training,
because the model is fitted to the data according certain restrictions (Sheikh, 2013). There are lots of different approaches to modelling data. For
signal analysis it is very common the use of the Fast Fourier Transform (FFT) and Wavelet Transform (WT).
Using the appropriate modelling technique, and having a set of well-trained models, it is possible to detect novelties in the data by just testing the
new data against the models. Depending on the case, the appropriate system message has to be triggered (warning, error, fault...).
Data Logging
The following step consists on storing the relevant data in a proper way. For the case when the stream of received data is extremely big, it would
be good to consider the possibility to design a feature extraction process targeted to reduce dimensionality in data for storing purposes. In these
cases, instead of storing the full set of data receive, it will store only a representation. Part of this process would imply re-training the stored
models, in order to update them according to the newest system’s state.
Conventional relational databases are not well suitable for this kind of applications, firstly because of the huge volume of the data gathered, and
secondly, due to the nature of the data itself: relational databases were designed mainly for enterprise applications, where the data is definitely
transactional and relational. In the big data applications it is very common to work with unstructured data, even with dynamic formats and
relationships. For that reason, many efforts were devoted to build schema-less databases which are better suited for these new kind of
applications. MongoDB is probably the most popular product in this sector, which usually is referred with the general name of ‘NoSQL’
databases. They were rapidly adopted by cloud providers as one of the key infrastructure components. Some examples are ‘Cloud Datastorage’
from Google and Amazon S3 from AWS.
FUTURE RESEARCH DIRECTIONS
Even though Big Data technologies are opening new exciting horizons in the ICT industry, some authors point out the necessity of approaching
the data exploitation from another more semantic and holistic perspective (Barone et al., 2010), due to the overwhelming volume of data coming
from different sources, and even in different formats. The idea behind this new approach is to make sense of data at the light of a conceptual
model that encapsulates the semantic of the raw data in a unified framework. In the context of industrial applications, it would entail a tighter
integration between the theoretical and experimental (data-based) model of a machinery performance, with wide consequences in the whole
product life-cicle, from operation and maintenance management, to the earlier stages of product developing and design.
CONCLUSION
The proposed model is used to cover the most important requirements of the structural health monitoring and vibration monitoring in a wind
turbine blade. The system analyses the received signals online and acts depending on different events. After the signal processing stages, the
system records the status of each wind turbine to predict future failures and to get failure patterns between different wind turbines. The
advantages of the proposed methodology include the exploitation of large volumes of data online, as well as the integration of continuous parallel
analysis in an unified framework. Additional advantages come from the fact that the system uses the same transducers for three different
purposes, which implies that the whole system is simplified and costs are reduced.
This work was previously published in a Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence edited by
Noor Zaman, Mohamed Elhassan Seliaman, Mohd Fadzil Hassan, and Fausto Pedro Garcia Marquez, pages 149163, copyright year 2015 by
Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
The work reported herewith has been financially supported by the Spanish Ministerio de Economía y Competitividad, under Research Grant IPT-
2012-0563-120000 (IcingBlades).
REFERENCES
Abouhnik, A., & Albarbar, A. (2012). Wind turbine blades condition assessment based on vibration measurements and the level of an empirically
decomposed feature. Energy Conversion and Management , 64, 606–613. doi:10.1016/j.enconman.2012.06.008
Barone, D., Yu, E., Won, J., Jiang, L., & Mylopoulos, J. (2010). Enterprise modeling for business intelligence . In The practice of enterprise
modelling (PoEM’10) (Vol. 68, pp. 31–45). Berlin: Springer. doi:10.1007/978-3-642-16782-9_3
Beattie, A. G. (Ed.). (1997). Acoustic emission monitoring of a wind turbine blade during a fatigue test. In AIAA Aerospace Sciences Meeting
(AIAA/ASME ‘97). American Institute of Aeronautics and Astronautics. 10.2514/6.1997-958
Berson, A., & Smith, S. J. (1997). Data warehousing, data mining, and OLAP . New York: McGraw-Hill.
Homola, M. C., Nicklasson, P. J., & Sundsbø, P. A. (2006). Ice sensors for wind turbines. Cold Regions Science and Technology ,46(2), 125–131.
doi:10.1016/j.coldregions.2006.06.005
Lehmann M. Büter A. Frankenstein B. Schubert F. Brunner B. (2006). Monitoring System for Delamination Detection–Qualification of Structural
Health Monitoring (SHM) Systems. In Proceedings of the Conference on Damage in Composite Material (CDCM ‘06). Stuttgart.
Minelli, M., Chambers, M., & Dhiraj, A. (2012). Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses .
Hoboken, NJ: John Wiley & Sons.
Pinar Pérez, J. M., García Márquez, F. P., Tobias, A., & Papaelias, M. (2013). Wind turbine reliability analysis. Renewable & Sustainable Energy
Reviews , 23, 463–472. doi:10.1016/j.rser.2013.03.018
Sheikh, N. (2013). Implementing Analytics: A Blueprint for Design, Development, and Adoption. Morgan Kaufmann.
Söker H. Berg-Pollack A. Kensche C. (2007). Rotor Blade Monitoring – The Technical Essentials. In Proceedings of the German Wind Energy
Conference (DEWEK ‘07). Bremen.
Su, Z., Ye, L., & Lu, Y. (2006). Guided Lamb waves for identification of damage in composite structures: A review.Journal of Sound and
Vibration , 295(3), 753–780. doi:10.1016/j.jsv.2006.01.020
Taniar, D. (2009). Progressive Methods in Data Warehousing and Business Intelligence: Concepts and Competitive Analytics . Hershey, PA: IGI
Global. doi:10.4018/978-1-60566-232-9
World Wind Energy Association. (2012). World wind energy report . Bonn: WWEA.
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data . New York: McGraw-Hill.
KEY TERMS AND DEFINITIONS
Acoustical Emission: Non-destructive technique used for detecting elastic waves produced by a material as repetitive loads are applied on it.
Big Data: The set of methodologies and technologies that allow the capture, management and exploitation of considerable amounts of data
within a tolerable period of time.
Cloud Computing: A set of internet-based computing services, resources, technologies and infrastructures that are remotely available for end-
users.
Data Mining: A subfield of computer science focused on patterns discovery in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems.
Filter: A signal processing technique used for partial or complete suppression of some components of a given signal.
Macro Fiber Composite (MFC): A kind of low profile actuator and sensor consisting of a rectangular piezo-ceramic rods stacked between
layers of adhesive, electrodes and polyimide film.
Parallel Computing: A computation technique consisting on the concurrent execution of subtasks for the accomplishment of a bigger task.
CHAPTER 60
Towards Harnessing Phone Messages and Telephone Conversations for Prediction of
Food Crisis
Andrew Lukyamuzi
Mbarara University of Science and Technology, Uganda
John Ngubiri
Makerere University, Uganda
Washington Okori
Uganda Technology and Management University (UTAMU), Uganda
ABSTRACT
Food insecurity is a global challenge affecting millions of people especially those from least developed regions. Famine predictions are being
carried out to estimate when shortage of food is most likely to happen. The traditional data sets such as house hold information, price trends,
crop production trends and biophysical data used for predicting food insecurity are both labor intensive and expensive to acquire. Current trends
are towards harnessing big data to study various phenomena such sentiment analysis and stock markets. Big data is said to be easier to obtain
than traditional datasets. This study shows that phone messages archives and telephone conversations as big datasets are potential for predicting
food crisis. This is timely with the current situation of massive penetration of mobile technology and the necessary data can be gathered to foster
studies such as this. Computation techniques such as Naïve Bayes, Artificial Networks and Support Vector Machines are prospective candidates
in this strategy. If the strategy is to work in a nation like Uganda, areas of concern have been highlighted. Future work points at exploring this
approach experimentally.
1. INTRODUCTION
Food insecurity is among of the terrors that have perturbed human welfare. Humans have lived with this challenge for many generations.
Numerous sources have been documented to testify early manifestations of this challenge. Such sources include; scholarly works, religious books,
story books and oral traditional. The famines that struck Europe in the years between 1343 and 1345 were deadly (Mellor, 1987). About 43
million people lost their lives in these famines. In the period from 1959 to 1961China fell into a similar trap (Mellor, 1987). It is estimated that
between 16 and 64 million people in China perished for the same cause.
The effects of food insecurity especially those that were devastating compelled the world to seek appropriate solutions. As a result, success stories
of reduced food insecurity cases have been recorded. According to Food and Agricultural Organization et al. (2014) hunger cases have reduced by
100 million people in the previous decade. The Millennium Development Goal one had a target to reduce hunger cases by half not later than
2015. Among the countries that have achieved this target, Latin America and the Caribbean have made the greatest progress (Food and
Agricultural Organization et al., 2014).
It is important to examine the extent at which the world has managed to control or eliminate food insecurity. Unfortunately the hard fact remains
that food insecurity is still a big challenge. The available statistics give limited room to doubt this. About 842 million populations in world are
victims of chronic hunger (Food and Agricultural Organization et al, 2013). In the report released by Food and Agricultural Organization et al.
(2014) it was established that 805 million populations are chronically undernourished. The Committee on World Food Security (2013) has
disclosed that more than 200 million children under five years of age are malnourished. In the period from 1995-98 about 1 million people lost
their lives in the famine attacks of North Korea (Committee for Human Rights in North Korea, 2005). In order to keep pace with population
increase by the year 2050, food production should increase by 70% (International Fund for Agricultural Development, 2010). On the other hand
factors such as climate change, soil exhaustion, and bio fuel practices are exacerbating this challenge (Faaij, 2008). It is therefore evident that
any innovation that can assist in transmuting this challenge is worthwhile.
Prediction of food insecurity is a possible remedy to this challenge. This is instrumental in guiding stakeholders where to direct early intervention
reliefs. These reliefs are helpful in several ways: (1) the impact of food insecurity can be reduced or eliminated completely, and (2) the expense
involved in amelioration can be minimized (Okori and Obua, 2011; Brown et al, 2008). This has proved successful in some parts of the world. For
instance, according to the United States Department of Agriculture –USDA (2005), reliable monitoring of food insecurity contributes to the
effective operation of Federal programs, food assistance programs, and other government initiatives aimed at reducing food insecurity.
Several features have been proposed for prediction of food insecurity. These include but not limited to house hold information, price trends,
biophysical features, and economic growth. Attention is needed in determining features suitable for predicting of food insecurity. While features
used for one study can be applied to other studies, this is not always appropriate. This is because players for food dynamics are not necessarily
the same. There situations when these features are the same but portray variation in the relevance. This can have a big impact on prediction
performance. Inspection is a possible approach in determining which features are appropriate for a particular study. This involves examining the
role of various features in food security dynamics. Correlations are commonly done to establish the strength between food insecurity and the
proposed features. In this way the research is able to establish which features are appropriate for prediction of food insecurity. The second
method is to review relevant literature. By examining previous related studies, it is possible to deduce justifications for selection of certain
features as prediction parameters. In the process, patterns governing choice of the features and the environment can be established. These
patterns now become an important reference point in determining which predictive features to use for a given study.
Identifying appropriate predictive features for use in a given study is not enough. It is equally important to consider the resources and time the
required to gather the required the datasets. Data sets for features such as household information, biophysical properties, and price trends
require certain amount of resources and time. For some studies, individual researchers find it difficult use their individual resources to acquire
the necessary datasets. In some situations even governments are unable to collect required datasets such as house hold information frequently
because if high expenses involved. This has created an open window for researchers to seek innovative ways of re-using other readily available
datasets. For example the web has become an alternative source of data for analyzing stock markets and people’s sentiments. It is on this
background that this study explores the possibility of using phone message archives and telephone conversations for prediction of food crisis.
Few studies if any have attempted to use this kind of data similarly. In section 4 will read about the implications for this innovation.
This paper is organized as follows: The next section reviews related work. Methodology section follows immediately, then discussion. The paper
ends with the conclusion and future work.
2. RELATED WORK
2.1. Existing Methods for Predicting Food Insecurity
Several model categories for predicting food insecurity have been proposed, although these models are applicable elsewhere. This is not an
exhaustive review. It is a high level review of these categories.
2.1.1. Statistical Models
These are developed based on statistical principles. Linear regression, nonlinear regression, and moving averages are examples of statistical
models that have been used in studying food patterns. According to Makowski & Michel (2013) these models analyze yield time series of past
yield trends to predict future yield trends. Poor yields forecasted can be sign of pending food insecurity. Linear regression techniques offer simple
methods to estimate yields and it is best for yields that grow at a fairly constant rates. For yields which vary nonlinearly, quadratic and
exponential growth methods can provide better approximations (Makowski & Michel, 2013). Makowski and Michel (2013) contributed to the
state-of-art by exploring the application of a dynamic model for analyzing yield trends. Their study revealed improved performance in
accommodating system uncertainties than most the available models such as Holt-Winters. Mbukwa (2013) has also explored the role of
statistical model (logistic regression) in studying food patterns. Mbukwa showed that age and household size are potential variables for
predicting food security status in the households of Tanzania. Statistical models are well structured and this makes them user-friendly. The major
drawback facing statistical models is that they are inclined towards manipulation of numerical data. Data for studying food patterns is not
necessarily in numerical format for example satellite imagery tracking vegetation growth patterns. These imagery patterns can be changed into
some numerical data but this can be challenging.
2.1.2. Canonical Correlation Analysis Models
These models are developed using advanced statistical analysis techniques. Both statistical techniques and linear algebra techniques are used in
formation of these models. Some studies have explored this approach. According to Famine Early Warning System Network (FEWS NET)
partners have developed models that use canonical correlation analysis to relate variations in sea surface temperature with rainfall in Africa
(Brown et al., 2008). Global forecasts from the International Research Institute (IRI) at Columbia University predict climate through Forecast
Interpretation Tool (FIT) (Brown et al., 2008). Examining and monitoring these rainfall patterns can give insights into the expected yield
outcomes. While canonical analysis models offer advanced mechanism for model formulation, they are complex to comprehend.
2.1.3. Mathematical Models
Most models involve some mathematical formulations ingrained in their description. Therefore drawing a dividing line between other models
and Mathematical models is a challenging task. In this context, mathematical models are geared mainly towards models developed by
researchers who have explored mathematics to high reams. Mathematical models manipulate mathematical functions for prediction purposes.
Some researchers in mathematical modeling extend the capabilities of these predictive functions. These extensions have positive impacts on the
performance of some particular models. It is important to note that other modeling categories can do the same. This overlap between
mathematical models and other models is not a clear cut. Mathematical models provide avenues to explore beyond limits of established models.
2.1.4. Markov Models
The term Markov is derived after a Russian mathematician known Markov. Markov models are based on the ideologies of this Mathematician. To
predict the future state of a system according to Markov historical behavior of the system is irrelevant. It is maintained that prediction of future
state of a system should be modeled based on current state of the system. For some systems ignoring previous behavior of the system is essential
in eliminating biasness that could otherwise be detrimental. Other hand for some systems, this biasness is useful formation of dependable models
than otherwise. Probabilistic behaviors are imbedded in the formation of Markov models and this makes them suitable for systems that portray
some level of uncertainty. In the study carried out by Brown et al., (2006), the application of Markov model in predicting food insecurity is
demonstrated.
2.1.5. Artificial Intelligence Models
Developers of Artificial Intelligence models strive to integrate human intelligence in their models. These models are suitable for systems
requiring sophisticated reasoning or integration of human expertise. Artificial Intelligence models have capabilities of capturing and modeling
data in various formats such textual data, numerical data, audio data, imagery data and symbols. Advancements in technology have led to the
birth of another discipline emerging from Artificial Intelligence and this is Machine Learning. Most of the models in Artificial Intelligence are
inclined towards Machine Learning. Current IA models for predicting food insecurity are linked to Machine Learning. As a result Artificial
Intelligence and Machine Learning model are difficult to distinguish.
2.1.6. Machine Learning Models
Machine Learning (ML) is an emerging computational approach for modeling food dynamics. It is also applied in other fields such as disease
detection and diagnosis, Web search, spam filters, recommender systems, credit scoring, fraud detection, stock trading, drug design, and many
other applications (Domingos, 2012). It integrates concepts from other disciplines such as statistics, Artificial Intelligence, Psychology,
Mathematics and Biology. It appears as if Machine Learning has come to unify all disciplines of human endeavor. In relation to food security,
Machine Learning techniques are well suited for prediction of risks like famine since they can enhance classification accuracy (Okori & Obua,
2011). Okori & Obua (2011), have explored the application of Machine Learning in studying food dynamics. Their major contribution reveals that
Machine Learning techniques such SVM, K-Nearest neighbors, Naïve Bayes and Decision trees are viable solution to prediction of famine from
household information such as size of land holding, size of household, labor input, livestock number owned, and other features. The possibility of
deducing food crisis from data sets such mobile communication patterns requires investigation.
2.2. The Potential of Big Data in Predicting Food Insecurity
Today more devices/sources are available for collection or generation of data. These sources include but not limited to sensors, Internet
transactions, social media, mobile-devices and automated sensors (Vashist, 2015). As a result the volume of medical data, remote sensed data,
phone message archives, web-postings, and online transaction data is growing exponentially among many other forms of data. This exponential
growth in data has led to birth of big data concept. What is big data? Big data is a term usually associated with large volumes of datasets. While
this is not untrue, the definition is not complete. According to Vashist (2015), other attributes of big data include; velocity, variety, and veracity.
Velocity means that data is produced at high rate. With variety, it means that data collected is in various forms such as structured, unstructured,
semi-structured, text, media. (Joroslave et al, 2015). Veracity means there is a variation in the quality of data collected.
While big data has been instrumental in alleviating the problem of data scarcity, on the other hand it causing information overload. According Li
and Li (2011), information overload is a big challenge affecting users mainly in managerial positions. This is not strange since traditional
techniques cannot effectively manipulate big data to support decision making. In response to this challenge, a lot of attention is geared towards
innovative techniques for mining this big data to benefit various fields. Food insecurity is among the fields that can benefit from big data. This
section will highlight opportunities in big data to promote the science of predicting food insecurity.
2.2.1. Mobile Device Data
Telecommunication companies collect various forms of data from subscribers. Movement patterns are some of these data types. Towers of these
communication companies can track an area where a mobile device is located and this reveals the place where the user of that device is located at
particular time. Using this technique, telecommunication companies are able to track the movements of various subscribers. Normal movement
patterns can be established and associated with particular time, days, seasons and places. Anything that disrupts social welfare impacts on these
movements patterns of these populations. Food insecurity disrupts the social welfare of societies and so it impacts on the movements patterns of
the people. For example during famines attacks of North Korea in the periods of 1995-98, several people were forced to relocate and it is said
some people escaped to neighbor countries (Committee for Human Rights in North Korea, 2005). Therefore examination of these movements
can potentially aid in studying food dynamics.
Communication load is another possible data for studying food dynamics. During the periods of food scarcity communication loads such as
frequency and duration of calls are affected. Deviation from the normal communication loads can provide proxy information on food status. The
deviation can occur in two ways. First, the frequency of communication can increase in the periods of food crisis as various people seek assistance
from other members. Secondly, frequency of communications can also reduce because of food crisis the people can be too poor to afford
communication costs. For example in the research carried out by United Nations Global Pulse (2015), correlation has been observed between air
time purchase for making calls and food security. Therefore communication loads (such as frequency & duration of calls) is a possible path for
deducing the food insecurity.
Phone messages and telephone conversations can also aid in deducing a pending food crisis. Communication companies have the ability to record
and backup communications contents such as phone messages and telephone conversations. This study has given special attention on how this
can be used study food patterns. For details see methodology section.
2.2.2. Web Data
Web is increasing becoming an important source of data to study various phenomena. Transaction records from client purchases are providing
opportunities to examine stock markets. Postings/comments to Facebook and Twitter are driving studies aimed at understanding people’s
sentiments on various issues such as the qualities of various products sold online. Web data can provide opportunities to study food scarcity.
From web postings, patterns can be extracted to study food scarcity. For example people’s sentiments towards food security can be extracted. A
shift towards a negative side can be is a sign of food crisis. In this way the web is a possible source of data to study food patterns.
2.3. Parameters for Prediction of Food Insecurity
Previous investigations have revealed key players of food dynamics. Brown et al., (2008) identifies biophysical, economic and political factors as
key players in predicting food insecurity. According Braun (1991), food insecurity is the outcome of an interaction between environmental and
socioeconomic factors, both in the short and the long terms, and a failure of policy to deal with them. Policy in this context he is referring to
politics. In brief the key players in modeling food dynamics are categorized as political, economic, biophysical, and social economic factors. In the
following paragraphs parameters/variables derived from these players are identified.
The biophysical factors focus on environmental influences on food security and this provides a relevant metric during growing season that can
provide insight into prediction of future food supply in the next harvest (Brown et al., 2008). Biophysical parameters identified in this research
include; Normalized Vegetation Index (NDVI), Rainfall Estimate (RFE) and Water Requirement Satisfaction Index (WRSI). According to Brown
(2008), a much bigger list of parameters is given and it includes; precipitation gauges and gridded data from merged satellite models, vegetation
data from a variety of sensors, gridded cloudiness products, global climate indicators, precipitation forecasts, modeled soil moisture, gridded fire
products, snow extent products, hydrological, models for flood forecasting, and seasonal forecasts. These data products were either developed
directly by Famine Early Warning System Network (FEWS NET) partners for them or were adapted to their needs (Brown, 2008). Gathering this
type of data necessitates various weather stations strategically located in a given region. This ensures that data collected will be reliable for
predictive studies. Installing and maintaining equipment in weather stations is expensive.
Social-economic factors are inclined to supply and demand as seen in many economic goods (Okori & Obua, 2011). According to Brown (2008)
socioeconomic parameters for food dynamics include; agricultural production, market prices, food economy zones, employment, population,
school attendance, infrastructure maps. With socioeconomic data according to Mellor (1986), famine is predicted by successive years of poor
crops, a rapid rise in food prices, a decline in the prices of goods that the poor sell (particularly including the livestock of pastoralists), and a
decline in employment. In the small, isolated, and informal markets that are typical of the region, food prices are intimately linked with local food
production (Brown et al., 2008). Hazard monitoring uses this baseline profile to determine the normal situation from which the impact of both
socio-economic and biophysical anomalies can be measured. It is expensive to mobilize these records.
Identifying political variable for development in food insecurity is not direct. According to United Nations Development Programme (2007),
technical and political failures such as lack of information to guide farmers in planning their agricultural activities and misguided policies also
contribute to famine. With effective institutions and adequate physical supplies, the occurrence of famine increasingly signals not the lack of food
or capacity, but some fundamental political or governance failure (United States Committee for Human Rights, 2005). This is why Mellor (1986),
believes that a democratic form of government plays a key role in famine prevention.
2.4. Categorizing Text Data and Audio Recording
For audio recordings, it is recommended in this study to convert it into text format. There already tools to do this. Now all the data will be in text
format. Textual data is considered unstructured and as result it is difficult to analyze (Wajeed & Adilakshmi, 2009). Analyzing textual involves
classifying or categorizing. Classification can be flat or hierarchical. In flat classification, categories are on the same level in a parallel format, one
category does not supersede another (Wajeed & Adilakshmi, 2009). See Figure 1 for illustration. Hierarchical classification as illustrated in
Figure 2 is multilevel organization of classes. Classification can also be binary classification or multi-categorization. Binary classification is also
known as single class categorization. This provides two options with single classification; the text can be identified belonging to that single class
or not. With multi-classification there are more than two classes where the text can belong. Another form of classification can be hard
classification or soft classification. With hard classification an entity can either fully belong or not belong to a given class. With soft classification
an entity can partially belong to the class with varying degree.
2.5. Extracting Patterns from Textual Data
Several stages for extraction of patterns from textual data are involved. See Figure 3 for illustration. At each stage this data undergoes
transformation. Early stages mainly deal with pre-formatting data. This ensures that data is appropriately trimmed. This improves the
performance and accuracy of the extraction process. These stages are now described in the following paragraphs.
Figure 3. Illustrating steps for extracting patterns from textual
data (Ikonomakis at el., 2005)
Contents of the documents to be analyzed are read. Tokenization involves splitting the contents into separate entities with the removal of
punctuation marks. These tokenized words are treated as groups of individual words. There some challenges at this stage. For example some
group of words need to treated as one word like New York. Tokenization techniques will have to address challenges as such these.
Stemming is a process is of reducing textual features by considering their word stems. For words such as trainers, training, train and trained are
reduced to the stem common to all which is train. Stop words such as auxiliary words, conjunctions, and articles are considered useless in
relation to pattern extraction and this why they are eliminated (Ikonomakis et al., 2005). Both stemming and removal of stop words results into
reduced amount data which improves performance. After stemming, the remaining data are represented in vector format to easy text
manipulation using linear algebra techniques.
Feature selection deals with generation of relevant features for text extraction. At this stage, less relevant features are eliminated to reduce
dimensionality. Less irrelevant features can add noise to the data which impacts negatively to model accuracy (Singh & Chauhan, 2012). With low
dimensionality, time complexity is reduced while resource utilization is improved.
Feature transformation maps a given set of features to lower dimension without dropping some features. This results into reduced time
complexity with few chances of compromising the accuracy. Next, is training this formatted data with selected machine learning techniques (also
called classifiers). The next section explores machine learning techniques that are potential for text classification.
2.6. Machine Learning
Several text classifiers such naive-Bayes, neural networks, k-Nearest Neighbors, and Support Vector Machines (SVM) have been proposed for
text extraction (Ikonomakis at el., 2005; Pang at el, 2002). Naive-Bayes classifier gives a probabilistic approach. It is assumes that all terms occur
independent from each other (Bijalwan at el., 2014). This is known as conditional independence. While this assumption seems unrealistic, naive
Bayes is competitive with those obtained by more elaborate methods (Bijalwan at el., 2014). The conditional independence assumption
jeopardizes this classifier’s performance. Consequently, Nave-Bayes degrades for more complex classification problems where features are
usually correlated (Singh & Chauhan, 2012).
K-nearest Neighbor is a nonparametric classifier. It operates on the assumption that similar or close attributes (features) yield similar
labels/classes. It uses a minimum distance between the query instance and the training data set to determine the k-nearest neighbors. Features
similar or close to a dataset whose label is to be determined are given high bias. Empirical evidence shows that K-nearest Neighbor produces
comparably high accurate classifier. k-Nearest Neighbor suffers from high time complexity resulting from the bias (Bijalwan at el, 2014) and this
makes it resource hungry. It is also susceptible to degradation in presence of irrelevant parameters (Singh & Chauhan, 2012).
Support Vector Machine is suitable for data sets that are separable. It is an optimization technique that does not only provide a plane/line of
separation but also strives to find best line of fit for separation. This is achieved by finding the separation or margin that is as large as possible
(Amara et al, 2014). In real life, the separation is not smooth always. This is overcome by transforming data into another dimension in which
separation is possible. Support Vector Machines have also played an instrumental role in text classification.
Neural Networks mimic the operation of a human brain. Brain and the nervous system as a whole, is said to be a complex network of
interconnected neurons. There are approximately 1011 neurons, each connected, on average, to l04 others (Mitchell, 1997). Electrical signals
trigger a section of interconnected neurons. The resulting response is forwarded to the brain for interpretation so that an appropriate action is
taken. While Neural Networks are known for their good performance for complex phenomena, users find it hard comprehend their developed
models.
2.7. Evaluating Machine Learning Techniques
During performance evaluation, the trained algorithm (classifier) is assessed to establish its viability. It is recommended to train an
algorithm/classifier on 2/3 of the data and the rest is reserved for evaluation (Kotsiantis, 2007). Assessment parameters have been proposed by
previous researchers. First is accuracy and it is computed from four states of a classifier; TP, FP, TN and FN. TP stands for true positive and
represents items correctly categorized to belong to a class of interest. FP stands for false positive and represents items falsely categorized to
belong to a class when they should not. TN stands for true negative and represents items that are correctly identified as nonmember to a class
when they should not be. FN stands for false negative and represents items is incorrectly identified as a member when they should not. According
to Ikonomakis at el., (2005), accuracy (C) is computed as follows:
(1)
This measure of performance is not reliable if the number of members and nonmembers is unbalance. If nonmembers are far greater than
members, accuracy can be high even when the classifier is unable to classify any item correctly. This can be misleading. Recall (R) and precision
(P) have been proposed in response to this challenge. These two measures are computed as follows (Ikonomakis at el., 2005):
(2)
(3)
For any research, there is a need to strike a balance between these two. One way of striking this trade off (F) is computed as following
(Ikonomakis at el., 2005):
(4)
3. THE PROPOSED STRATEGY
This section describes the proposed strategy for harnessing archives of phone messages and telephone conversations in order to deduce any
possible food vulnerability. Telecommunication companies are the potential sources for this archived data. This archived data can be in textual or
verbal format. For telecommunication companies, this data is for future reference. For example if a user deletes a message, that message can be
recovered from the telecommunication company. Below is step-by-step process for harnessing this archived data to reveal a pending food crisis.
The study adapts a general approach for extraction patterns from textual data as described by Ikonomakis at el. (2005). The general approach is
the portion of the illustration which is not in grey. This approach was previous reviewed (see subsection: Extracting patterns from textual Data).
Two modifications have been made on the general approach as describe below:
• Modification one: The general approach assumes that data is in textual format. To ensure this necessity, all the audio recordings are
converted into textual data. There are tools to do this. The textual data from audio recordings is then aggregated with phone messages. Now
all the data is in the required format (textual format). This is the first modification. This is shown in a grey portion, left of the illustration;
• Modification two: The general approach gives limited attention to algorithm stage. To close this gap, a second modification has been
suggested. This is a description to provide more details on the algorithm stage. Its description has been integrated with the description in
this proposed approach. The modification is covered towards the end. It is shown in a grey portion to the bottom of the illustration.
3.1. A StepbyStep Description
In the modification one, all data has been aggregated and it is all in the textual format. This data is read/ retrieved as shown in Figure 4.
Tokenization then follows and this deals with converting data into entities of individual words. Symbols such as punctuation marks which are not
relevant for text analysis are removed at this stage (tokenization). Stemming as described in previously (see subsection: Extracting patterns from
textual data), reduces a group of words to their stems. For example the words food and foods are reduced to food. The next step deals with
removal of stop words such as auxiliary words, conjunctions, and articles which are considered useless in relation to pattern extraction
(Ikonomakis at el., 2005). Stemming and removal of stop words is aimed at trimming data available for improved performance. Vector
representation packages data into vector format (a matrix format). Data especially in high dimension is easy to manipulate in vector format using
linear algebra techniques. The next stage (feature selection) deals with identifying features needed to tag a message as an indicator of food crisis
or not.
Identifying features or key words to signal food insecurity is a challenging. The individual messages in aggregated data are studied. The purpose
is to establish which key words indicate food insecurity. The key words found form set known as bag of words. Algorithms check for existence of
these words in order to classify a message as indicator of food insecurity or not. Feature transformation is a technique for reducing
dimensionality. It is more of an optimization technique. In an explorative study such as this, feature transformation reserved for tuning purposes.
Selecting a machine learning classifier is also an important stage. Previously four classifiers (Nave Bayes, K-Nearest Neighbors, Support Vector
Machines, and neural networks) were reviewed. See subsection Machine Learning for recap. This being an exploratory study, it recommended to
apply one classifier at a time. Each classifier is then tuned to the desired performance. A classifier which performs best on the data set is selected
based on performance criteria. See subsection classifier evaluation for details on tuning and computing evaluation parameter.
Figure 4. A diagram illustrating steps for extracting food crisis
patterns as adopted from Ikonomakis at el., (2005)
3.2. Selection of Tools
These are tools such as Mat lab, Octave, R-programming and Python. They provide an environment to execute Machine Learning techniques such
as these reviewed (see subsection Machine Learning for details). Python deserves a high priority as an experimentation environment. It supports
functionalities for extraction of patterns from not only in text data but also in other formats such as numerals and graphics. It also freely available
and this makes it appropriate for education purposes like this.
3.3. Classifier Evaluation
Using bag of words identified, a sample of archive messages is tagged as indicator of food crisis or not. This is done manually. This tagged sample
of messages is split into two parts; one reserved for training the algorithm and the other is for evaluating the classifier performance. It is
recommended to evaluate the classifier on the two parameters; precision (P) and recall (R). The computations for these are described in
subsection: Evaluating Machine learning techniques (see action equation (2) and equation (3)). If the classifier yields to the satisfactory
performance, it is then retained for pattern extraction of data. If the performance is unsatisfactory three options are suggested for tuning: (1)
adjusting the features selected; (2) selecting another classifier; and (3) increasing the volume of data sets.
3.4. Deducing Food Crisis
The classifier obtained in the above stage can now be applied to aggregated data (both phone messages and text data converted from recording of
telephone conversations). Periodical trends showing the normal pattern messages tagged as food insecurity in aggregated data can be
established. If the trends deviate from the normality this is an indication of abnormality in food supply chain. If these patterns are for a long
period, a normal distribution can also be established. Abnormal changes in the normal distribution can be established. For example messages
labeled as food insecurity can increase. This can signal as pending food crisis. More investigations to explain this abnormality can be carried to
ascertain the cause. If the cause is linked to food insecurity, stakeholders can begin to take mitigation actions before it is too late.
4. DISCUSSION
In the previous section the proposed strategy was described. This strategy has some uniqueness from other strategies that address the same
challenge. So there are weaknesses and strengths that accrue to this strategy. On the other hand this approach bears some similarities with its
counterparts. This section will address issues such as these.
4.1. Uniqueness and Strengths the Proposed Approach
The proposed strategy harnesses mobile communication data (phone messages and telephone conversations) for prediction of food crisis.
Previous approaches have focused on utilizing other data sets such as price trends, biophysical properties, precipitation patterns, economic
trends, house hold information and refractive index for melting ice. Little attention if any has been explored this communication data as
proposed. This research has reveals the potential of this dataset in prediction of food crisis/insecurity. Secondly, communication is linked to
mobile technology. The increasing penetration and accessibility to mobile technology in the society today is a potential mechanism to collect this
data in large amount from users moreover continuously in real time. Other datasets such as house hold information are expensive to collect and
this requires a lot time. For a real time decision and more accurate results, substantial amount data and real time data is a necessity. This
communication data can potentially satisfy this. Another aspect to consider is value additional. Service providers collect this communication data
for other primary purpose. This communication data acts as a backup for future use. For example if a client accidently deletes his/her message, a
backup copy of this message can be requested from the service provider. The service providers also use this communication data to monitor
communication patterns for their clients. This study is an attempt to provide an additional value for this communication data other than its
primary purpose.
4.2. Anticipated Challenges
For many countries, several languages are used as a medium of communication. This communication data can equally be in many languages
used. In Uganda these archives can be in any of the over thirty languages spoken in the country. Capturing features for food insecurity indicators
amidst these several languages is a challenge. To address this challenge, it is recommended to base the study on dominantly used languages. This
simplification can unfortunately impact on performance prediction. In the context of Uganda, the study can focus on English and Luganda which
are mainly spoken. Privacy is another challenge. Subscribers expect privacy in their communications. Using this communication data for
investigations like this, guarantees no privacy. This is risky if these archives fall in the wrong hands like fraud men. In medical field this challenge
is overcome by trimming off patients’ identities. The same technique can be applied; sender/receiver identities can be removed from these
message records. Unfortunately the message bodies of this archived data can also contain traces of sender identities like their names appended
normally at the end. To address this, brute force technique can be used to removal these identities based on a developed corpus. Success of this is
dependent on strength of the corpus. The need to use this communication data with consent from the subscriber is also a challenge. To address
this challenge, an appropriate policy should be put in place to allow data collected on citizens like this, to be used for a common good. Without
this guidance, subscribers open charges in relation to violating person privacy. The concept of Open Data is making access to public data easy. In
Uganda Data.ug is publically making data available for researchers. Therefore privacy challenge is becoming a less threat to researchers.
5. CONCLUSION AND FUTURE WORK
The need for alternative methods to forecast future phenomena is on demand. This is important especially with current trends of increasing
scarcity of resources such as water and food. In this study a strategy to harness phone messages and telephone conversations as potential big
datasets have been described. This data can be acquired in real time. Expenses involved in acquiring are less demanding. Benefits such as these
make the proposed approach friendly. Web data can also be explored for studying the disasters that threaten social welfare of a nation or any
society. This research provides theoretical setup on how food crisis can be predicted using phone archives data. Future work aims at exploring the
proposed approach experimentally.
This work was previously published in the International Journal of System Dynamics Applications (IJSDA), 4(4); edited by Ahmad Taher
Azar, pages 116, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Amara, M., Zidi, K., Zidi, S., & Ghedira, K. (2014). Arabic Character Recognition Based M-SVM: Review. In Hassanien AE, Tolba M, Azar AT
(Eds) Advanced Machine Learning Technologies and Applications:Second International Conference, AMLTA 2014 (pp. 18-25). Cham
Heidelberg New York Dordrecht London: Springer
Bijalwan, V., Kumar, V., Kumari, P., & Pascual, P. (2014). KNN based Machine Learning Approach for Text and Document Mining. International
Journal of Database Theory and Application, 7(1), 61–70http://arxiv.org/ftp/arxiv/papers/1406/1406.1580.pdf. RetrievedFebruary282015.
doi:10.14257/ijdta.2014.7.1.06
Braun, J. V. A. (1991). Policy Agenda for Famine Prevention in Africa, Food Policy Report: International Food Policy. Research Institute:
Washington, D.C. Retrieved August 3, 2014, from http://www.ifpri.org/sites/default/files/publications/pr1.pdf
Brown, M. E. (2008). Famine Early Warning Systems and Remote Sensing Data . Berlin, Heidelberg: Springer-Verlag.
Brown, M. E., Pinzon, J. E., & Prince, S. D. (2008). Using Satellite Remote Sensing Data in a Spatially Explicit Price Model: Vegetation Dynamics
and Millet Prices. Land Economics , 84(2), 340–357. http://onlinelibrary.wiley.com/doi/10.1111/j.1749-8198.2009.00244.x/pdf Retrieved
August 3, 2014
Committee on World Food Security. (2013). Global Strategic Framework for Food Security & Nutrition. Retrieved March 28, 2015 from
http://www.fao.org/fileadmin/templates/cfs/Docs1213/gsf/GSF_Version_2_EN.pdf
Domingos, P. A. (2012). Few Useful Things to Know About Machine Learning. Communications of the ACM , 55(10), 78–
87http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf. RetrievedFebruary282015. doi:10.1145/2347736.2347755
Faaij, A. (2008). Bioenergy and global food security. Berlin. Retrieved August 1, 2014, from Retrieved from http://www.wbgu.de/wbgu jg2008
ex03.pdf
Food and Agricultural Organization, International Fund for Agricultural Development, and World Food Programme. (2013). The State of Food
Insecurity in the World Retrieved August 18, 2014, from http://www.fao.org/docrep/018/i3434e/i3434e.pdf
Food and Agricultural Organization, International Fund for Agricultural Development, and World Food Programme. (2014). The State of Food
Insecurity in the World Retrieved March 28, 2015, from http://www.fao.org/3/a-i4030e.pdf
Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text Classification Using Machine Learning Techniques. Wseas Transactions on
Computers, 8(4), 966- 974. Retrieved September 6, 2014, from http://www.infoautoclassification.org/public/articles/Ikonomakis-et.-al._Text-
Classification-Using-Machine-Learning-Techniques.pdf
International Fund for Agricultural Development. (2010). Rural Poverty Report 2011. Retrieved March 28, 2014, from
http://www.ifad.org/rpr2011/report/e/rpr2011.pdf
Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of Classification Techniques. Informatica, 31, 249 -268. Retrieved October 10,
2014 from http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf
Li T., and Li M. (2011). An Investigation and Analysis of Information Overload in Manager’s Work. iBusiness, 2011(3):49-52.
doi:10.4236/ib.2011.31008
Makowski D. Michel L. (2013). Use of dynamic linear model for predicting crop yield trends in foresight studies on food Security.Sixth
International Conference on Agricultural Statics. Thiverval-Grignon, France
Mitchell, T. M. (1997). Product Details MACHINE LEARNING. McGraw-Hill Science EngineeringMath. New York, NY Columbus. Retrieved
December 5, 2014 from http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf
Okori, W., & Obua, J. (2011). Machine Learning Classification Technique for Famine Prediction. Proceedings of the World Congress on
Engineering, 2, pp. 991-996. London, U.K. Retrieved August 3, 2014, from http://www.iaeng.org/publication/WCE2011/WCE2011_pp991-
996.pdf
Okori, W., Obua, J., & Baryamureema, V. (2009). Famine Disaster Causes and Management Based on Local Community’s Perception in Northern
Uganda. Journal of Social Sciences , 4(2), 21–32. http://docs.mak.ac.ug/sites/default/files/21-32.pdf Retrieved August 3, 2014
Pokorny, J., Skoda, P., Zelinka, I., Bednarek, D., Zavoral, F., Krulis, M., & Saloun, P. (2015). Big Data Movement: A Challenge in Data Processing
. In Hassanien, A. E., Azar, A. T., Snasel, V., Kacprzyk, J., & Abawajy, J. H. (Eds.), Big Data in Complex Systems: Challenges and Opportunities,
Studies in Big Data (Vol. 9, pp. 29–69). Cham, Heidelberg, New York, Dordrecht, London: Springer. doi:10.1007/978-3-319-11056-1_2
Poppy, G. M., Chiotha, S., Eigenbrod, F., Harvey, C. A., Honza, M., & Hudson, M. D. (2014) Food security in a perfect storm: using the ecosystem
services framework to increase. Phil. Trans. R. Soc. B, 369. Retrieved August 20, 2014 from
http://rstb.royalsocietypublishing.org/content/369/1639/20120288.full.pdf
Sajda, P. (2006). Machine Learning for Detection and Diagnosis of Disease. Annual Review of Biomedical Engineering, 8, 8.1-8.29. Retrieved
August, 3, 2014 from http://liinc.bme.columbia.edu/publications/SajdaAnnRevBME.pdf
Singh, M., & Chauhan, B. (2012). Classfication: A holistic View. ISSN 3(1), 69-72 (2012). Retrieved November 5, 2014 from
http://www.csjournals.com/IJCSC/PDF3-2/Article_13.pdf
United Nations Development Programme (UNDP). (2007). Famine in Malawi:Causes and Consequences. Human Development Report
2007/2008. Retrieved December 28, 2015 from http://hdr.undp.org/sites/default/files/menon_roshni_2007a_malawi.pdf
United Nations, Office for the Coordination of Humanitarian Affairs (OCHA). (2011). Somalia. Famine & Drought Situation Report No.19. New
York. Retrieved August 3, 2014, from
http://reliefweb.int/sites/reliefweb.int/files/resources/OCHA%20Somalia%20Situation%20Report%20No.%2019_2011.10.25.pdf
United States Committee for Human Rights in North Korea. (2005). Hunger and Human Rights: The Politics of Famine in North Korea.1101 15th
Street, NW, Suite 800, Washington, DC USA. Retrieved August 3, 2014 from http://www.asiapress.org/rimjingang/english/archive/pdf/2012-
2013_Hwanghae_Report_ASIAPRESS_North_Korea_Reporting_Team_V004.pdf
Vashist, R. (2015). Cloud Computing Infrastructure for Massive Data: A Gigantic Task . In Hassanien, A. E., Azar, A. T., Snasel, V., Kacprzyk, J.,
& Abawajy, J. H. (Eds.), Big Data in Complex Systems: Challenges and Opportunities, Studies in Big Data (Vol. 9, pp. 1–29). Cham, Heidelberg,
New York, Dordrecht, London: Springer. doi:10.1007/978-3-319-11056-1_1
Wajeed, M. A., & Adilakshmi, T. (2009). Text Classification Using Machine Learning. Theoretical and Applied Information Technology, 7(2), 119-
123. Retrieved November, 10, 2014 from http://www.jatit.org/volumes/research-papers/Vol7No2/4Vol7No2.pdf
Zawbaa, H. Z., Abbass, M., Hazman, M., & Hassenian, A. E. (2014). Automatic Fruit Image Recognition System on Shape and Color Features. In
Hassanien AE, Tolba M, Azar AT (Eds)Advanced Machine Learning Technologies and Applications:Second International Conference, AMLTA
2014 (pp. 18-25). Cham Heidelberg New York Dordrecht London: Springer.
CHAPTER 61
Towards Security Issues and Solutions in Cognitive Radio Networks
Saed Alrabaee
Concordia University, Canada
Mahmoud Khasawneh
Concordia University, Canada
Anjali Agarwal
Concordia University, Canada
ABSTRACT
Cognitive radio technology is the vision of pervasive wireless communications that improves the spectrum utilization and offers many social and
individual benefits. The objective of the cognitive radio network technology is to use the unutilized spectrum by primary users and fulfill the
secondary users’ demands irrespective of time and location (any time and any place). Due to their flexibility, the Cognitive Radio Networks
(CRNs) are vulnerable to numerous threats and security problems that will affect the performance of the network. Little attention has been given
to security aspects in cognitive radio networks. In this chapter, the authors discuss the security issues in cognitive radio networks, and then they
present an intensive list of the main known security threats in CRN at various layers and the adverse effects on performance due to such threats,
and the current existing paradigms to mitigate such issues and threats. Finally, the authors highlight proposed directions in order to make CRN
more authenticated, reliable, and secure.
1. INTRODUCTION
Nowadays, there is an unexpected explosion in the demand for wireless network resources. This is due to the intense increase in the number of
the emerging services. For wireless computer network, limited bandwidth along with the transmission quality requirements for users, make
quality of service (QoS) provisioning a very challenging problem as highlighted by Bhargavaand et al. (2007). To overcome spectrum scarcity
problem, Federal Communications Commission (FCC) has already started working on the concept of spectrum sharing where unlicensed users
(also known as secondary users or SUs) can share the spectrum with licensed users (also known as primary users or PUs), provided they respect
PUs rights to use spectrum exclusively. The underutilization of the allocated spectrum has been also reported by the Spectrum Policy Task Force
appointed by Federal Communications Commission (FCC) (2002).
In spectrum sharing based CR networks, secondary users (SUs) coexist with a primary user (licensed) system. A fundamental challenge is how to
serve SUs while ensuring the quality of service (QoS) of the primary user (PU).
Cognitive radio networks (CRNs) are smart networks that automatically sense the channel and adjust the network parameters accordingly. In
CRN, the unlicensed user (SU) has the possibility of using large amounts of unused spectrum in an efficient way while reducing interference with
other licensed user (PU). The key technology in CRNs that enables the SUs to sense and utilize the spectrum is the radio technology. This
technology is an emerging technology that enables the dynamic deployment of highly adaptive radios that are built upon software defined radio
technology (SDR) as mentioned in Dhar et al. (2006) and Qusay et al. (2007). Moreover, it allows the unlicensed operation to be in the licensed
band. We show the tasks in cognitive radio network that have been conducted from 2006 to 2014 in Table 1.
Table 1. Cognitive radio network research history
In CRN, security threats are much more complex and possibility of an attack is higher than in other networks since the network nodes are much
more intelligent by design. Hence, security measurements and polices should be developed to reduce the opportunity that malicious nodes attack
the CR network. Here we will be considering different scenarios of the different attacks that target different layers of protocol stack in order to
propose new models to detect and mitigate these attacks.
2. BACKGROUND
In this section, we will discuss the security properties and the literature review in the CRN security.
2.1 An Overview of Cognitive Radio Networks
The principle of Cognitive Radio was firstly mentioned and explained by Mitola et al. (1999). Cognitive Radio could be defined as an efficient
technology that allows more users to use the available spectrum. Spectrum sensing is assumed as the basic functionality in CR. Spectrum sensing
aims to find the vacant spectrum holes for dynamic use. In general, there are two sensing modes, reactive sensing and proactive sensing as
mentioned by Jin et al. (2009). Generally, the spectrum sensing techniques can be categorized as transmitter detection, cooperative detection,
and interference-based detection as highlighted by Mitola et al. (1999). In transmitter detection, the presence of the PU transmitter in its
spectrum band is determined. Three schemes that are generally used for the transmitter detection are: matched filter detection, energy detection
and cyclostationary feature detection. Matched filter detection is used if the secondary user has information about the primary user signal. If
there is not enough information about PU’s signal, energy detection is applied. In cyclostationary feature detection, modulated signals are
coupled with other signals. In cooperative detection technique, cooperation concept between the SUs is applied in order to improve the sensing
results. The last technique, interference-based detection technique, has been introduced by the FCC et al. (2001), wherein the interference
temperature is measured and compared with statistical information to make the decision about the PU presence in its spectrum band. Different
schemes such as Zeng et al. (2010), represent the spectrum sensing functionally which could be classified as following:
• Centralized cooperative scheme: where there is a controller, and cooperation between the SUs to sense the spectrum holes, where each SU,
individually, senses the spectrum holes and sends the sensing information to the controller that makes the final decision of the spectrum.
• Centralized non-cooperative scheme: in this scheme the controller senses the spectrum holes and manages the access to the holes for
different SUs.
• Distributed cooperative scheme: there is no controller in this approach. Each SU senses the spectrum holes, and then all SUs distribute
their spectrum sensing information to other SUs.
• Distributed non-cooperative: same as the previous one, but each SU senses the spectrum holes and then decides which spectrum hole to
use without considering the other SUs’ sensing information.
Spectrum management is another functionality of CR as introduced by Faulhaber et al. (2002). The objective of spectrum management is to
share the spectrum between many users, PUs and SUs, in such a way that it accomplishes different goals and requirements. The main objective
for the SU is to attain its QoS. There are many factors that represent the QoS of SU such as using high data rate for sending its data, using proper
power values in the transmission process, or reducing the interference caused to other users in the network. For the PU, it always tries to lease its
unused frequency channels to SUs which pay more; that finally results in attaining high revenue.
Three different models used to represent the spectrum sharing functionality in cognitive radio networks are as follows:
• Public Commons Model: The radio spectrum is open to anyone for access with equal rights; this model currently applies to the wireless
standards (e.g., WiFi and Bluetooth radio) operating in the license-free ISM (Industrial, Scientific, and Medical) band.
• Exclusive Usage Model: The radio spectrum can be exclusively licensed to a particular user; however, spectrum utilization can be
improved by allowing dynamic allocation and spectrum trading by the spectrum owner.
• Private Commons Model: Different users in a cognitive radio network (e.g., primary, secondary, tertiary, and quaternary users) can
have different priorities to share the spectrum. Secondary users can access the spectrum using an underlay or overlay approach.
2.2 Security Properties
2.2.1 Authentication
Ngo et al. (2010) defined the authentication as the verification of the claimed identity of a principal. It is a primary security property while other
properties often rely on authentication having occurred. Authentication is sometimes taken to be of two types:
• Message authentication: Ensuring, that a message received matches the message sent. Sometimes, it means a proof of the identity of the
creator of the message.
2.2.2 Authorization
The authorization property stipulates which principal has access to what resource or operation. It distinguishes between legal and illegal
accesses. Legal principals are granted authorization to the resource/operation in question while illegal ones are denied access to the resource or
operation as introduced by Eronen et al. (2004).
2.2.3 Integrity
Integrity is the property of ensuring that information will not be accidentally or maliciously altered or destroyed. It means that data is
transmitted from source to destination without alteration as highlighted by Balasubramanyn et al. (2007). The message data can only be altered
by the sender without detection. Integrity protects against unauthorized creation, alteration or destruction of data. If it were possible for a
corrupted message to be accepted, then this would show up as a violation of the integrity property by Jung et al. (2013).
2.2.4 NonRepudiation
Non-repudiation is defined as the impossibility for one of the entities involved in a communication denying having participated in all or part of
the communication. It provides a protection against false denial of having been involved in the communication. The general goal of non-
repudiation is to collect, maintain, make available, and validate irrefutable evidence concerning a claimed event or action in order to resolve
disputes about the occurrence or non-occurrence of that event or action as introduced by Tandel et al. (2010).
2.2.5 Fairness
In fair protocols, agents require protection from each other, rather than from an external hostile agent. In electronic contract signing, for
instance, we will want to avoid one of the participants being able to gain some advantage over another by halting the protocol part-way through.
Bob could, for example, refuse to continue after Alice has signed up, but before he has signed. Some efficient fair protocols are conceived to run
between two agents and occasionally call upon a trusted third agent in case of disputes as introduced in Tychogiorgos et al. (2012).
2.2.6 Availability
This property deals with the availability of certain resources manipulated by the protocol. For instance, for a key-exchange protocol, we would be
confident that a session will indeed be established. Generally, to verify the availability property in crypto protocols, we have to restrict the
capabilities of the intruder. In particular, we cannot allow the intruder unlimited ability to kill messages as introduced by Maple et al. (2007).
2.2.7 Secrecy
Secrecy protects against unauthorized disclosure of information. It is the trait to keep some information confidential. The property of secrecy
allows the intended receiver in a communication session to know what was sent, but unintended parties cannot determine what was sent. We say
that a protocol preserves the secrecy of one of its parameters if it does not leak any information about this parameter during the execution the
protocol. The parameters of the protocol for which we want secrecy are often cryptographic keys, but broadly speaking, they can be any sensitive
data. Encryption is, in general, the mechanism used to ensure secrecy as highlighted in Perron et al. (2009).
2.2.8 Discussion
The most important security properties in CRN are availability: the spectrum/channel should be returned to the primary user when he is active;
reliability of transmitting sensing results for SU; non-repudiation: agreement between the PU and the SU; authentication: to assure the
credibility of the CR users; and stability which is defined by Butt et al. (2013) as follows: the ability to come back to equilibrium state after being
hindered by a physical disturbance. Nodes may join or leave cognitive radio network. Radio networks require mechanisms to authenticate,
authorize and protect information flows of participants. The issues to be addressed in security are as follows (Steenkiste et al. (2009), Alrabaee et
al. (2012), Alrabaees et al. (2012)):
• The effects of increasingly software-based radio implementations because of software flaws are known to be a major security problem in the
Internet today.
• The guarantee to assure that CRs operate as proposed and designed and if it is there a trusted cognitive radio architecture that can address
some of these security concerns.
Table 2. Summary of security properties in terms of CRN
According to Info-graphic statistics published by the end of 2013, there are more mobile devices on earth than people. These devices include the
regular cell phones, smart phones, tablets, and laptops. This large number of devices generates a huge amount of traffic over different types of
networks. Therefore, CR is considered as the best solution to avoid the traffic congestion and provide the required quality of service to the
different users. Cognitive radio networks differ from other wireless networks where some reliability issues are unique to CRN, such as high
sensitivity to weak primary signals, unknown primary receiver location, tight synchronization requirement in centralized cognitive networks, and
lack of common control channel as in Mathur et al. (2007). In CR networks, security threats are much more complex and possibility of an attack
is higher than in other networks since the network nodes are much more intelligent by design. Hence, security measurements and polices should
be developed to reduce the opportunity that malicious nodes attack the CR network.
The most important behaviours of the attackers can be categorized into the followings (i) Misbehaving: acting in such a wrong way by providing
wrong information about its identity and other nodes identity to fully utilize the spectrum. (ii) Selfish: trying to allocate the network resources for
its own use and denying other nodes from using these resources. (iii) Cheating: an attacker cheats on other nodes about its identity and luring
them that it is the best path to all other nodes, therefore they use it for their packets’ forwarding process, and (iv) malicious: a node behaves
maliciously to effect the QoS of other nodes as well as stealing their confidential information to benefit from that as introduced in Weifang et al.
(2010). These behaviors clearly impact the network performance significantly.
The attacks generally follow a layered approach as in El-Hajj et al. (2011). The attacks such as Primary User Emulation (PUE) as proposed in Jin
et al. (2012), Jamming as in Xu et al. (2005), and the Objective Function, as in Leon et al. (2010)., occur in the Physical Layer. The attacks such
as Spectrum Sensing Data Falsification (SSDF) as in Chen et al. (2008)., the Control Channel Saturation DoS as introduced in (Bian et al. (2011).,
and Zhu et al. (2008)), and the Selfish Channel Negotiation as in Zhu et al. (2008)., occur in the Link Layer. The attacks such as Sinkhole,
HELLO Flood as in Karlof et al. (2003), Wormhole attack, and Sybil attack as in Reddy et al. (2013)., occur at the Network Layer. The attacks
such as the Lion Attack as in (Leon et al. (2009). and Hernanlidis et al. ()), and Key Depletion Attack as in Zhang et al. (2009), occur at the
Transport Layer. Some of these attacks, such as Jamming Attack, may target multiple layers.
One of the oldest and most widely used attack strategies is to reduce the received SNR below the required threshold by transmitting noise over
the received channel (Xu et al. (2005), and Altman et al. (2011)). Another type of security threat, an eavesdropper might get access to the content
of exchanged data over wireless links, such as in CRNs as in Liang et al. (2009), and then exploit this information against the end users or the
network. In addition, the ability of cognitive nodes to alter their transmission specification can potentially pose security threats when a malicious
node takes advantage of this flexibility for its benefit. The attacks on various layers are summarized in Table 3.
Table 3. Summary of security attacks on various layers of CRN
Objective Malicious
Function transmission in
one user affects
other users
Delay
Data Link SSDF Selfish user
changes utility
3. DETECTION AND MITIGATION OF SECURITY THREATS IN CRN
The first step in utilizing the unused spectrum is the spectrum sensing process, as mentioned above, which is considered as cataleptic context for
malicious nodes to arise and attack the CRN. Security comprises two issues in PUs signals’ detection, which are misdetection and false detection.
False detection means that a SU records that a PU is present in its band while in real it is not where a malicious node alleges as a PU and sends
strong signal to SUs. Misdetection issue is the opposite of the false detection issue. The previous mentioned issues are one example of some
security issues that can arise and make CRN more challenging solution. Stronger security mechanisms should avoid the harmful effects of the
different attacks such as overhearing other users’ information, interfering with other users’ transmission signals, degrading the quality of service
of licensed users, and therefore increasing the spectrum scarcity problem intended to be solved by CR technology. In order to develop new
approaches to detect and mitigate the security attack in CRN, security issues especially related to CRN should be studied. First issue is to define
the different types of security attacks, second is to implicate security issue through the implementation of cognitive radio, and third is to design
cognitive radio nodes to be considered as trusted nodes by applying the security concepts.
In CRN, many challenges can be faced in proposing new approaches and models to detect and mitigate the different security attacks that are
summarized as follows:
The main application of secured CRN will be in the TV white spaces as they were the main motivation of CR concept. Another application is in the
vehicular communication. The cognitive radio technology could be applied in the vehicles as cars, trucks, and buses. Therefore, CR devices will
alert the drivers about the traffic conditions and the status of roads so they can take their precautions ahead. Mobile networks are another and
rich context where CR technology can be applied. Mobile service providers wish if this technology is applied as soon as possible as it helps them
to unpredictably increase the number of subscribers which lead to higher profits, however the opportunity of using the spectrum at the same time
by different users should be taken in consideration before launching and applying CR technology in such networks because that might cause
problems in the networks and make the servers down. The threats, in CRN, can be categorized according to the layers that they target as follows:
physical layers attacks, data link layer attacks, network layer attacks, and transport layer attacks.
The following sections illustrate the different attacks, and their detection methods and mitigation techniques in detail.
3.1 Security Threats Detection and Mitigation in Cognitive Radio Networks: Layers 1 and 2
3.1.1 Physical Layer Attacks
Physical layer focuses on the data transmission through the physical medium which is the frequency spectrum in CRN. The main difference
between the CRN and the traditional wireless networks in this layer is the spectrum sensing process. In CRN, the secondary users should
correctly sense the licensed users’ absence/presence in their spectrum bands. The secondary users can use the spectrum band in the absence of
the licensed user(s). Transmitting over various frequencies across a wide frequency spectrum makes the physical layer in CRN more complex
than the general one. Different attacks can target on the physical layer and the following are some of them:
Primary User Emulation (PUE) Attack
The process of distinguishing between the signals of different users in CRN refers to spectrum sensing wherein different detection techniques are
used such as filter detection, energy detection, and cyclostationary feature detection. This attack is carried out by a malicious secondary user
emulating a primary user or masquerading as a primary user to obtain the resources of a given channel without having to share them with other
secondary users. This attack will affect the network by disturbing the legitimate users’ communication, therefore degrading their quality of
service (QoS). Cooperation among the different secondary users throughout the spectrum sensing process is the first step to mitigate this attack.
Chen et al. (2007) proposes a transmitter verification scheme, called LocDef (localization-based defense) which verifies whether a given signal is
that of an incumbent transmitter by estimating its location and observing its signal characteristics. To estimate the location of the signal
transmitter, LocDef employs a non-interactive localization scheme to detect and pinpoint PUE attack.
Lin et al. (2011) introduces a robust technique based on the principal component analysis for spectrum sensing process. All SUs send their
observation matrix about different PUs to one fusion center which keeps track of the SU’s transmission signal power in another matrix called low
rank matrix. The fusion center uses this matrix to decide which nodes are suspect nodes and notify the other legitimate nodes. The data cache is
no longer poisoned, and the results of the primary user sensing process are more accurate.
Another method of defense against the primary user emulation attack is proposed by Yuan et al. (2012). This method is based on the concept of
belief propagation. All secondary users in the network follow a sequence of steps until the suspect nodes are detected and excluded from the
spectrum sending process. Each SU calculates iteratively two different types of functions which are location function and compatibility function
that are being used to determine and check the location and the compatibility of PUs. After that, SU makes the decisions about PUs, prepares
sensing messages, exchanges these messages with neighboring SUs, and calculates the belief level of other SUs until convergence. At
convergence, any existing attacker will be detected, and secondary users will be notified via broadcast message of the attacker’s signal
characteristics and therefore neglect and exclude that attacker sensing results. This allows all secondary users to avoid the attacker’s primary
emulation signal in the future. Chen et al. (2011) proposes another method of detection and mitigation of a primary user emulator. A fusion
center receives the sensing information from the different SUs in the network which uses such estimation algorithms to detect the primary user
in the presence of the attacker.
Objective Function Attack
The objective function of a CR is to maximize the transmission rate (R) and the security level (S). Whenever, the CR aims to maximize the
security level, the attacker creates a jamming traffic on the radio which reduces the R and therefore reduces the objective function. To detect and
mitigate this attack, a predefined threshold for each of these parameters is proposed by Leon et al. (2010). If the value of any of these parameters
goes beyond the threshold value, the communication shall stop and these communicating nodes will be reported to a fusion center that have to
re-authenticate each of them.
Jamming Attack
The attacker (jammer) maliciously sends out packets to hinder legitimate participants in a communication session from sending or receiving
data; consequently, creating a denial of service situation. However, there are other reasons that make the transmission channels saturated such
as the network congestion due to the messages exchange between the nodes. To mitigate this attack the secondary users have to keep track of the
primary user location which can be obtained by contacting a base station or relayed via other participating network nodes. By comparing the PU’s
location to the location of the node making requests would alert the network that a malicious node may exist.
3.1.2 Data Link Layer Attacks
Framing the data, regulating the access to the physical resources, error correction, and modulation are the main functions of the link layer. There
are many differences between the link layer in CRN and the traditional networks. First, in the traditional networks, fixed predefined
communication channels are assigned to the users to use for their data transmission according to their protocols. In CRN, the communication
channels are not fixed and might exist anywhere in the spectrum because of the dynamic spectrum access (DSA) feature. One more difference is
that the users in CRN use many channels to simultaneously transmit data in order to increase their throughput. Thus, the resource management
is more complex and requires intelligent scheduling models to avoid data collision on this layer. Different attacks target on the link layer and
following are some of them:
Spectrum Sensing Data Falsification (SSDF)
It takes place when an attacker sends false local spectrum sensing results to its neighbors or to a fusion center, causing the receiver to make a
wrong spectrum-sensing decision as highlighted by Chen et al. (2008). Rawat et al. (2011) proposes a mitigation method for SSDF attack. During
the sensing period, all the malicious nodes and the other SUs make their own decisions about the presence/absence of PUs in their bands and
forward these decision to a central fusion. The central fusion keeps a track of how many times each node needs to have the right decision about
the PU; this number of times is called measure. The higher the value of the measure the less reliable the node’s observation is considered. The
nodes with higher value of measure will be excluded from the following sensing results collection iteration.
Control Channel Saturation Attack (CCSA)
Usually a single control channel is used by all the secondary users to send information about the spectrum channels; however this channel has a
capacity. An attacker can keep generating fake frames over this control channel to make it overloaded which leads to decreasing the network
performance. To mitigate this attack, a CR network could be categorized into many clusters. In each cluster, a common control channel is used. If
an attacker is targeting a control channel, the other clusters’ nodes will not be affected; hence the affected network area is reduced as mentioned
by Khasawneh et al. (2014).
Selfish Channel Negotiation (SCN)
This attack occurs if the attacker (one of the secondary users) refuses to forward any data to the other secondary users in the same network. It
leads to maximize its own throughput, thus decreasing the total throughput over the CR network as introduced by Zhu et al. (2008). To mitigate
this attack a backup route should be known earlier to all the nodes and used in that case. Moreover, the destination node should notify the fusion
center when the message requested is not received. The fusion center investigates with the collaboration of the network nodes to identify the
node that did not forward the message as mentioned by Fragkiadakis et al. (2012).
3.1.3 Network Layer Attacks
The main function of the network layer is routing. In the traditional networks, the routers build the routing table which identifies the paths
from/to each node in the network. In CRN, the routing is more complex because the spectrum can be openly accessed. The attackers in CRN try
to access the routing table and change its contents by sending messages to its neighbors telling them that this node is their next hop as
highlighted by Alrabaees1 et al. (2012). Two main relevant attacks on network layer are:
Sinkhole Attack
The attacker announces itself as the best route to a specific destination luring neighbors to use it for their packets forwarding. Neighboring nodes
trust the “attacker” and advertise that. By doing that the attacker has built a trust base and can start attacking the network as introduced by
Karlof et al. (2003). The attacker is considered as the best route to all the network nodes, therefore the attacker can use a higher power level to
direct all packets received to the base station and hence he advertises that he is on node away from the base station. The different nodes will
consider this node “attacker” as the best route in the network and the attacker will build more trust by forwarding the packets correctly to their
proper destinations in the beginning. After high level of trust is being built, the attacker can begin his attacking work by using some techniques as
following: Eavesdropping, selectively forwarding, modifying packets, and dropping packets as in Burbank et al. (2008).
Sinkhole attack is effective when the destination is not in the same sender’s or attack’s network, i.e. when all packets should be forwarded to a
base station that is going to forward them to the distained network base station.
Many techniques are available to mitigate this attack. If a new node likes to join the network, an authentication process should be applied. This
authentication process will add new nodes to the network if and only if it is well authenticated and identified.
If the attacker is one of the already authenticated nodes, a periodic notification messages should be sent by the base station to all network nodes
about any doubt or communication issues in the forwarding, dropped, modified packets and the attacker should be excluded and discarded from
the network. Another solution to mitigate the sinkhole attack is to apply one of the on demand routing protocols used in wireless sensor or ad hoc
networks such as security-aware ad hoc routing protocol, AODV, or DSR.
In these routing protocols, the source node that wants to send a packet to another node establishes the path by sending route request message.
This message should contain a security metric (level) that should be processed by intermediate nodes to check if this level is satisfied or not. The
message will be forwarded to the next intermediate node once the security level is satisfied otherwise it is dropped. If this request reaches the
destination properly and correctly, the destination will prepare and send a route reply to sender through the intermediate nodes that process the
route request message earlier.
The attacker still can be present in networks that use this type of protocols by changing or altering the security level. However the route request
and reply messages contain a ciphered key that prevents any node that does not know this key from decrypting the messages. Therefore, even if
the attacker generates messages with changed security levels, the legitimate nodes will drop these packets since they do not contain the correct
ciphered key generated by the base station.
HELLO Flood Attack
An adversary broadcasts a message to all node of a network stating that it is the best route to a specific destination node in the network as
highlighted by Karlof et al. (2003). The attacker uses high power level to send the broadcast message to convince all the other nodes that this
attacking node (adversary) is their neighbor.
When the attacker uses a high power level to send the broadcast message, the other nodes will receive this message with good signal strength and
they assume that this attacking node is very close to them while it is not in reality. The network nodes will forward their packet destined to a
particular node through this attacking node with regular signal power level, but the messages will be lost due to the far distance of the attacking
node (the forwarding node). Since all nodes of network forward packets to an attacking node and their packet are lost, they will find themselves
with no neighbors after a while. The following methods shall be used to mitigate this attack. All links between nodes should be bidirectional and
this functionality could be checked and verified by sending one message over links and in presence of a trusted node which is fusion center. The
fusion center will initiate and verify the session keys between any pair of network nodes. Two purposes sit behind the use of session key which are
verifying the identity of the communication nodes to each other and providing a ciphered link among them. If one node claims to be a neighbor to
a big number of network nodes, an alarm should be raised about attacker detection.
Sybil Attack
In Sybil attack as proposed by Yenumula et al. (2013), an attacker uses different fake identities to represent one entity. The attacker uses the
same node with its different fake identities to cheat on the legitimate nodes. The effect of this attack is clear in the cooperative spectrum sensing
technique wherein all nodes participate cooperatively in making the decision about the presence or absence of a PU over its spectrum. In that, the
attacker can send wrong sensing information which leads to wrong sensing decision and hence lets the PUs channels unused or exclusively used
by the attacker himself. Nodes’ identity validation technique is used to mitigate this attack wherein there are two ways of validation used which
are direct and indirect validation. In direct validation, each node tests directly the identity of other node if it is valid or not. On the other hand, in
indirect validation, other verified nodes can validate or send reputation report for other nodes. In any type of the validation types, the resources
of a node are tested and these resources should be limited and able to handle communication, storage, and computation resources.
3.1.4 Transport Layer Attacks
Flow control, error control, and congestion control are the main functionalities of the transport layer. There are two factors involved in the
transport layer control which are the round trip time and the packet loss probability. These factors are influenced by the different characteristics
of CRN such as spectrum accessing technology and the operating frequency. This layer as the other layers is vulnerable to many attacks that
target the cognitive radio networks.
Key Depletion Attack
Cognitive radio networks have a short transport layer session duration due to frequently occurring retransmissions and high round trip times as
introduced by Balasubramanyn et al. (2007). Therefore, a large number of sessions are initiated between communication parties. Most transport
layer protocols, such as secure socket layer (SSL) and transport layer security (TLS), establish cryptographic keys at the beginning of each
transport layer session. With the great number of session keys generated, it becomes more likely that a session key got repeated. Repetitions of a
key can provide an avenue of exploitation to break the underlying cipher system as introduced by Tychogiorgos et al. (2012). It has been
established that wired equivalent privacy (WEP) and temporal key integrity protocol (TKIP) protocols used for IEEE 802.11 are prone to key
repetition attacks. Therefore, the attackers can eavesdrop the communication traffic between the two communication users and got the session
key, and therefore use this key to send get the session data. To mitigate this attack new ciphering algorithms have to be developed to make the
session keys’ sharing process done in a more secure way.
Some attacks might target one layer and have influences and consequences over other layers, these attacks are known as cross-layer attacks. In
cognitive networks, there is an inherent need for greater interaction between the different layers of the protocol stack. Therefore, the cross-layer
attacks need to be given more attention in cognitive networks.
Figure 1. illustrates the different concepts of cognitive radio networks and the way that they are linked to each other. As each layer is
communicating with the other layers to provide its functionality that brings new security threat problems in CRN which is called Lure attack
problem as highlighted by Long et al. (2012). In Lure attack, during the process of finding routes from source to destination, malicious node
firstly modifies the request packet of receiving routing by adding false available channel information to it. This false channel information will lure
other nodes into the routing lap, and drop the forwarded packets. This threat seriously can affect the communication performance of the
network.
Figure 1. Cross-Layers Attacks Effects as introduced by Long
et al. (2012)
4. FUTURE DIRECTIONS
4.1 Genetic Algorithm for CRN
Genetic Algorithm (GA) is an appropriate optimal searching method which simulates natural selection and genetic evolution. Chen et al. (2010)
presents GA to cognitive engine model. The combinations of CR parameters build up different chromosomes in Zhang et al. (2008) model, a
fitness function evaluates each chromosome to decide whether it will contribute to the next generation of solutions, then, through operations
analogous to gene transfer in sexual reproduction, the algorithm makes a new population of candidate solutions. The parameters of CR terminals
are treated as genes of the chromosome, the mixture of different genes forms a chromosome, and many chromosomes form the population. The
relation between the elements in GA and CRN is shown in Table 4. Development starts from initially generated population, it processes iteration
by iteration to create new better chromosome until pre-specified stopping condition is satisfied. The procedure of the general GA in CR is
highlighted in Zhao et al. (2012) as follows:
Table 4. Cognitive radio network based on Genetic Algorithm
1. Choose the parameters to be configured, and then set the bit number of gene according to their range and accuracy and the number of
chromosomes in a population.
3. Iteration: calculate the fitness of each chromosome of the ithpopulation P(i), use wheel selection to get the parents from P(i), execute two-
point crossover and random mutation, the new chromosomes form P(i +1) .
If one chromosome which owns a quite high fitness exists at the previous stage of evolution, it will be chosen frequently to crossover with other
chromosomes and produce more offspring chromosomes. Then the number of similar chromosome will become larger and larger with the
iterations, the evolution progress will fall into premature convergence as introduced in Cao et al. (2010), even after mutation, the trend will never
change. If almost all chromosomes own similar fitness after iterations, the probability of a chromosome chosen to crossover with other is almost
equal. GA will play an important role in improving the security. More specifically, it could be adapted to dynamic environments, or when the
complexity of user behaviors is very high. Moreover, GA contains many features that make it very appropriate for security system especially for
intrusion system detection as in (Lu et al. (2004), Chittur et al. (2001), and Folino et al. (2005)). Like robustness to noise, self-learning
capabilities, and the fact that initial rules can be built randomly so there is no need of knowing the exact way of attack machinery at the beginning
as in Bankovic et al. (2007). The numerous gains of using genetic algorithms could be summarized as in Majeed et al. (2014): (i) GA possess
marvelous capabilities for parallel processing, (ii) GA provides a wider solution space, (iii) GA is easy to modify, (iv) GA handle functions with
noise efficiently, and (v) GA does not need prior knowledge of the problem space.
4.2 Game Theory for CRN
Game theory has been used mostly in economics, in order to model competition between firms. It has also been applied to networking, generally
to solve routing and resource allocation problems in a competitive environment, which is introduced by Akkarajitsakul et al. (2011). Recently,
game theory was also applied to wireless communication: the decision makers in the game are rational users who control their communication
devices as in (Akkarajitsakul et al. (2011), Alrabaees1 et al. (2012), Alrabaees5 et al. (2012), Alrabaees2 et al. (2012), Alrabaees4 et al. (2012),
Khaswaneh1 et al. (2012), and Khaswaneh2 et al. (2012)). Due to the nature of cognitive radio network where any change in environment triggers
the network to re-allocate the spectrum resources; the game theory can be used as an important tool to analyse, model, and study the
interactions. By using game theory we can get an efficient model as well as self-enforcing and scalable schemes. Furthermore, the secondary users
who compete for spectrum may or may not cooperate with other users. The latter will lead to the selfish behaviour of users. This nature
specifically can make use of a game theory model. There are various kinds of games such as cooperative, non-cooperative, static, dynamic,
repeated, and Stackelberg. Brief descriptions of the game theories are provided below:
• Dynamic Game: It is good for more periods and any change in the parameters of the system will affect the game.
• Stackelberg Game: It consists of a leader and a follower. Leader announces a policy and the follower chooses its policy based on leader
action.
Table 5 shows the correspondence between each element in a cognitive radio network with each element in game theory.
Table 5. Cognitive radio network based on wireless networking game
Most game approaches on spectrum and power management do not consider security issues and make some assumptions related to security,
such as all users are not malicious users, all users are trusted, all users are authorized as well as authenticated, and the primary user is a trusted
party. However, in some environment these assumptions are not valid, which requires changes to the existing model to prevent any kinds of
attacks or denial of services.
In a simple scenario of applying game theory, we have attackers targeting a system that applies CR and defenders defending against these attacks.
The attackers aim to minimize the QoS level provided to regular network users by increasing the noise or the interference level while the regular
network users will serve as defenders by communicating with each other. The game will be played between the defenders and the regular users by
tuning few parameters such as detection rate and false alarm rate, cost of attacking and monitoring, and probability of a node to be malicious.
The game players shall use these parameters to develop proper payoff functions to reduce the effects of these attacks. Therefore; game theory
should be studied and applied to provide secure approaches of spectrum sharing between the different networks’ nodes.
4.3 Capabilities in CRN against Security Issues
Basically, cognitive radios offer significant new capabilities to encounter the attacks such as intrusion, and denial of service attacks. The
capabilities are as follows: (I) the spectrum sensing and Software Defined Radio (SDR) capability of the radio make it possible to employ physical
layer properties, such as RF signatures which are used for authentication or secure communication. In addition to that, (II) the spectrum
scanning and swiftness related with cognitive radios enable networks to move away from frequency channels experiencing denial-of service
attack. Moreover, (III) The location is another vital feature of a cognitive radio network and information on geographic position can also be used
to defend against certain types of attacks on cognitive networks such as Primary User Emulation Attacks. Spectrum trading as in (Alrabaees4 et
al. (2012) and Khasawneh2 et al. (2012)) is one important issue that opens the doors to malicious node to launch attacks in CRN.
4.4 Efficient Spectrum Sensing Techniques
If the secondary users sense the primary users correctly, then they can efficiently use the unused licensed bands. An exchanging information
method has been proposed by Khasawneh3 et al. (2012), where we used clustering, sureness, and cooperation concepts to exchange the spectrum
sensing information between the secondary users. Comparing different proposed schemes will lead to develop more efficient and robust
spectrum sensing techniques that prevent frauds from attacking cognitive radio networks.
5. CONCLUSION
Security is affected by changing technology, environment, demand, and the performance. The security was presented in application layer;
designing security models and mechanisms is very challenging. However, security in cognitive radio networks is a technical area that has received
little attention to date. Despite the main objective of using cognitive radios is to increase spectrum utilization by allowing the unlicensed
(secondary) users to opportunistically access the frequency band actually owned by the licensed (primary) users, the classification of users into
two different categories gives rise to several security issues that are unique to cognitive radio communications. We showed through this article
the main requirements of security, and some attacks that are targeting the different protocol layers stack.
This work was previously published in the Handbook of Research on SoftwareDefined and Cognitive Radio Technologies for Dynamic
Spectrum Management edited by Naima Kaabouch and WenChen Hu, pages 813834, copyright year 2015 by Information Science Reference
(an imprint of IGI Global).
REFERENCES
Akkarajitsakul, K., Hossain, E., Niyato, D., & Kim, D. I. (2011). Game theoretic approaches for multiple access in wireless networks: A
survey. IEEE Communications Surveys and Tutorials ,13(3), 372–395. doi:10.1109/SURV.2011.122310.000119
Alrabaee S. Agarwal A. Goel N. Zaman M. Khasawneh M. (2012, August). Higher layer issues in cognitive radio network. In Proceedings of the
International Conference on Advances in Computing, Communications and Informatics (pp. 325-330). ACM.10.1145/2345396.2345450
Alrabaees, S., Agarwal, A., Goel, N., Zaman, M., & Khasawneh, M. (2012a, November). Comparison of spectrum management without game
theory (smwg) and with game theory (smg) for network performance in cognitive radio network. In Proceedings of the 2012 Seventh
International Conference on Broadband, Wireless Computing, Communication and Applications (pp. 348-355). IEEE Computer Society.
Alrabaees, S., Agarwal, A., Goel, N., Zaman, M., & Khasawneh, M. (2012d, November). Routing management algorithm based on spectrum
trading and spectrum competition in cognitive radio networks. In Proceedings of the 2012 Seventh International Conference on Broadband,
Wireless Computing, Communication and Applications (pp. 614-619). IEEE Computer Society.
Alrabaees, S., Khasawneh, M., Agarwal, A., Goel, N., & Zaman, M. (2012e, December). A game theory approach: Dynamic behaviours for
spectrum management in cognitive radio network. InProceedings of Globecom Workshops (GC Wkshps), (pp. 919-924). IEEE.
Alrabaees, S., Agarwal, A., Goel, N., Zaman, M., & Khasawneh, M. (201b2, October). A game theoretic approach to spectrum management in
cognitive radio network. In Proceedings of Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), (pp. 906-913).
IEEE.
Alrabaees, S., Agarwal, A., Anand, D., & Khasawneh, M. (2012c, August). Game Theory for Security in Cognitive Radio Networks. In Proceedings
of Advances in Mobile Network, Communication and its Applications (MNCAPPS), (pp. 60-63). IEEE.
Altman, E., Avrachenkov, K., & Garnaev, A. (2011). Jamming in wireless networks under uncertainty. Mobile Networks and Applications , 16(2),
246–254. doi:10.1007/s11036-010-0272-4
Balasubramanyn, V. B., Thamilarasu, G., & Sridhar, R. (2007, June). Security solution for data integrity in wireless biosensor networks.
In Proceedings of Distributed Computing Systems Workshops, (pp. 79-79). IEEE.
Banković, Z., Stepanović, D., Bojanić, S., & Nieto-Taladriz, O. (2007). Improving network security using genetic algorithm approach. Computers
& Electrical Engineering , 33(5), 438–451. doi:10.1016/j.compeleceng.2007.05.010
Bian, K., Park, J. M., & Chen, R. (2011). Control channel establishment in cognitive radio networks using channel hopping .IEEE Journal
on Selected Areas in Communications , 29(4), 689–703.
Burbank, J. L. (2008, May). Security in cognitive radio networks: The required evolution in approaches to wireless network security.
In Proceedings of Cognitive Radio Oriented Wireless Networks and Communications, (pp. 1-7). IEEE.
Butt, M. A. (2013). Cognitive radio network: Security enhancements. Journal of Global Research in Computer Science ,4(2), 36–41.
Cao, D. Y., & Cheng, J. X. (2010). A genetic algorithm based on modified selection operator and crossover operator. Computer Technology and
Development , 20(2), 44–47.
Chen, C., Cheng, H., & Yao, Y. D. (2011). Cooperative spectrum sensing in cognitive radio networks in the presence of the primary user emulation
attack . IEEE Transactions on Wireless Communications , 10(7), 2135–2141.
Chen, R., Park, J. M., Hou, Y. T., & Reed, J. H. (2008). Toward secure distributed spectrum sensing in cognitive radio networks.IEEE
Communications Magazine , 46(4), 50–55. doi:10.1109/MCOM.2008.4481340
Chen, R., Park, J. M., & Reed, J. H. (2007). Defense against primary user emulation attacks in cognitive radio networks . IEEE Journal
on Selected Areas in Communications , 26(1), 25–37.
Chen, S., Newman, T. R., Evans, J. B., & Wyglinski, A. M. (2010, April). Genetic algorithm-based optimization for cognitive radio networks.
In Proceedings ofSarnoff Symposium, (pp. 1-6). IEEE. 10.1109/SARNOF.2010.5469780
Clancy, T. C., & Goergen, N. (2008, May). Security in cognitive radio networks: Threats and mitigation. In Proceedings of Cognitive Radio
Oriented Wireless Networks and Communications, (pp. 1-8). IEEE.
Dhar, R., George, G., Malani, A., & Steenkiste, P. (2006, September). Supporting integrated MAC and PHY software development for the USRP
SDR. In Proceedings of Networking Technologies for Software Defined Radio Networks, (pp. 68-77). IEEE. 10.1109/SDR.2006.4286328
El-Hajj, W., Safa, H., & Guizani, M. (2011). Survey of security issues in cognitive radio networks. Journal of Internet Technology, 12(2), 181–198.
Eronen, P., & Arkko, J. (2004, November). Role of authorization in wireless network security. In Proceedings of DIMACS Workshop. Academic
Press.
Faulhaber, G. R., & Farber, D. (2002). Spectrum management: property rights, markets, and the commons. AEIBrookings Joint Center for
Regulatory Studies Working Paper, (02-12), 6.
Folino, G., Pizzuti, C., & Spezzano, G. (2005). GP ensemble for distributed intrusion detection systems . In Pattern Recognition and Data
Mining (pp. 54–62). Springer Berlin Heidelberg. doi:10.1007/11551188_6
Fragkiadakis, A. G., Tragos, E. Z., Tryfonas, T., & Askoxylakis, I. G. (2012). Design and performance evaluation of a lightweight wireless early
warning intrusion detection prototype. EURASIP Journal on Wireless Communications and Networking , (1): 1–18.
Hernandez-Serrano, J., León, O., & Soriano, M. (2011). Modeling the lion attack in cognitive radio networks. EURASIP Journal on Wireless
Communications and Networking , 2, 2011.
Jin, Z., Anand, S., & Subbalakshmi, K. P. (2009, June). Detecting primary user emulation attacks in dynamic spectrum access networks.
In Proceedings of Communications, (pp. 1-5). IEEE. 10.1109/ICC.2009.5198911
Jin, Z., Anand, S., & Subbalakshmi, K. P. (2012). Impact of primary user emulation attacks on dynamic spectrum access networks . IEEE
Transactions on Communications , 60(9), 2635–2643. doi:10.1109/TCOMM.2012.071812.100729
Jung, S. M., Kim, T. K., Eom, J. H., & Chung, T. M. (2013). The Quantitative Overhead Analysis for Effective Task Migration in Biosensor
Networks. In Proceedings of BioMed Research International. Biosensor Networks.
Karlof, C., & Wagner, D. (2003). Secure routing in wireless sensor networks: Attacks and countermeasures. Ad Hoc Networks , 1(2), 293–315.
doi:10.1016/S1570-8705(03)00008-8
Khasawneh, M., Agarwal, A., Goel, N., Zaman, M., & Alrabaee, S. (2012a, September). A game theoretic approach to power trading in cognitive
radio systems. In Proceedings of Software, Telecommunications and Computer Networks (SoftCOM), (pp. 1-5). IEEE.
Khasawneh, M., Agarwal, A., Goel, N., Zaman, M., & Alrabaee, S. (2012b, October). A price setting approach to power trading in cognitive radio
networks. In Proceedings of Ultra Modern Telecommunications and Control Systems and Workshops(ICUMT), (pp. 878-883). IEEE.
Khasawneh, M., Agarwal, A., Goel, N., Zaman, M., & Alrabaee, S. (2012c, July). Sureness efficient energy technique for cooperative spectrum
sensing in cognitive radios. In Proceedings of Telecommunications and Multimedia (TEMU), (pp. 25-30). IEEE.
Khasawneh, M., & Agarwal, A. (2014, March). A survey on security in Cognitive Radio networks. In Proceedings of Computer Science and
Information Technology (CSIT), (pp. 64-70). IEEE. 10.1109/CSIT.2014.6805980
León, O., Hernandez-Serrano, J., & Soriano, M. (2009, June). A new cross-layer attack to TCP in cognitive radio networks. InProceedings of
Cross Layer Design, (pp. 1-5). IEEE. 10.1109/IWCLD.2009.5156526
León, O., Hernández‑Serrano, J., & Soriano, M. (2010). Securing cognitive radio networks. International Journal of Communication
Systems , 23(5), 633–652.
Liang, Y., Somekh-Baruch, A., Poor, H. V., Shamai, S., & Verdú, S. (2009). Capacity of cognitive interference channels with and without secrecy
. IEEE Transactions on Information Theory ,55(2), 604–619.
Lin, F., Hu, Z., Hou, S., Yu, J., Zhang, C., Guo, N., et al. (2011, July). Cognitive radio network as wireless sensor network (ii): Security
consideration. In Proceedings of Aerospace and Electronics Conference (NAECON), (pp. 324-328). IEEE.
Long, T., & Juebo, W. (2012). Research and analysis on cognitive radio network security. In Proceedings of Wireless Sensor Network. Academic
Press.
Lu, W., & Traore, I. (2004). Detecting new forms of network intrusion using genetic programming. Computational Intelligence, 20(3), 475–494.
doi:10.1111/j.0824-7935.2004.00247.x
Majeed, P. G., & Kumar, S. (2014). Genetic Algorithms in Intrusion Detection Systems: A Survey. International Journal of Innovation and
Applied Studies , 5(3), 233–240.
Mathur, C. N., & Subbalakshmi, K. P. (2007). Security issues in cognitive radio networks. In Cognitive networks: Towards selfaware networks,
(pp. 284-293). Academic Press.
Mitola, J. (1999). Cognitive radio for flexible mobile multimedia communications. In Proceedings of Mobile Multimedia Communications, (pp.
3-10). IEEE. 10.1109/MOMUC.1999.819467
Ngo, H. H., Wu, X., Le, P. D., & Srinivasan, B. (2010, April). An Authentication Model for Wireless Network Services. InProceedings of Advanced
Information Networking and Applications (AINA), (pp. 996-1003). IEEE. 10.1109/AINA.2010.40
Perron, E., Diggavi, S., & Telatar, I. E. (2009, April). On cooperative wireless network secrecy. In Proceedings of INFOCOM 2009, (pp. 1935-
1943). IEEE.
Qusay, H. M., & Mahmou, D. (2007). Cognitive networks: Towards selfaware networks. John Wiley & Sons Ltd.
Rawat, A. S., Anand, P., Chen, H., & Varshney, P. K. (2011). Collaborative spectrum sensing in the presence of Byzantine attacks in cognitive radio
networks . IEEE Transactions on Signal Processing , 59(2), 774–786. doi:10.1109/TSP.2010.2091277
Reddy, Y. (2013, June). Security Issues and Threats in Cognitive Radio Networks. In Proceedings of AICT 2013,the Ninth Advanced
International Conference on Telecommunications (pp. 84-89). Academic Press.
Sadek, A. K., Han, Z., & Liu, K. R. (2010). Distributed relay-assignment protocols for coverage expansion in cooperative wireless networks . IEEE
Transactions on Mobile Computing ,9(4), 505–515.
Steenkiste, P., Sicker, D., Minden, G., & Raychaudhuri, D. (2009, March). Future directions in cognitive radio network research.NSF Workshop
Report, 4(1), 1-2.
Tandel, P., Valiveti, S., Agrawal, K. P., & Kotecha, K. (2010). Non-Repudiation in Ad Hoc Networks. In Proceedings of Communication and
Networking (pp. 405-415). Springer Berlin Heidelberg.
Tychogiorgos, G., Gkelias, A., & Leung, K. K. (2012). Utility-Proportional Fairness in Wireless Networks. In Proceedings of 2012 IEEE 23rd
International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC, 2012. IEEE.
Xu W. Trappe W. Zhang Y. Wood T. (2005, May). The feasibility of launching and detecting jamming attacks in wireless networks. In
Proceedings of the 6th ACM International Symposium on Mobile Ad Hoc Networking and Computing (pp. 46-57).
ACM.10.1145/1062689.1062697
Yenumula, B. R. (2013). Security Issues and Threats in Cognitive Radio Networks. In Proceedings of AICT:The Ninth Advanced International
Conference on Telecommunications, (pp. 85-90). AICT.
Yuan, Z., Niyato, D., Li, H., Song, J. B., & Han, Z. (2012). Defeating primary user emulation attacks using belief propagation in cognitive radio
networks . IEEE Journal on Selected Areas in Communications , 30(10), 1850–1860.
Zeng, Y., Liang, Y. C., Hoang, A. T., & Zhang, R. (2010). A review on spectrum sensing for cognitive radio: Challenges and solutions.EURASIP
Journal on Advances in Signal Processing , 2010, 2. doi:10.1155/2010/381465
Zhang X. Li C. (2009, June). The security in cognitive radio networks: A survey. In Proceedings of the 2009 International Conference on Wireless
Communications and Mobile Computing: Connecting the World Wirelessly (pp. 309-313). ACM.10.1145/1582379.1582447
Zhang, Z., & Xie, X. (2008, June). Application research of evolution in cognitive radio based on GA. In Proceedings of Industrial Electronics and
Applications, (pp. 1575-1579). IEEE.
Zhao, J. H., Li, F., & Zhang, X.-. (2012). Parameter adjustment based on improved genetic algorithm for cognitive radio networks.Journal of
China Universities of Posts and Telecommunications ,19(3), 22–26. doi:10.1016/S1005-8885(11)60260-4
Zhou, D. (2003, January). Security issues in ad hoc networks . InThe handbook of ad hoc wireless networks (pp. 569–582). CRC Press, Inc.
Zhu, L., & Mao, H. (2011, March). An efficient authentication mechanism for cognitive radio networks. In Proceedings of Power and Energy
Engineering Conference (APPEEC), (pp. 1-5). IEEE. 10.1109/APPEEC.2011.5748783
Zhu, L., & Zhou, H. (2008, December). Two types of attacks against cognitive radio network MAC protocols. In Proceedings of Computer Science
and Software Engineering, (Vol. 4, pp. 1110-1113). IEEE. 10.1109/CSSE.2008.1536
CHAPTER 62
Classification of Failures in Photovoltaic Systems using Data Mining Techniques
Lucía SerranoLuján
Technical University of Cartagena, Spain
Jose Manuel Cadenas
University of Murcia, Spain
Antonio Urbina
Technical University of Cartagena, Spain
ABSTRACT
Data mining techniques have been used on data collected from a photovoltaic system to predict its generation and performance. Nevertheless, up
to date, this computing approach has needed the simultaneous measurement of environmental parameters that are collected by an array of
sensors. This chapter presents the application of several computing learning techniques to electrical data in order to detect and classify the
occurrence of failures (i.e. shadows, bad weather conditions, etc.) without using environmental data. The results of a 222kWp (CdTe) case study
show how the application of computing learning algorithms can be used to improve the management and performance of photovoltaic generators
without relying on environmental parameters.
INTRODUCTION
During recent years the number of large-scale PV (photovoltaic) systems has grown worldwide. In 2010, the photovoltaic industry production
more than doubled and reached a worldwide production volume of 23.5 GWp of photovoltaic modules. Business analysts predict that
investments in PV technology could double from € 35-40 billion in 2010 to over € 70 billion in 2015, while prices for consumers are continuously
decreasing at the same time. (European Commission, DG Joint Research Centre, Institute for Energy, Renewable Energy Unit, 2011).
The complexity in PV system configurations represents an additional problem in maintenance and control operations in large systems. For
example, a failure in one PV module placed at a big façade is very difficult to detect. A quick detection of failures would avoid energy losses due to
malfunctions of PV system and therefore improve its performance and end-user satisfaction (Roman, Alonso, Ibanez, Elorduizapatarietxe, &
Goitia, 2006).
When data-mining techniques are applied to a PV database, a wide variety of relations between parameters can be found. This study relays on
expert knowledge to study the possible behaviours of PV generation performance that can be affected by changes in the environment conditions;
furthermore, computing learning algorithms allow us to detect and classify failures without measuring the environmental parameters.
This study focuses on a methodology to control the correct performance of each group of modules which compose a large-scale PV generator,
identifying failure occurrences and its most likely causes, and it does so by a procedure that analyses the performance of a group of modules by
ignoring environmental information.
DATA MINING AND RENEWABLE ENERGIES
Data Mining
Knowledge Extraction Process tries to find useful, valid, important and new knowledge about a phenomenon or activity by computational
efficient procedures (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Additionally it is of paramount importance to present the results in a clear and
easily interpretable way, therefore this process comprises data-mining techniques in order to extract or identify patterns from data that needs to
be complemented with both pre-processing and post-processing stages. The process implies several steps:
• To extract knowledge in order to select and apply the most appropriate Data Mining technique.
The concept “Data Mining” is academically considered as a step inside Knowledge extraction Process, but nevertheless, from an applied point of
view, both terms are used in an equivalent way (Fayyad et al., 1996)
Basically, Data Mining process can be supervised or not supervised, depending if the entries are assigned to a finite number of discrete classes or
not, and it includes the selection of the tasks to be done, for instance, classification, grouping or clustering, regression, etc. Data Mining processes
find patterns that can be expressed as a model or it can easily show dependence between data in a graphical manner. The introduced model
depends on its function (for instance, classification) and on the way it is represented (decision trees, rules, etc.). Besides, it should specify
preference criteria to select a model inside a group of possible models and it has to specify the finding strategy to be used (that is normally
determined in a particular Data Mining technique) (Fayyad et al., 1996; Hernández Orallo, Ramírez Quintana, & Ferri Ramírez, 2005; Witten &
Frank, 2011)
1. Predictives, in which the aim is to estimate future or unknown values of interesting variables, which are objective or depending variables,
by using other variables from the database, which are referred as independent or predicative variables; and
2. Descriptives, in which the aim is to identify patterns that explain or summarize data, i.e., they serve to explore different properties in the
examined data, and not to predict new data.
Kinds of tasks can be distinguished within Data Mining. Each task has its own requirements and, in the same way, the information obtained by
one task can differ very much from information obtained by another task. Among tasks oriented to obtain predicted models (predicted tasks)
classification tasks and regression tasks can be found; while clustering, association rules, sequential association rules and correlation are tasks
oriented to obtain descriptive models (descriptive tasks), (Hernández Orallo et al., 2005; Witten & Frank, 2011).
Classification/regression of feature’s values as a function of other features’ values is part of the supervised process. The classification goal is to
learn a model, referred as classifier, which represents the correspondence between examples. That is, for each value from the entrance group, an
exit value is given. Decision trees, induction rules, random forest, etc. are examples of the classification techniques.
In order to select or use a classification technique, several factors have to be taken under consideration. Among them, the following are
highlighted:
1. Power of classification: it is one of the main factors to keep in mind when a technique is chosen.
2. Explanation of the results or understandable models. The usual application in Data Mining is within a professional field and hence the
user will expect to get an understandable description of the learned model. The most often used interpretable models are those that are
obtained from decision trees and induction rules methods.
Once the model or classifier is obtained, a validation process should be accomplished by checking that the obtained conclusions are valid and
sufficiently satisfactory. When several models are obtained from different techniques, they should be compared to find which one provides the
best solution to the problem. If none of the models provided good results, they should be altered in some of the steps to generate new models.
The basic measure used to validate this kind of models is denominated accuracy (error rate). This measure is obtained as the number of correct
answers divided by the number of pieces of information. Nevertheless, when a model is evaluated, it should be taken into account that different
data from those used during originally during the training step must be now utilized. This situation could cause overfitting. To avoid this
problem, training data and evaluation data are usually separated. The most used method is k-fold cross-validation, repeating this method several
times in order to obtain an average accuracy. The k-fold cross-validation method consists in:
compute the accuracy
Data Mining Techniques on Renewable Energy Issues
Up to date, there are several studies that apply the data mining techniques in order to find a solution to renewable energies specific problems.
In (Chaabene, Ammar, & Elhajjaji, 2007) and (Ben Salah, Chaabene, & Ben Ammar, 2008) the authors develop a fuzzy decision-making
algorithm, based on expert knowledge, in order to decide when to connect devices either to the PVP (photovoltaic panel) output or the electric
grid to achieve an energy saving during daylight up to 90% of the PVP generated system. (Sallem, Chaabene, & Kamoun, 2009) applies fuzzy
rules as well, with the aim to extend operation time of the water pump by controlling a switching unit.
(Ammar, Chaabene, & Elhajjaji, 2010) proposes an energy planning of the estimated photovoltaic generation for the next day. It considers the
PVG (photovoltaic panel generation) during the last ten days in order to forecast its behaviour for the following day using a Neuro Fuzzy
estimator. (Cirre, Berenguel, Valenzuela, & Klempous, 2009) uses two-layer hierarchical control strategies, fuzzy logic and physical model-based
optimization, to automatically track the operating point despite any disturbances affecting the plant, taking operating constraints into account.
Nevertheless, other lines of research on PV-systems face the solar energy problems from a more general point of view. Increasing attention is
being paid to PV systems reliability in recent years due to rapid growth of PV power installation in residential and commercial buildings (Wang,
Zhang, Li, & Kan’an, 2012).
Primary interest for researchers in solar energy is related to the design and optimization of solar energy homes, also called “Solar Home Systems”
(SHS), while improving energy efficiency in buildings is a major priority worldwide (Baños et al., 2011). Many studies regard the reliability and
risk assessment of large-scale PV systems as a way to bring benefits for both utility companies and customers (Zhang, Li, Li, Wang, & Xiao, 2013).
The discussions are extended to emerging research topics including time varying and ambient-condition-dependent failure rates of critical PV
system components (Zhang et al., 2013). (Catelani et al., 2011) noticed that a crucial aspect in PV systems is the cleaning status of the panel
surface. They used FMECA (Failure modes effects and criticality analysis) in order to classify the occurrence of failures. The obtained results
allow the designer to identify the modification, which must be made in order to improve the RAMS (Reliability, Availability, Maintainability and
Safety). Similar is the solution found by (Collins, Dvorack, Mahn, Mundt, & Quintana, 2009) that uses FMEA (Failure Modes and Effects
Analysis) as a technique for systematically identifying, analysing and documenting the possible failure modes within a design and effects of such
failure modes on system performance or safety. An hybrid method for six hours in advance solar power prediction was developed by R. Hossain
et al. (Hossain, Oo, & Ali, 2012).
UNDERSTANDING THE PROBLEM
222kWp CdTe Photovoltaic Generator: Case Study
A 222kWp thin film CdTe (Cadmium Telluride) parking integrated grid-connected PV facility was commissioned in May 2009, at the University
of Murcia, in the South-East of Spain. This facility started to operate and store information on 10th July 2009.
• 7º tilted
• 2,263.36 m2 surface
• 30 inverters of 7kVA
The modules are divided in 10 groups and 30 subgroups, as Figure 1 shows, which are, each of them, connected to an inverter (and therefore we
have a total of 30 inverters).
Figure 1. Groups of PV modules forming the PV generator,
group organization and inverter identification
The 30 subgroups of modules are connected to an inverter in the next way: 13 parallel strings of 8 series-connected modules each string, except 3
subgroups of FS-267 modules (First Solar FS Series PV Module, 2011), which have 14 parallel strings. Each one of these subgroups feeds the grid
through an inverter of 7kVA of capacity. The 30 single-phase inverters are grouped and work in coordination to generate a three-phase AC signal.
The model of inverter used at the PV facility is the SMC7000HV(SMA Solar Technology AG, 2012).
Each inverter can send information in a user-defined period of time. In order to record the information provided by the inverters, a monitoring
system has been installed; it is divided into three subsystems as shown in Figure 2: data acquisition system, permanent data storage and data
access.
Figure 2. Relations between data acquisition system,
permanent data storage and data access
The aim of the Data Acquisition System is to gather the information from the inverters and environmental sensors.
The information sent by each one of the 30 SMC7000HV inverters every five minutes is shown in Table 1.
Table 1. Parameters sent by SMA SMC7000HV inverters. Name (tag) and description
Name Description
Table 2. Parameters sent by Sunny Sensor Box. Name (tag) and description
Name Description
The database has, up to now, more than 200,000 registers (16 field attributes) of electrical parameters for each inverter that form the PV
generator, and more than 200,000 registers (6 field attributes) with environmental information.
The 222kWp CdTe parking integrated grid-connected facility, can be considered a large-scale photovoltaic system. Fixed shadows, temporal
shadows, bad weather or dirty modules are common circumstances that inverters cannot detect. Under these circumstances, a group of modules
will loose efficiency. Inverters connected to a group of modules affected by bad conditions will inject less electricity into the grid than others. It
does not exist any mechanism that controls these disparities between same-installation inverters performance.
Performance Assessment
With the aim of identifying malfunctions on the PV generator, the comparison between groups of modules could be a solution, as well as the
comparison between group of modules and the performance of the whole system. Each group of modules is connected to an inverter, which send
parameters described Tables 1 and 2. This comparison could check parameters that reflect the performance of the group of modules or possible
inverter errors. The parameters that have been used in this study are:
• Performance Ratio (PR). PR is the relation between the power generated by the modules and the theoretical output power it should
produce in specific irradiation conditions. When an inverter has an average PR smaller than the rest, it could indicate a special condition, as
fixed shadows, or a potential error. Figure 5 shows the number of times each inverter reaches a PR value, namely the frequency of PR for
each inverter. For instance, the inverter named “Inverter I.7.2” raises a high PR more often than the rest. Figure 3 shows the relation
between the PR of the whole PV generator and four groups of modules.
• Output power. If an inverter injects less AC power (Pac) to the grid than others, it may due to causes that can be explained by the
information we have in the database. For example, inverter “Inverter I.2.1” is affected by a shadow at certain times that reduces its power
generation, as Figure 4 shows.
• Input/output performance. The relation between the input and output power in an inverter is a sign of a possible error in performance.
Figure 6 shows how shadow affects inverter I.2.1 in its input/output performance.
Figure 3. Relation between PR of the whole PV installation
and PR of four group of modules
Figure 4. Power output of three inverters and the total power
output of the PV system during ten hours. A shadow affects
inverter I.2.1 in the morning
Figure 5. Frequency of performance ratios obtained for each
inverter
Figure 6. Relation between input and output power indicates a
possible error in the inverter or a shadow affecting this group
of modules. Inverter I.2.1 has smaller performance
Data mining techniques can improve and control the performance of large-scale PV generators.
DATA MINING TECHNIQUES AND THE PERFORMANCE OF PHOTOVOLTAIC GENERATOR
Preprocess
A necessary and hard first step is pre-processing. Some of the stored data should not be taken into consideration in the learning computing
process, because they are incomplete or wrong.
The criteria to select the useful dataset, which will be introduced in data mining techniques, are the following:
• Status inverter parameter must be greater than 0, because this is the error code. Just in case, we limited the registers from 6h. in the
morning until 22h. in the night.
From the total of 30 tables with same attributes from different inverters, one was chosen in this study: inverter with name
wr7k_023_2000457991. From now on, it is referred as inverter I.1.1. A total of 30550 instances from this inverter fulfil the above mentioned
criteria.
From all the fields that compose a register, “Zac” (described in Table 1) was discarded because its value is not relevant in energy balance. The date
and the time were transformed using the following equations:
date = year*365 + month*31 + day
time = hour*3600 + minute*60 + sec
The study of the PV generator behaviour aims at looking for special situations. These situations can be prolonged during time, or can be caused
by other previous special situations. In order to include the relation between parameters throughout time in our research of control algorithms,
the values of the Performance Ratio (PR) and Power output (Pac) from a few days before are taken into account. Then 13 new fields were added to
each register:
The last field in Table 3 (Label) is the state the expert identifies. The mining of the labels is defined in Table 4. This attribute is necessary in order
to apply the data mining techniques, as decision trees, decision rules, linear models, etc. This process was made thanks to expert knowledge, as is
explained in next sub-section.
Table 3. Parameters added to each register in database
Name Description
Pvgis
Average global irradiation in tilted surface (W/m2)
G
Table 4. Possible states of the group of PV modules considered
State Description
Name
OK Correct performance
For each register, the PR has been calculated following the Equation 1:
(1)
with:
κ = Peak power of the group of modules connected to the inverter per square meter,Wp/m2, i.e. 74.88.
Looking at each parameter values, an expert labelled each register. The parameters taken into account in expert criteria were defined in Tables 1,
2 and 3.
Table 4 defines the possible states of the group of modules following the labelling process.
Applying Techniques
From the whole intelligent system, we have started by analysing the models that are useful to detect and classify the states of the generator. In
order to perform this analysis, five data mining techniques, based on decision trees and rules were chosen:
▪ Random Forest. Classification algorithm that construct a multitude of decision trees at training time and outputting the class.
▪ RippleDown Rule learner. Algorithm that generates a default rule first and then exception for the default rule.
▪ Conjunctive Rule. Algorithm that implements a single conjunctive rule learner that can predict for numeric and nominal class labels. A
rule consists of antecedents “ANDed” together and the consequent (class value) for the classification/regression.
RESULTS
Techniques Based on Decision Trees
A dataset of 30550 instances with 35 attributes was labeled thanks to expert knowledge. The set of used labels is: F, E1, T, T1, M, E2/S, F1, S, OK,
as was defined in Table 4 for labeling process.
To analyze the results obtained in the learning process in terms of prediction accuracy, we use 3x5-fold cross validation. Therefore, we will show
the average of the obtained results during the validation process.
The obtained results applying five techniques indicated before are shown in Table 5. First column of this table indicates whether the method is
interpretable or not. μ is the average of correct classified instances percent, and σ is the average of mean absolute error.
Table 5. Average accuracy and root mean squared error
Interpretable Method μ σ
RippleDown Rule
C 99.951 0.012
learner
In particular, the obtained decision tree (25 leaves and size 749) by C4.5 at the end of the process is showed in Figure 7. This study will focus in
C4.5 because it is interpretable and useful for our goals.
Figure 7. Decision tree obtained in C4.5 learning method,
reading 30550 instances of PV generator information
The obtained results show that the main parameter to decide whether an inverter performance is acceptable is the performance ratio (PR). This
result has plenty of meaning because a group of modules, whose PR is greater than 0.7, can be considered to have a correct behavior. Though,
when PR is less than 0.7, from a PV expert point of view, something is obstructing the correct operation of the group of modules.
The branch with the condition “Upv_Soll >786.5” has been considered by the expert system as an error in solution. Resulted tree indicates this
branch as error as well, but a little range that is considered as bad atmospheric conditions. Nevertheless, we have taken into consideration all
states into this range as an error of the system. Information contained in the database that belongs to this set has attributes with wrong values.
The branch with the condition of diffuse irradiation “Gd <=28” can be explained as well. Diffuse irradiation is low when direct irradiation is high.
Gd is a parameter standard, given by PVGIS (European Commision, 2011). So, this comparison indicates that when the average irradiation is high
and PR is low, a problem can be affecting the performance of the modules. When days before this finding, the obtained PR is low as well, the tree
interprets that modules are dirty or a shadow is affecting the area.
Managing Without Environmental Information
Now a new challenge is suggested. The technique was trained with no environmental sensor information dataset.
A full monitoring system, including the measurement of environmental parameters, is not installed in every large-scale installation. Currently,
inverters in the market are able to send data about input/output power and performance information. Nevertheless, environmental information
sent by sensors cannot be found in every PV facility. To tackle the problem of getting a decision tree which does not require environmental
information, the same computational intelligence techniques were applied to the same instances of the previous study, but changing some
attributes in order to avoid the environmental ones. Thus, the attributes described in Table 2, i.e. IntSollrr, OpTm, Windvel, TmpAmb and
TmpMdul, are now omitted.
The considered attributes (described in Tables 1 and 3) in this new study are: day, hour, Iac_Ist, Ipv, Pac, Uac, Upv_Ist, Upv_Soll, PR, PR1, PR2,
PR3, PR4, PR5, Pac, Pac1, Pac2, Pac3, Pac4, Pac5, PVGIS mes, PVGIS hora, G, Gd, label.
3x5 cross-validation were executed in order to validate the obtained decision tree. 99.87% of the instances were classified correctly (μ), with a
root mean squared error (σ) of 0.019.
Differences between decision trees (with and without environmental attributes) can be found in Figures 7 and 8 respectively. Table 6 shows these
differences in two rules.
Table 6. Disagreements found in decision trees. Equivalent branch with and without irradiation attribute
In spite of this found equivalence, only 21 instances of more than 30000 belong in this branch of the tree. If we wanted to dismiss this branch,
the error would only increase a 0.0007%. This result shows that insolation parameter is not a deciding parameter when our aim is to predict the
state of a studied group of modules.
In the same way that it was found equivalence between these three parameters when decision tree is being modelled, it could be found another
relationship between attributes by replacing or omitting some of them.
The results shows that, from the set of information used in this study, it could be possible to get an accurate decision tree, managing without the
environmental information, and using the average weather PVGIS data.
Inference Phase
A subset of information from inverter I.1.1 was selected to infer the state of the PV generator using the obtained decision tree without
environmental information. During days between 18th and 22th October on 2012, the PV generator suffered from complex weather conditions, as
Figure 9 shows.
Figure 9. Performance Ratio of the whole PV generator and
output power of the group of modules connected to inverter
I.1.2 during days with special atmospheric conditions: stand-
storm, rain and sunny day
▪ Days 2 and 3 (2012-10-19 and 2012-10-20). Modules are dirty due to sand storm day before.
During this period of time the irradiance sensor was broken. The performance ratio was re-defined by replacing in Equation 1 the parameter or
irradiation G by an approximation; the irradiation can be approximated by the multiplication of the current output by a constant parameter “λ”,
that can be experimentally calculated from stored information, as Equation 2 shows. The new performance ratio (PR´) is now defined in
Equation 3.
(2)
(3)
with:
λ= constant factor that defines the relation between irradiation and generated current from group of PV modules
κ = Peak power of the group of modules connected to the inverter per square meter, Wp/m2, i.e. 74.88.
The decision tree showed in Figure 8 is used to infer the state of varying weather information set. Figure 10 shows the results of the model.
Figure 10. Result of classification during varying weather
Results show that a deeper study is needed in order to improve the identification of dirty modules, shadows appearance and special weather
circumstances. In spite of the good behaviour of the model when inverter works with high performance ratio, most of the low performance ratio
data sets were identified as fixed small shadows (F1), and a low rate as adverse atmospheric conditions. State for “dirty modules” was not
identifying correctly.
CONCLUSION
Data mining techniques have so far been widely used to study and predict solar electricity production and to detect failures or losses due to
malfunctions. In this brief study we propose a new methodology to identify the state of a PV generator when environmental information is not
available.
A decision tree model was obtained from C4.5 learning algorithm, which was trained with more than 30,000 instances from inverter technical
information and from environmental sensor data set. The training was repeated without environmental sensor´s information and a relation
between these trees were identified. Nevertheless, a wider training is needed to get more precise algorithms.
Some of the results are promising since experts have assessed them, but comparison between inverters from the same installation has to be made
in order to find other performance parameters that can be used to get more accuracy state prediction and to compare and prove results.
This study is an incursion on decision-system design from a computational point of view, in order to identify and assess the performance of PV
large-scale state. A small step to the construction of a hierarchical process to predict states of large-scale PV generator is presented.
Future work in small PV system is projected to do in-situ real simulations of shadows, dust, etc. in order to compare the PV modules power
output behavior in same time and different conditions.
This work was previously published in Soft Computing Applications for Renewable Energy and Energy Efficiency edited by Maria del Socorro
García Cascales, Juan Miguel Sánchez Lozano, Antonio David Masegosa Arredondo, and Carlos Cruz Corona, pages 300319, copyright year
2015 by Information Science Reference (an imprint of IGI Global).
ACKNOWLEDGMENT
Supported by the projects TIN2011-27696-C02-02 of the Ministry of Economy and Competitiveness of Spain, and MAT2010-21267-C02 by
Spanish Ministry of Science and Innovation.
REFERENCES
Ammar, M. B., Chaabene, M., & Elhajjaji, A. (2010). Daily energy planning of a household photovoltaic panel. Applied Energy ,87(7), 2340–2351.
doi:10.1016/j.apenergy.2010.01.016
Baños, R., Manzano-Agugliaro, F., Montoya, F. G., Gil, C., Alcayde, A., & Gómez, J. (2011). Optimization methods applied to renewable and
sustainable energy: A review. Renewable & Sustainable Energy Reviews , 15(4), 1753–1766. doi:10.1016/j.rser.2010.12.008
Ben Salah, C., Chaabene, M., & Ben Ammar, M. (2008). Multi-criteria fuzzy algorithm for energy management of a domestic photovoltaic
panel. Renewable Energy , 33(5), 993–1001. doi:10.1016/j.renene.2007.05.036
Catelani, M., Ciani, L., Cristaldi, L., Faifer, M., Lazzaroni, M., & Rinaldi, P. (2011). FMECA technique on photovoltaic module. InProceedings of
2011 IEEE Instrumentation and Measurement Technology Conference (I2MTC) (pp. 1-6). IEEE. 10.1109/IMTC.2011.5944245
Chaabene, M., Ammar, M. B., & Elhajjaji, A. (2007). Fuzzy approach for optimal energy-management of a domestic photovoltaic panel. Applied
Energy , 84(10), 992–1001. doi:10.1016/j.apenergy.2007.05.007
Cirre, C. M., Berenguel, M., Valenzuela, L., & Klempous, R. (2009). Reference governor optimization and control of a distributed solar collector
field. European Journal of Operational Research , 193(3), 709–717. doi:10.1016/j.ejor.2007.05.056
Collins, E., Dvorack, M., Mahn, J., Mundt, M., & Quintana, M. (2009). Reliability and availability analysis of a fielded photovoltaic system.
In Proceedings of 2009 34th IEEE Photovoltaic Specialists Conference (PVSC) (pp. 2316-2321). IEEE. 10.1109/PVSC.2009.5411343
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data.Communications
of the ACM , 39(11), 27–34. doi:10.1145/240455.240464
Hernández Orallo, J., Ramírez Quintana, M. J., & Ferri Ramírez, C. (2005). Introducción a la minería de datos . Madrid: Pearson Prentice Hall.
Hossain, M. R., Oo, A. M. T., & Ali, A. B. M. S. (2012). Hybrid prediction method of solar power using different computational intelligence
algorithms. In Proceedings of Universities Power Engineering Conference (AUPEC), 2012 22nd Australasian (pp. 1-6). AUPEC.
10.4236/sgre.2013.41011
Roman, E., Alonso, R., Ibanez, P., Elorduizapatarietxe, S., & Goitia, D. (2006). Intelligent PV Module for Grid-Connected PV Systems. IEEE
Transactions on Industrial Electronics , 53(4), 1066–1073. doi:10.1109/TIE.2006.878327
Sallem, S., Chaabene, M., & Kamoun, M. B. A. (2009). Energy management algorithm for an optimum control of a photovoltaic water pumping
system. Applied Energy , 86(12), 2671–2680. doi:10.1016/j.apenergy.2009.04.018
SMA Solar Technology AG. (2012). Sunny mini central 7000HV. SMA Solar Technology AG. Retrieved from
http://www.sma.de/en/products/solar-inverters/sunny-mini-central/sunny-mini-central-7000hv.html
Wang, Y., Zhang, P., Li, W., & Kan’an, N. H. (2012). Comparative analysis of the reliability of grid-connected photovoltaic power systems.
In Proceedings of 2012 IEEE Power and Energy Society General Meeting (pp. 1-8). IEEE. doi:10.1109/PESGM.2012.6345373
Witten, I. H., & Frank, E. (2011). Data mining: practical machine learning tools and techniques with Java implementations (3rd ed.).San
Francisco, CA: Morgan Kaufmann.
Zhang, P., Li, W., Li, S., Wang, Y., & Xiao, W. (2013). Reliability assessment of photovoltaic power systems: Review of current status and future
perspectives. Applied Energy , 104, 822–833. doi:10.1016/j.apenergy.2012.12.010
KEY TERMS AND DEFINITIONS
Data Acquisition System: Process of sampling signals that measure real world physical conditions and converting the resulting samples into
digital numeric values that can be manipulated by a computer.
Data Mining: Computational process of discovering patterns in large data sets involving methods as the intersection of artificial intelligent,
machine learning, statistics and database system.
Electrical Data: Electronic devices performance parameters, sent by electronic devices, for instance solar inverter.
Environmental Sensors: Devices that detect the magnitude of various ambient conditions such as humidity, light intensity, wind speed, etc.
Inverter: Device that converts the variable direct current (DC) output of a solar panel or group of solar panels into a utility frequency alternating
current (AC).
Photovoltaic Generator: Power system designed to supply usable solar power by means of photovoltaic. The set that build a photovoltaic
system is: solar panels, solar inverter, mounting, cabling, and sometimes batteries.
CHAPTER 63
Costs of Information Services and Information Systems:
Their Implications for Investments, Pricing, and Market-Based Business Decisions
Steve G. Parsons
Washington University, USA
ABSTRACT
The large number of cost terms in use regarding information services contributes to confusion in discussion and cost analysis. This confusion can
largely be resolved by focusing on decisions rather than on products and cost terms. This decision focus is consistent with the proper application
of total cost of ownership approaches and a real options perspective for evaluating managerial flexibility. Information services also tend to
display public good-like characteristics. The non-rivalrous nature of production (i.e., low marginal cost of production) and the perishability of
services have critical implications for investments and complex pricing. In addition, some information services and other Internet-based services,
can display network effects, which also have important implications for managing the life cycle of the service. Finally, the implications of big data
for information services are briefly considered.
INTRODUCTION AND BACKGROUND
Every rational economic decision should be based on a comparison of the present value of the stream of the costs of the decisions with the
present value of the stream of the benefits of that same decision (often the benefits are changes in revenue streams). One then need only overlay
an evaluation of the risks of the decision to be complete. However, the use of standard cost terminology and a lack of focus on the business
decision can lead to errors in analysis and the business decisions being made.
This chapter begins with a discussion of the types of cost terminology that are sometimes used and how this terminology often leads to confusion.
It continues with a way to solve the terminology confusion problem by focusing on each decision and the consequences of that decision. This
decision-based perspective is consistent with Total Cost of Ownership (TOC) evaluations. It is also consistent with identifying sources of
potentially sunk costs and methods of avoiding sunk costs using a real options perspective to search for sources of managerial flexibility.
A useful perspective is to recognize that information services and other internet-based services have public good-like characteristics. This has
important implications for thinking about the costs of the services, and how best to price those services. Perishability and value decay over time
are also important factors for pricing, and can be used to assist in price discrimination for information services sold outside the firm. Finally, the
chapter finishes with a discussion of network effects and their implications for information services.
The objective of this chapter is to draw upon the economics literature to assess topics of importance to the non-economists working in
information services. While the chapter invokes the technical literature in economics (and related disciplines), the chapter itself is designed to be
completely accessible to anyone with an interest in information services and other internet-based services. While there are graphs and simple
numerical examples, there is virtually no math, nor difficult mathematical notation.
COST TERMINOLOGY AND CONFUSION
Unfortunately, costs are often misunderstood, or misused, in a variety of industries and across departments within companies and government
agencies. Often, classic cost terminology actually can lead to more confusion than clarification in a decision-action context.
Consider the following list of typical cost terms and cost phrases, listed in alphabetical order, but not intended to be exhaustive: accounting cost,
activity based cost, administrative cost, average cost (often in combination with other terms such as average variable cost),, avoidable cost,
capacity cost, conversion cost, cost of debt, cost of equity, cost of goods sold, cost of money, cost overrun, differential cost, direct labor cost,
explicit cost, external cost, feature cost, function cost, incremental cost, inventory cost, labor cost, life cycle cost, long-run cost (often in
combination with other terms such as long-run average total cost), manufacturing cost, marginal cost, materials cost, minimum efficient cost (the
cost at minimum efficient scale), non-manufacturing cost, non-period cost, opportunity cost, overhead costs, period costs, private cost, product
cost, public cost, psychic cost, repugnancy cost, search cost, short–run cost (often in combination with other terms such as short-run marginal
cost), social cost, standard cost, sunk cost, transactions cost, variable cost, and volume-sensitive cost.
This proliferation of cost terms may in part be due to a desire to have (or create) a cost term that is unique to a specific set of circumstances.
However, on balance, I find the proliferation of cost terms, and some cost terminology in particular, leads to more confusion than clarification.
Standard Economic Terms and Confusion
Fixed vs. Variable or Marginal Costs
Most economic textbooks have the discussion of costs follow a chapter on production in which some inputs are described as fixed in the short run
and some are described as variable. A classic definition of a fixed input is “when the quantities of plant and equipment cannot be altered” (Allen
et al, 2013, p. 174). Fixed costs are then described as the costs associated with fixed inputs. This discussion generally glosses over a distinction
between being fixed in acquisition (lumpiness in the practical ability to choose the size of the fixed input) and fixed in use (the input is either on
or off in use, e.g., a nuclear reactor, which has either reached sustained nuclear chain reaction or it has not). Unfortunately, an input which is
“fixed” with respect to one business decision (e.g., producing additional units in order to satisfy additional quantity demanded for a one-month
promotion) may not be fixed with respect to a different business decision (e.g., expanding into a new geographic territory).
The cost discussion has the appearance of rigor when marginal cost is defined as dC/dQ or the first derivative of the total cost function with
respect to quantity. But this apparent rigor simply shuffles the responsibility elsewhere to the definition of the total cost function. For example,
one might define the total monthly cost function as C = $1,000,000 + $2Q; where marginal cost (dC/dQ) is constant at $2, and the $1,000,000 is
implicitly assumed to be fixed. The calculus did not somehow solve for what cost is fixed; that was already determined in the specification of the
total cost function.
In a complex world, what is fixed or variable is not the only source of confusion. Many years ago Alchian (1968) mathematically described
alternate interpretations of the marginal costs of a firm producing only one product. He provided nine mathematical propositions related to
interpretations of marginal costs with respect to: lead time or the date at which production would begin, the length of the time period of the
production run, the rate of output per period, and the total volume of output. Cost issues become more complex in the context of a multi-product
firm. They also become more complex when the firm produces services rather than products. Another way of describing part of the complexity
mathematically is that, even if one could “solve” the fixed cost issue, “Q” could properly refer to a byte of a particular type of information, the
capacity to obtain that information, the ability to store the information, or the ability to analyze (in a very specific way) that byte. Moreover, the
relevant Q may not be best represented by a byte, but rather by a minute of usage, a message, a new customer, a sub-account for an existing
customer, a new report, a line item on a report, a line item for billing, etc.
LongRun vs. ShortRun Costs
Another classic economic cost distinction is that between the long-run and short-run. The standard definition of the long-run is one in which no
inputs are fixed and the short-run is a one in which at least one input is fixed. While the terminology conjures images of a time period, it is not
time per se that determines the long-run, but rather the fixity of an input. Hence, technically, there is no standard that suggests that the short-
run is a period of less than a year, or less than five years, or any other period.
Of course, given the discussion above regarding shortcomings of fixed inputs and fixed costs constructs, one should rightfully be suspicious of the
long-run v. short-run distinction that is based on fixity of an input. This may be part of the reason why analysts are inconsistent in their
recommendation regarding which cost to use for pricing. While many regulators (in electric power and telecommunications) endorse or require
the use of long-run costs, many economists have suggested the use of short-run marginal costs as germane to pricing including: Brown &
Johnson (1969), Della Valle (1988), Fisher (1991), and Rees (1976). However, some (e.g., Coase (1938); Knight (1921), p. 322; Parsons (2002),
section 4.4; and Stigler (1966), p. 135) have suggested that the long-run v. short-run cost distinction should be more of a guide to thinking of a
continuum of costs for a potential continuum of different business decisions. This is different from believing that there is one long-run cost and
one short-run cost, and one must choose between them. The long-run/short-run distinction has created confusion in legal circles at times as well,
for example with respect to the relevant costs to use in determining the existence of predatory pricing (Areeda & Turner, 1975; Bork, 1978;
Baumol, 1989. Unfortunately, the same input that is “fixed” with respect to one decision is not fixed with respect to a different decision, even if
the two decisions are made at virtually the same time.
THE SPECIFIC BUSINESS DECISION SHOULD DEFINE WHICH COSTS ARE RELEVANT AND WHICH ARE NOT
Fortunately, there is a way in which to resolve this confusion. The key in identifying relevant costs (and for that matter relevant revenues or
benefits) is to focus on the decision and the consequences of the decision. The term “decision” here refers to both the act of making the decision
and taking an action to implement the decision. In almost every instance, the decision will be to choose between two competing alternatives. For
example, one decision would be to “purchase” and implement a completely new information system for a company. The implied competing
alternative is to continue with the existing information system (perhaps with some minor changes if a new system is not implemented). This is a
completely different decision (likely with different costs) from a decision to change the pricing structure related to the services provided outside
the firm (with the services provided by an existing information system that will not change).
To employ this technique one must perform the following three steps:
2. Identify what is different between the two scenarios; in particular identify the difference in the resources that are used under each
alternative;
3. Quantify the value of the resource difference between the alternatives (and the differences in benefits) in order to make final decision.
After these three steps, one must still bring the value differences back to a common date with present value techniques and then overlay risks, but
for now we hold these two final steps to the side and focus on the primary three tasks.
Consider a stylized hypothetical example (but one that is typical of those I have seen many times). Imagine that a potential customer has asked a
telecommunications provider to provide a custom bid to provide a high-speed connection between point A and point C (with transit through
point B) for three years. The alternative is that telecommunications provider declines to bid and sells nothing new to the potential customer.
Focus first on the possible costs associated with capital assets. In this example, the provider has existing capacity to provide the service over the
next three years between points A and B. However, between points B and C existing facilities are insufficient and additional electronics would be
required in order to serve the custom bid (that would not have been spent in the absence of the custom bid). In this case, there are no capital
costs associated with the provision of the customer bid service between points A and B, although there could be some very small increase in the
costs of maintenance (or the probability of additional maintenance) for those facilities. In contrast there are real capital costs associated with the
new electronics required for transmission between point B and C. And there are difficult questions to be answered regarding the residual value of
the electronics three years hence when the custom bid comes to an end (and the new capital assets placed for the B-to-C route may, or may not,
be reused).
This is consistent with a thread in the economics literature that stresses cost and choice. This thread includes the work of Noble Laureate Sir
Ronald Coase (1938), Noble Laureate James Buchanan (1969), and writings that are sometimes associated with the Austrian School of thought
and the London School of Economics regarding costs, such as von Mises (1949), Robbins (1934) and Von Hayek (1937). For example Coase
(1938) states:
The first point that needs to be made and strongly emphasized is that attention must be concentrated on the variations which will result if a
particular decision is taken, and the variations that are relevant to business decisions are those in costs and/or receipts.… Costs and receipts
which will remain unchanged whatever decision is taken can be ignored. (p. 1)
This perspective is more likely to be captured in managerial accounting (as compared to other branches of accounting). In addition, the focus on
comparing alternatives is more likely to exist in treatments on managerial economics and engineering economy than in most other sub-
disciplines of economics.
Practical Implications of a Focus on Decisions
Businesses must make a host of decisions over time. These include choice of technology, choice of vendor, choice to make or buy, length of
contract, basic business strategy, initial choice to enter an industry, entry into geographic areas, expansion, exit from a geographic area, exit from
an industry, sale of assets, other forms of contraction, initial price levels, initial price structures, changes in price structures or price levels, special
promotions, over-time runs and many others. If business analyses were costless and instantaneous, one would perform a special study for every
decision the business makes, and the relevant costs (and benefits/revenues) for that decision would likely be unique to that decision.
Unfortunately business and cost analyses take time and use resources. The practical implications of a focus on decisions include the following:
o Identify alternatives;
● Products and services do not (per se) have a cost; only a decision and a consequent action can have a cost. Precisely defining the decision
(i.e., the choice between specific alternatives) allows one to precisely define the relevant costs (and relevant revenues);
● A company must be forward looking; the consequences of a business decision all occur in the future. It is therefore the current and future
value of resources used (and revenues obtained) that are germane;
● Historical information may, in some circumstances, be used as a guide, when there is a reasonable expectation that history will repeat
itself;
● Design systems that generate cost (and revenue) information for those decisions that are the most common, or the most important;
● Employ longer, more detailed, descriptions for information (rather than cost terms that might be misunderstood);
● Use analysis short cuts (that may overstate costs and understate revenues) as a first pass. If the decision passes this test, no further
analysis is necessary. Additional analysis resources are only used for the “close” calls;
In some sense, one only needs a single cost term – “cost”; it is the value of the resources that are used up (or put to one use rather than another).
Cost is specific to an individual business decision and precisely defining the decision will precisely define the cost items to be measured. To be
sure, the cost measurement will often not be easy, but one cannot proceed with measurement until the items to be measured (or excluded from
measurement) are clearly defined.
This perspective does not mean that all cost terminology is useless. Rather, it means that one must be diligent to determine if the cost
information best matches the relevant decision. To the extent that a special study is not possible or warranted for a particular decision, one must
proceed with the information available. Ideally, systems that provide cost information have been designed to produce cost information that is
useful for evaluating the most important decisions the company will face and/or the most common decisions. If possible, the cost information
system should be sufficiently flexible, with more detailed descriptions of categories of costs (or potential costs), to approximately match the cost
information to the relevant decision.
Bounding Values for Cost Calculations
Because some business decisions are unique, this makes the cost (and possibly the revenue) calculation process difficult. In such circumstances,
it will often be important to use short cuts in a first pass when evaluating the business decision. To illustrate, expand the short example above
with the customer bid for a customer needing a high speed telecommunications connection from point A to point C. Now imagine that this
requires a dozen links to provide the service (rather than two), and that there is currently no information on six of the links as to whether there is
sufficient capacity to serve the customer. In this case, one upper-bound approximation is to assume that new facilities will be required to serve
the customer. With this assumption, if the planned bid is still sufficient to produce a positive net present value (and/or to reach a hurdle internal
rate of return, discussed in more detail later in the chapter), then no further research on the capacity of the six links is required. One could also
perform a lower-bound calculation assuming that there is sufficient capacity on all six links and there is virtually no capital cost for those links. If
the estimate of the highest competitive bid is below the cost with this assumption, the company may wish not to spend resources bidding, or to
search for an alternate method of meeting the customers’ needs. Also, as part of the risk overlay (discussed in more detail later in this chapter), it
is critical to test the sensitivity of any key assumptions.
Salvaging Economic Cost Terms under a Decision Focus
The endless variations on possible business decisions have greatly contributed to the proliferation and variation in cost terms. As noted above,
one could argue that the single term “cost” is all that is necessary and the relevant cost is determined only by the specific decision at hand. Given
this framework, one can salvage, at least partially, some of the standard economic cost terms, in the context of the types of decisions to which
they apply.
“Marginal Cost” (MC) is germane to decisions which cause small changes in the level of output, such as cutting the price of an existing service.
However, one must be careful in identifying to what “output” one is referring (e.g., minutes, messages, bandwidth, count of customers, or a
quantity at a specific time of day). The relevant rules include: 1) MC = MR (marginal revenue) for single-part monopoly pricing; 2) No unit sold
below MC, for complex pricing.
“Average Total Cost” (ATC, which includes “fixed” and “variable” costs) is best thought of as corresponding to a decision in which all factors are
within the control of the business. The basic rule is enter only if expected revenue per unit is greater than expected ATC, or expected Price > ATC
with single-part pricing.
“Average Variable Cost” (AVC) is best thought of as corresponding to a decision to exit an industry once some fixed and sunk costs have been
incurred. If all “fixed” costs are sunk (i.e., non-recoverable) then the basic rule is: exit if P < AVC.
Costs, Revenues, CrossElastic Effects and Profits
One complaint economists have is that some within the organization tend to focus on revenue, and not profit (e.g., sales or marketing personnel
compensated on the basis of revenue or revenue-based commission). The simple definition of profit is: revenue – cost. However, it is not always
obvious which revenues are germane to this definition. In the face of cross-elastic revenue effects (substitutes or complements) with other
services in the firms service portfolio, net revenues may be greater than (in the case of complements) or less than (in the face of substitutes) gross
revenues from the primary product itself. Some cross-elastic revenue effects can indicate substitutes (representing possible cannibalization of
revenue between product or service offerings). Here the net effect of substitution yields a smaller measure of revenue that the gross measure
without accounting for cannibalization. If revenues net of cannibalization is used, the costs avoided (if any) by reduced sales of the substitute
should also be considered; one method by which to incorporate substitution/cannibalization effects is to subtract a measure of lost contribution
(revenues – cost avoided) from the substitute.
Some cross elastic effects indicate complements, in which the sale of one product likely generates additional sales of complementary products;
here the net effect leads to greater measures of cost than the gross revenue measure. Similarly, if net revenues (adding the revenues of
complements) are employed in the calculation, one must also account for any additional costs of producing greater quantities of the
complement(s). There can be a tendency for a product manager or a sales person to assert strong complementary relationships, or deny strong
substitution. Of course, good business decisions are not made on the basis of wishful thinking or simple assertion; rather they are based on good
data, sound analysis and testing the sensitivity of key assumptions in a risk overlay.
Bundling also creates difficulties in calculating relevant revenues. However, one bound the values of revenue that are possible. When mixed mode
bundling occurs (where each component of the bundle is also sold separately), the upper bound on the value of the revenue for any component of
the bundle is the price sold separately. The lower bound is the bundle price minus the sum of the separate prices of the other components of the
bundle. The larger the number of the bundled components and the larger the bundled discount (from the sum of the prices of the components
purchased separately), the more ambiguous the revenue value for individual components.
In the case of pure bundling (where the components are not available separately), it is better to think of decisions involving the entire bundle,
rather than considering each of the components separately.
Sunk Costs
While revenues may be contentions, they tend to be far less contentious than cost calculations. One lesson from economics is that some costs
incurred in the past are simply not relevant to current decisions. If no aspect of the historical expenditure can be controlled or mitigated then it
can be considered a “sunk” cost. Sunk costs, like spilt milk, should not be cried over; they should simply be ignored. Imagine that a company
spent $7 million for a specialized software system six years ago that is used to provide an information system. Further, imagine that the
accounting organization assigned a life of seven years to this asset, to be depreciated on a straight-line basis. Some in the organization now argue
that the $1 million in depreciation for the coming year must be recognized in any cost calculations related to the services provided via this
software system; others in the organization argue that the $7 million purchase price is sunk, and now irrelevant.
This argument can only be resolved in the context of an actual business decision. First, consider a decision to re-price the information service that
is sold outside of the company and which relies upon the 6-year old software. Here, the software and its use are identical in the two scenarios: a)
old price for the information service; b) new price for the information service. The original software purchase, and its accounting depreciation, is
completely unaffected by the re-pricing and it should not be part of the marginal cost (the cost per unit or subscriber) of information services.
The only circumstance that could possibly affect this conclusion is if the contract with the software company was one that was not a pure initial
purchase, but rather one that also had a form of revenue sharing, where ongoing software license fees were tied to the volume of information
services sold.
Consider now an alternate decision to shut down the information service and abandon the software. Here the two scenarios are: a) continue to
provide information services, and b) exit the business of providing information services (at least information services that use this software). This
decision is much more encompassing than the reprising decision, and it is at least possible that a broader array of factors are influenced by the
decision (i.e., differences between the scenarios). However, it is likely that even for this decision the abandonment of the software does nothing to
avoid any forward-looking costs. The costs of the software system are likely sunk and irrelevant to virtually any current or future decision,
regardless of the accounting treatment of the expenditure.
Managerial Flexibility and a Real Options Perspective
The discussion above does not mean that all past expenditure are sunk costs, nor that all past decisions are irrevocable. Imagine that the
company was allowed to resell the software system for $200,000. This $200,000 (but not the $7 million, nor the $1 million in remaining
depreciation) now becomes a real cost of continuing to provide the information services. The $200,000 is not a marginal cost (that is it is not
affected by the quantity of information services sold, or the number of subscribers provided with the service). However, it is avoidable if the
company chooses to cease provision of the information service; the $200,000 differs between the scenarios of ceasing or continuing operations.
In essence, one could say that all but the $200,000 of the past expenditure is sunk, but that $200,000 is still an implicit opportunity cost of
continuing provision of the information service.
Existing capital assets can have two sources of opportunity costs. One, as in the example above, is a possible market opportunity to sell (or lease)
the asset outside the firm. Second, there may be a higher valued use of the asset elsewhere within the firm. Unfortunately, valuing the alternate
use within the company is often difficult, but no less real.
How then does one think about such sunk costs? As noted above, if a cost is truly sunk, then it should be ignored. However, part of the
implication is that the decision that creates a sunk cost (e.g., signing the contract to purchase the specialized software system) becomes even
more important (than if the asset were subsequently fungible).
There is a body of literature that deals with the topic of “real options” including Alleman (2002), Benaroch (2002), Benaroch & Kauffman (1999),
Benaroch et al (2007), Dixit & Pindyck (1994), Margrabe (1978), Trigeorgis (1999a), and Trigeorgis (1999b). This topic has its genesis in financial
markets and much of it is technical and mathematical. However, there are some intuitive implications for business decisions and business
management.
At the heart of these implications is the concept that flexibility has value. The literature (e.g., Angelou & Economedies, 2005, p. 4) describes
certain sources of flexibility including: Defer (to wait to commit until additional information, or more favorable circumstances exist); Timeto
Build (staged investment); Expand or Contract (to increase or decrease the scale of operations, usually in response to
demand); Abandonment (to discontinue and liquidate the assets); and Switch (to switch inputs or output mix).
One of the critical implications is that investments should be evaluated to determine the degree to which they retain some degree of flexibility.
Each of these sources of flexibility will tend to reduce the degree of sunk costs. Generally, the value of flexibility is driven in large part by risk and
uncertainty regarding demand and/or other aspects of the cost function. The greater the variation in key factors (risk) and uncertainty with
respect to that variability, the more dangerous sunk costs become.
There are, however, a variety of ways in which one can attempt to retain flexibility and limit the degree to which investments can become sunk. In
theory, one method would be to use insurance markets to insure against downside events (such as through Lloyds of London, famous for their
willingness – at a price – to insure almost anything). Unfortunately, insuring such business-specific outcomes is fraught with the economic issues
of asymmetric information, high search, transaction and monitoring costs and moral hazard. These characteristics make it highly impractical to
insure against such poorly defined events. However there may be many opportunities to design contracts with suppliers to garner greater
managerial flexibility. Such contractual components could include:
1. The option to return a purchase for a partial refund if certain conditions exit (e.g., an insufficient number of customers using the item);
4. Partial rebates if the vendor lowers prices for the same item during a specified time period;
For example, an aluminum manufacturer could negotiate a price with the electric power provider (electric power is a major component of the
costs of manufacturing aluminum) for which the price it paid for electric power was tied, at least in part, to the market price of aluminum.
The greater the degree one can provide managerial flexibility related to an investment, the lower the risk. The less the degree of managerial
flexibility, the greater the degree of sunk (or potentially sunk) costs - the greater the risk.
TOTAL COST OF OWNERSHIP
Total Cost of Ownership (TCO, or total ownership cost, TOC, but unrelated to the Theory of Constraints, ToC), is a concept that is increasingly
used in business around the world (Ferrin & Plank, 2002). The essence of the concept is that the full costs of a decision should be evaluated,
rather than focusing on the initial purchase price (of hardware and software for example). The term TCO is relatively new, but the approach is
similar to notions of life cycle costs and other valid economic criteria for properly evaluating business decisions.
The term TCO is sometimes used to describe the full costs of one choice. In other instances, the term TCO is used to describe the comparison of
full costs of two competing alternatives. With either usage, the approach is fundamentally the same - to calculate the full opportunity costs of a
business alternative, over a period sufficient to capture the life of the primary asset. The term, and the approach, is especially popular (and
useful) in evaluating the possible deployment of a new technology, or the head-to-head comparison of two competing technology platforms.
When properly applied, TCO reflects the full opportunity cost of a decision (i.e., of taking one course of action rather than another). Therefore,
TCO reflects not only the initial purchase price of assets, but also the less obvious initial costs of training personnel to use the new assets, the
costs of upgrades over time, maintenance and operating expenses, net salvage value (negative or positive) of the asset at the end of its life, and
other costs.
Present Value
There are many different ways to consider organizing or categorizing the relevant costs. Some analysts recommend identifying direct costs (e.g.,
hardware and software purchase price) v. indirect costs (e.g., training and migration costs). I find it useful to also think about costs over time;
that is the initial costs (both obvious direct as well as indirect costs) as well as the subsequent costs that will occur. Part of the reason this
distinction is important is that a decision should rest not on the nominal sum of the costs, but rather on the present value of the costs. A dollar
today is worth more than a dollar one year from now; a dollar of cost (or revenue) today is more important than a dollar of cost (or revenue) one
year from now. The relevant discount rate (interest rate) is the firm’s marginal cost of capital corresponding to a level of risk commensurate with
the project.
Example
Consider a hypothetical example. ABC retail is contemplating implementing a new billing system. The major components of the system will be:
hardware, software, initial training, other transition costs, subsequent software upgrades, subsequent training, ongoing maintenance and
operations. Salvage is zero. For simplicity, the billing system is assumed to last 3 years. ABC’s finance group informs us that the forward-looking
cost of capital is 15%.
The calculations in Table 1 implicitly assume initial purchase at time zero, and that other costs occur at the middle of the year. (i.e., cash outlays
for year 1 are assumed to occur 6 months after initial purchase, those for year 2 occur 18 months after initial purchase). In practice, this so-called
“mid-year” calculation is often a reasonable approximation if the initial purchase occurs at the very end of the prior year/start of year 1 and if
subsequent cash outlays are fairly evenly spread over the year. Monthly calculations (using a mid-month convention) are more accurate.
Part of the costs of changing to a new billing system is the loss of the human capital, expertise, and comfort employees have with the use of the
existing billing system. These may not readily appear as cash flows (unless some employees are paid overtime, due in part to additional time in
learning and using the new system). It is possible that adopting a new system could push employees’ use of time to a point that the firm hires an
additional employee. Unfortunately, measuring such effects is exceptionally difficult. Moreover, even if no overtime is billed and the new system
does not trigger (directly or indirectly) the hiring of a new employee, additional time spent in learning a new system has an opportunity cost; the
employee could have been doing something else of value.
Note that in Table 1 the nominal (not present value) cost of the new billing system is $9,850,000. The present value of the costs (brought to
today, time zero) is $8,970,000 over the three years. The most obvious costs are the hardware and software initial purchases, but these items
sum to only $3 million; not much more than a third of the total costs.
Table 1. Costs of new billing system in $1,000
Discount
Factor 1.0000 0.9325 0.8109 0.7051
PV
Year 0 Year 1 Year 2 Year 3 Costs
initial
software $2,000 $2,000
initial
training $800 $800
other
transition
costs $800 $400 $1,173
subsequent
software
upgrades $1,000 $1,000 $500 $2,096
subsequent
training $200 $200 $150 $454
maintenance
& operations $500 $600 $700 $1,446
Is the new billing system a good decision? The information provided is far short of the necessary information to make such an assessment. Part of
the information necessary to evaluate the new billing system is the cost of continuing operations with the old billing system.
Note in Table 2 that the nominal (not present value) cost of the existing billing system is $8,270,000. The present value of the costs is
significantly smaller at $6,667,000 since the costs occur disproportionately towards the end of the period, while with the new billing system, the
costs occur disproportionately towards the start of the period. The differential between present value and nominal value is driven by: a) the
distribution of costs over time; and b) the discount rate (interest rate) employed.
Table 2. Costs of existing billing system in $1,000
Discount
Factor 1.0000 0.9325 0.8109 0.7051
Time PV
0 Year 1 Year 2 Year 3 Costs
hardware
upgrades $200 $100 $281
initial
software
initial
training
other
transition
costs
subsequent
software
upgrades $100 $100 $150 $280
subsequent
training $40 $40 $40 $98
maintenance
& operations $2,000 $2,500 $3,000 $6,007
One of the more important costs of the existing billing system is the relatively high costs of operations and maintenance. The new billing system
allows operations with a much smaller staff than the existing billing system.
Benefits
Should the new billing system be implemented? The total cost of ownership (in present value) of the new system is $8,970,000 v. a total cost of
ownership of $6,667,000 for the existing system. What one can say is that unless the new billing system allows new sources of contribution or net
revenue (net of any other additional costs not identified in the analysis so far) greater than $2,303,000, then the new billing system is not a
rational business decision.
In some instances, a new technology can be justified simply on the basis of the present value of cost comparison; reductions in operating costs
dominate (in PV) the cost comparison. Often, however, new technologies can only be justified (if at all) on the basis of cost comparisons as well as
opportunities for additional revenues.
It is possible that the new billing system could yield significant revenue benefits. These could include reduced billing errors, more rapid billing
(with the likelihood of increased present value of revenue streams), and perhaps the opportunity to engage in more complex pricing (the benefits
described in the subsequent section of this chapter).
Implications for Business Decisions
The most important implication of the TOC perspective is that the obvious and direct costs of a new system (such as a new billing system or a new
information system) are only part of the costs of fully implementing a new system. Often the indirect costs are significant, and can represent the
great majority of the total costs. In particular, it is easy to overlook the labor costs associated with changing systems and learning a new system,
for those using the new system. Knowledge of an existing system is human capital which is lost, and must be rebuilt if a new system is put in
place.
To help quantify costs, and control risks, it may be important to either perform a trial, or to learn from another firm that has already engaged in
the same (or similar) transition.
THE IMPLICATIONS OF INFORMATION SERVICES HAVING PUBLIC GOODLIKE CHARACTERISTICS
NonRivalrous and NonExcludable Goods and Services
One sometimes hears the claim that information services are “public goods.” To evaluate such claims, and to understand the economic
consequences of information as a public good, one must first recognize the two critical characteristics of public goods; that the good be: 1) non-
rivalrous, and 2) non-excludable (Samuelson, 1954; Stiglitz, 1977; Stiglitz, 1999). A “non-rivalrous” good is one in which one consumer’s use of
the good does not reduce the availability or value of the good to other consumers. A private good, such as a hamburger, is fully “rivalrous.” One
consumer’s use of the good (eating the hamburger) completely displaces any other consumers from using or benefiting from the good. In the case
of a non-rivalrous good, such as a new song, one consumer listening to the song (or storing the song on a device) does not use up the song. From
the supply side, this means that “there is zero marginal cost from an additional individual enjoying the benefits …” of the good (Stiglitz, 1999, p.
309). Information services tend to have relatively low (or zero) marginal costs of additional customers using the information. The majority of the
costs are in collecting and creating the information and creating systems to analyze and distribute the information.
A “non-excludable” good is one in which it is not possible to exclude other users from benefitting from the good or service. With private goods,
again such as a hamburger, one customer’s consumption of the hamburger automatically excludes all other uses and users of the hamburger. Two
classic examples of non-excludable goods are national defense and a fireworks display. It is not possible (or certainly not practical) to exclude any
citizen from the benefits of national defense nor from the benefits of a fireworks display. Note, however that there may be geographic limits to a
non-excludable good. National defense systems are generally designed to provide little (or certainly less) value outside the boundaries of the
country creating the national defense system. The benefits of a fireworks display are relatively narrow geographically. With most information
services, geography plays relatively little role with respect to excludability.
In contrast to private goods like hamburgers, intellectual property (IP) (or IP-like services such as a patented process, a song, or a movie) have
characteristics that may make it difficult to exclude users. Indeed, protection of IP via patents, trademarks and copyrights provide codified legal
property rights protection to assist in exclusion. From a societal perspective, such enforceable IP property rights protection offers incentives for
innovation (to those developing IP) in exchange for disclosure and long-run release of the IP into the public domain (Stiglitz, 1999; Varian et al,
2004, part II).
One can think of any information system as producing information services. Those services may be used only internally, used internally and
externally (but not for sale) and/or provided (sold) outside the company. If the information service is sold outside the company, pricing (and
other marketing) decisions are important. In addition, when information services are sold outside the company (or if the firm is contemplating
sale outside the company), the degree to which the services display non-rivalrous and non-excludable characteristics is critical.
Information services often do not fit into the category of IP protected by patents, trademarks and copyrights. However, such information may still
be protected as confidential and proprietary. Moreover, many of the economic pricing and marketing implications for protected IP also apply to
information services. In any circumstance, in order to sell an information service, or IP, outside of one’s own company, the service must be
excludable, at least to some degree.
Pricing a Rivalrous Good
To illustrate the pricing implications of rivalrous vis-à-vis non-rivalrous goods, begin with a simple stylized demand curve. In Figure 1a is the
graph of DH showing a simple linear inverse demand function with the formula P = $10 Qd. Qd is the monthly quantity of a good and P is the
price consumers are willing and able to pay for that good. With this simple linear demand, if single-part “monopoly” pricing is employed the
marginal revenue is also a straight line with twice the negative slope of the demand function or dR/dQd = $10 – 2Qd; where R is revenue and
marginal revenue (MR) is dR/dQd (the first derivative of the revenue function with respect to Qd).
Figure 1. (a) Monopoly pricing rivalrous good (b) DH two-part
pricing
Now imagine the production of a private good with this demand curve DH and the cost function C = $0 + $6Qd, where fixed cost is $0 and
marginal cost = dC/dQd = $6 (which also equals average variable cost in this circumstance). Profit maximization with single-part pricing occurs
where marginal cost = marginal revenue or where $6 = $10-2Qd. This occurs where Qd = 2 and P = 8. Total revenue is $16, total cost is $12 and
“profit” = $4.1 At this price consumer surplus is the small shaded triangle below the demand curve and above a price of $8 or .5(2x2) = $2. This is
shown in Figure 1a (“Monopoly” Pricing Rivalrous Good).
It is noteworthy that fixed costs will have no influence on the choice of the profit maximizing price and quantity. Any fixed (per period) cost
between $0 and $4 yields the same profit maximizing price and quantity; fixed cost only influences profit and whether the firm chooses to exit
the business (if per period volume insensitive costs are expected to be greater than $4).
First-best static economic efficiency in exchange occurs at a quantity at which marginal cost = demand (which represents the marginal value to
consumers of consuming the service). In this case that would occur at a price of $6 and a quantity of 4, shown in Figure 1b (DH Two-Part Pricing).
However, with single part pricing that would lead to a zero profit to the firm. Note that the marginal revenue curve is excluded from Figure 1b
since it is germane to single part “monopoly” pricing but not to first-best economic efficiency in exchange. In Figure 1b, at a price of $6, consumer
surplus is the triangular area below the demand curve and above that price, or .5 ($4x4) = $8. This is the point at which producer surplus ($0)
plus consumer surplus ($8) is maximized given demand and the cost function.
Expand the example now, to one in which there are a million customers outside the firm, each with a demand curve as shown in Figures 1a and
1b. Single part profit maximizing pricing can be represented by simply multiplying the quantities above by one million. Each customer would pay
a price of $8, consuming 2 units of service of the good, with resulting revenues of $16 million and profit of $4 million, and total consumer surplus
of $2 million.
The firm can improve its profits and capture a greater proportion of the available consumer surplus of $8 at that quantity. In contrast to single
part pricing, the firm could employ optimal two-part pricing by establishing a “club fee” (or per period fee, corresponding to the period for which
the demand curve represents), and a usage or per unit price equal to marginal cost (e.g., Allen et al, 2013, ch. 9). In this case the usage fee would
be $6 per unit (marginal cost); if the usage fee were the only price then each consumer would have a consumer surplus of $8. Therefore, the
profit maximizing optimal two-part price (or tariff) is one with a usage fee of $6 per unit, and a club fee of $8 (to capture the entire consumer
surplus shaded in Figure 2). Here usage revenue is $6x4 = $24 per person plus the $8 club fee = $32 and profit = $8 per person. Across all
customers in the market, total values simply require multiplying by 1 million customers (e.g., profit is $8 million). In this example, the firm will
still stay in business even with per period (“fixed”) costs of up to $8 million. Moreover, this is the economically efficient quantity where marginal
cost = demand; society in total (producer surplus + consumer surplus) can be no better off.
Figure 2. DH non-rivalrous
Pricing a NonRivalrous Good
Consider what occurs with the same demand curve DH, but rather than for a private good, it is now the demand for an information service with a
non-rivalrous cost function. The marginal cost of providing additional units of the information service is assumed below to be = $0.00 per unit.
This is shown in Figure 2 (DH Non-Rivalrous), where the marginal cost line simply lies along the horizontal axis. With single-part pricing,
marginal revenue = marginal costs where marginal revenue intersects the horizontal axis. This yields a quantity of 5 with a per unit price of $5.
The revenue per customer is $25 and consumer surplus is the area of the triangle above the price of $5 and below the demand curve = .5 (5x5) =
$12.50.
Optimal two-part pricing with the same demand curve DH, and a non-rivalrous cost function (i.e., when marginal cost = $0) would ignore the
marginal revenue curve in Figure 2 and would yield a usage fee of $0. At a zero usage fee consumers will each consume 10 units; here the entire
triangular area under the demand curve would be the consumer surplus (without a club fee) or .5(10x10) = $50; the optimal club (per period) fee
per person would be $50. Obviously, the optimal two-part tariff produces greater revenue and greater profit than single-part monopoly pricing.
Unfortunately, all customers will not be clones. Consider Figure 3 (DL Non-Rivalrous) below where a segment of demand DL (that we will call
“LOW”) has the inverse demand function P = $5 .5Qdwith corresponding marginal revenue P = $5 - Qd. If these demanders exist for the private
good with marginal cost = $6, they are irrelevant to the firm and irrelevant to the economically efficient output and consumption since none of
the LOW demanders has a marginal value for the good above the cost to society to produce the good.
Figure 3. DL non-rivalrous
However, if the LOW segment (DL) exists for the non-rivalrous information service with marginal cost of $0, the single part profit maximizing
quantity is 5, with a corresponding price of $2.5 (for LOW only). This would yield revenue of 5x$2.5 or $12.50. Also note that the optimal two
part tariff for LOW is significantly different from the high segment. As with the higher valued customers, the usage fee will also be $0, but the
available consumer surplus in this circumstance is the area under the demand triangle: .5 (5x10) = $25.
With two (or more) different demand segments the firm should engage in third degree price discrimination, charging different prices for the
same good to the different groups. In order to engage in successful third degree price discrimination, two conditions must be met: 1) the ability to
identify two different groups with different own price elasticity of demand at the same price (i.e., different marginal revenue curves); and 2)
limited ability to engage in arbitrage (buying low and selling high). Begin with the second condition. In this case it means that there is limited
ability of the group buying at a low price to resell to those in the high priced group.
Assuming limited arbitrage, if single part monopoly pricing is used by the firm facing two demand segments, DH and DL, and with zero marginal
costs, as noted above, it should charge prices of $5 to DH and $2.50 to DL. Alternatively, the firm could use two different optimal two-part pricing
schemes; DH customers would pay a 0$ usage charge and a $50 monthly club fee while DLcustomers would pay a 0$ usage charge and a $25
monthly club fee. This would combine third degree price discrimination with optimal two part pricing.
Complex Pricing of Information Services
This begs the question of how one identifies different groups with different demand elasticities and separates them. In practice a myriad of
factors are used to identify and separate high value customers like DH from low value customers like DL. This includes: children, senior citizen,
student, and military discounts; residential v. business price differentials; Saturday night layover fares (airlines); weekday v. weekend price
differentials; charging by number of employees; and charging by business type (e.g., discounts for non-profit or government agencies). Often one
can change the characteristics of the product slightly to help groups self-select into high and low value segments such as supplying first class
seats, premium service packages, leather seats (cars), and time of day discounts (matinee prices). Such changes in the quality and characteristics
of products and services begin to blur the boundaries of price discrimination; however, price discrimination should be an important part of the
motivation for providing variations in products and segmenting markets. Moreover, one can impose time costs on buyers to assist in price
discrimination. There is likely to be a strong correlation between customers with low opportunity cost of time and high responsiveness to usage
prices. Therefore, techniques such as requiring a coupon that is obtained from a newspaper or a website impose additional costs on customers
and allows the highly price elastic customers to identify themselves, through what is sometimes called “self-select”.
For information services sold outside the company, many of the same characteristics used in other industries (described above) could be useful to
assist in price discrimination. One could establish a pricing schedule based on company or group size measured by number of employees or
revenues. Such a schedule need not be linear. In addition, there may be relatively easy ways in which to intentionally degrade the value of the
information to the lower priced segment. This could include an intentional delay in providing the information to the low-priced segment. For
example, a customized “push” news service dealing with recent regulatory events could be provided in an AM version (the fastest delivery), a PM
version (later deliver), and a weekend version (slowest delivery). Value can also be added or detracted based on the form of the information
provided and the ease with which it is analyzed.
In addition to third degree price discrimination, one can also employ second degree price discrimination with different prices for different
volumes of the product or service.2 This is sometimes called declining block pricing where higher quantities demanded are sold at lower prices.
The classic example is utility pricing in which lower per unit prices are obtained beyond a certain quantity.3 Today, one can likely find over 1,000
examples in any large American grocery store, where many products are sold in different package sizes, with a corresponding quantity discount.
Unlike third degree price discrimination, one need not identify separate groups to engage in second degree price discrimination, because
customers self-select. However, to successfully engage in second degree price discrimination, there must still be limited practical ability for
arbitrage; otherwise those obtaining the lower prices will resell to those that would have paid the higher prices. If the original DH were the only
consumers in the market for a non-rivalrous (with zero marginal cost) information service, one could use a three-part declining block price (i.e.,
second degree price discrimination with a three-part price) as shown in Figure 4 (Declining Block Price).
Here, a price of $8 is charged for the first 2 units, after which the customer can now obtain a lower price of $6 for additional units. For any units
beyond 4 purchased, a per-unit price of $3 is obtained. This produces revenue for the firm of $8x2 + $6x2 + $3X3 = $37. This is greater than the
$25 obtained from single part pricing, but less than the $50 obtained from optimal two part pricing. Consumers would obtain consumer surplus
equal to areas A, B, and C. Area D is the welfare loss to society vis-à-vis optimal two-part pricing.
Imagine that one is supplying a non-rivalrous (with zero marginal cost) good to two groups of customers, DH (as shown in Figure 2) and DI (as
shown in Figure 5)
Figure 5. DI non-rivalrous
In this case, the single-part monopoly pricing result will be identical for both groups, with a price of $5, and DH consuming a quantity of 5 (and a
consumer surplus of $12.5), and DI consuming a quantity of 2.5 (with a consumer surplus of $6.25); this yields a combined revenue of (5+2.5) x
$5 = $37.50. This is relatively unusual and means that given the two groups, the single part monopoly pricing is more effective than will typically
occur given two groups of customers. (It is also interesting that the optimal two part price is identical for DL and DI; with a zero usage fee and a
club fee of $25).
Combined Complex Pricing of NonRivalrous Goods
However, DL consumes 10 while DI only consumes 5. If one could identify the groups and engage in combined third degree price
discrimination/optimal two-part pricing, the resulting prices would be a $0 usage fee for both groups, and a club fee of $50 for DH and $25 for
DI, or a combined revenue of $75. However, what if one cannot separately identify members of DH and DI? Can one outperform the $37.50 in
revenue from single part monopoly pricing by engaging in more complex pricing? The answer is yes. A revenue-improving price structure is a
price of $25 for consumption of up to five units and an additional fee of $12.50 for an additional five units. In this case DH customers will self-
select to consume 10 units paying $25 + $12.50, while DI customers will self-select to consume 5 units and pay $25. Total revenue is $62.50;
more than the single part price, but less than the $75 from combined two-part pricing and third degree price discrimination. Here, the entire
consumer surplus for DI customers is captured by the firm. DH customers obtain a consumer surplus of $12.50 on the first 5 units, and barely
decide to consume the second five units at an additional price of $12.50.
Practical Implications for Pricing Information Services
These examples assume perfect information regarding the demand curves (even in the circumstances where we did not know which customers
belonged to each of the demand curves). What are the practical implications for pricing services in the real world?
● Price discrimination and two-part pricing is important for any provider to consider; but
● The higher marginal costs are - the more prices are constrained by marginal cost;
● To sell information services outside the firm, they must be at least be partially excludable;
● Information services are likely to be non-rivalrous (with zero or very low marginal cost);
● With zero or low marginal cost of production, the critical decision then becomes the investments required to create new information or
new information services;
● With zero or low marginal cost, demand information should dominate marketing and pricing decisions;
● With zero or low marginal cost, profit maximization is equivalent to revenue maximization;
● This makes price discrimination and two-part tariffs particularly important for information services;
● This may include the need to degrade the information services in some way for lower-priced customers – this is often not popular with
technical staff;
● In real world pricing, it will often be important to use non-zero usage prices to help capture consumer value due to heterogeneity in
demand (and uncertainty in measuring demand);
● It is important to identify sources of value in aspects of the services you provide, and use those sources of value to engage in complex
pricing and/or price discrimination
● Higher usage fees are important when marginal costs are non-zero or when demand is more heterogeneous.
NonRivalrous Revenue Maximization, Elasticity and CrossElastic Effects
Own-price elasticity of demand can be defined as the percentage change in quantity demanded divided by the percentage change in price.
Rearranging we have (P/Q)(dQ/dP) where dQ/dP is the inverse of the slope of the demand curve (Allen et al., 2013, chapter 2). With a linear
demand curve, the slope is constant at -1.0 in Figures 1, 2, and 3, and constant at -0.5 in Figure 4. However, because the term (P/Q) varies from
infinity to 0 over the demand segment, the own price elasticity of demand also varies from 0 to negative infinity. This can be seen in Figure 2.
In order to take advantage of the revenue opportunities of the segment LOW, the information services provider must engage in 3rd degree price
discrimination.
PERISHABILITY OF SERVICES AND THE IMPLICATION FOR COSTS AND DECISIONS
Products can vary drastically in the degree to which they are “perishable.” The reasonable “life” of a hot funnel cake at the state fair may be 20
minutes; an unrefrigerated tuna salad sandwich -2 hours; a refrigerated fillet of fish - 4 days; a head of cabbage - 3 weeks; a potato - 2 months; a
can of soup - 5 years. For some products, it is not the physical decay that creates the perishable dimension; it is technological obsolescence or
simply shifts in demand. Software stored on a compact disc may function after decades, but the value to consumers may perish within a few
years, or even a few months. The perishability of a product is an important part of the practical ability to store the product (the cost of storage is
the other major part). Storage of products can be done by manufacturers or consumers. This has important implications for manufacturers in
managing variations in demand for products, and for pricing products (e.g., a temporary price cut for a product may simply shift consumers’
demand from future time periods to current time periods in combination with consumer storage).
Information can generally be stored (with some cost of storage). However, information services can also vary drastically in their perishability.
Current events/news-like services can have extremely rapid decay in value (e.g., a half-life in value of perhaps less than one day). Some customer
information (e.g., address and contact information for customers that currently, or previously, have purchased a particular service) has a much
longer half-life (perhaps more than a year). The degree to which information is perishable, like perishable produce, has important business
implications. The shorter the half-life of the information, the faster the required time to market and deliver the information. Rapid value decay of
demand also has important marketing and pricing implications. Variations in demand are an important method by which to engage in price
discrimination or market segmentation; e.g., highly price elastic segments may self-select for later delivery (and a correspondingly lower price).
With sufficient delay, there can also be opportunities to provide the older information as samples for free to potential customers who are
uncertain as to the value of the information.
Services, in general, can also vary in the degree to which they are “storable.” Consider traditional land-line telecommunications services. In a
hypothetical example, one company may have capacity to carry the equivalent of one million simultaneous voice calls between Chicago and St.
Louis. If, at 6 AM on a Sunday morning the actual utilization of that network path had only the equivalent of 1,000 simultaneous voice calls, the
other 999,000 units of capacity cannot be stored for later, they are simply lost forever. Services that are provided via capital assets that have
limited capacity are generally perishable in real time. Such real-time perishability, in combination with variability in demand over time or
geography, has critical implications for the calculation of costs as well as the pricing and marketing of services.
For real-time perishable services derived from capital assets with limited capacity, it is critical to examine business decisions with respect to their
effects on the use of capacity. While the details of capacity cost approaches are beyond the scope of this chapter, some important conclusions
from the capacity cost perspective can be briefly described here. First, it is critical to look for opportunities to provide services (either internally,
or externally for sale) that will NOT exhaust the capacity of capital assets; i.e., to seek ways to more fully utilize currently underutilized capital
assets. Second, if a business decisions will exhaust the capacity of assets it is important to only pursue that decision if it generates significant
benefits. When an existing capital asset is exhausted in its capacity, one of two results must occur. Either: a) a new capital asset is placed to
expand capacity; or b) some demand or use for the services provided must go unfulfilled (and some rationing device should be employed to
determine which potential use goes unfulfilled).
Other services, or aspects of services, are not perishable in real time (or do not face variability of demand across time or geography). For example,
electric power services are a peculiar mix of components that are perishable in real time, and those that are not. The capacity created by the
capital assets required for electric power generating and high voltage transmission are designed for system peak load; these are perishable in real
time. Moreover, these facilities can display significant variability in demand across time and geography (electric power demand at off-peak may
be only 20% of the peak). Part of the process that electric power companies utilize to deal with exhaustible capacity (and variability of demand) is
to use real time pricing (to help ration demand to high valued users) and interruptible power contracts for very high volume (usually industrial)
customers (another form of rationing). In contrast, the coal, fuel oil, natural gas and even water (for hydroelectric) used in electric power
generation can be stored, when not used.
Part of the difficulty in managing electric power transmission and distribution networks and land-line telecommunications networks is that the
networks have many geographically-specific capital assets. Capacity constraints exist at specific links and nodes within the network. This
geographic characteristic greatly exacerbates the problems of real-time perishability of the limited capacity of the assets. This is because the more
narrow the geographic area, the greater the variability in demand is likely to be vis-à-vis the capacity of the facilities (there is no opportunity to
use the law of large numbers that tends to smooth demand). Moreover, it is difficult or impossible to have the capacity in one part of a network
substitute for capacity in another part of the network. In telecommunications, this is slightly alleviated via preconfigured fault protection
(predetermined alternate routing in case of failure of a network link or node) and the use of internet protocol-based transmission which can more
easily smooth network demand across alternate routes. However, such techniques require facilities that do, in fact, provide alternate routes.
Consider four dimensions of information services: a) the creation of the information itself; b) the storage of in the information; c) analysis of the
information; and d) access to the information or distribution of the information. Of these four dimensions, access to (or distribution of) the
information is the only one which is likely to display real-time perishability in combination with demand variability. If users/demanders of the
information choose the time at which they access the information, the demand variability (across time or geography) is likely to be high. In
contrast, if the producer of the information is in control of the distribution of the information (e.g., a specialized “push” internet news service)
capacity constraints on the distribution facilities are likely to be much less of an issue.
NETWORK EFFECTS
Externalities
The earliest discussion of network effects was in the telecommunications industry, often via use of the phrase “network access externality” (e.g.,
Katz & Shapiro, 1985; Rohlfs, 1974). Externalities are well known in economics and are said to exist when an economic agent (a producer or a
consumer) takes an action in which the full costs are not borne (or the full benefits of the action are not received) by the decision maker. Pollution
is the classic example of an external cost. The polluter does not bear the full costs of the pollution; other members of society bear some portion, or
perhaps all, of the costs. But in telecommunications, it was recognized that an external benefit accrues when a potential new subscriber gains
access to the network; that is the existing subscribers to the network receive some additional benefit when a potential new subscriber joins the
network. However, the value others on the network receive from this potential marginal subscriber joining the network is not included in that
subscriber’s decision calculus. This leads to a lower level of network subscription than is economically optimal, and this represents the best
possible rationale for historic subsidization of telecommunications network access (Parsons, 1994). However the high rates of penetration in
telecommunications networks today suggests a much weaker rationale for continued subsidization of telecommunications services (Parsons &
Bixby, 2010). The early literature in telecommunications noted two possible types of positive externalities: 1) network access (just discussed); and
2) call or usage (Armstrong, Dole & Vickers, 1996; Economedies, 1989; Laffont & Tirole, 2000; Liebowitz & Margolis, 2002; Rohlfs, 1974; Rohlfs,
2001; Squire, 1973; Vogelsang & Mitchell, 1997; and Wenders, 1987).
Direct Network Effects
Later discussions broadened to concepts of “network effects” and the “externality” terminology largely disappeared. Network effects are now
generally discussed as being of two forms: 1) direct network effects; and 2) indirect network effects (Varian et al, 2004; Rohlfs, 2005; Gruber,
2005). Varian et al (2004) uses the phrase “demand-side economies of scale” to refer to network effects. Under the modern terminology a direct
network effect is the same as the older telecommunications concept of a positive network externality: each new subscriber adds value to those
already subscribing to the network. Examples of direct network effects include: voice telecommunications, fax machines, email, Picturephone
(1973, failed to reach critical mass), the internet, and social networking sites (Rohlps, 2001; Varian et al, 2004). With a direct network effect, the
first subscriber receives no value without another subscriber on the “network.” Recognize that this is different from most goods and services
consumed in society. For most goods, one consumer’s valuation of the good is completely independent of others’ consumption. The typical
consumer of a hamburger receives not greater (nor lesser) satisfaction if others consume (or do not consume) hamburgers.
Therefore, the demand function for a direct network effect can take on a peculiar shape. In some circumstances, it is possible that the demand
curve can rise with the quantity demanded (since a higher quantity represents more subscribers on the “network”), i.e, that the demand curve is
upward sloping at relatively low network subscription rates. This could take on the form shown in Figure 6.
One key business lesson when dealing with a direct network effect is that one must plan to reach critical mass. This includes very low prices for
network access early in the life cycle of the network. This can mean that it is necessary to obtain revenues from some other sources such as
complex pricing and revenues from high-valued usage (rather than prices for access). Once critical mass has clearly been reached, sustainable
and more typical business and pricing practices can prevail.
Another critical implication of direct network effects is that it is important for new entrants or smaller firms to obtain interconnection (or
interlinking) or network compatibility in order to leverage existing network subscriptions. In theory, when mobile voice communications services
began to emerge, it would have been possible to have created a separate and incompatible network of mobile users only. However, such a network
would have missed the value of interconnecting (in a compatible way) with tens of millions of existing land-line telecommunications network
subscribers. The very first mobile customer had relatively high value by being able to call an existing base of land-line voice customers. Rohlfs
(2001) states: “In bandwagon markets with interlinking, competitive rivalry is not much different than in non-bandwagon markets. Bandwagon
markets without interlinking tend to generate a winner-take-all contest among suppliers” (p 35). Early in the 20th century, AT&T gained market
share, and market power by refusing to interconnect with small phone companies (forcing each to attempt to reach critical mass in its own small
isolated network; this eventually led to an antitrust suit by the U.S. government and the Kingsbury commitment by AT&T (Brooks 1976)).
Therefore interconnection and network compatibility is so powerful it is now required by virtually every telecommunications regulator in the
world.
One could build, for example, an intra-network proprietary email system for a large company. Such a system could have a variety of customized
functions and features with value to employees. However, the theory of direct network effects suggests that significant additional value will
accrue to such a system when it is interconnected to email systems around the world. Interconnection here refers to both the physical connection
of the private intra-network to external networks, and consistency in protocols and standards.
Indirect Network Effects
The second form of network effect is the “indirect network effect.” An indirect network effect occurs when there are strong complementary cross-
elastic effects in demand between two products or services (Varian et al, 2004, section I). Classic examples are CD players and CDs, and DVD
players and DVDs. In these cases the complementary effects are so strong that consumers receive no value from either component without the
other. Here, customers do not receive value directly when other customers obtain a CD player; rather, each customer with a CD player
benefits indirectly as others purchase a CD player as it expands the market for content (CDs). Such indirect network effects create a “chicken or
the egg” (which came first?) startup problem. Often, this startup problem is exacerbated by competition with existing (older) technology (vinyl
records and cassette tapes in the case of CDs) and customers with an existing inventory of complements for the old technology. The case of
compact disc players is perhaps instructive as a miniature case study (Rohlfs, 2001, Chapter 9; Varian et al, 2004, Section I). Early in the history
of this industry, several companies had the potential to successfully manufacture a form of compact discs and/or players (JVC, Sony, Philips, and
Telefunken). Here Sony and Philips agreed to cross-license their patents to form a superior product “standard” (without a technical standards
body), that would later be adopted by other manufacturers of CD players. Sony and Philips jointly solved the chicken and the egg problem by
planning the production of complementary content (CDs) consistent with the timing of the production of CD players.
Other examples of indirect network effects include VCRs and VHS tapes, early application software (e.g., VisiCalc) and early personal computers,
internet subscription and websites and email (having elements of both direct and indirect network effects), high-definition television, early
broadcast radio (early 20th century), Vinyl records and players (78, 45 and LP rpms each as an example), and electric power and electric lights
and electric motors (Rohlfs, 2001).
However, Stremersch et al (2007) find that “indirect network effects as commonly operationalized are less pervasive in the examined markets
than expected on the basis of prior literature” (p 68). The authors also conclude that for hardware manufacturers the importance of the quantity
of related software is less important (but quality of the related software is important) than previously expected.
Lessons and Conclusions
The existence of direct or indirect network effects has critical implications for business decisions. These include the following:
● With direct network effects, one must reach critical mass by initially pricing “network” subscription low, and/or otherwise encouraging
new subscribers;
● With direct network effects (and even indirect network effects to a somewhat lesser degree) it is critical for new entrants or smaller
providers to obtain interconnection, interlinking and compatibility with existing or other potential network providers;
● The new network products or services must have clear and strong advantages to consumers in order to overcome customers’ existing
choice of technology/network (e.g., the failure of video disc players in the face of existing VHS);
● Companies must look for partners with which to create superior combined technologies and/or to overcome the chicken and the egg or
critical mass problems;
o A firm should license (rather hold close) an information technology product patent when network effects exist;
o A fixed fee patent license is optimal with strong network effects; and
o A royalty with a fixed fee is superior when network effects are weaker.
Costs to Consumers
Traditionally, one can think of the consumers of a service (such as an information service) as being outside and separate from employees within
the firm producing the service. In some instances the “consumers” of a service are employees within the firm. In each instance, consumers are
likely to incur costs of obtaining and consuming the service beyond the price paid for the service. These costs can include those related to search,
transactions and learning. Such costs were first discussed by Noble Laureate Sir Ronald Coase (1937, 1938, 1960), and later by Nobel Laureate
Oliver Williamson (e.g., 1981).
The fact that there is not a pecuniary market transaction corresponding to these costs makes them no less real to consumers choosing between
services and combinations of services. These costs have important implications for providers of services. First, rational consumers will choose
amongst competing services by comparing total costs (both pecuniary out of pocket costs as well implicit search, transactions and learning costs)
to the benefits expected. Providers that can reduce consumers’ search, transaction and learning costs will improve the likelihood that consumers
will choose their product.
Second, to the extent that there are strong complementary relationships between products, bundling the services together, or making it easy for
consumers to purchase both types of services can help reduce search and transactions costs for consumers. Similarly, producers that can
anticipate downstream demanders’ uses of services may be able to reduce demanders opportunity costs by adding features that displace
consumers’ time. For example, information services that are summarized to different degrees (to allow for different levels of use and interest), are
pre-analyzed, or have easy analytical components added on, can save demanders time and reduce their opportunity costs of using the
information.
For a service that is only used internally, transaction and learning costs can still be significant. The full opportunity cost of a decision is (or should
be) at the heart of the total cost of ownership (TOC) approach, discussed earlier.
Switching Costs
In many circumstances a consumer’s past choices influence the full opportunity costs of future choices. When there are strong complementary
relationships between products (as occurs with indirect network effects) this can create “switching costs” in which the customer must bear some
costs in order to switch from one technology, or one vendor, to another (Varian, Farrell & Shapiro, 2004). Consumers contemplating the
purchase of a digital compact cassette player, a technology that failed to reach critical mass (Rohlfs, 2001, p. 101) in 1992 (the year of first
marketing) will have significant switching costs via their existing libraries of compact discs. Often the switching costs are associated with a
particular vendor’s proprietary product. Many desktop printers employ proprietary printer cartridges that do not interlink (i.e., are not
compatible with) other printer cartridges. This creates some switching costs for a consumer purchasing a new printer cartridge (who, in order to
switch must now buy a new printer).
This has important business implications for circumstances in which indirect network effects exist or other strong complementary effects exist.
First, the higher the switching costs after a consumer makes an initial purchase of the primary product, the more intensively producers should
compete for the initial product. At the extreme, a producer might “give away the razor to sell the blades.” (Mitchell, 2011).
Second, note that such an approach (low or negative margins on the initial/primary component and high margins on the subsequent/secondary
component) is contingent on variable proportions in consumption; that is, that demand for the subsequent/secondary strong complement varies
with use. (An example of strong complements in consumption with fixed proportions would be a right shoe and a left shoe). A higher margin on
the secondary complement allows producers to engage in price discrimination; higher use in the subsequent/secondary component (e.g., by using
more printer cartridges) likely implies higher value. And, this form of price discrimination can occur largely via self-selection; the high value and
high use customers simply buy more of the subsequent/secondary component (e.g., more printer cartridges).
BIG DATA
An explosion in the volume of data available for virtually every business creates opportunities and problems. Almeida & Calistru (2012) state:
“Big data is a disruptive force that will affect organizations across industries, sectors and economies” (p. 1). Lock (2012) suggests companies
experience an average annual growth in data of 38%; some suggest higher numbers. However, Odlyzko (2012) notes the difficulties in measuring
the volumes of information that are communicated. The issue of big data has even invaded the popular press (e.g., The New York Times, 2012).
Big data has critical implications for designing and using information systems, communication/data transmission, data storage, and data
analysis. As with designing information systems for cost analysis it is important to employ a decision focus. In particular, it is critical to design
information systems to be useful in providing information for the most important or the most frequent business decisions. LaValle et al (2011)
use the phrases “focus on the biggest and highest-value opportunities” and “within each opportunity, start with questions, not data” (p. 25).
The very nature of the big data problem is that there is an expanding set of data available, but not necessarily a corresponding improvement in
analytical use of the data. It will be tempting, in contemplating building or changing an information system, to focus on creating or capturing
more data. In addition, it will be tempting to create processes and mechanisms to distribute, and perhaps display this new data. This data
oriented focus is a mistake. “Only about one of five respondents cited concern with data quality or ineffective data governance as a primary
obstacle” to useful analytics (Lavalle et al, 2011, p. 23).
Moreover, part of the practical implication of using a decision focus to deal with big data issues in a large corporation requires that the potential
decisions on which to focus are those consistent with the business strategy and those that currently attract senior management bandwidth.
Consider an example of how one might attempt to leverage big data. Senior management at a large retail company with a significant on-line
presence had identified that high return rates had been a problem or at least a perceived problem with on-line orders. This problem was
exacerbated by a policy that the company would pay for both the original shipping (beyond a purchase price of a certain dollar amount) and the
return shipping for on-line purchases that were returned. Consider three possible decisions related to the high returns problem: a) change the
free shipping policy; b) identify problem items (detailed decisions to follow); or c) identify problem customers (detailed decisions to follow).
In some circumstances, high return rates of retail merchandise exist for specific products or lines of products. This can be due to poor quality or
poor descriptions (such as size measures that are inconsistent with most customers’ expectations). Data on returns can be organized by product
or product line, in order to identify the high return rate offenders. The follow-on decisions can include:
In such a circumstance, it will also be useful to examine customers with the highest return rates. This requires not only sorting the customer
records over some period of time for the highest number of returns (the numerator), but also for a measure of total purchases by the same
customer (the denominator). This analysis can indicate which customers likely cause more costs than revenue; you simply don’t want such
customers (McWilliams, 2004). While one cannot preclude such customers from patronizing your brick and mortar or your website, you can
eliminate them from your direct mail or email solicitations list. In particular, you can eliminate them from notifications regarding discounts,
coupons, or other sales.
The last category of decision (to eliminate the free return shipping), is more difficult to evaluate. Free return shipping can have significant value
to customers, including those that seldom make returns. The value can include overcoming asymmetric information problems classic in online
shopping; with online shopping, customer simply do not have as much information as the provider, nor as much as they would when shopping a
brick and mortar store. It may be that once the high return customers, and the high return products and product lines are eliminated, the free
return shipping is no longer a significant problem.
FUTURE RESEARCH DIRECTIONS
The possibilities for future research in areas related to this chapter include: empirical examination of the circumstances in which limited capacity
for capital assets affects cost; the relationship between the literature on Theory of Constraints (ToC) and the cost implications of limited capacity
of capital assets and the literature on TCO; a broader range of business cases published related to dealing with the big data problem; and how the
practical implications of real options theory can be used for developing information services and information systems.
CONCLUSION
Miscommunication and confusion regarding cost terms and relevant cost concepts can be resolved by focusing on the business decision; the
relevant costs (and revenues) will emerge from carefully defining the business decision and its competing alternative(s). Information services
often have some public good-like characteristics. Such low or zero marginal cost of production services (costs corresponding to decisions that
cause small changes in the quantity of the information service provided) make demand information (rather than marginal cost information)
critical to determining the appropriate price structure and price level. Moreover, this makes the costs of information services, and information
systems, very front end heavy; that is, the great majority of the costs are caused by the decision to initially establish the information system, or
create the information itself (rather than the cost of distributing the information to another user).
In contemplating the investments that would be necessary for a new information system, a full opportunity cost/full life cycle cost/total cost of
ownership approach is appropriate. The proper choice of the discount rate (interest rate) and proper application of discounted present value
techniques also becomes important as decisions involve the purchase of long-lived assets. As part of this process, it is also useful to consider the
value of managerial flexibility and to employ a real options perspective to identify sources of flexibility and reduce sunk costs.
This work was previously published in Approaches and Processes for Managing the Economics of Information Systems edited by Theodosios
Tsiakis, Theodoros Kargidis, and Panagiotis Katsaros, pages 1446, copyright year 2014 by Business Science Reference (an imprint of IGI
Global).
ACKNOWLEDGMENT
Thanks are due to Ellesse Henderson at the Washington University School of Law for research assistance and helpful comments.
REFERENCES
Alchian, A. A. (1968). Costs and outputs . In W. Breit & H.M. Hochman (EDs.), Readings in microeconomics (2nd ed., pp. 159–171). Hinsdale, IL:
Dryden Pres.
Allen, (2013). Managerial economics: Theory application and cases(8th ed.). New York, NY: W.W. Norton & Co.
Almeida, F., & Calistru, C. (2012). The main challenges and issues of big data management. International Journal of Research Studies in
Computing , 1(1), 1–10.
Angelou, G., & Economides, A. (2005). Flexible ICT investments analysis using real options. International Journal of Technology .Policy and
Management , 5(2), 146–166.
Areeda, P., & Turner, D. F. (1975). Predatory prices and related practices under section 2 of the Sherman act. Harvard Law Review, 88(4), 697–
733. doi:10.2307/1340237
Armstrong, M., Dole, C., & Vickers, J. (1996). The access pricing problem: A synthesis. The Journal of Industrial Economics , 44(2), 131–150.
doi:10.2307/2950642
Baumol, W. J. (1979). Quasi-permanence of price reductions: A policy for prevention of predatory pricing. The Yale Law Journal ,89(1), 1–26.
doi:10.2307/795909
Benaroch, M. (2012). Managing information technology investment risk: A real options perspective. Journal of Management Information
Systems , 19(2), 43–84.
Benaroch, M., Jeffery, M., Kauffman, R., & Shah, S. (2007). Option-based risk management: A field study of sequential IT investment
decisions. Journal of Management Information Systems , 24(2), 103–140. doi:10.2753/MIS0742-1222240205
Benaroch, M., & Kauffman, R. (1999). A case for using real options pricing analysis to evaluate information technology project
investments. Information Systems Research , 10(1), 70–86. doi:10.1287/isre.10.1.70
Bork, R. H. (1978). The antitrust paradox: A policy at war with itself . New York, NY: Free Press.
Buchanan, J. M. (1969). Cost and choice: An inquiry in economic theory . Chicago, IL: University of Chicago Press.
Coase, R. H. (1960). The problem of social cost. The Journal of Law & Economics , 3(1), 1–44. doi:10.1086/466560
Della Valle, A. P. (1988). Short-run versus long-run marginal cost pricing. Energy Economics , 10(4), 283–286. doi:10.1016/0140-
9883(88)90039-4
Dixit, A. K., & Pindyck, R. S. (1994). Investment under uncertainty. Princeton, NJ: Princeton University Press.
Economedies, N. (1989). Desirability of compatibility in the absence of network externalities. The American Economic Review ,79(5), 1165–1181.
Ferrin, B. G., & Plank, R. E. (2002). Total cost of ownership models: An exploratory study. Journal of Supply Chain Management , 38(3), 18–29.
doi:10.1111/j.1745-493X.2002.tb00132.x
Fisher, P. S. (1991, March). The strange career of marginal cost pricing. Journal of Economic Issues , 25(1), 77–92.
Gruber, H. (2005). The Economics of mobile telecommunications . Cambridge, UK: Cambridge University Press.
doi:10.1017/CBO9780511493256
Katz, M. L., & Shapiro, C. (1985). Network externalities, competition and compatibility. The American Economic Review ,75(3), 424–440.
Knight, F. H. (1921). Cost of production and price over long and short periods. The Journal of Political Economy , 29, 304–335.
doi:10.1086/253349
Laffont, J., & Tirole, J. (2000). Competition in telecommunications . Cambridge, MA: MIT Press.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan
Management Review , 52(2), 21–31.
Liebowitz, S., & Margolis, S. (2002). Network effects . In Cave, (Eds.), Handbook of telecommunications economics: Structure, regulation, and
competition . Academic Press.
Lin, L., & Kulatilaka, N. (2006). Network effects and technology licensing with fixed fee, royalty, and hybrid contracts. Journal of Management
Information Systems , 23(2), 91–118. doi:10.2753/MIS0742-1222230205
Lock, M. (2012). Data management for BI: Big data, bigger insight, superior performance. Aberdeen Group White Paper, 1, 4-20.
Margrabe, W. (1978). The value of an option to exchange one asset for another. The Journal of Finance , 33(1), 177–186. doi:10.1111/j.1540-
6261.1978.tb03397.x
McWilliams, G. (2004, November 8). Analyzing customers, Best Buy decides not all are welcome. The Wall Street Journal, p. A-1.
Mitchell, D. (2011, November). Amazon may cut itself on ‘razors and blades’ theory. CNN Money.
Parsons, S. G. (1994). Seven years after Kahn and Shew: Lingering myths on costs and pricing telephone service. Yale Journal on
Regulation , 11(1), 149–170.
Parsons, S. G. (2002). Laffont & Tirole’s competition in telecommunications: A view from the U.S. International Journal of the Economics of
Business , 9(3), 419–436. doi:10.1080/1357151021000010409
Parsons, S. G., & Bixby, J. (2010). Universal service in the U.S.: A focus on mobile communications. Federal Communications Law Journal , 62.
Robbins, L. (1934). Certain aspects of the theory of costs. The Economic Journal , 44, 1–18. doi:10.2307/2224723
Rohlfs, J. (1974). A theory of interdependent demand for a communications service. The Bell Journal of Economics and Management Science , 5,
16. doi:10.2307/3003090
Samuelson, P. (1954). The pure theory of public expenditure. The Review of Economics and Statistics , 36, 387–389. doi:10.2307/1925895
Squire, L. (1973). Some aspects of optimal pricing for telecommunications. The Bell Journal of Economics and Management Science , 4, 515.
doi:10.2307/3003051
Stiglitz, J. E. (1977). Theory of local public goods . In Feldstein, M., & Inman, R. (Eds.), The economics of public service . New York, NY: Halsted
Press.
Stiglitz, J. E. (1999). Knowledge as a global public good . In Kaul, I., Grunberg, I., & Stern, M. (Eds.), Global public goods: International
cooperation in the 21st century . Oxford, UK: Oxford University Press. doi:10.1093/0195130529.003.0015
Stremersch, S., Tellis, G., Hans Franses, P., & Binken, J. L. G. (2007). Indirect network effects in new product growth. Journal of
Marketing , 71(3), 52–74. doi:10.1509/jmkg.71.3.52
Taylor, L. (1994). Telecommunications demand in theory and practice (pp. 9, 28–31, 83). Dordrecht, The Netherlands: Kluwer Academic
Publishers. doi:10.1007/978-94-011-0892-8
Trigeorgis, L. (1999). Real options: A primer . In Alleman, J., & Noam, E. (Eds.), The new investment theory of real options and its implication
for telecommunications economics (pp. 122–138). Boston, MA: Kluwer Academic Publishers.
Trigeorgis, L. (1999). Real options: Managerial flexibility and strategy in resource allocation (4th ed.). Cambridge, MA: The MIT Press.
Varian, H., Farrell, J., & Shapiro, C. (2004). The economics of information technology: An introduction . Cambridge, UK: Cambridge University
Press. doi:10.1017/CBO9780511754166
Vogelsang, I., & Mitchell, B. (1997). Telecommunications competition: The last ten miles (p. 51). Cambridge, MA: MIT Press.
Von Mises, L. (1949). Human action . New Haven, CT: Yale University Press.
Wenders, J. (1987). The economics of telecommunications . Cambridge, MA: Harper and Row, Ballinger.
Williamson, O. (1981). The economics of organization: The transaction cost approach. American Journal of Sociology , 87(3), 548–577.
doi:10.1086/227496
ADDITIONAL READING
Alchian, A. A. (1968). Costs and outputs. In W. Breit & H.M. Hochman (EDs.), Readings in microeconomics (2nd ed., pp.159-171). Hinsdale, IL:
Dryden Pres. (Reprinted from The allocation of economic resources, by M. Abramovitz, Ed., 1959, Stanford, CA: Stanford University Press).
Alchian, A. A. (1969). Cost . In International Encyclopedia of the Social Sciences (pp. 404–415). New York, NY: Macmilan.
Almeida, F., & Calistru, C. (2012, April). The main challenges and issues of big data management. International Journal of Research Studies in
Computing , 1(1), 1–10.
Bell, J., & Ansari, S. (2000). The kaleidoscopic nature of costs . U.S.: McGraw-Hill Companies.
Benaroch, M. (2012). Managing information technology investment risk: A real options perspective. Journal of Management Information
Systems , 19(2), 43–84.
Benaroch, M., Jeffery, M., Kauffman, R., & Shah, S. (2007). Option-based risk management: A field study of sequential IT investment
decisions. Journal of Management Information Systems , 24(2), 103–140. doi:10.2753/MIS0742-1222240205
Buchanan, J. M., & Thirlby (1981). L.S.E. essays on cost. New York, NY: University Press. (Original work published 1973).
Liebowitz, S., & Margolis, S. (2002). Network effects . In Cave, (Eds.), Handbook of telecommunications economics: Structure, regulation, and
competition (p. 76).
Lin, L., & Kulatilaka, N. (2006, Fall). Network effects and technology licensing with fixed fee, royalty, and hybrid contracts.Journal of
Management Information Systems , 23(2), 91–118. doi:10.2753/MIS0742-1222230205
Lock, M. (2012). Data management for BI: Big data, bigger insight, superior performance. Aberdeen Group White Paper, 1, 4-20.
Robbins, L. (1934). Certain aspects of the theory of costs. The Economic Journal , XLIV, 1–18. doi:10.2307/2224723
Rohlfs, J. (1974). A theory of interdependent demand for a communications service. The Bell Journal of Economics and Management Science , 5,
16. doi:10.2307/3003090
Stiglitz, J. E. (1999). Knowledge as a global public good . In Kaul, I., Grunberg, I., & Stern, M. (Eds.), Global public goods: International
cooperation in the 21st century . Oxford, England: Oxford University Press. doi:10.1093/0195130529.003.0015
Varian, H., Farrell, J., & Shapiro, C. (2004). The economics of information technology: An introduction . Cambridge, England: Cambridge
University Press. doi:10.1017/CBO9780511754166
Williamson, O. (1981). The economics of organization: The transaction cost approach. American Journal of Sociology , 87(3), 548–577.
doi:10.1086/227496
KEY TERMS AND DEFINITIONS
Complex Pricing: Pricing beyond single-part pricing such as the use of an optimal two-part price or price discrimination.
Marginal Cost: The first derivative of the total cost function with respect to output (the change in cost due to small change in output).
NonRivalrous Good: One for which the marginal cost of production is zero or close to zero (such as intellectual property). One consumer’s
use of the good does not displace another consumer’s use.
Real Options: Valuing managerial flexibility via the implications of real options theory, originally developed for financial instruments.
Rivalrous Good: A good with non-trivial marginal costs such that one consumer’s use largely or completely displaces other consumers’ use.
Total Cost of Ownership (TCO): A full opportunity cost approach similar in concept to some life cycle approaches and often used to evaluate
two competing alternatives.
ENDNOTES
1 The word profit is used in quotes here since this may be only a portion of the costs of a company operating in a multiproduct dimension. There
may be other joint, shared, or common costs not shown here. In such a case the term “contribution” would be better applied to reflect the need to
contribute to other costs of the multiproduct firm.
2 One can argue whether the old distinctions between first, second, and third degree price discrimination that were originally described by Pigou
(1932) are less useful than other forms of distinctions. Here I continue with the old delineations of price discrimination, largely because they have
been so commonly used.
3 With a new focus in recent decades on energy conservation, block pricing is not as common as it once was.
4 The one exception could be demonstrable strong positive cross elastic effects with other services (complements) sold in the portfolio, which
provide significant contribution.
APPENDIX
Annotated Additional Readings
Of the readings above, three are particularly accessible to non-economists and provide significant value, with relatively little reading.
1. Varian, Farrell & Shapiro (2004) provides a great introduction to the economics of information technology. The book is written by
economists embedded in the research of these topics, but written for the non-economist in a very intuitive fashion. It is available in
paperback and the text only goes through 86 pages.
2. Noblee Laurette James Buchanan’s Cost and Choice: An Inquiry in Economic Theory (1969) is also a quick read at 102 pages of text in
a small paperback. It describes much of the original history of economic thought on cost as it relates to decisions.
3. LaValle et al (2011) discuss the results of a survey with 3,000 business executives and managers involving big data, information systems
and information analytics and how information analytics drives performance.
CHAPTER 64
Big Data Analytics on the Characteristic Equilibrium of Collective Opinions in Social
Networks
Yingxu Wang
University of Calgary, Canada
Victor J. Wiebe
University of Calgary, Canada
ABSTRACT
Big data are products of human collective intelligence that are exponentially increasing in all facets of quantity, complexity, semantics,
distribution, and processing costs in computer science, cognitive informatics, web-based computing, cloud computing, and computational
intelligence. This paper presents fundamental big data analysis and mining technologies in the domain of social networks as a typical paradigm of
big data engineering. A key principle of computational sociology known as the characteristic opinion equilibrium is revealed in social networks
and electoral systems. A set of numerical and fuzzy models for collective opinion analyses is formally presented. Fuzzy data mining
methodologies are rigorously described for collective opinion elicitation and benchmarking in order to enhance the conventional counting and
statistical methodologies for big data analytics.
1. INTRODUCTION
The hierarchy of human knowledge is categorized at the levels ofdata, information, knowledge, and intelligence [Debenham, 1989; Bender,
1996; Wang, 2006, 2014a]. Big data is one of the fundamental phenomena of the information era of human societies [Jacobs, 2009; Snijders et
al., 2012; Wang, 2014a; Wang & Wiebe, 2014]. Almost all fields and hierarchical levels of human activities generate exponentially increasing data,
information, and knowledge. Therefore, big data analytics has become one of the fundamental approaches to embody the abstraction and
induction principles in rational inferences where discrete data represent continuous mechanisms and semantics.
Big data are extremely large-scaled data in terms of quantity, complexity, semantics, distribution, and processing costs in computer science,
cognitive informatics, web-based computing, cloud computing, and computational intelligence. Big data and formal analytic theories are a
pervasive demand across almost all fields of science, engineering, and everyday lives [Chicurel, 2000; Snijders et al., 2012; Wang, 2003, 2009a,
2014a]. Typical applications of big data methodologies in sciences are such as mathematics, number theories, neuroinformatics, computing
systems, IT and web systems, brain science, memory capacity, genomes, linguistics, sociology, and management science [Wang, 2015; Wang &
Berwick, 2012]. Paradigms of big data applications in engineering are such as Internet technologies, web systems, search engines,
telecommunications, image processing, cognitive knowledge bases, multi-media databases, data mining, online text comprehension, machine
translations, cognitive informatics, and cognitive robotics [Wang, 2003, 2007b, 2009b, 2009c, 2010, 2012a, 2013a, 2013b, 2015]. Big data in the
modern society are fast approaching to the order of Petabyte (1015) per year [Wiki, 2012].
Big data analytics in sociology and collective opinion elicitation in social networks are identified as an important filed where data are often
complex, vague, incomplete, and counting-based [Wang & Wiebe, 2014]. Censuses and general elections are the traditional and typical domains
that demand efficient big data analytic theories and methodologies beyond number counting and statistics [Emerson, 2013; Saari, 2000]. Among
modern digital societies and social networks, popular opinion collection via online polls and voting systems becomes necessary for policy
confirmation in general elections.
This paper presents formal models and methodologies of big data analytics for collective opinions and representative equilibrium in social
networks and electoral systems. The cognitive and computing properties of big data are explored in Section 2. Potential pitfalls of Conventional
counting-based voting methods for collective opinion elicitation and the majority rule embodiment are analyzed in Section 3. The characteristic
equilibrium of collective opinions in social networks is revealed via big data analyses and numerical algorithms in Section 4. A set of fuzzy models
for collective opinion elicitation and aggregation are rigorously described in Section 5. Case studies on applications of the formal methodologies
of big data analytics are demonstrated in big poll data mining, collective opinion elicitation, and characteristic equilibrium determination.
2. PROPERTIES OF BIG DATA
The sources of big data are human collective intelligence. It is noteworthy that the first principle of mathematics is abstraction [Bender, 1996;
Wang, 2009c, 2014a]. The essence of data is an abstract qualification or quantification of a real world entity or its attributes against a certain
scale or benchmark. Therefore, data are fundamental information and knowledge of human civilization. The human capability for manipulating
big data indicates the development of sciences and technologies in information, knowledge, and intelligence processing. Typical human activities
that produce big data are such as many-to-many communications, massive downloads of data replications, digital image collections, and
networked opinion forming.
Data modeling and representation in mathematics has been advanced from nonquantitative (O), binary (B), natural number (N), integer (Z), real
(R), fuzzy (F) numbers, to hyper numbers (H) in line with the development of human civilization as illustrated in Figure 1 [Wang, 2014e].
Although decimal numbers and systems are mainly adopted in human civilization, the basic unit of data is a bit [Lewis & Papadimitriou, 1998;
Shannon, 1948], which forms the converged foundation of computer and information sciences. Based on bit, complex data representations can be
aggregate to higher structures such as byte, natural number, real number, structured data, databases, and knowledge bases.
Definition 1: Data, D, are an abstract representation of the quantity Q of real-world entities or abstract objects by a quantification
mapping fq based on a certain scale ℝ, i.e.:
(1)
The physical model of data and data storage in computing and the IT industry are the container metaphor where each bit of data requires a bit of
physical memory. However, the relationalmetaphor is adopted in neuroinformatics and brain science where data and memory are represented by
the synaptic connections between neurons in the brain [Sternberg, 1998; Wang, 2007c, 2014d; Wang & Wang, 2006; Wang & Fariello, 2012].
Definition 2: Big data are extremely large-scaled data across all facets of data properties such as quantity, complexity, semantics,
distribution, and processing costs.
Basic properties of big data are unstructured, heterogeneous, monotonous growing, mostly nonverbal, and decay of consistency or increase of
entropy over time [Wang, 2007a, 2014a]. The inherent complexity and exponentially increasing demands create unprecedented problems in big
data engineering such as big data representation, acquisition, storage, searching, retrieve, distribution, efficiency, standardization, consistency,
and security. Typical mathematical and computing activities that generate big data are Cartesian products (O(n2)), sorting (O(n • log n)),
searching (exhaustive, O(n2)), knowledge base update (O(n2)), and permutation O(2n) [Lewis & Papadimitriou, 1998; Wang, 2007a]. Although
the appearance of data is discrete, the semantics and mechanisms behind them are mainly continuous. This is the essence of the abstraction and
induction principles of natural intelligence.
Big data are fundamental materials for the inductive generation of knowledge. Inversely, big data can be deductively derived as instances of
knowledge in a vast state space. The syntax of data is concrete based on computation and type theories. However, the semantics of data is fuzzy
[Zadeh, 1965, 1975; Wang, 2012c, 2014b, 2014c, 2014e]. The analysis and interpretation of big data may easily exceed the capacity of
conventional counting and statistics technologies. Therefore, the nature of big data requires more efficient and rigor methodologies for big data
engineering, which studies the properties, theories, and methodologies of big data as well as efficient technologies for big data representation,
organization, manipulations, and applications in industries and everyday lives.
3. CONVENTIONAL VOTING METHODS AND POTENTIAL PITFALLS IN SOCIOLOGY
One of the central sociological principles adopted in popular elections and voting systems is the majority rule where each vote is treated with an
equal weight [Davis et al., 1970; Chevallier et al., 2006; Goldsmith, 2011]. The conventional methods for embodying the majority rule may be
divided into two categories known as the methods of max counting and average weighted sum. The former is the most widely used technology
that determines the simple majority by the greatest number of votes on a certain opinion among multiple or binary options. The latter assigns
various weights to optional opinions, which extends the binary selection to a range of weighted rating. Classic implementations of these voting
methods are proposed by Borda, Condorcet, and others [Emerson, 2013; McLean & Shephard, 2005; Mokken, 1971]. Borda introduced a scale-
based system where each casted vote is attached a rank that represents an individual's preferences [Emerson, 2013]. Condorset developed a
voting technology that determines the winner of an election as the individual who is paired against all alternatives as a run-off vote [McLean &
Shephard, 2005]. However, formal voting and general elections mainly adopt the mechanism that implements a selection of only-one-out-of-
n options without any preassigned weight. In this practice for casting the majority rule in societies, the average weighted sum method is
impractical.
Definition 3: The max finding function, max, in sociology elicits the greatest number of votes on a certain opinion, Oi, 1 ≤ i ≤ n, as the
voting result Vn among a set of n options, i.e.:
(2)
where NOi is the number of votes casted for opinion Oi. When there are only two options for the voting, Eq. 2 is reduced to a binary selection.
Although the conventional max finding method is widely adopted in almost all kinds of voting systems for collective opinion elicitation, it is an
over simplified method in terms of accuracy and representativeness. Its major pitfall is that the implied philosophy, the winner takes all, would
often overlook the entire spectrum of distributed opinions. This leads to a pseudo majority dilemma [Saari, 2000; Wang & Wiebe, 2014] as
follows.
Definition 4: The pseudo majority dilemma states that the result of a voting based on the simple max-finding mechanism may not
represent the majority opinion distribution casted in the voting, i.e.:
(3)
A typical case of the pseudo majority dilemma in voting can be elaborated in the following example.
Example 1: A voting with a distributed political spectrum from far right (NO0), right (NO1), neutral (NO2), left (NO3), and far left (NO4) is
shown in Figure 2 where the vote distribution is According to the max-finding method
as given in Eq. 2, the voting result is:
Figure 2. Distribution of collective opinions and their votes
The result indicates that opinion O0 is the winner and the other votes would be ignored. However, in fact, the sum of the rest opinions
4. IDENTIFICATION OF THE CHARACTERISTIC EQUILIBRIUM OF COLLECTIVE OPINIONS VIA BIG DATA ANALYTICS
On the basis of analyses in the preceding section, a key mechanism in collective opinion elicitation is identified known as the characteristic
equilibrium of opinions existed in social networks and electoral systems. This leads to the introduction of a set of novel methodologies such as
opinion spectrums regression, adaptive integrations of collective opinions, and allocation of the opinion equilibrium according to given big data
samples beyond conventional counting technologies for max finding.
4.1. Numerical Regression of Opinion Distributions beyond Counting
It is recognized that an overall perspective on the collective opinions elicited from a social network or casted in a vote can be rigorously modeled
as a nonlinear function over the opinion spectrum. In order to implement a complex polynomial regression for the nonlinear function, a
numerical algorithm is developed in MATLAB as shown in Figure 3, which can be applied to analyze any popular opinion distribution against a
certain political spectrum represented by a big set of opinion data. In the analysis program, a 3rd order polynomial is adopted for curve fitting,
while other orders may be chosen when it is appropriate. The general rule is that the order of the polynomial regression m must less than the
points of the collected data n. Data interpolation technologies may be adopted to improve the smoothness or missing points of raw data in
numerical technologies [Gilat & Subramaniam, 2011; Wang et al., 2013].
Applying the algorithm OpinionRegressionAnalysis(X, Y), a specific polynomial function and a visualized perception on the entire opinion
distribution can be rigorously obtained.
Example 2: The seats distribution of Canadian parties in the House of Commons is given in Table 1 [Web-J.J., 2013]. In Table 1, the
relative position of each party on the political spectrum is obtained based on statistics of historical data such as their manifestos, policies,
and public perspectives [Sartori, 1976; Strom, 1990].
Table 1. Voting data distribution by seats in parliament
P1 160 50
P2 34 -14
P3 1 -43
P4 4 -71
P5 100 -100
According to the data in Table 1, i.e., X = [-100, -71, -43, -14, 0, 50] and Y = [100, 4, 1, 34, 4, 160], the voting results can be rigorously represented
by the following function, f(x), as a result of the polynomial regression implemented in Figure 3:
(4)
where m = 3 and n = 5.
The above regression results are visually plotted in Figure 4. Because the polynomial regression is a continuous characteristic function, it can be
easily processed for multiple applications such as for opinion spectrum representation, equilibrium determination, and policy gains analyses
based on the equilibrium benchmark as described in Section 4.3.
4.2. Determination of the Characteristic Equilibrium of Collective Opinions on Opinion Spectrums
It is recognized that the representative collective opinion on a spectrum of opinion distributions is not a simple maximum nor an average of
weighted sum as conventionally perceived. Instead, it is the centriod covered by the curve of the characteristic function as marked by the red ⊕
sign as shown in Figure 4.
Definition 5: The opinion equilibrium Ξ is the natural centroid in a given weighed opinion distribution where the total votes of the left and
right wings reached a balance at the point k, i.e.:
(5)
Where
The integration of distributed opinions based on the regression function can be obtained using a numerical integration method. For instance, the
iterative Simpson’s integration method for an arbitrary continuous function f(x) over [a, b] can be described as follows:
(6)
The collective opinion equilibrium determination method as modeled in Eq. 5 is implemented in the algorithm as shown in Figure 5. The core
integration method adopted in the algorithm is based on a built-in function quad() in MATLAB [Gilat & Subramaniam, 2011] that implements
Eq. 6.
Figure 5. Algorithm for characteristic equilibrium
determination via big data analytics
Example 3: Applying the characteristic equilibrium determination algorithm to the opinion distribution data in the Canadian general
election as given in Figure 4, the collective opinion equilibrium is obtained as Ξ1 = 20.3. The result indicates that the overall national
characteristic equilibrium of opinions was at the mid-right as casted in 2011.
Because the characteristic equilibrium Ξ is the centroid of the collective opinion integration as defined in Eq. 5, it is obvious that the equilibrium
cannot be simply determined or empirical allocated without the numerical algorithm (Figure 5) as demonstrated in Example 3.
4.3. Analysis of Collective Opinions based on Characteristic Equilibrium Benchmarking
Using the methodologies developed in Section 4.2, interesting applications will be demonstrated in this subsection with real-world big data. The
case studies encompass the analyses of a series of general elections in order to find out the dynamic equilibrium shifts and the extrapolation of
potential policy gains based on the historical electoral data.
A benchmark of opinion equilibrium can be established on the basis of a series of the historical data. Based on them, trends of the opinion
equilibriums can be rigorously analyzed in order to explain: a) What was the extent of serial shifts as casted in the general elections? and b)
Which party was closer to the political equilibrium represented by the collective opinions casted in the general elections?
Example 4: The trend in Canadian popular votes over time can be benchmarked by results from the last four general elections as given in
Table 2. Applying the opinion equilibrium determination algorithm, opinion_equilibrium_ analysis as given in Figure 5, the collective
opinions as distributed in Figure 6 can be rigorously elicited, which indicates a dynamic shifting pattern of the collective opinion
equilibriums, i.e., 5.0 → 7.4 → 7.0 → 10.9, on the political spectrum between [-100, 100] during 2004 to 2011.
Table 2. Historical electoral data distributions of Canadian general elections
The characteristic opinion equilibrium determination method provides insight for revealing the implied trends and the entire collective opinions
distributed on the political spectrum. An interesting finding in Example 4 is, although several parties on the left spectrum, -100 ≤ x < 0, had won
significant number of votes as shown in Table 2, the collective opinion equilibrium had mainly remained unchanged at the area of central-right
where Ξ = 7.6 in average.
5. FUZZY METHODOLOGY FOR COLLECTIVE OPINION ELICITATION AND ANALYSIS BASED ON BIG DATA
Big data analytic technologies for collective opinion elicitation based on historical data have been demonstrated in preceding sections, which
reveal that a party may gain more votes by adapting its policy towards the political equilibrium established in past elections. It is recognized that
a social system is conservative that does not change rapidly over time because the huge base of population and human cognitive tendency
according to the longlife span system theory [Wang, 2012d]. However, the collective opinion equilibriums do shift dynamically. Therefore, an
advanced technology for enhancing potential policy gains is to calibrate the current collective opinion equilibrium by polls in order to support up-
to-date analysis and predication.
The typical technology for detecting current collective opinion equilibrium is by polls. A poll may be designed to test the impact of a potential
policy in order to establish a newly projected equilibrium. The projected equilibrium will be used to update and adjust the historical benchmark.
In this approach, rational predictions of policy gains towards a general election or a social network vote can be obtained in a series of analytic
regressions as formally described in the remainder of this section.
Definition 6: An opinion oi on a given policy pi is a fuzzy set of degrees of weights expressed by j, 1 ≤ j ≤ m, groups in the uniform scale I,
i.e.:
(7)
where the bigR notation represents recurring entities or repetitive functions indexed by the subscript [Wang, 2007a].
The normalized scale for fuzzy analyses is a universal one because any other scale can be mapped into it.
Definition 7: A collective opinion on a set of n policies pi, 1≤ i ≤ n, is a compound opinion as a fuzzy set of average weights
on each policy, i.e.:
(8)
where may be aggregated against the averages of each row or column that indicate the collective opinions of a certain policy casted by all groups
or that of all policies of a certain group, respectively, as illustrated in Table 3.
Table 3. Sample poll data of collective opinions
G1 G2 G3 G4 G5
p 1 0.2, 0.1 0.4, 0.4 0.9, 0.7 0.7, 0.6 0.5, 0.9
p 2 1.0, 0/9 0.5, 0.7 0.8, 0.9 0.5, 0.4 0.6, 0.5
p 3 0.5, 0.2 0.6, 0.3 0.3, 0.5 0.7, 0.6 1.0, 0.8
Example 5: The collective opinion on the set of 3 testing policies against 5 groups on the political spectrum can be elicited based on a set
of large sample poll data as summarized in Table 3. The current average weights of opinions and those of the historical ones are
aggregated from the sample data of individual opinions according to Eqs. 7 and 8.
Definition 8: The complexity or size of poll data, , is proportional to the numbers of testing policies |P|, groups on the spectrum |G|, and
number of sample individuals Nq, i.e.:
(9)
where 2,000 tests in a poll will result in a collection of big data of 30,000 raw individual opinions as in the settings of Example 5.
Definition 9: The effect of a set of policies is a fuzzy matrix of the average weighted differences between the current opinion and the
historical ones for the ith policy on the collective opinion of the jth group, i.e.:
(10)
Example 6: Based on the summarized poll data as given in Table 3 with the average collective opinions, the fuzzy set ofeffects of the ith
policy to the collective opinion of the jth group can be quantitatively determined according to Eq. 10 as follows:
where the most effective policy is p3→{G1,G2} with a 30% improvement, while the most negatively effective policy is p1→G5with a -40% loss.
Definition 10: The impact of a policy is a fuzzy matrix of products of effects and the corresponding group sizes , i.e.:
(11)
where the ± sign indicates a positive or negative impact on a target group, respectively.
Example 7: On the basis of Table 3 and Example 6, theimpact of each tested policy is a fuzzy matrix of the products of individual group
size and the effects that projects the ith policy on the jth group with the size , i.e. as presented in Box 1.
Box 1.
Definition 11: The gain of policy impacts, , is a fuzzy set of the mathematical means of the cumulative impacts that each group obtained as
results of the series of aggregations from the initial poll data, i.e.:
(12)
Example 8. The potential average gain of policy impacts can be derived according to Equation 12 based on the results in Example 7 as
follows in Box 2.
Box 2.
The projected gains or losses, G, over the political spectrum produce a new set of estimated electoral distributions Y = Y’ + G = [4508474,
889788, 576221, 2783175, 5832401] + [751412, 29660, -19279, 278317, -194413] = [5259886, 919488, 557014, 3061492, 5637988]. On the basis
of the projected gains derived from current polls of collective opinions, the potential shift of the collective opinion equilibrium on the political
spectrum can be predicated using the algorithm as shown in Figure 5. The regression result is plotted in Figure 6, which indicates a collective
opinion equilibrium shift slightly to the middle, i.e., ΔΞ = Ξ2 - Ξ1 = 9.8 – 10.9 = -1.1, by contrasting to that of the historical collective opinion
distributions.
6. CONCLUSION
Big data analytics has been introduced into sociology and social networks for collective opinion elicitation and characteristic opinion equilibrium
determination. A fundamental principle of the characteristic equilibrium of collect opinions has been revealed in distributed big data in social
networks and electoral systems, which is not simply a weighted average in conventional sociological methods rather than the point of natural
centriod in the integrated areas of total opinion distributions over a spectrum. Interesting insights of the nature of large-scale collective opinions
have been presented in poll data mining and collective opinion equilibrium allocation by big data engineering methodologies. A set of numerical
and fuzzy models for collective opinion analyses has been presented for collective opinion elicitation and benchmarking, which enhance the
conventional counting and statistics methodologies.
This work was previously published in the International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 8(3); edited by
Yingxu Wang, pages 2944, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for their valuable comments on the previous version of this paper.
REFERENCES
Bender, E. A. (1996). Mathematical Methods in Artificial Intelligence . Los Alamitos, CA: IEEE CS Press.
Chevallier, M., Warynski, M., & Sandoz, A. (2006). Success Factors of Geneva's E-voting System, The Electronic . Journal of E-Government , 4,
55–61.
Davis, O., Hinich, M., & Ordeshook, P. (1970). An Expository Development of a Mathematical Model of the Electoral Process .The American
Political Science Review , 64(2), 426–448. doi:10.2307/1953842
Emerson, P. (2013). The Original Borda Count and Partial Voting .Social Choice and Welfare , 40(2), 353–358. doi:10.1007/s00355-011-0603-9
Gilat, A., & Subramaniam, V. (2011). Numerical Methods for Engineers and Scientists (2nd ed.). MA: John Wiley & Sons.
Jacobs, A. (2009). The Pathologies of Big Data . ACM Queue; Tomorrow's Computing Today , (July): 1–12.
Lewis, H. R., & Papadimitriou, C. H. (1998). Elements of the Theory of Computation (2nd ed.). NY: Prentice Hall.
McLean, I., & Shephard, N. (2005). A Program to Implement the Condorcet and Borda Rules in a Small-n Election, Technical Report . UK:
Oxford University.
Mokken, R. J. (1971). A Theory and Procedure in Scale Analysis with Applications in Political Research (pp. 29–233). Netherlands: Mouton &
Co.doi:10.1515/9783110813203
Saari, D. G. (2000). Mathematical Structure of Voting Paradoxes: II. Positional Voting . Journal of Economic Theory , 15(1).
Sartori, G. (1976). Parties and Party Systems: A Framework for Analysis (p. 291). UK: Cambridge University Press.
Shannon, C.E. (1948), A Mathematical Theory of Communication,Bell System Technical Journal, 27, 379-423 and 623-656.
Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big Gaps of Knowledge in the Field of Internet . International Journal of Internet
Science , 7, 1–5.
Sternberg, R. J. (1998). In Search of the Human Mind (2nd ed.). NY: Harcourt Brace & Co.
Strom, K. (1990). A behavioral theory of competitive political parties. American Journal of Political Science , 34(2), 565–598.
Wang, Y. (2006). On the Informatics Laws and Deductive Semantics of Software [Part C]. IEEE Transactions on Systems, Man, and
Cybernetics , 36(2), 161–171. doi:10.1109/TSMCC.2006.871138
Wang, Y. (2007a). Software Engineering Foundations: A Software Science Perspective . NY: Auerbach Publications. doi:10.1201/9780203496091
Wang, Y. (2007b). On Cognitive Computing . International Journal of Software Science and Computational Intelligence , 1(3), 1–15.
doi:10.4018/jssci.2009070101
Wang, Y. (2007c). The OAR Model of Neural Informatics for Internal Knowledge Representation in the Brain . International Journal of Cognitive
Informatics and Natural Intelligence , 1(3), 66–77. doi:10.4018/jcini.2007070105
Wang, Y. (2009a). Formal Description of the Cognitive Process of Memorization, Transactions of Computational Science, Springer ,5, 81–98.
Wang, Y. (2009b). Toward a Formal Knowledge System Theory and Its Cognitive Informatics Foundations . Transactions of Computational
Science, Springer , 5, 1–19.
Wang, Y. (2009c). On Abstract Intelligence: Toward a Unified Theory of Natural, Artificial, Machinable, and Computational Intelligence
. International Journal of Software Science and Computational Intelligence , 1(1), 1–17. doi:10.4018/jssci.2009010101
Wang, Y. (2010). Cognitive Robots: A Reference Model towards Intelligent Authentication . IEEE Robotics and Automation , 17(4), 54–62.
doi:10.1109/MRA.2010.938842
Wang, Y. (2012a). On Abstract Intelligence and Brain Informatics: Mapping Cognitive Functions of the Brain onto its Neural Structures
. International Journal of Cognitive Informatics and Natural Intelligence , 6(4), 54–80. doi:10.4018/jcini.2012100103
Wang, Y. (2012b). In Search of Denotational Mathematics: Novel Mathematical Means for Contemporary Intelligence, Brain, and Knowledge
Sciences . Journal of Advanced Mathematics and Applications , 1(1), 4–25. doi:10.1166/jama.2012.1002
Wang, Y. (2012c). Formal Rules for Fuzzy Causal Analyses and Fuzzy Inferences . International Journal of Software Science and Computational
Intelligence , 4(4), 70–86. doi:10.4018/jssci.2012100105
Wang, Y. (2012d). On Long Lifespan Systems and Applications .Journal of Computational and Theoretical Nanoscience , 9(2), 208–216.
doi:10.1166/jctn.2012.2014
Wang, Y. (2013a). Neuroinformatics Models of Human Memory: Mapping the Cognitive Functions of Memory onto Neurophysiological
Structures of the Brain . International Journal of Cognitive Informatics and Natural Intelligence , 7(1), 98–122. doi:10.4018/jcini.2013010105
Wang, Y. (2013b). Neuroinformatics Models of Human Memory: Mapping the Cognitive Functions of Memory onto Neurophysiological
Structures of the Brain . International Journal of Cognitive Informatics and Natural Intelligence , 7(1), 98–122. doi:10.4018/jcini.2013010105
Wang, Y. (2014a), Keynote: From Information Revolution to Intelligence Revolution: Big Data Science vs. Intelligence Science,Proc. 13th IEEE
International Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC 2014), London, UK, IEEE CS Press, Aug., pp. 3-5.
Wang, Y. (2014b). Fuzzy Causal Inferences Based on Fuzzy Semantics of Fuzzy Concepts in Cognitive Computing . WSEAS Transactions on
Computers , 13, 430–441.
Wang, Y. (2014c), Towards a Theory of Fuzzy Probability for Cognitive Computing, Proc. 13th IEEE International Conference on Cognitive
Informatics and Cognitive Computing (ICCI*CC 2014), London, UK, IEEE CS Press, Aug., pp. 21-29. 10.1109/ICCI-CC.2014.6921436
Wang Y. (2014d), Keynote: Latest Advances in Neuroinformatics and Fuzzy Systems, Proceedings of 2014 International Conference on Neural
Networks and Fuzzy Systems (ICNF-FS’14), Venice, Italy, March, pp. 14-15.
Wang, Y. (2014e). The Theory of Fuzzy Arithmetic in the Extended Domain of Fuzzy Numbers . Journal of Advanced Mathematics and
Applications , 3(2), 165–175. doi:10.1166/jama.2014.1063
Wang, Y. (2015), Keynotes: Big Data Algebra: A Rigorous Approach to Big Data Analytics and Engineering, 17th International Conference on
Mathematical and Computational Methods in Science and Engineering (MACMESE '15), Kuala Lumpur, April, in Press.
Wang, Y., & Berwick, R. C. (2012). Towards a Formal Framework of Cognitive Linguistics . Journal of Advanced Mathematics and
Applications , 1(2), 250–263. doi:10.1166/jama.2012.1019
Wang, Y., & Fariello, G. (2012). On Neuroinformatics: Mathematical Models of Neuroscience and Neurocomputing .Journal of Advanced
Mathematics and Applications , 1(2), 206–217. doi:10.1166/jama.2012.1015
Wang, Y., Liu, D., & Ruhe, G. (2004), Formal Description of the Cognitive Process of Decision Making, Proceedings of the 3rd IEEE International
Conference on Cognitive Informatics (ICCI'04), IEEE CS Press, Canada, August, pp.124-130.
Wang, Y., Nielsen, J., & Dimitrov, V. (2013). Novel Optimization Theories and Implementations in Numerical Methods .International Journal of
Advanced Mathematics and Applications ,2(1), 2–12. doi:10.1166/jama.2013.1025
Wang, Y., Wang, Y., Patel, S., & Patel, D. (2006). A Layered Reference Model of the Brain (LRMB) [Part C]. IEEE Transactions on Systems, Man,
and Cybernetics , 36(2), 124–133. doi:10.1109/TSMCC.2006.871126
Wang, Y., & Wiebe, V. J. (2014), Big Data Analyses for Collective Opinion Elicition in Social Networks, Proceedings of IEEE 2014 International
Conference on Big Data Science and Engineering(BDSE’14), Beijing, China, Sept., pp. 630-637.
Zadeh, L. A. (1975). Fuzzy Logic and Approximate Reasoning .Syntheses , 30(3-4), 407–428. doi:10.1007/BF00485052
Section 5
Jurgen Janssens
TETRADE Consulting, Belgium
ABSTRACT
To make the deeply rooted layers of catalyzing technology and optimized modelling gain their true value for education, healthcare or other public
services, it is necessary to prepare well the Big Data environment in which the Big Data will be developed, and integrate elements of it into the
project approach. It is by integrating and managing these non-technical aspects of project reality that analytics will be accepted. This will enable
data power to infuse the organizational processes and offer ultimately real added value. This chapter will shed light on complementary actions
required on different levels. It will be analyzed how this layered effort starts by a good understanding of the different elements that contribute to
the definition of an organization’s Big Data ecosystem. It will be explained how this interacts with the management of expectations, needs, goals
and change. Lastly, a closer look will be given at the importance of portfolio based big picture thinking.
INTRODUCTION
Big Data is an extremely vast field. Big Data can be all about Hadoop, Map Reduce, Tableau, HANA, Nexidia and Stata. Big Data can be all about
crunching and capturing regional specificities. It can be all about statistical modelling, tendency plotting and data supporting technological
optimization. Big Data can be and is indeed all of this. But it is also much more: it is about elevating an organization to unexplored reflection
paths.
To make the deeply rooted layers of catalyzing technology and optimized modelling gain their true value for education, healthcare or other public
services, it is necessary to prepare well the Big Data environment in which the Big Data will be developed, and integrate elements of it into the
project approach. It is by integrating and managing these non-technical aspects of project reality that analytics will be accepted. This will enable
data power to infuse the organizational processes and offer ultimately real added value.
This chapter will shed light on organizational, human and change management actions required on different levels to maximize the unfolding of
the Big Data potential. It will be analyzed how this layered effort starts by a good delineation of the different elements of the organization’s Big
Data ecosystem. It will be explained how the management of expectations, needs and goals is essential for the fit between the silver lining and the
technical realization. Lastly, to ensure feasibility and long term contribution, a closer look will be taken at the importance of the bigger portfolio
picture.
All together this chapter will illustrate that managing Big Data projects in a public context can only deliver a solid result if the organizational
context and the human reality are embraced together with the technical challenges.
BACKGROUND
In the Big Data definition provided by research and advisory firm Gartner (2014) a strong emphasis is put on technology related aspects and the
potential contribution to decision making. As indicated by Kobielus (2013), Big Data players like IBM base themselves on 4 Vs to characterize the
key elements of Big Data success: Volume, Variety, Veracity and Velocity.
This tendency to focus on the technical aspects is part of a larger reality where the Big Data potential is put in the perspective of the always
moving frontier of technological possibilities. With recent estimates of Turner, Reinsel and Gantz (2014) expecting digital data created by
humans and devices to increase between 2012 and 2020 by a 50-fold to almost 40ZB, it seems very unlikely that the attention for technological
aspects of Big Data will revert soon.
At the same time, research has indicated repeatedly that the majority of projects focusing on data analytics fail because of non-technical reasons,
or because they do not deliver the benefits that are agreed upon at the start of the project (Young, 2003; Gulla, 2012; Van der Meulen & Rivera,
2013). As tools and technology are used by people who are working themselves in an organizational context, this technical IT focus of Big Data
endeavors should therefore be complemented with other aspects of the reality.
In that context, paying attention for change on the human level and opening the mind to a new way of thinking are part of the ‘soft’ factors that
play a fundamental role in paving the road to success of Big Data projects. There is namely limited added value to provide a hyper-advanced data
cruncher to people that only want to execute the very same daily professional routine, and use the tools they have been using for years. If they are
not gradually prepared for this change and don’t grasp the potential advantages, there will be no ‘real’ use of the data tool. If they have the
impression that they don’t have the means to work efficiently anymore, they risk even to blame Big Data for a disruption of service.
Complementarily, managing data projects means being able to look beyond the most exciting Big Data facets, and taking time for the guiding
project backbone. Freewheeling on the most stimulating ideas can namely only result in a concrete and valuable outcome if the efforts are
framed, planned and guided towards a final goal.
Project Management frameworks exist that provide a sound base for a structured project follow-up. Prince21 and PMBOK2, for instance, have
been developed and improved, respectively since 1989 and 1996. They are applied nowadays in different sectors by private and public
organizations all over the world.
These frameworks ensure structuring guidance of essential project aspects from the very start till the final delivery. They cover soft and hard
matters, ranging from the building of a solid Business Case or the management of Stakeholder dimensions, to a detailed product description or
thorough quality reviews.
References to the use of these frameworks are however relatively limited in the context of Big Data. The author wants to translate and include the
underlying philosophy into practical advice for Big Data projects, as data management is bound to become a powerful driver, not only of
technological evolution, but also for strategic decisions and organizational change, especially in the public-private context.
MANAGING DATA POWER BY BLENDING IN PEOPLE AND ORGANISATION
Issues, Controversies, Problems
Due to its multiple dimensions and constant evolutions, Big Data is often regarded as a technical subject matter. By extension, there is a natural
tendency to see Big Data projects as purely technical projects. Omitting to take into account also organizational and human aspects in the
management of projects entails however different risks.
Firstly, there is a risk that the delivered results are not in line with what is wanted, even if the outcome is technically very complete and efficient.
Senior stakeholders might have had something fundamentally different in mind, which could, if the expectations are not met, lead to a lack of
final acceptance or a lack of support for future projects.
For end users, the proposed change in way of working could be perceived as unworkable or unacceptably insurmountable, and therefore without
added value.
This resistance could then lead to the outcome being wrongly used (or not used at all). In more severe cases, the project or its outcome could
simply be put in the fridge or thrown in the trash bin due to its perceived inadequacy.
Another, more fundamental, risk is that there might be a total misfit between the delivered results and the strategic direction the organization
wants to follow. Managing and executing a project from an isolated technical perspective could even result in the organization being refrained
from exploring new directions in its current development, or missing evolutions that could have opened doors to the future. On a macro level,
this would mean that public bodies miss the opportunity to keep their services to society on par with those offered by private counterparts. In the
worst case, this lack of coordinated project advancement could lead to public organizations remaining structurally behind.
Technical possibilities to perfect the unfolding of the Big Data potential are not in scope of this chapter. In this chapter, the author analyzes
practical guidelines on the organizational and human level, to be used in complement of the technical foundation. Once these guidelines are
structurally integrated in the management of Big Data projects, they will contribute in making the foundation open up towards its true value in
the ecosystem for which it is intended.
In combination with representative examples from the public sphere, it will be detailed out how the layered effort starts by a thorough
understanding of the different elements of an organization’s Big Data ecosystem. It will be explained how this interacts with the management of
expectations, the identification and control of the needs, and a consequent big picture thinking on the portfolio level.
All together this chapter will illustrate that managing Big Data projects in a public context can only come to a satisfying result for the entire
stakeholder community if the organizational and human reality will be embraced in support of the technical efforts.
SOLUTIONS AND RECOMMENDATIONS
To optimize the benefit public organizations can obtain from their own efforts in optimized modelling and out of the box experimentation, it is
indeed necessary to combine data work with guiding thoroughly the non-technical elements of the ecosystem in which Big Data3 will be
developed and used. It is this synchronized interaction between technology and reality that will open the doors to real progress and new insights.
This requires efforts on different levels. It starts by a good definition of the components of the expected Big Data ecosystem. It encompasses the
management of expectations, needs, goals and change. Lastly, it involves some big picture thinking, by making sure the Data ambition is
compatible with the entire Project Portfolio.
The Elements to Build the Big Data Ecosystem
The socio-economic cycles and the competitive market reality put pressure towards higher efficiency of administrations, stronger decision taking
of leaders and more sophisticated technology to keep public services on par with private ones. One of the means of getting there is Big Data, as it
is at the intersection of increased data power and reinforced actionability.
Before kick-starting the analysis of the ‘real’ data, it is fundamental to reflect on the fundamental elements of the larger Big Data ecosystem.
Knowing what the organization wants and where the organization wants to go is essential in the shaping of the Big Data initiatives by the project
sponsor (in the very beginning) or by the project manager (along the different phases).
Traditionally, these discussions are driven by choices on technology, optimization layers, or expected improvements. From a pure management
perspective, however, the categorization that makes most sense is one that fits the angle of the decision taker(s).
This essence focused helicopter view can be articulated around some key drivers. What is the final objective for the organization, and what
dimensions are particularly important? Does the organization want to gain experience or target immediately solid transformation? What are the
technical and humans means at hand?
Once these fundamentals are clarified, sleeves can be turned up and the management of the Big Data project started.
Possible Objectives
In the public sector, developing data powered decision taking is mostly done with one or more specific benefits and objectives in mind: Internal
improvement, Innovation towards external stakeholders, Development of integrated solutions or, depending on the type of public body,
eventually Building collaborations with public or private actors.
Projects focusing on Internal Improvement want to have a better view and understanding of the internal functioning of one or more departments
or services. The factual data are used to define specific improvements or end-to-end process enhancements. The reason for doing so is to attain
budgetary efficiencies, improve the image towards internal or external stakeholders, or, in the end, increase ‘customer’ satisfaction, for instance
to avoid that the community moves away from public services for which private alternatives exists.
Public or semipublic bodies may also desire to identify possibilities for Innovation. It is unlikely that this type of projects will take place in an
isolated way, as innovation should not be set in motion for the sole sake of innovation. Such projects will therefore be in complement of the other
mentioned categories. This can for instance be in an environment that desires to give itself the means for a significant improvement after many
years of technological status quo (see Figure 1).
Figure 1. The objective of Big Data projects can be Internal
improvement, Innovation towards External parties, the
Development of Integrated solutions, or Building
Collaborations
Environments with a higher Big Data maturity may want to take a leap forward by working on integrated solutions. Rather than working on
specific improvements, a combined program will focus on a larger quality injection, by joining complementary initiatives, or focusing on
improvements that go beyond departmental borders. City administrations may for example decide to join forces in offering a unique entry point
for all questions and administrative obligations of entrepreneurs. Cities like Brussels have this already for entrepreneurs4. By working on an
integrated data solution that is used as a collaborative platform for the different services and as a smooth front end service by the companies, all
concerned parties benefit from this advanced approach.
In specific contexts, the Big Data objective to start or enhance collaborations with public or private actors can be done through above initiatives,
or through a dedicated collaboration program, fueled with Big Data information. Such initiatives can bring the additional automation, maturity
or strategic value to an organization. This, in turn, can open the door to collaboration with others.
This could potentially be the case if several hospitals are owned by the same local body. In their desire to better allocate their financial resources
and offer a better service, it could be beneficial for the different hospitals to reach the same level of data intelligence. The resulting integrated
understanding from their shared ‘population’ would then make it possible to develop a complementary service portfolio. Over time, it could even
help to balance peaks within this community for the services that each institution would still provide separately (e.g. emergency treatments,
maternity…).
Similarly, public transportation companies could be very interested in levelling up in preparation of joint initiatives, for instance in areas where
railway transport and urban transport (underground, bus, tram) are managed by different entities. If companies attain a similar quality and
granularity of their transportation data, they could optimize their offerings or develop synchronized schedules for dense zones. It is even
imaginable to go a step further by sharing the physical customer passes, which would path the way to even more shared optimization initiatives
(and eventual cost reductions). In Belgium, different public Belgian transportation companies are exploring this path by gradually using the
shared smart travel card MOBIB (Baele & Devraux, 2014). This card, initially used only by one of them, offers in the meantime the possibility for
multi-modal use and follow-up.
Overall, knowing and expressing the objectives offers thus a first element of focus. Note that none of these areas of improvement is ‘better’ or
‘more ambitious’ from a Big Data angle than others. It is the fit of the project(s) with the organization that is the key to success. As will be
explained in one of the coming sections, paving the road to this fit starts by having a clear understanding of the needs, the approach and the
available means. Once determined, one can evaluate what initiative is better suited to answer these needs.
Starting Small or Going for a Big Bang
Closely related to the reflection on the objectives, public bodies need to take sufficient time to define the suitable project size and the return on
experience that the they want to obtain.
Although each Big Data project should leave some freedom for experimentation (Viaene, 2014), an organization can desire to start with a small
project to optimize the return on experience. The driver can be very down to earth, i.e. harvesting low hanging quick win fruit. A more
fundamental driver for starting with a limited perimeter can be to gain experience first, for example, before larger projects. This experience
development is possible on the technical level on new data modeling techniques, on the use of Hadoop, HANA, Spotfire or Tableau, on new types
of hardware and the like. It is advised to manage the project in such a way that technical experimentation is stimulated, with a cascading of the
experience on the overall stakeholder community.
The gaining of experience can also be focused on the human level. A small project can for instance have the advantage to develop concrete use
cases to create the necessary awareness and change in mindset amongst stakeholders. This will reduce a possible reluctance and resistance that
can arise in a more profound, disruptive context.
Besides the technical and human dimension, starting small can offer valuable experience on the managerial level. It can provide the necessary
lessons learned to adapt or fine-tune the way of working for subsequent projects, highlight the skills that need to be improved, and the like.
Instead of kicking-off a small focus with specific added value (or after having started small), one may decide to go for a full blown Big Bang. The
reason to do so can be that the desired level of experience is already present, or that important windows of opportunity cannot be missed for
organizational, economic or political reasons. If well prepared and managed, Big Bang projects have the potential advantage to bring higher
informational return.
In public bodies, the order-of-magnitude of this ‘big’ can be expressed in terms of impacted processes, the number of internal people, the size of
the (external) community, or the level of overall disruption compared to the existing situation.
Altogether, small projects offer thus the occasion for focused gain of experience, whereas large projects offer potentially larger results, but require
more coordination.
In both scenarios, it is important to keep in mind that a Big Data project is not only an IT project, but also an organizational and human journey
requiring the necessary preparation, follow-up and guidance. The better the initial choices are made and the better the actions are supported, the
higher the likelihood that the initiative will be embraced by everyone, and offer the necessary transformational value.
At the same time, both small and large projects need to be ‘prepared for success’. Initiatives delivering the expected results with the necessary
visibility can initiate a demand for an extended use to other departments, or for additional analytical models in different contexts. It is therefore
advised, both for small and for large projects, to foresee the necessary room for capacity growth, in terms of technical and human means, skills
and budget. This aspect will be discussed in more detail in one of the next sections.
Types of Data
In complement of the preliminary reflection on the objective and the size of the Big Data initiative, it is important to do an assessment of the
desired and available data. No matter if it concerns a pilot project that is ‘only’ intended to gain experience, or a ‘real’ full-fledged one, data can be
grouped from two different angles.
A first categorization focuses from the technical angle on the distinction between structured data and non-structured data. Structured data are
‘traditional’ data, like databases, spreadsheets, or statistical analytics. Non-structured data come from less conventional data sources like free-
text payment descriptions, content of social media and blogs, emails, news feeds, instant messages and the like.
A second approach consists of gaining insights through a division in Human/Social data, Location data, Market Data, Machine related data and
Smart Data (Yasuda, Yasuharu, & Yoshida, 2014). This distinction of the data opens the door to more ‘contextual’ reflections on the project scope.
Human and Social Data result from initiatives intended to obtain data on the human body or to study collective human social behavior. On the
medical level, this can provide insights on individual habits or the evolution of certain variables. In the US, for example, the government funded
5
private organization UNOS5is using he outcome of Big Data initiatives to continuously improve the algorithms used in the optimization of waiting
lists for organ transplants.
On the level of (sub)societal dynamics, the Social Data provide insights on patterned habits and recurring interactions between groups of
individuals. Public transportation companies or healthcare programs like healthcare.gov6 (also known as ‘Obamacare’) could bring significant
improvements to their online services by completing attractiveness focused redesign initiatives (Paskin, 2014) with intelligent mining of all
information on the time people spent on specific pages, the type of information they search for and the like. This information will not only help in
optimizing the general layout and perceived efficiency of the services. It could also be the trigger for providing additional information that would
fit to queries done simultaneously by other people in the same area. This could appear to be very useful in periods of seasonal issues, extreme
climatological problems or pandemic concerns.
Using Human and Social data in the public sector requires paying attention to two subtle ‘traps’.
A first one is the data security. No public body wants to announce that it has lost control on the citizens private information. This is applicable on
all types of data, but is a more sensitive matter for human, social and health related information. As hospitals are, for instance, hacked more and
more (Orcut, 2014), public bodies want to dedicate additional attention to this dimension to avoid that citizens move from public to private
services, due to (the impression of) weak data protection. For details on this matter, the author refers the reader to specialized literature.
A second attention point is avoiding to be regarded as Orwellianly scary by using these data. Public bodies need to reflect well on how sensitive or
‘intrusive’ the use of certain data is. This may lead to a fractional or gradual use until sufficient public acceptance and trust is available.
In certain exceptional cases, the opposite may be true. The advantages that the pooled use of medical or genetic information can offer, prevail
then over the impact on privacy. This is the case in Iceland, where the medical records and genealogical and genetic data of most Icelanders were,
until some years ago, all present in a common database, deCODE7. As the population is rather small and most citizens have Icelandic roots, the
initiative wanted to provide the keys to solving eventual general health issues impacting the entire Icelandic population. In the meantime, the
attempts to make an Icelandic Health Sector Database out of it have been stopped (Gertz, 2004). The outcome is used, amongst others, as basis
for Íslendingabók8 (literally ‘the Book of Icelanders’), a search engine that allows Icelanders to know if they are dating distant family members.
In the same data pooling train of thought, an initiative was launched by Sage Bionetworks9. This nonprofit organization promotes open science,
and has developed the ethical procedures needed to create an open database of anonymized health and genetics related data from many sources.
Dutcher (2014) argues that the compilation of test results in one location would turn genetic info into Big Data, giving scientists new insights that
could accelerate findings, and strongly influence the current approach to healthcare research.
Location Data result from initiatives intended to obtain information on the location of people, based on mobile technology, radio frequency or
GPS powered devices like phones, computers, cars or public transport. The patterns that become apparent through the analysis of location data
are used to steer (or adjust) business decisions and marketing initiatives. If translated intelligently, they are then experienced as quality
enhancing services by the targeted people.
In a series of cases, this information is used by private companies. Local governments can however benefit very much of this information, for
example, to optimize the fluidity of the traffic. A noteworthy example on this level is under development in Finland. Inspired by a master thesis
written in 2014 (Heikkilä, 2014), the city of Helsinki is planning to create a 2.0. transportation system by 2025. The goal is to have all types of
public and private traffic synchronized through the combination of transport sharing, solid technology to guarantee streamlined, affordable
payment, and ambitious localization programs (see Figure 2).
Market Data result from initiatives intended to collect and analyze the visual data collected in public spaces. Although the use of visual data is
subject to specific legislation in each country, it does offer public bodies useful information. Instead of using GPS data, cities could for instance
use Market Data to optimize traffic light synchronization or optimize road cleaning schedules in function of the visually observed traffic density.
For certain niches of the public sector, Machine related data and Smart Data are increasingly important, for instance in the improvement of
(sub)processes for the healthcare sector or in a leanified running of Utilities or other (semi)public facilities.
Machine Data result from initiatives intended to obtain operational knowledge of the (often real time) functioning of machines. These data are
generated by monitoring industrial devices. They are used to optimize machines and maintenance, with the real aim of improving business
processes, service quality or safety.
In the public Sector, these data can be potentially obtained through healthcare equipment, IT hardware, school infrastructure, traffic lights,
public transport vehicles, mining or manufacturing machines and the like.
Smart Data result from smart infrastructure used by private or (semi)public actors. Whereas machine data are used to optimize efficiency of
machines and equipment, smart data are focusing on gaining insights on usage patterns or potential infrastructure optimization. This concerns
for instance equipment used by telecom companies, or in industries like specific financial services, mining or energy distribution or supply (Ala-
Kurikka, 2010).
Although smart data are mostly generated and used by private or semipublic companies (like utility companies), it is worth noticing that smart
data can be used also in more ‘traditional’ public environments. In certain countries, data warehousing powered vending machines are currently
already gathering consumption and machine data in real time to increase their stock management and product freshness (Honaman, 2010).
According to Nelson (2014), Coke’s Freestyle dispensing machines go a step further. They have the bold purpose to grasp better the consumption
profiles and tastes of the customers in the different locations where people are longing to their daily drink.
Currently, these data are privately owned by the Coca-Cola Company. But when taking into account that the vast majority of public schools have
vending machines, it is not unthinkable that educational bodies start similar initiatives to have more information on consumption habits, or to
build and improve health awareness programs for children and students.
The evaluation of the different types of data does not imply that an organization needs to choose and focus solely on one category of data – quite
the contrary. In practice, the power of Big Data projects lays exactly in obtaining most from a constellation of data – be it from a range of
interlinked (sub)categories or from a variety of categories where the interconnections seem less straightforward.
Determining the dataset(s) can even imply going beyond the traditional borders. Whereas past initiatives would focus on improving the use (and
outcome) of existing data and in house algorithms, it is now possible to maximize the value from own sources by combining them with public and
private third party data, made available to a larger audience.
New York City, for instance, set up the Mayor’s Office of Data Analytics (MODA)10 to tackle the challenges of rapid urbanization. Through this
vehicle, a series of local government agencies share openly their data. By 2018, all city agencies of New York will have to have their data openly
shared. In October 2014, data streamed into MODA resulted already in more than 300 data sets. This allows each agency to work on different
dimensions of urbanization (such as crime, public safety, overcrowding, road incidents, and pollution), by benefitting from information that was
previously not directly available.
A similar potential exists beyond the collaboration with local sister organizations. The EU11, the US12, the OECD13, the Worldbank14and many
governments have made large amounts of data available, that could significantly improve the granularity and power of data initiatives.
Although private open data initiatives are regularly under discussion (Herzberg, 2014), one can have the same collaborative advantage in some
crossover platforms. Algorithmia15, for instance, builds a bridge between people that have developed algorithms for which they see specific
possibilities, and organizations or companies that are looking for algorithms that solve certain problems or provide specific information. Similar
collaborations exist between academic and non-academic actors. Reviewing such open initiatives can contribute to having a resized scope and
move forward more quickly.
Assessing Means and Skills
Private companies embrace the potential of new technologies to stay ahead in the competition. Certain public bodies or (semi) public companies
might want to do the same with one of the aforementioned objectives in mind. At the same time, the preliminary review of the desired and
available data might reveal that it is too challenging to do, due to a lack of means or skills. Healthy ambition requires however that an
organization gives itself the necessary means to succeed.
Crossing the river that meanders between the as-is and to-be banks without going to the very end is rarely interesting. Therefore, three options
are at hand. A first one is to increase the technical, human or financial means. A second option is to limit the scope (without necessarily limiting
the reflection process), and to do the rest once sufficient means are at disposal. Lastly, the commitment can be taken that the final goal will be
attained, but by building upfront a phased plan.
The last option was chosen in the earlier mentioned transportation project in Helsinki. Initially, the city had only access to the public
transportation data. To solve this in due time, it increased its technical means by building collaboration platforms like Traffic Lab16 and plans to
extend existing initiatives to integrate gradually (by 2025) all desired data of the private transportation companies, taxis, and car sharing pools.
Besides a lack of means, there might be a (temporary or structural) shortage of qualified skills. In such situations, the scope or the project
planning needs to be adapted, knowledge needs to be ‘transferred’, or reinforcement needs to be foreseen with the help of additional resources.
A non-negligible side note for knowledge transfer or resource reinforcement is the time needed for a new (internal or external) resource to be up
to speed. It may indeed require a significant amount of time before a person understands sufficiently the practical or technical specificities of an
organization.
It is therefore advisable to foresee a solid scaling and contingency margin when evaluating the resource needs and the time needed for the
different phases. An alternative is to foresee the transition of internal resources from other projects. This has, however, an impact on the project
portfolio. This aspect will be discussed in a dedicated section.
Note that securing the necessary human means and technical skills might require the set-up of new teams, or new collaborative combinations
within the existing organizational framework. It might require people to change their way of working or the way they are organized. It might even
involve the investment in talent and the development of new skills. Given these potential reorganizations, it is advised to foresee the skill-based
analysis upfront, eventually in collaboration with the concerned HR manager(s) of the respective units (see Figure 3).
The skill and resource constraints can occur, for instance, in projects in an educational context. Every student corresponds to a lot of
(continuously growing) information in the files of the school. A school – or an association of schools – may decide to optimize the use of these
data for educational purposes. Such initiatives may trigger privacy concerns from parental associations, for example due to fear for misuse or
selling of this information to third parties. Set aside the privacy concern, schools may be faced with an important expertise problem in the
realization of the data project. Many schools lack namely the technical abilities to structure efficiently these data, and, especially, the skills to
manage the databases and the related infrastructure.
Similar challenges can happen in healthcare contexts. Such initiatives require, in addition to budget, the availability of skilled people to prepare
the set-up of the infrastructure and train the practitioners, people available during the trainings to ensure the continuity of activities and services,
and the necessary staff to actively use this information, while keeping a critical eye on potential improvements. Developing such data empowered
projects requires therefore the necessary sizing.
In 2012, a British hospital benefitted from supercomputers usually used by the McLaren group in monitoring hundreds of variables and
thousands of health indicators from their Formula 1 drivers. During this period, the hospital got to enrich its traditional follow-up with predictive
early warnings through increased, real-time analytical power (The Health Foundation, 2012). A project with a similar potential exists in New
York City, at the Mount Sinai Hospital. They are injecting Big Data power into their daily functioning to forecast potential diseases and reduce the
number of hospital visits (Mount Sinai Hospital, 2014).
Ambiguously enough, such an issue may arise also in environments with a strong track record of following data sets. A data rich environment can
namely lead to an informational flow or data structure that requires even more means than already at disposal. In those cases, it is advised to
evaluate if focusing on sub-segments of information (rather than on the full set) can bring sufficient catalytic value to the management in support
of the envisioned changes or decisions. The inclusion of other segments can then be planned for an eventual later phase.
On that level, reality has shown that improvement initiatives involving potentially zealous amounts of data have encountered big successes as
well as significant failures.
During the last ten years, several European governments launched ambitious data projects, mostly with the goal of optimizing their internal
functioning and improving their service to the citizens (Bové, 2013). One of the means was to have groups of data mapped and interconnected, to
obtain more precious information for courts of justice, tax departments or financial collection divisions.
Quite illustrative are the attempts made by the Belgian federal government since the early 2000’s. Inspired by success in the Netherlands and
driven by encouraging results of the ‘eHealth’ platform of the Belgian Ministry of Health, the plan was to realize a similar initiative for the
Ministry of Justice: building a web portal able to deliver efficient services and data exchange, fueled (amongst others) by cross-combined data
sources and strong analytics. During the first ten years, cluster-projects with inspired names like Phoenix and Cheops turned out to be vast
disasters (Bové, 2014 ; Vanleemputten, 2013), partially due to the high ambitions and the important complexity. Based on these lessons, the
scope was reviewed and a new project started in 2014, JustX (Peumans, 2014). It aims at bringing all data dynamics up to speed in the near
future.
Overall, it is thus needed to do an aggregated evaluation of the means that are available and scale in accordance to the organization’s ambitions.
If assessed correctly and put together with the other pieces of the Big Data ecosystem, one will have a good view of the type of project that is
actually at stake. The understanding of this big picture will be valuable throughout the subsequent steps of the project lifecycle.
Managing Projects, the Before and After Included
The goal of increasing the use of Big Data in public services is to develop or refine different types of information that contribute to the operational
efficiency. This dynamic turns fractional information into multidimensional results that allow to grasp a bigger picture, analyze more precise
tendencies, and drive the strategic decision making process.
As indicated earlier, the level of contribution is preconditioned by the components of the desired Big Data ecosystem: the nature of the data that
will be analyzed, the quality of the delivered information, and the supporting means.
It depends, however, at least as much – if not more – on a clear vision and strategy on the way forward. It is this vision and strategy that will
create the fit between analytical ambition, technological power and organizational reality.
In essence, five major steps should be followed to ensure that the data dynamic will be embraced and bring the necessary value: Preparing the
mindset, Determining the Business Case, Planning & Realizing the flow, Analyzing & Steering, and Guiding the Change (see Figure 4).
Preparing the Mindset
Organizations are led by people that either take decisions based on the analytic power of their tools, or based on experience and (interaction
enriched) intuition. In both cases, it will take time to change these habits. It is therefore important that the sponsor(s) or the initiator(s) of the
project prepare early the mindset of the future data powered decision takers, ideally already at the Ideation stage.
Decision takers need to understand what Big Data is really all about. They need to understand that it is not solely about adding a sexy App on the
governmental servers and gaining extra insights by pushing a button. It is about setting in motion an organizational transformation journey in a
structured way, based on revisited analytics through technical improvements on different levels, leading potentially to insights that could trigger
real game changers in the envisioned field(s) of activity (Van Driessche, 2014).
This is possible - even before really kicking off the project and its larger change management process - by focusing on correct understanding,
preparing for dedication and ensuring enriched thinking.
Firstly, empowering decision takers to support the project needs to be done with a sense of realism. They need to understand that success stories
from other countries cannot necessarily be transposed as such. Transportation projects like LIVE Singapore!17in Singapore or Traffic Lab in
Helsinki (Heikkilä, 2014) will not have the same influence in New York, Brussels or Mumbai. The education focused hackathons in Boston18 may
be inspiring, but the outcome will not be bluntly applicable in Hong Kong, Buenos Aires or Johannesburg.
The contextual caution is even needed in the same geographical sphere. Projects in, for example, healthcare are not necessarily transposable in
education or transport, and vice versa, due to diverging realities. Even within the same geographical and operational area, one should take time
to regard reference examples with sufficient caution, as the intrinsic maturity or the economical ecosystem may be incomparable.
In addition to realism on the management level, is essential to have managerial dedication. Regular communication and collaboration with the
different concerned parties is therefore key when bending the data in the correct shape.
In case of green field projects, for instance, project sponsors can be very enthusiastic, because of the high exposure and/or the ‘innovative’ nature
of the project. This type of project will therefore require close follow-up to ensure that the taken direction fits to the (wild) plans of the fascinated
sponsor. More traditional brown field projects, on the other hand, might require more steering, due to more mature context, and the eventual
existence of more ambitious initiatives.
Thirdly, in preparing the mindset awareness should be created about the fact that a Big Data projects go beyond the sole providing of traditional
data. The Big Data reality can offer much richer information than the existing view that some hospitals, governmental departments or
educational working groups have. Managers should be aware and support the fact that Big Data can – and should - lead to a different way of
thinking (see Figure 5).
Figure 5. Preparing the mindset starts with a clear explanation
to the stakeholders of what Big Data really is about, goes
through the active involvement of senior members in the
project support, and in looking to unveil the Big Data potential
through a new way of thinking
The earlier mentioned UNOS activities, for instance, do not only focus on improving the organ donor’s waiting list with new names. The work is
done with a focus on creating more detailed ways to define the fit between donors and receivers. These different angles may lead to exploring new
data paths that, if intelligently combined, lead to a more powerful granularity, and, in the end, improve the fit.
Likewise, Helsinki does not only want to have a better traffic management with Traffic Lab (Heikkilä, 2014). It wants to create a new way of
living. This will be enabled by a vast source of data. But some of these data are triggered by intermediate changes in behavior that the town wants
to initiate amongst its citizens.
Note that the need to prepare the mindset (by building on a realistic mindset, ensuring focus on communication and collaboration, and making
sure that the high potential of big data is really understood) can also be true for the non-managerial stakeholders, like the technical teams. People
often look at the new problem through the frame of reference of previous experiences or old problems (Janssens, 2002). To unveil the full
potential, it is however important to be also open-minded enough on the technical level. This can mean pushing for trying and exploring new
technologies (Mac Gregor, 2014; Van Driessche, 2014), new architectures or new technical catalysts that will stimulate the novel human way of
thinking.
Overall, the expectations of the project’s team should thus be aligned on the fact that Big Data are not only about ‘plug and play’ or brushing the
existing. Going for Big added value requires Big implication from the very beginning by everyone, and should embrace the true potential of Big
Data: going beyond the current frames of reference of the organization, with a sense of realism.
Determining the Business Case
For all endeavors, having a clear view on the final goal is a precondition to gauge success. This is also valid for Big Data projects. Generally
speaking, it is advised to prepare beforehand a definition of the expected benefits. This will be a valuable backbone for the preparation of the
project and the evaluation of the completed outcome.
To do so, the managers and senior users who will be using the data need to define the value in regards of their targets. From a methodological
point of view, Kobielus (2014) advises to pay attention to 4 Vs of Big data: Volume, Velocity, Variety and Veracity.
Although not excluding the latter, the value definition can be done from a complementary, more pragmatic angle with a Business Case. This
includes a reflection on the reasons and objectives of the project, having an aligned view on the stakeholders and impacted users, as well as
evaluating the expected return, the payback period and the exact scope. It is advised to align the budgetary dimension well with the reality the
concerned public body of and the policy in place.
Classically, important indicators are also evaluated. This will take relative expectations into account for the project delivery in terms of time, cost,
scope and quality; and assess the project environment on risks, dependencies, constraints and assumptions (see Table 1).
Table 1. Major due diligence elements for writing the Business Case or related deliverables
Attention Points
7 Timeline
8 Project stakeholders
To ease this reflection, it is worth to have a closer look at how the project could contribute to the organization’s objectives.
A public bank or insurance company, for instance, could be interested in reproducing a project that has been reported to provide significant
19
results in lowering the fraud detection time for a specific private institution19, or ‘simply’ try to reproduce the Big Data Leadership strategy taken
by a competing private bank (ING, 2015).
If the institution reviews the above mentioned points before kicking off a Big Data initiative, they could come to the conclusion that it is not the
right moment to start the project because the government wants to significantly cut public expenses for the financial sector, or prefers to allocate
the budget to more down to earth initiatives that differentiate them from private actors.
Similarly, such a due diligence could show that most aspects can be managed, but that the perception certain stakeholders have of Big Data
requires a significant review of the timing or the scope. On the other hand, the Business Case could also conclude that the envisioned project can
and should be started quickly, to avoid some strong constraints linked to a possible decisional status quo resulting from upcoming elections or
evolving international liquidity regulations.
Note that it can happen that people expect to estimate the return for the business case only at completion of the data project. The rationale
behind this approach is that key stakeholders can have a convicted gut feeling that improved analytical power on certain groups of data could lead
to interesting information, which, if confirmed, would then open the door to reflection paths for actions and belonging value estimates. If this
situation takes place in an organization with limited Big Data experience, the project will then have to be sliced in a smaller pre-project to prepare
the data, and a second ‘full-fledged’ project, that focuses on the real value dimension. Environments with a track record should be able to
evaluate the potential value and build the business case based on prior experience.
If the business case shows that the success of the project requires some key decision takers to feel first more comfortable with the Big Data
dynamic at large, it might be useful to set up a best practice sharing or experience sharing with other organizations or departments. This will help
to benefit from lessons learned with a critical eye.
It is, for example, imaginable that major cities would want to anticipate city focused data initiatives by gaining first insight on the road already
travelled by similar projects like the MODA project, the London Data Store20, or the Brussels’ Smartcity21platform.
If the goal of the initiative is to come to a compatibility level with a sister organization, experience sharing can have a beneficial influence on the
set up of the Big Data initiative, and on the shaping of foundations for future collaboration.
Overall, one should be cognizant in every context of the fact that obtaining an agreement on the business case means that a balanced choice will
have to be made. The potential decision power will be directly impacted by the level of allocated means – or the lack of.
Planning and Realizing the Big Data inflow
With the expectations being clear and the business case agreed upon, it is very tempting to start directly the technical realization: working on the
crunching core, adapting the external visualization layer, or - as Big Data projects without eye candy offer rarely a lasting outcome – finding an
optimal combination of both. Still, it is essential to be aware that the real level of complexity (and the speed to start) does not only depend on the
technical choices, but also on the way the work is managed and organized.
ETL projects involving limited data sources, for instance, are technically speaking less complex than projects involving also the creation of new
data bases, cleansing of existing sources, the development of new algorithms or the preparation of hardware upgrades (Willinger & Gradl, 2008).
But if there is no clarity on matters like project ownership, deadlines and risks, even traditional ETL projects can become a challenge.
For the technical side of the distillation of the essence of the data, the author refers the reader to dedicated chapters and additional reading.
In the ideation or project initiation stage, it is essential to develop a structured project plan. It will be the base to come to a shared view and
understanding of all elements necessary to attain the goal: the project time line, the resources and the budget, the work breakdown and belonging
schedule, plans for stakeholder management and communication, project governance and the like.
In addition, a good preparation needs to result in a transparent view on the as-is and a blueprint of the to-be situation. This is especially true for
Big Data projects, as they are prone to open contexts with dynamic requirements gathering. This preparation includes detailed technical
inventories, extended data mapping and qualitative technical specifications. One will also want to document clearly the impact of the data change
on the decisional process, as well as the future data ownership (Madrid, 2009; Merz, Hügens, & Blum, 2015).
During the realization itself, it is essential to keep the stakeholder(s) closely involved, and foresee sufficient managerial feedback on the technical
work. If the preparation is done too much behind closed doors, there is namely a real risk that the outcome will not fit to the expectations, or that
key decision makers will lose interest and support (see Figure 6).
Figure 6. Projects can be managed through waterfall phases,
or as work packages in scrum sprints. Both approaches require
regular follow up and feedback loops.
The feedback and follow-up loops are particularly important. The number of loops depends on the vastness of the scope and the adopted project
approach. For details on waterfall methodologies or agile scrum approaches, the author refers to dedicated literature22. The different project sizes
can however already provide a good indication.
For small projects or projects intended to bring quickly Big Data experience, it may be preferable to work with sprints, as they tend to deliver
swift results, compared to a traditional methodology with more formal periodic reviews. This being said, the structural fluidity one might
associate with scrum and other agile methodologies, requires a significant level of maturity and discipline of the key actors, and an unconditional
availability of decision takers towards rapid feedback. A balance has therefore to be made during the setup between pros and cons of each
methodology.
Besides the fact that the quantity of loops depends on the project approach, it is essential that they aim at quality. The real goal of these loops
goes beyond trying to obtain some rubber stamping. In the interest of the organization - and the people involved - these feedback rounds should
be regarded as stage gates, offering incremental quality to each step of the process foreseen in the overall planning, until the completion of the
project.
Analyzing. Using. Staying Critical.
Depending on the fact if the goal of the Big Data project is to have straightforward and clear ‘final’ data, or rather an outcome that will be a base
for different interpretative scenarios, Big Data projects can – and will mostly - require additional analysis and interpretation of the information.
But in the end, the data will be the feeding influx for action focused initiatives.
From an organizational and managerial point of view, the delivery of new tangible information is therefore regarded as the most exciting stage,
especially in the expectation of previously unthinkable insights. The excitement after the long preparation journey is however also exactly the
reason why this stage requires a critical mindset and a sound quality review of the delivered results.
To start the quality review, the easiest is to re-assess the main expectations defined in the business case in terms of time, cost, scope and quality.
This will offer a relatively good view on the project execution.
If there is real desire to go beyond a ‘one off’ initiative, one should also evaluate the contribution of the project to the overall Big Data dynamic.
To do so, the adaptability and reusability of the solution are relatively good indicators, as well as the organizational acceptance and the actualized
payback period.
Combined with latter, one might also want to include the review of the total cost of ownership over 3 to 5 years, as this provides an additional
perspective into time, compared to the ‘basic’ review of the actuals of the project.
The adaptability of the solution can be evaluated in terms of the use of open formats, the respect of governance defined standards, or the ease of
use of the front end layer.
The organizational acceptance can be evaluated through interviews and questionnaires of senior users and key stakeholders. This should be part
of more general project work, focusing on the change management guidance. This will be discussed more in detail in subsequent sections.
Note that a complete assessment report does not close the door to differentiated managerial decisions. If a big bang project delivered only a part
of the expected results, for example, one can still decide to be positively satisfied, because the project provided a solid experience, a positive
change climate or a specific technology base for further initiatives.
What really matters in the end is that criteria are set in the beginning, and evaluated at the end. What comes afterwards is a matter of managerial
decision taking.
Guiding the Change
Once the mindsets are aligned, the needs clear, the realization work done and the outcome used in an intelligent way, one could expect that the
organization is in motion and the circle therefore closed. In reality, this the phase of the project requiring increased attention to have the
‘technical’ leap forward accepted and handled correctly by everyone. In other words: making sure that the changes are adopted and accepted as
evolution, in preparation of an eventual revolution.
Although organizational change management actions should be initiated from the early beginning of the project, this dimension is often
overlooked, or at least underestimated as Big Data projects tend to be regarded as technical – or reporting focused at best.
This is quite contradictory to the spirit of Big Data. As explained earlier, tools offering power through advanced analytics are not a goal on itself;
they are a means to more.
The only exception are projects that focus only on the intermediate reinforcement or increased automation of existing data mining tools. Such
initiatives will not necessarily change the span or depth of the decisional process, but ‘only’ speed up the existing situation. In such contexts, the
need of solid change management is less important, and will mainly ensure that the people collecting the data in the existing situation
understand correctly the functioning of the new data feed. Eventual calculation- or usage related assumptions that were taken to build the feed
need to be documented and shared to avoid the (wrong) perception of biased data. In complement, sufficient time should be taken to ensure that
the concerned people trust the data (as they are created without ‘control’ of the concerned person).
Set aside such cases, the outcome of Big Data projects is thus a means to more. As these means need to be used mostly by people, organizational
acceptance and guided change of the results of the project are a conditio sine qua non to reap the fruits from the increased decisional intelligence.
A noteworthy example in the public sphere is the journey of the non-profit corporation inBloom. Founded in 2012 by a group of American
educators, and co-funded by the Gates and Carnegie foundations, the company has been active in the educational field. Its goal was the creation
of an open source computer system for schools, to offer a structured and efficient management of the data related to the students. The deeper
purpose was to provide insights in potential improvements for learning and teaching.
Mid 2014, inBloom announced it had to stop its activities, after strong protests of parents and privacy lawyers. One of the main reasons that this
project failed was thus not IT or technology related, but more human, given that the schools were not able to combine their role of ‘customer’ of
inBloom, with their role of change agent towards parents and the like.
Streichenberger (2014), the group's CEO, phrased it as following in his closing letter:
It is a shame that the progress of this important innovation has been stalled because of generalized public concerns about data misuse, even
though inBloom has worldclass security and privacy protections that have raised the bar for school districts and the industry as a whole. (…)
We stepped up to the occasion and supported our partners with passion, but we have realized that this concept is still new, and building public
acceptance for the solution will require more time and resources than anyone could have anticipated.
Generally speaking, organizations or public bodies that have the ambition to launch a Big Data project with a large impact will have to follow a
structured approach. For change management matters, this can be summarized in a gradual process (see Figure 7).
The success and efficiency of the change management guidance depends on different factors.
First of all, change management actions have to be taken up early in the process. Ideally, this is already clearly highlighted and planned at the
beginning of the project, and translated in a clear change management strategy. This strategy will then be the starting point for a change
management plan, for subsequent communications, and for training plans. In small projects, this preparatory responsibility is taken up by the
project manager or owners of the concerned processes. For large projects, it is advised to have a dedicated change manager or change
management team.
Secondly, guiding change requires a down to earth sense of realism and vision, as transformational matters implying human change are usually
faced with resistance. Smoothening the road to acceptance is traditionally done through information and guidance, and complemented by
training sessions and follow-up. The final goal of these initiatives is making sure that the concerned people fully accept and understand the new
tools, use them correctly, and, in the end, really own them.
The target ‘audience’ of these initiatives can be relatively large, as it concerns the different groups of internal stakeholders that will contribute to
the success. This includes three major groups. Firstly, there are the people offering the necessary senior support and structural (and/or political)
oxygen to make the project work. They need to understand the purpose of the project and need to be prepared to play their role of internal change
agent towards the different levels of the organization.
A second group consists of the people realizing the Big Data flow. They are less impacted by the final change as such. But to design the technical
layers appropriately, they need to be given a view on the way the data will be used.
Last but not least, the change initiatives focus on the end users of the information. In some cases, these end users are analysts working with the
data. It can also include their management, who will (have to) base their decisions on the analyst’s reports.
When preparing and guiding the Change Management process, change agents need to take into account the impact of their words and their
actions. As communicators they need to pay additional attention to the fact that their audience may have a totally different understanding due to
their specific background (Pietrucha, 2014) or their organizational culture. It would be regretful to leave this unnoticed as having powerful
analytics is good, but worth nothing if it is translated or used incorrectly due to a skewed understanding.
The organizational culture is strongly influenced by the tendencies and macro-evolutions in the field of activity, eventually by specific factors that
relate to the identity of certain departments, and by cultural habits of the region or country.
Initiating a Big Data project in the financial department of the (government owned) oil and gas company Statoil, for instance, requires another
mindset and way of working than data initiatives for the Brazilian Ministry of Finance. Likewise, optimizing the dense networks of public
transportation in Czech Republic through advanced analytics, will take into account totally different cultural codes than when developing one in
the Eastern Provinces of Australia. Specialized literature can help in depicting these specific geographical influencing factors (Kogut, 2012; Moll,
2012).
Including differences in perception and reaction in the change process can even be needed on an intra-organisational scale. Some departments
can, for instance, be more open to data sharing than others, for human or historical reasons.
Different tools and techniques are available to grasp the essential elements of the overall culture (Cameron & Quinn, 2011; Campens, 2011; HBR,
Kotter, Kim, & Maubergne, 2011). The author will focus on the distillation of three dimensions: the Climate for Change, the Mindset of the
Stakeholders, and the Impact of the Project. All these elements are traditionally assessed through interviews of stakeholders at different levels. A
more detailed explanation of these techniques can be found in the dedicated section.
Change Management Tools
Periodically, research and articles evaluate the main reasons of failure of major IT transformation projects. Unsurprisingly, they highlight the
lack of proper management, absence of real leadership and underestimated planning or cost estimates. At the same time, attention is also drawn
to the importance of good understanding, communication, and change management.
In 2003 already, Gartner - the research and advisory firm - expressed the trend as follows (Young, 2003):
Collectively and individually, human beings respond to change in predictable ways. This predictability lends itself to a fairly standard, simple
set of change tactics that can build systematic support for change initiatives and radically reduce their risks.
Research consistently demonstrates that initiatives, investments and enterprise responses requiring high levels of organizational compliance
or agility fail more than 90 percent of the time, and that the drivers of failure are not found in the nature of the change decision itself, but in
how it’s implemented – that is, leadership’s failure to recognize and manage the magnitude of the change and its effects on those who must
adapt.
More recently, IBM estimated that only 3% of project failures are attributed to technical challenges (Gulla, 2012). In 2013, Gartner estimated that
more than 50% of the projects focusing on data analytics fail because they are not completed in time or on budget, or because they fail to deliver
the benefits that are agreed upon at the start of the project (Van der Meulen & Rivera, 2013).
One of the essential ingredients of successful data projects is thus the anticipation and preparation of the organization for the upcoming change.
Different tools and techniques exist to facilitate this process. Most of them focus on building the bigger picture through the evaluation of three
fundamental drivers: the Climate for Change, the Mindset of the Stakeholders, and the Impact of the Project. Each of them is traditionally built
through interviews or questionnaires of stakeholders.
The evaluation of the Change Management Climate gives an indication of the recent exposure to change in function of the past track record. The
track record is influenced by elements like the quality of the internal communication, the way resistance to change is usually managed, priority
setting habits, the presence of change agents, the span of the eventual politicized environment, and the like. The results give a strong indication of
the organizational zones requiring additional attention. This can range from a clear need for consistent trainings for each project, to strong efforts
needed on the management level in priority setting, or consistency in the decision-taking process.
The Stakeholder assessment evaluates the willingness to change versus the capability to do so for a series of individual stakeholders. Usually,
different persons are interviewed in each group of stakeholders. This helps in identifying eventual change agents capable to mentor their
colleagues or bring significant support to the initiative. Adversely, the assessment might shed light on the departments that are most resistant to
change. They will require specific change management attention. Note that the outcome can be aggregated per organizational unit to understand
differences per site or department.
The Project Impact assessment evaluates who will feel the biggest impact. The impact can be seen from different angles. It can come from having
to work in a different way with data. It can also relate to changes that the Big Data outcome may trigger amongst the management, for example in
contexts where the (conclusions on the) data will be the enablers for internal improvements or external collaborations.
Contrary to the evaluation of the Change Climate and the Stakeholders that are plotted on a two-dimensional axis, the outcome of the Project
Impact assessment is translated into a web diagram. In case of a pentagram, the five angles can be Organization, Governance, Culture,
Technology and Business Processes.
In Figure 8, the Change diagnostics have been done for an organization desiring the put in place a joint data project for its different departments.
The Change climate has been evaluated by the Executive Director of the organization through an assessment of 50 statements, to be evaluated on
a scale from 1 (strongly agree) to 5 (strongly disagree). The answers on the questions result automatically in an update of the graph.
Figure 8. Change Management climate, Stakeholder
assessment, Project Impact assessment
The Stakeholder assessment has been obtained through a (different) self-assessment, completed by the senior managers of each division,
resulting each in a divisional image. The aggregated picture shows the final graph. The Project Impact assessment is based on the feedback of the
Executive Director.
The combination change diagnostic illustrates a project in an environment with a limited exposure to change and a limited change track-record; a
moderately positive stakeholder support; and the highest impact of the project being expected on business processes and organizational fluidity.
To take this change management climate into account, one will have to capture the outcome from previous projects and make a selective use of
good practices or recognized experts, focus on quick-wins to reassure and involve users, and put extra focus in communication, business
involvement and transparent reporting.
The stakeholder assessment illustrates that the project will take place in an environment that is favorable to change management, but with a
strong need to guide the change management awareness that is latently present. Based on the aggregated assessment of the stakeholders, a part
of the department appears to be composed by people able to support the initiative from a change management angle, if they receive some
additional support. In their position of change agent, they appear to be appropriate to promote the project to key users or, at least, have a positive
influence on the local promotion of change.
As the project is likely to impact more business processes and organizational fluidity, one will have to dedicate reinforced attention to these
aspects – or to the expectations stakeholders have on this level. On the other hand, it is an indication that the stakeholders have their feet well on
the ground concerning the possibilities of the Big Data power. This will make it easier to embed the technological changes in a fitting
environment.
For more detailed guidelines on Change Management in IT Transformation projects and specific information on (sub)organizational models, the
authors advises the reader to consult dedicated literature.
Taking into Account the Portfolio
‘Early data projects’ tend to focus on obtaining quick wins, on fueling a very specific set of decisions, or on gaining experience. Subsequent waves
of similar initiatives will use the outcome of these first projects to nourish larger ones. Big Data analysis is thus rarely a one-shot action. It is
rather the start of a broader process, intended to deliver products or services with higher value to customers and society.
At the same time, these projects are also embedded in a larger reality, where a project runs in parallel with the daily activities or with other
projects. To make this portfolio reality function efficiently (including the earlier mentioned evaluation of resources and means), the management
needs to set a strong silver lining.
Senior leaders should prepare a time phased roadmap upfront to give a contextual meaning to the portfolio reality. The roadmap should indicate
what sets of information are used now, what information is under rework through other projects, and what sets of information will be included
when in what future initiatives. This will offer a clear sense of direction and reality to collaborators. In addition, it will comfort the people that are
eventually deceived by a limited scope of early Big Data projects.
In parallel with thinking holistically and setting a clear direction, a portfolio requires a gradual preparation of the cumulated technical and
organizational change that will come out of the portfolio of Big Data initiative(s). With the help of the Enterprise Architect and the IT Architects,
reflections need to take place on the evolution in data and application architecture, and what it means for other projects and services.
One of the important discussions in that context will be about the impact on the use of existing strategic analytical tools, like Business
Intelligence applications based on ERP systems (SAP, Cognos, Oracle, Odoo and the like) (see Figure 9).
Choices might have to be made, for example, in a context where Big Data and Business Intelligence are used in the same perimeter. Big Data and
Business Intelligence have both their own pros and cons. Business Intelligence through ERPs can be much faster and much more precise than Big
Data powered reports. On the other hand, Big Data is much more powerful to draw tendencies out of huge volumes of data. In other words:
despite a valid series of potentially overlapping tasks, both can and should be used for their own strengths.
As there is no general preference of Big Data vis-à-vis Business Intelligence (Breugelmans, 2014), ERP data will thus be impacted to a larger or a
lesser extent, depending on the span and the scope of the Big Data project(s). The success of this symbiosis (or cohabitation) implies therefore a
preliminary review of the impacts of the data related project portfolio on existing processes.
In such a review, a non-ambiguous understanding is needed of all taken assumptions. Ideally, these assumptions should be documented in the
business case of each project.
It could, for example, be decided to implement SAP HANA to ‘improve the analytical power’ of the organization. The understanding of this power
could diverge, as HANA is driven by three core components (parallel, in-memory relation query techniques, columnar stores, and improved
compression technology) (Plattner & Zeier, 2012). Some will opt for it because of the possibilities the high compression rates offer, others for the
speed of data retrieval, or for the fact that it can process queries via HANA and other Hadoop like tools. As the reasons to choose for this product
may differ, the realization might differ. Consequently, the impacts may differ as well.
In complement of doing a sound transversal portfolio review, it is essential to create a climate that determines consistently what source of
information will be used as single source of truth for what type of decision. This is especially useful in case of applications with (temporarily)
overlapping data. Confusion could for instance occur when variables with similar names are used, but with a slightly different meaning; or when
two applications use exactly the same variable, but with a synchronous/asynchronous delay. An additional advantage of such a climate is that it
will limit the tendency to cruise back to Excel islands, and reinforce the focus on shared operational dashboards.
Hence, a sound preparation of Big Data initiatives requires that one takes the entire portfolio of activities, projects and data tools into account.
The portfolio view and time based roadmap will allow everyone to see to what extend they will be impacted by the data crunching applications as
such, by the evolving way of thinking, or by the shifting organizational approach. If, in addition, clarity is safeguarded on the different sources of
truth, the chances are real that the fruits of this preparation will be used at their true value for all elements of the portfolio.
FUTURE RESEARCH DIRECTIONS
Big Data is in the midst of a period where the technical possibilities and the number of implementations are experiencing a continuous growth.
According to outlooks of Gartner and IBM (Gulla, 2012; Van der Meulen & Rivera, 2013), this tendency will continue during the years to come.
Given this constant inflow of new projects, it is important to bring Big Data initiatives to a higher maturity level. One of the possibilities is to
capitalize on past experiences by integrating not only the lessons learned and incremental quality gains on the technical level, but also those that
take the organizational and human dimensions into account.
Currently, the consequent inclusion of a structural approach that reconciles the technological potential and the contextual reality is still going
through a crystallization process, despite the fact that elaborated project management methodologies exist.
From the methodological point of view, the project management toolbox has to be further enriched with specific best practices. This best practice
dynamic will have to dedicate specific attention to Change and Stakeholder Management during Big Data projects, and return on experience after
the Go-Live. This will not decrease the important structuring role of project management frameworks. On the contrary, it will make it easier for
the continuously growing number of Big Data projects to truly embrace the project methodologies and apply them appropriately.
Similarly, it would be valuable to build a best practice base for specific sub-sectors of the public sphere. By building on lessons learned, it is
imaginable to strengthen data projects in key domains like education, health or city management. Indirectly, this experience base could even
stimulate future improvements of the general project management frameworks.
Obviously, best practices will require regular revisiting as the Big Data world is in constant evolution. But the author believes that the recurring
efforts will be largely compensated by the advantages in quality and outcome of the data projects.
Lastly, future research should explore ways to use the potential of the public-private force field. Using private experiences to improve the public
reality is indeed only a part of the Big Data journey. The real satisfaction and added value will come when public organizations are able to think
ahead of their private counterparts. To do so, a management approach needs to be developed that brings the current inspirational relation to a
higher, more catalytic level, in combination with the structured project follow-up.
CONCLUSION
Three fundamental aspects need to be taken into account in the management of projects that intend to unveil the true potential of Big Data in the
public sector, especially if they have the ambition to nourish the organization with new insights.
Firstly, it is essential to shape the building blocks of the Big Data ecosystem from the very beginning. To do so, one needs to define from the
decisional angle what objective the data will have to support. This can be Internal improvement, Innovation towards external stakeholders, the
Development of integrated solutions or eventually Building collaborations with public or private actors. In complement, one has to decide if the
idea is to go for a small project, or a larger Big-Bang. It has been explained that an important factor in choosing between both is the extent to
which the organization wants to focus on experimentation and the development of human and technical experience.
With these aspects in mind, it will have to be evaluated what types of data are desired, in addition to existing ones. Public bodies will also want to
pay additional attention to the integrity and security of these data.
To complete the base outline, an evaluation is needed of the means that are available on the human, technical and financial level. They need to be
scaled in accordance to the organization’s ambitions. Ideally, this scaling should go slightly beyond the actual ambitions, in order to be prepared
for strong demand, resulting from the project’s success.
If assessed correctly, the combined configuration will offer a good view of the type of project that is actually at stake. This improved view and
understanding will be valuable in the ‘real’ preparation, the shaping and the management of the project throughout the entire lifecycle.
A second essential element in managing a Big Data project is being prepared to go beyond the follow-up of the technical creativity with Hadoop,
HANA and the like. Project management should not only encompass the different degrees of freedom of the project or the need to follow the
progress closely. Big Data projects are about developing new possibilities that can bring public organizations on par with services offered by
private counterparts- or even ahead of them.
This is done by changing the frames of reference and setting the foundations of a new way of thinking. To ensure that this added value gets
captured efficiently, Big Data changes need to be embedded in a larger process. Effective management starts therefore already at the ideation of
the project with the preparation of the mindset of the organizational sponsors. In combination with the clear definition of the business case, it
should be evaluated if all elements of the public context feel right to launch the Big Data project in the envisioned circumstances. Once this is
indicator is green, managing will require further follow-up through a solid planning, iterative follow-up loops with different stakeholders and the
necessary change management until the organizational acceptance after the project.
The change management tools at hand are diverse. It has been illustrated that a multidimensional understanding can be built by evaluating the
Stakeholders’ capability and willingness to change, the Climate for Change and the Project Impact. The resulting bigger picture will offer a
complementary indication of the change management reality. On the one hand, it will bring clear indications of the different aspects that require
specific attention. At the same time, one will also have a better view on the possibilities of the organization, and the eventual presence of active
change agents.
Last but not least, the management of Big Data projects needs to be done with a holistic view. The entire portfolio of activities, projects and data
tools needs to be assessed, prepared and aligned upfront. This will bring a synchronized, contextual and functional coherence to otherwise loose
pieces. Doing so will benefit the data landscape and the internal functioning, and therefore the final outcome.
In the end, data management, Big Data and traditional decision taking are bound to converge through a cross-fertilizing interaction, also for
public bodies. Organizations should therefore see Big Data projects as an opportunity to enter with both feet in a new era of possibilities.
Managing it with a clear vision, a good knowledge of the day-to-day reality and a layered approach will therefore contribute to a stable continuity
of existing services, while assuring the transition of the existing organization to future evolutions.
This work was previously published in Managing Big Data Integration in the Public Sector edited by Anil Aggarwal, pages 107136, copyright
year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Ala-Kurikka, S. (2010). Enel: Italy reaping firstmover benefits of smart meters. Retrieved May 30, 2015, from http://www.euractiv.com/italy-
reaping-first-mover-benefits-smart-meters-enel
Baele, E., & Devreux, H. (2014, December). MOBIB: the card of the future. Eurotransport. Retrieved May 30, 2015, from
http://www.eurotransportmagazine.com/
Boone, R. (2013). Er is geen blauwdruk voor de informatisering van Justitie. Legal World. Retrieved July 7, 2014, from
http://www.legalworld.be/
Bové, L. (2012, February 24). Justitie prutst met miljoenenclaim over computers. De Tijd. Retrieved Aug 19, 2014, from http://www.tijd.be/
Bové, L. (2013, August 29). België scoort slecht in informatisering justitie. De Tijd. Retrieved July 7, 2014, from http://www.tijd.be/
Gulla, J. (2012, February). Sevens Reasons IT Projects Fail. IBM Systems Magazine. Retrieved May 13, 2014, from
http://www.ibmsystemsmag.com/
Orcutt, M. (2014). Hackers Are Homing in on Hospitals.Technology Review. Retrieved September 15, 2014, from
http://www.technologyreview.com/news/530411/hackers-are-homing-in-on-hospitals/
Paskin, J. (2014). Sha Hwang, the Designer Hired to Make Obamacare a Beautiful Thing. Business Week. Retrieved May 14, 2014, from
http://www.businessweek.com/
Vanleemputten, P. (2013, August 29). Informatisation de la Justice: la Belgique parmi les plus mauvais élèves. Datanews. Retrieved May 20,
2015, from http://www.datanews.be/
Viaene, S. (2014). Zorg ervoor dat je technologie mee kan.Business Bytes. Retrieved May 15, 2015, from
http://business.telenet.be/nl/artikel/zorg-ervoor-dat-je-technologie-mee-kan
KEY TERMS AND DEFINITIONS
Agile: Project management methodology, in which the development is characterized by the breakdown of tasks into short periods, with frequent
reassessment of work and plans. Used in software related projects.
BigBang: Term used in project management to identify a project in which most of the changes are operated at once, contrary to a phased
implementation.
Business Case: Cost benefit analysis used for the justification of a significant expenditure at the initiation of a project.
Change Management: Management of changes impacting an organization by enabling changes while ensuring minimal impact on the
organization or its stakeholders. Involves the definition and implementation of new values and behaviour in an organization, the management of
expectations, the building of consensus, and the management of organisational changes.
Lean(ification): Approach that focuses on the optimization of processes by the elimination of waste in terms of time and resources.
NonStructured Data: Data coming from non-structured data sources. Can be encountered as free-text communications in payments, or
unmapped information from external data sources or social media.
Stakeholder Management: Management of all people who have an interest in an organization or project, or that could be impacted by its
activities, targets, resources or deliverables. This can include the management of customers, suppliers, vendors, shareholders, employees and
senior management. Closely related to change management.
Structured Data: Data coming from structured data sources. Can be encountered as structured payment data or transactional data in internal,
data model based ERP systems.
ENDNOTES
1 For more information on Prince2, the author refers the reader to: http://www.prince-officialsite.com/
2 For more information on PMBOK, the author refers the reader to: http://www.pmi.org/PMBOK-Guide-and-Standards.aspx
3 For a detailed definition of Big Data, the author refers the reader to Gartner (2014)
4 For more information on the Entrepreneurs Desk, the author refers the reader to:
http://www.beci.be/services/je_cree_ma_societe/guichet_d_entreprises/enterprise_desk/
5 For more information on UNOS, the author refers the reader to: http://www.unos.org/
6 For more information on Healthcare.gov, the author refers the reader to: https://www.healthcare.gov/
7 For more information on deCODE, the author refers the reader to: http://www.decode.com/
8 For more information on Islendingabok, the author refers the reader to: https://www.islendingabok.is/Leidbeiningar.jsp
9 For more information on Sagebase, the author refers the reader to: http://sagebase.org/
10 For more information on MODA, the author refers the reader to: http://www.nyc.gov/html/analytics/html/home/home.shtml
11 For more information on the Open Data from the European Union, the author refers the reader to: https://open-data.europa.eu/en/data/
12 For more information on the Open Data from the USA, the author refers the reader to: https://www.data.gov/
13 For more information on the Open Data from the OECD, the author refers the reader to: http://data.oecd.org/
14 For more information on the Open Data from the Worldbank, the author refers the reader to: http://data.worldbank.org/
15 For more information on Algorithmia, the author refers the reader to: https://algorithmia.com/
16 The aim of the local authorities is to install Traffic Lab devices in more than 50 000 vehicles by 2015. When ready, the data will be made
available to any interested party, in order to continue to crunch the data and have the new transport up and running by 2025. For more details on
Traffic Lab, the author refers the reader to: http://trafficlab.fi/
17 For more information on LIVE Singapore!, the author refers the reader to: http://senseable.mit.edu/livesingapore/
18 For the Big Data events in Boston, the author refers the reader to: http://www.massbigdata.org/events/
19 For more information on data driven fraud detection, the author refers the reader to the Big Data insight of the Association of Certified Fraud
Examiners: http://www.acfe.com/
20 For more information on the London Data Store, the author refers the reader to: http://data.london.gov.uk/
21 For more information on the Brussels Smartcity initiative, the author refers the reader to: http://smartcity.bruxelles.be/
22 For details on waterfall methodologies, the author refers the reader to: http://www.pmi.org/PMBOK-Guide-and-Standards.aspx (PMBOK),
http://www.prince-officialsite.com/ (Prince2). For scrum, the author refers to https://www.scrum.org/ and to https://www.scrumalliance.org/.
CHAPTER 66
The Role of GeoDemographic Big Data for Assessing the Effectiveness of Crowd
Funded Software Projects:
A Case Example of “QPress”
Jonathan Bishop
Centre for Research into Online Communities and ELearning Systems, UK
ABSTRACT
The current phenomenon of Big Data – the use of datasets that are too big for traditional business analysis tools used in industry – is driving a
shift in how social and economic problems are understood and analysed. This chapter explores the role Big Data can play in analysing the
effectiveness of crowd-funding projects, using the data from such a project, which aimed to fund the development of a software plug-in called
‘QPress’. Data analysed included the website metrics of impressions, clicks and average position, which were found to be significantly connected
with geographical factors using an ANOVA. These were combined with other country data to perform t-tests in order to form a geo-demographic
understanding of those who are displayed advertisements inviting participation in crowd-funding. The chapter concludes that there are a number
of interacting variables and that for Big Data studies to be effective, their amalgamation with other data sources, including linked data, is
essential to providing an overall picture of the social phenomenon being studied.
INTRODUCTION
In the current digital age, we have seen an unprecedented global recession that could be seen to have challenged the willingness of persons to
take risk in innovation (Etzkowitz, 2013), but this is not always the case (Singh, 2011). One approach that has been suggested as an appropriate
means to help overcome such financial shortfalls is crowd-funding. Put simply, crowd-funding is the procurement of financial capital from those
who want to benefit from a particular innovation (Kshirsagar & Ahuja, 2015; Ordanini et al., 2011) . The question that is often asked is how to
assess the effectiveness of crowd-funding projects and also how they should be benchmarked. This chapter argued that an important part of this
process is the use of what has become called ‘Big Data.’ Big Data is still a maturing and evolving discipline and Big data databases and files have
already scaled beyond the capacities and capabilities of commercial database management systems (Kaisler et al., 2014) . Big data is defined as
datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyse, where the primary
characteristics are ‘volume, velocity, and variety’ (Malgonde & Bhattacherjee, 2014; Zhang et al., 2014) .
It has been argued that geography might provide a useful lens through which to understand big data as a social phenomenon in its own right in
addition to providing answers to the complexity of social and spatial processes (Graham & Shelton, 2013) . Even so, it has been argued that the
aggregation of social media as big data is not necessarily social science data, even in the fields of human geography and geographic information
science (Wilson, 2014). This chapter shows how using geo-demographic analyses with Big Data can improve the effectiveness of crowd-funded
projects.
BACKGROUND
This chapter is in essence looking at effective means for assessing the impact of a crowd-funded campaign supported by advertising. It is argued
that geo-demographic factors play a significant role in the effectiveness of crowd-funding projects, particularly those supported by advertising. It
is further argued that Big Data can be used to identify trends that go beyond the usual metrics for advertising campaigns – such as impressions,
clicks and average position – while at the same time supporting the use of such measures.
Big Data
According to the New York Times, many think Big Data is synonymous with “Big Brother,” in the form of mega-corporations collecting masses of
surveillance information on their customers or potential customers. However, as this chapter advocates, it can also be of use to smaller entities,
such as crowd-funded projects. It has been estimated that Google alone contributed 54 billion dollars to the US economy in 2009 as a result of
Big Data, but there is still no clear consensus on what it is (Labrinidis & Jagadish, 2012). Even so, Big Data is something that each business will
have to adopt as a normal way to develop business strategy (Woerner & Wixom, 2015).
Crowd Funding
It has been argued that the main objective of crowd-funding is to give entrepreneurs a way of raising money that does not involve banks or
venture capitalists, which usually involved giving the customers of the product a stake in it in one way or another (Kshirsagar & Ahuja, 2015) .
This can be seen as desirable in the current capitalist environment, where banks charge excessive rates of interests and venture capitalists extort
shares in profits totally unrelated to the assistance they have actually given a project. Crowd-funding has been used in fields as diverse as
archaeology (Bonacchi et al., 2015) and sustainable development (Kunkel, 2015). By its very name it is clear that by targeting potential
beneficiaries of a product or service, to ask them to get on board with the funding of something they want, then a strong customer base is possible
for products before they have even hit the shelves – or download pages as may be more appropriate. Crowd-funding is increasingly becoming a
trusted way to provide finance to businesses and consumers alike (Laven, 2014). The existence of specialist websites like IndieGoGo (Dushnitsky
& Marom, 2013; Stern, 2013) and tailored charitable giving websites, it can be seen the opportunities for reaching potential investors is possible
whether one is looking to be sponsored for a race or other effort, or get a product off the ground, then crowd-funding can be a means to do it. It is
usually conducted via the Internet and involves not only the investors and initiator, but often third-parties, such as programmers, who use the
funds collecting through crowd-funding initiatives to produce the project being crowd-funded (Ferrer-Roca, 2014).
It is clear that crowd-funding has a lot of potential to replace the traditional capitalist way of doing things, where a product’s development is
driven by someone intending to make financial profit through efficient use of human and financial capital, rather than it achieve benefits to
stakeholders is other ways also (Harvey, 2011) . This is because when customers become financers, the traditional model of risk being taken on by
a small number of shareholders in a company that brings in money from selling to customers to reward that risk appears less relevant. It is
known that the cost of private sector services can be equivalent to the value of human capital in the voluntary sector (Bishop, 2012b). On that
basis, whilst this study will focus on the crowd-funding of software through receiving funds to pay programmes, it is easily conceivable that in the
future crowd-funding will involve venture human capital as much as the micro-financing it does at present.
Online Advertising
Big data is on the one hand distinct from the Internet, and on the other it can be seen that the Web makes it much easier to collect and share data
(Cukier & Mayer-Schoenberger, 2013) . One such situation in which Big Data can be collected is in the case of online advertising, which produces
a huge number of metrics. This has long been the case with online advertising, where the tracking of a user’s interests based on the websites they
visit and using that to present custom adverts is long established (Arthur et al., 2001). As can be seen from Table 1, the standardised metrics
collected from such platforms are clicks, impressions, cost and average position.
Table 1. Key metrics collected in the serving of online advertisements
Metric Description
The cost of an online advertising campaign is usually measured in cost-per-impression, cost-per-click, or cost-per-action (Huang, 2013). By
exploring the overall picture around this, Big Data can improve the effectiveness of an online advertising campaign (Huang, 2013). Equally, as the
point of Big Data is to mine data in ways that bring out new information, then it might be a dataset could suggest other ways of determining the
cost of an online advert, such as ‘cost-per-friend,’ where advertising to the friends of someone who is already a customer, may deliver new
customers. Such systems already exist, such as to recommend people join the buddy list of someone whom a friend has expressed positive
sentiment towards or whom an enemy has been declared an enemy by (Bishop, 2011).
GeoDemographic Analysis
Effective use of analytics in relation to Big Data is now key to the success of many businesses, whether, scientific, engineering or government
endeavours (Herodotou et al., 2011; Hurwitz et al., 2013) but by the very nature of Big Data, choosing the correct method for analysis is a
challenge (Shim, 2012). This chapter has sought to propose that the effectiveness of crowd-funding campaigns be assessed through making use of
geo-demographic factors. Table 2 sets out some of the ones that will be used in this study.
Table 2. Geodemographic factors that influence advertising campaign effectiveness
Factor Description
In terms of geographic mobility, education has been known to be an important factor (Olwig & Valentin, 2014). Indeed, geo-demographic factors
such as size of household are known to be linked to education (Katircioğlu, 2014). Geographical factors are known to be causes of barriers to
education (Suhonen, 2014). Education and geographical factors are known to be linked to socioeconomic indicators (Wilson‑Ching et al., 2014).
On that basis it is likely education levels in geographical regions will affect the extent to which people participate in online advertising and crowd-
funding projects. Many crowd-funding projects that relate to software will often have a limited user-base. It might be that the less likely a mega-
corporation is to develop a particular product or service, the more popular a crowd-funded project – doing something many people want – will
be.
Intelligence is known to be linked to geographic factors (Suhonen, 2014), and also crowd-funding is known to make the best use of finance and
knowledge to provide opportunities to multiple people from diverse backgrounds (Shiller, 2013). Indeed, crowd-funding has become a popular
means for people to improve the quality of their life though pooling financial and other resources to achieve their goals (Feldmann et al., 2013) .
Crowd-funding therefore has the potential to be a means to increase opportunity to disadvantaged groups, including through making it possible
for those who have no apparent impairment raise funds to help those who do have difficulties, but whom may otherwise be denied equal
opportunities because they are not as profitable as mega-corporations deem they need to be in order to be worthwhile customers.
Table 3 shows two technological factors that could have an effect on geo-demographic differences in assessing crowd funding effectiveness,
namely online ad expenditure and Internet access. It has been argued that when assessing the impact of expenditure in relation to geographical
issues, these should as far as possible reflect local market conditions (Bojke et al., 2013) . This may be the case with ad expenditure, meaning
factors such as productivity and Internet access being important considerations in assessing geo-demographic issues in relating to the advertising
of crowd-funding projects. Factors such as number of rooms in a house and Internet access available in those rooms are known to be important
factors geo-demographically (Bishop, 2014a).
Table 3. Technologybased geodemographic factors
Term Description
A GEODEMOGRAPHIC ANALYSIS OF BIG DATA CONNECTED WITH ADVERTISING OF CROWDFUNDING PROJECTS
THROUGH THE CONTEMPORARY PRISM OF INTERNET TROLLING
Big Data as a term on the whole refers to datasets that are so complex that they become awkward to work with using standard statistical software
(Sagiroglu & Sinanc, 2013; Snijders et al., 2012). On that basis, testing its impact on crowd-funded projects can be challenging, especially as many
crowd-funding website do not produce analytics data. The purpose of this study is on the one hand to show that Big Data can replicate the
findings of ‘small data’ (Bishop, 2014a) when it comes to geo-demographic datasets. On the other hand the study hopes to show that one of the
optimal ways to analyse Big Data is over a period of time based on monitored categories of data. It is argued that methods such as Panel Data are
suited to Big Data, such as for identifying trends in behaviour over periods of time. In order to provide evidence in support of this claim an
ANOVA is used along with t-tests to show relationships between metrics such as clicks, impressions, costs and position with the geo-demographic
data of the study to be replicated (Bishop, 2014a). As can be seen, however, even a dataset this small has the same difficulties as the massive
datasets associated with Big Data, as the multitude of variables still produce a large, more varied and complex structure that is akin to the
difficulties associated with storing, analysing and visualising Big Data (Sagiroglu & Sinanc, 2013) .
The Study Being Replicated
Internet trolling can be defined as the posting of a message to the Internet in order to provoke a reaction (Bishop, 2012a; Bishop, 2013a; Bishop,
2013b; Bishop, 2014b; Bishop, 2014c; Cowpertwait & Flynn, 2002; Crumlish, 1995; de-la-Peña-Sordo et al., 2013; Hardaker, 2010; Hardaker,
2013; Jansen, 2002; Jansen & James, 1995; McCosker, 2013; Pfaffenberger, 1995; Phillips, 2012; Walter, 2014). One might therefore regard
advertising – which attempts to provoke certain reactions in consumers and others – to be a form of trolling. Indeed, a premise used in this
chapter is that online advertisements can be considered to be acts of trolling users – to draw their attention to something they may not have
originally been concerned with, such as a crowd-funding project. In order to provide evidence in support of this, the chapter will seek to replicate
the findings of a study into linking Internet trolling to geographical factors (i.e. Bishop, 2014a). This previous study sought to identify factors that
could be used to predict whether a given locality was likely to have high levels of trolling based on its geo-demographical factors.
As can be seen from the data in Table 4, which had a CV of 3.07, it was safe to conclude that the locality in which one lives has an effect on
education outcomes, which were likely down to the geo-demographics of the area. Wales, which had the lowest productivity (164), which was half
that of the South East of England (320) had an education outcome of 2 compared to 3 in the case of the South East. Intelligence was also lower in
Wales (IQ=92) compared to the South East (IQ=105), but this may be down to biases in the measure of intelligence, which can be linked to
factors that favour more prosperous communities over more deprived ones.
Table 4. Factors affecting propensity of Internet trolling in given geographies
Differences between the areas in relation to intelligence meant it was suitable to accept the claim that intelligence differs by geography, which is
likely because those who achieve higher qualifications are more likely to be well practised in the skills that form part of intelligence testing.
Perception of quality of life is also significant and may correlate to the productivity levels as the higher productivity figures seem to be directly
proportional to lower quality of life. The number of trolling incidents per productivity in Wales, Scotland and in the South East of England
provided a clear indicator that increased productivity does not result in reduced cybercrime. The extreme number of incidents of flame trolling in
the South East suggests that the police authorities in the region are not taking trolling seriously, and it is not young people that to blame for the
flame trolling. Indeed, one police force in that locality is known to be soft on flame trolling.
Sussex Police refused to take action against a police officer, aged 32 from Birmingham, who allegedly harassed Sussex resident, Nicola Brookes,
on Facebook. The police officer allegedly targeted Brookes directly, including hacking in to her email, for which computer forensics of his IP
address was available, but during the 19 months of appeals by Brookes he allegedly had his computer reconditioned. This clearly shows that even
where there is strong geographical links showing trolling propensity, this does not necessarily reflect itself in terms of police action. Indeed, one
might argue that the reason trolling is so high in some geographical regions is because of police inaction.
Methodology
The study’s methodology was based on a somewhat empirical approach, where data was collected in the form of web metrics on the basis that it
could be analysed to provide truths about those in the countries the data was recorded from. In terms of methods the data was collected through
using the metrics that come from Google Ads, and analysed using ANOVA and t-tests on the basis of that. The dataset was reduced to produce
groups based on countries, which were reduced to 13 OECD countries on the basis there were at least 30 observations per country and there was a
good availability of quality country-specific data from the OECD. This produced a total number of observations of 2787. The Google Ad displayed
is in Figure 1.
Results
Table 5 shows the result of the outcome of the ANOVA that was run on the dataset. It is clear to see that for the 13 OECD countries there were
three significant outcomes, namely clicks, impressions, and average position. As can be seen below, these were used as part of a further analysis
with the means for each of the countries on each of these factors being compared on a country level.
Table 5. Analysis of variance of web metrics
Mean
Factor df Square F p Null
Average
Position 12 0.000 18.745 0.000 Reject
Table 6 shows the OECD data on Internet access and online advertisement spending merged into the same table with the Means for the factors
identified in Table 6, all listed for each of the 13 countries selected. The table is sorted with the country with the highest amount spent on
advertising at the top and the one with the lowest advertising expenditure at the bottom. No data was supplied by the OECD for Mexico, and as it
is low on many other indicators, it did not seem inappropriate to have it at the bottom due to having a null value. With a CV of 1.786 it is clear to
see that it is safe to reject the null in the case of Clicks (F=2.271, p=<0.008), Impressions (F=6.330, p=<0.001) and Average Position (F=18.745,
p=<0.001). In the case of Costs it was necessary to keep the null because the F-score of 1.229 did not exceed the CV of 1.786, and in any case the
difference was not significant (p=<0.001). Table 6 shows all three of these factors with the Means for each country next to the Internet access rate
and online ad spend for those countries to enable to reader to easily see the relevance of these.
Table 6. Advertising and Internet access data for 13 OECD countries (order by ad expenditure)
Online
Ads
(USD
1/100 Internet
of a Access
Country million) (Percent) Impressions Clicks Position
United
States 324.79 78.70 3.25 0.003 4.41
United
Kingdom 69.98 82.50 8.15 0.016 4.06
This section seeks to find out the differences between those countries reporting the lowest scores on various socio-economic factors and those
showing the highest. The socio-economic factor of productivity was derived from the OECD variable of the same name. Education was calculated
from the Wolfgram Alpha databank, which ranks education level for a given country on a scale of 0 to 1. The value for each country in terms of
education was multiplied by 5, which is often the upper level for NVQs in the UK, which was used in the earlier study(Bishop, 2014a). The values
derived seemed comparable with the earlier study. Intelligence was calculated by modifying the literacy rate provided by the OECD for each
country. The median of all the countries in the data set was used for the 100 baseline and the other countries calculated in relation to that. The
number of rooms was calculated using the OECD variables of the same name.
What is noticeable from Table 7, is that the countries with the highest impressions are generally associated with low levels of productivity. A BBC
investigation found that many of the likes on pages are bogus, and can be linked to the least prosperous of countries. It could therefore be
assumed that the high numbers of impressions could be linked to the fact that many websites in emerging economies are set up with bogus
content for the sole purpose of raising revenue from contextualised advertising where those who click the adverts have no actual interest in what
is being advertised, but only click because doing so earns the website concerned money. This on the one hand proves that considering geo-
demographic factors in choosing outlets for advertising, but on the other hand might suggest datasets collected from online advertising might not
be entirely reliable as it cannot be guaranteed they represent genuinely interested parties.
Table 7. Education and economic data (sorted by highest impressions)
Productivity
(USD Per
Country Impressions Clicks Person) NEETs Education
United
Kingdom 8.15 0.016 73.64 20.60 4.08
In terms of productivity Mexico was the lowest at $23.68 per person and Switzerland was the highest with $135.83 per person. In terms of
Internet access, Switzerland was the highest with 86.8 per cent of people having access and Mexico was the lowest with 38.7 per cent of its people
having access to the Internet. As can be seen from Table 8, there was a significant difference (p=<0.001) between Mexico (M=2.35) and
Switzerland (M=4.01) in relation to average position, and with a t-score of 3.829, it was acceptable to reject the null. There was no significant
difference in impressions (p=<0.033) between Mexico (M=18.73) and Switzerland (M=4.89), nor with clicks (p=<0.005) between Mexico
(M=0.04) and Switzerland (M=0.00). Equally, there was no significant difference (p=<0.037) in relation to cost between Mexico (M=0.01) and
Switzerland (M=0.00). It was therefore not possible to reject the null in these cases.
Table 8. Comparing Mexico’s Internet access rate with Switzerland’s using Big Data metrics
Mexico Switzerland
Metric (M) (M) t-score p-value
Average
Position 2.35 4.01 3.829 0.000
The significance of average position in relation to Internet access might be simple to explain. The average position of the advert in Mexico was
2.35, but in Switzerland it was 4.01, which might reflect the fact that as more people are subscribed to the Internet in Switzerland (86.8%)
compared to Mexico (38.7%) then the number of ads displayed will be higher, meaning many will be lower down the list in Switzerland where
there is likely to be more advertisers than in Mexico. Productivity in Switzerland ($135.83 per capita) is significantly higher than Mexico ($23.68
per capita), which may also be a factor in why impressions are higher in Mexico, where revenue from adverts will be more important. No data on
the amounts spent on online advertising were available for Mexico, but if one considers Hungary, which has a similar productivity ($27.29) to
Mexico ($23.68) as well as having similar numbers of impressions ($17.43) to Mexico ($18.73) then their online advertising expenditure is likely
to be as low ($150m). Compared to Mexico, Switzerland’s spending is phenomenal ($578), and the fact that costs of advertising is higher in
Mexico, might suggest that crowd-finders would make a loss by advertising there.
Online Ad Expenditure: Portugal (Lowest) vs. United States (Highest)
The United States had the highest amount of annual online ad expenditure with $32,479,000 dollars, and Portugal was the lowest, with $42,800
of online ad expenditure. Table 9 shows that in the case of impressions there was a significant difference (p=<0.001) between the United States
(M=3.25) and Portugal (M=14.37) in terms of impressions with a suitable t-score of -6.624. Equally there was a significant difference (p=<0.001)
between Portugal (M=2.79) and the United States (M=3.25) in terms of average position with a t-score of 6.533. There was no significant
difference (p<0.078) between the United States (M=0.06) and Portugal (M=0.02) in relation to clicks as can be seen from Table 9. Equally, there
was no significant difference in terms of cost as although there was a p-value of less than 0.003, the t-score of 0.002 was not enough.
Table 9. Comparing Portugal’s online ad expenditure with the United States’ using Big Data metrics
Portugal United
Metric (M) States (M) t-score p-value
Average
Position 2.79 4.41 6.533 0.000
NEETs: Spain (Highest) vs. Japan (Lowest)
As there was no dedicated OECD indictor for youths not in education, employment or training (NEETs), the available one of youth
unemployment was used in its place. The term NEETs will still be used for analysis purposes. The country with the highest number of NEETs was
Spain with 51.6 per cent of youths being unemployed, and the country with the lowest was Japan with only 6.9 per cent of youths being registered
as unemployed. As can be seen from Table 10, in the case of clicks there was a significant difference (p=<0.001) between Spain (M=0.11) and
Japan (M=0.03) with a t-score of 2.581, it was safe to reject the null. It was decided to keep the null due to insignificant differences in the case of
impressions (p=<0.026), average position (p=<3.165) and cost (p=<1.256).
Table 10. Comparing Spain’s percentage of NEETs with Japan’s using Big Data metrics
As can be seen from Table 10, the number of clicks for Spain (M=0.11) was significantly higher than Japan (M=0.03). Linking this to the metric of
youth unemployment, such as those not in education, employment or training, could be conceptually challenging, even if statistically significant.
Spain has a higher number of NEETs in the form of 51.6 per cent youth unemployment, compared to Japan, where youth unemployment is 6.9
per cent. It has been argued that advertising on social media like Facebook is a cost-effective method of recruiting youth from a wide population
(Chu & Snider, 2013) . However, a 2012 investigation by the British Broadcasting Corporation found that many clicks on Facebook are done by
people with fake accounts. It could therefore be that some of these clicks are down to trying to raise revenue for a website on which they are
displayed to raise money for that website, and this may be common in geographical regions where there is poverty.
Even so, the results may be more related to the fact it was advertising a crowd-funding project. Even though it has been shown that
entrepreneurial activity fell in Spain as a result of the global recession, entrepreneurship out of necessity increased (del Rio et al., 2014) . A
research study into 521 Spanish undergraduate design students, found that they demonstrate a high entrepreneurial intention (62%) and that
attitudinal factors outweighed the students' self-perceived inability to develop their own businesses (Ubierna et al., 2014) . It could therefore be
the case that young people are more likely to take risks with crowd funded projects where they anticipate future returns.
Room in House, Education: Mexico (Lowest) v United States (Highest)
In terms of rooms in house, Mexico was the lowest (M=3.88) and the United States was the highest (M=4.65). The same was the case with
education outcomes, with the United States being highest (M=4.70) and Mexico being the lowest (M=3.63). There was a significant difference
with regards to impressions (p=<0.001) and average position (p=<0.001). In the case of impressions it would be suitable to reject the null as
there was a t-score of -9.887 and Means of 18.73 for Mexico and 3.25 for the United States (see Table 11). In the case of average position, the
significant difference could be seen between Mexico (M=2.35) and the United States (M=4.41) with a t-score of 6.003, meaning it was safe to
reject the null. There was no significant difference for clicks (p=<0.210) in relation to Mexico (M=0.04) and the United States (M=0.06), and nor
was there a significant difference in relation to costs (p=<0.015) for the United States (M=0.00) and Mexico (M=0.01), which an insignificant t-
score of -1.269.
Table 11. Comparing Mexico’s rooms in ho use and education outcomes with the United States’ using Big Data metrics
Average
Position 2.35 4.41 6.003 0.000
It is known that in Mexico, the number of rooms in a house is indicative of social and economic status more widely (López-Feldman, 2014; Mora-
Ruiz et al., 2014). In Mexico, the size of the house and the number of rooms mould the number and type of activities people can do, and the
relationship with those outside of the house is also connected with positive life experiences (Landázuri et al., 2014). It has been found that the
highest numbers rooms equated with areas with the lowest productivity and highest levels of trolling (Bishop, 2014a). Even though not a
significant difference, the mean number of clicks were higher in the United States even though the number of impressions were nearly six times
greater in Mexico. The t-score for impressions (i.e. -9.887) by being a negative value would suggest instead high number of impressions is
equated with low levels of education and room numbers. This, taken with the number of clicks, would suggest that those in Mexico are less prone
to ‘feed the troll’ in the form of advertisers, which may be down to not being able to afford the product advertised. However, as QPress is
marketed at academics, it might simply be that low education outcomes (3.65 out of 5) would lead to little interest in QPress’s functions. The
reason the average position is higher in the United States may be because there is more completion for advertising space in the US, or simply that
QPress is advertising for the same areas as other products and services.
Intelligence: Mexico (Lowest) vs. Japan (Highest)
Table 12. Comparing Mexico’s Intelligence levels with Japan’s using Big Data metrics
Mexico
Metric Japan (M) t-score p-value
(M)
Average
Position 2.35 2.32 -0.183 0.066
Even though the numbers of impressions for Mexico (M=18.73) are higher than Japan (M=6.67), because the t-score is negative (t=-3.163) it
suggests that a higher number of impressions is linked to lower intelligence (i.e. literacy) and not higher intelligence. One might expect the
opposite – for searches that would call up an advert on QPress to be linked with more academic audiences. One might therefore want to consider
whether it was the crowd-funding aspect that draw people to the advert. Mexico is associated with micro-financing more as a recipient than as an
investor (Smith et al., 2014), but the relative poverty in the region might mean necessity entrepreneurship would be higher. Japan is associated
with several high profile crowd-funding initiatives, and micro-financing is an accepted practice (Ikeda & Marsumaru, 2012) .
Mexico’s productivity ($23.68 per capita) is lower than Japan’s ($93.23 per capita), which might explain why the impressions for Mexico are
nearly triple Japan’s. Mexico needs the funding due to its low geo-demographic figures and it might be websites are being set up for the sole
purpose of extorting funds from advertisers who have not optimised their advertising campaigns. The fact that a cost of an advert in Mexico is
significantly lower than Japan, which has higher intelligence, might suggest it is not the most optimal market for targeting crowd-funded
schemes, which want to raise funds and not expend them.
GENERALISING THE DATA
This chapter has so far explained the differences between countries in terms of the geo-demographic data associated with them. This section aims
to put this together to show how in the advertising for QPress findings can be used to further its development, which might be generalised to
other contexts.
Figure 2 shows the factors that are most associated with whether impressions, clicks, average position, and cost are high or low. A low number of
impressions can be seen to be associated with countries with high education attainment and lowest intelligence, which may appear contradictory.
Conversely, a high number of impressions were associated with low numbers of rooms in housing and low ad expenditure for a country. This
could be because low ad expenditure means that one’s online ad is likely to appear higher up the page, or more frequently, because less people are
advertising.
Figure 2. Geo-demographic factors associated with online
advertising metrics
In terms of clicks, a high number of clicks is associated with a high number of young people classed as NEETs. It is known that online adverts are
more likely to be clicked on by those who develop, or have already developed, strong brand awareness (Dahlen & Bergendahl, 2001). It is a fact
that online ads that offer high involvement products as opposed to low involvement are also more likely to be clicked on (Dahlén et al., 2000) .
The advert for QPress shown in Figure 1 by referring to “Invest in QPress” clearly suggests an amount of involvement will come from the activity.
There is, however, known to be an amount of cynicism about the actual engagement of young people when it comes to online advertising, which is
seen as representing the commercialisation of the medium (Loader, 2007). Other reasons may be that because young people are out of work, they
are more likely to be interested in opportunities to make money, such as from investing in QPress as the online advert displayed encouraged. The
fact that few other factors were significantly associated with clicks might suggest that advertising schemes based on clicks are least likely to be
effective in achieving a return on investment. With some advertising initiatives being based on pay-per-interaction, such as Pages or Posts that
have been ‘boosted’ on Facebook, or those banners available on Tradedoubler requiring products to be purchased before fees are paid to the
publisher, then the pay-per-click and pay-per-impression models will likely become obsolete, as such methods are open to abuse, where they may
be viewed or clicked on by people who have no interest whatsoever in what is being advertised.
In terms of average position, a place lower down the screen was associated with high education and low productivity, whereas being higher up the
page in terms of average position is associated with countries with a high number of rooms per house, high Internet access and where low ad
expenditure is the norm. Average position of an advert is an important issue in having an effective online advertising campaign (Rutz & Bucklin,
2011). Knowing the position of a given keyword can help one know whether to focus on it (Rutz et al., 2012). For instance, a low average position
could mean one has a lot of competitors, whereas a high average position could mean that one might not be marketing to the right people.
In terms of cost, the high the cost then the lower that country’s average intelligence. It is known that those who randomly surf the Internet are
more likely to click on banner adverts than those who are being more purposeful, such as information-seekers (Li & Bukovac, 1999). On this basis
it is likely that the view that those who do generic searches have little brand awareness (Rutz & Bucklin, 2011) might be true of those from more
deprived areas, who might not have awareness of what the product being advertised actually is.
IMPLICATIONS AND FUTURE RESEARCH DIRECTIONS
This study has shown that it is possible to combine national data on countries with individual observations in order to understand social factors,
such as Internet access, education and intelligence. One might argue that in case of geographical factors, there are a number of interacting
variables in the dataset used. This could be generalised to conclude that in order for Big Data studies to be effective that they need to amalgamate
with other data sources, including linked data, as an essential component of providing an overall picture of the social phenomenon being studied.
The study has presented a model for understanding the links between online advert metric and geo-demographic data. Future research will have
to uncover whether this relates online to the advertisement associated with QPress (Figure 1), or whether it can be applied more generally to
other crowd-funding projects.
DISCUSSION
Big Data is the use of datasets that are too big for traditional business analysis tools used in industry. This chapter has explored the role Big Data
can play in analysing the effectiveness of crowd-funding projects, using the data from such a project, which aimed to fund the development of a
software plug-in called ‘QPress.’ Using an ANOVA, the website metrics of impressions, clicks and average position, were found by this chapter to
be significantly connected with geographical factors. To understand these further, they were combined with other country data to perform t-tests
in order to form a geo-demographic understanding of those who are displayed advertisements inviting participation in crowd-funding.
The question as to whether it was possible to replicate the findings of an earlier study now has an answer. It appears clear that the factors such as
productivity, number of rooms in house, education, intelligence and NEETs, are indeed important to understanding what affects the extent to
which an online communication is provocative. The study looked not at regions as with was the case with the earlier study, but countries, and it
was still found that these facts play and important part in understanding the geo-demographics that exist and are measurable in relation to
human behaviour.
The chapter has presented a model linking the geo-demographic factors identified with the online advertising matrix. It shows that a high
number of impressions is associated with low rooms per house and low ad expenditure. It also shows that high clicks are associated with high
numbers of young people not in education, employment or training (NEETs), and that a high cost campaign is associated with lower intelligence.
It found that the average position of an online advert is low advert expenditure, high rooms per house and high Internet access, while low average
position is associated with high education outcomes and low productivity.
It can there be concluded that an online advertising campaign, like the one for the crowd-funded application called QPress, has its success
dependent on geo-demographic factors. It is clear that when considering how to advertise to a locality, these will have to be taken into account.
When advertising to countries with low productivity and education outcomes, an advertising method based on cost-per-impression or cost-per-
click would not be the most effective option, due to it being more likely they will be received by websites that's are clicked on by people with little
interest in what is being advertised. Methods based on cost-per-interaction, such as where an ad is only paid for when a person buys from a
website, leaves a comment or likes a post, would be most suitable to these economies. Cost-per-click models would be most suited to geographies
where productivity is high, along with education outcomes and intelligence being the same. This may be because people in these economies are
least likely to click on adverts, meaning a cost-per-impression model would have least return on investment. Cost-per-impression models seem to
be most suited to developed economies, such as Japan, Canada and the USA for the reason that the number of impressions is low, but number of
clicks is high. These economies have the highest spending on online advertising this results in fewer impressions and greater competition for a
good average position. It is therefore necessary for advertisers to know their target market as bidding high on a keyword most will search on to
get higher up the page may pay off more than bidding on many keywords of little relevance.
This work was previously published in GeoIntelligence and Visualization through Big Data Trends edited by Burçin Bozkaya and Vivek
Kumar Singh, pages 94120, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Arthur, C., Ball, S., Bray, P., Grossman, W. M., Queree, A., & Rowe, J. (2001). The hutchinson dictionary of computing and the internet (4th ed.).
Oxford, GB: Helicon Publishing Ltd.
Bishop, J. (2012b). Lessons from the emotivate project for increasing take-up of big society and responsible capitalism initiatives . In Pumilia-
Gnarini, P. M., Favaron, E., Pacetti, E., Bishop, J., & Guerra, L. (Eds.), Didactic strategies and technologies for education: Incorporating
advancements (pp. 208–217). Hershey, PA: IGI Global. doi:10.4018/978-1-4666-2122-0.ch019
Bishop, J. (2013a). The art of trolling law enforcement: A review and model for implementing ‘flame trolling'legislation enacted in great britain
(1981–2012). International Review of Law Computers & Technology , 27(3), 301–318. doi:10.1080/13600869.2013.796706
Bishop, J. (2013b). The effect of deindividuation of the internet troller on criminal procedure implementation: An interview with a
hater. International Journal of Cyber Criminology , 7(1), 28–48.
Bishop, J. (2014a). Digital teens and the 'antisocial Network': Prevalence of troublesome online youth groups and internet trolling in great
Britain. International Journal of E-Politics , 5(3), 1–15. doi:10.4018/ijep.2014070101
Bishop, J. (2014b). Internet trolling and the 2011 UK riots: The need for a dualist reform of the constitutional, administrative and security
frameworks in great britain. European Journal of Law Reform , 16(1), 154–167.
Bishop, J. (2014c). Representations of ‘trolls’ in mass media communication: A review of media-texts and moral panics relating to ‘internet
trolling’. International Journal of Web Based Communities , 10(1), 7–24. doi:10.1504/IJWBC.2014.058384
Bojke, C., Castelli, A., Street, A., Ward, P., & Laudicella, M. (2013). Regional variation in the productivity of the english national health
service. Health Economics , 22(2), 194–211. doi:10.1002/hec.2794
Bonacchi, C., Bevan, A., Pett, D., & Keinan-Schoonbaert, A. (2015). Developing crowd-and community-fuelled archaeological research: Early
results from the MicroPasts project.
Brechner, I. (2013). The value of long-term client and search agency relationships in a digital world. Journal of Digital & Social Media
Marketing , 1(3), 258–264.
Cowpertwait, J., & Flynn, S. (2002). The internet from A to Z. Cambridge, GB: Icon Books Ltd.
Crumlish, C. (1995). The internet dictionary: The essential guide to netspeak . Alameda, CA: Sybex Inc.
Cukier, K., & Mayer-Schoenberger, V. (2013). Rise of big data: How it's changing the way we think about the world, the. Foreign Affairs , 92, 28.
Dahlen, M., & Bergendahl, J. (2001). Informing and transforming on the web: An empirical study of response to banner ads for functional and
expressive products. International Journal of Advertising , 20(2), 189–205.
Dahlén, M., Ekborn, Y., & Mörner, N. (2000). To click or not to click: An empirical study of response to banner ads for high and low involvement
products. Consumption Markets & Culture , 4(1), 57–76. doi:10.1080/10253866.2000.9670349
de-la-Peña-Sordo, J., Santos, I., Pastor-López, I., & Bringas, P. G. (2013). Filtering trolling comments through collective classification. Network
and system security (pp. 707–713). Springer.
del Rio, Maria de la Cruz, Garcia, J. A., & Rueda-Armengot, C. (2014). Evolution of the socio-economic profile of the entrepreneur in galicia
(spain). Business and Management Research , 3(1), 61.
Dushnitsky, G., & Marom, D. (2013). Crowd monogamy. Business Strategy Review , 24(4), 24–26. doi:10.1111/j.1467-8616.2013.00990.x
Etzkowitz, H. (2013). Silicon valley at risk? sustainability of a global innovation icon: An introduction to the special issue. Social Sciences
Information. Information Sur les Sciences Sociales ,52(4), 515–538. doi:10.1177/0539018413501946
Feldmann, N., Gimpel, H., Kohler, M., & Weinhardt, C. (2013). Using crowd funding for idea assessment inside organizations: Lessons learned
from a market engineering perspective. Paper presented at the Cloud and Green Computing (CGC),2013 Third International Conference On, pp.
525-530. 10.1109/CGC.2013.88
Ferrer-Roca, N. (2014). 2. business innovation in the film industry value chain: A new zealand case study. International Perspectives on Business
Innovation and Disruption in the Creative Industries: Film, Video and Photography,, 18.
Graham, M., & Shelton, T. (2013). Geography and the future of big data, big data and the future of geography. Dialogues in Human
Geography , 3(3), 255–261. doi:10.1177/2043820613513121
Hardaker, C. (2010). Trolling in asynchronous computer-mediated communication: From user discussions to academic definitions.Journal of
Politeness Research.Language, Behaviour . Culture (Canadian Ethnology Society) , 6(2), 215–242.
Hardaker, C. (2013). “Obvious trolls will just get you banned”: Trolling versus corpus linguistics . In Hardie, A., & Love, R. (Eds.),Corpus
linguistics 2013 (pp. 112–114). Lancaster: UCREL.
Harvey, D. (2011). The enigma of capital: And the crises of capitalism . Profile Books.
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., & Babu, S. (2011). Starfish: A self-tuning system for big data analytics. Paper
presented at the CIDR, 11. pp. 261-272.
Huang, L. (2013). Visual analysis on online display advertising data. Paper presented at the LargeScale Data Analysis and Visualization
(LDAV),2013 IEEE Symposium On, pp. 123-124. 10.1109/LDAV.2013.6675170
Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big data for dummies . For Dummies.
Ikeda, M. M., & Marsumaru, M. (2012). Leadership and social innovation initiatives at the grassroots during crises. Japan Social Innovation
Journal , 2(1), 77–81. doi:10.12668/jsij.2.77
Jansen, E. (Ed.). (2002). NetLingo: The internet dictionary . Ojai, CA: Netlingo Inc.
Jansen, E., & James, V. (Eds.). (1995). NetLingo: The internet dictionary . Ojai, CA: Netlingo Inc.
Kaisler, S., Armour, F., & Espinosa, J. A. (2014). Introduction to big data: Challenges, opportunities, and realities minitrack. Paper presented at
the System Sciences (HICSS), 2014 47th Hawaii International Conference On, pp. 728-728.
Katircioğlu, S. T. (2014). Estimating higher education induced energy consumption: The case of northern cyprus. Energy , 66, 831–838.
doi:10.1016/j.energy.2013.12.040
Kshirsagar, V., & Ahuja, R. S. (2015). Crowd-funding and its perspectives. Indian Journal of Applied Research , 5(1).
Kunkel, S. (2015). Green crowdfunding: A future-proof tool to reach scale and deep renovation? World sustainable energy days next 2014 (pp.
79–85). Springer.
Labrinidis, A., & Jagadish, H. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.
doi:10.14778/2367502.2367572
Landázuri, A. M., Mercado, S., & Terán, A. (2014). Sustainability of residential environments. Revista Suma Psicológica , 20(2), 191–202.
doi:10.14349/sumapsi2013.1463
Laven, M. (2014). Money evolution: How the shift from analogue to digital is transforming financial services. Journal of Payments Strategy &
Systems , 7(4), 319–328.
Li, H., & Bukovac, J. L. (1999). Cognitive impact of banner ad characteristics: An experimental study. Journalism & Mass Communication
Quarterly , 76(2), 341–353. doi:10.1177/107769909907600211
Loader, B. D. (2007). Young citizens in the digital age: Political engagement, young people and new media . Routledge.
López-Feldman, A. (2014). Shocks, income and wealth: Do they affect the extraction of natural resources by rural households?World
Development, Malgonde, O., & Bhattacherjee, A. (2014). Innovating using big data: A social capital perspective. Paper presented at the Twentieth
Americas Conference on Information Systems, Savannah, GA.
Mora-Ruiz, M., Penilla, R. P., Ordonez, J. G., Lopez, A. D., Solis, F., Torres-Estrada, J. L., Rodriguez, A.D. (2014). Socioeconomic factors,
attitudes and practices associated with malaria prevention in the coastal plain of chiapas, mexico. Malaria Journal, 13(1), 157-2875-13-157.
doi:10.1186/1475-2875-13-157
Norcliffe, G., & Mitchell, P. (1977). Structural effects and provincial productivity variations in canadian manufacturing industry. The Canadian
Journal of Economics. Revue Canadienne d'Economique , 10(4), 695–701. doi:10.2307/134300
Odesola, I. A., & Idoro, G. I. (2014). Influence of labour-related factors on construction labour productivity in the south-south geo-political zone
of nigeria. Journal of Construction in Developing Countries , 19(1), 93–109.
Olwig, K. F., & Valentin, K. (2014). Mobility, education and life trajectories: New and old migratory pathways. Identities, (ahead-of-print), 1-11.
Ordanini, A., Miceli, L., Pizzetti, M., & Parasuraman, A. (2011). Crowd-funding: Transforming customers into investors through innovative
service platforms. Journal of Service Management ,22(4), 443–470. doi:10.1108/09564231111155079
Pfaffenberger, B. (1995). Que's computer & internet dictionary(6th ed.). Indianapolis, IN: Que Corporation.
Phillips, W. (2012). This is why we can't have nice things: The origins, evolution and cultural embeddedness of online trolling.Unpublished
Doctor of Philosophy . Oregon, USA: University of Oregon.
Rutz, O. J., & Bucklin, R. E. (2011). From generic to branded: A model of spillover in paid search advertising. JMR, Journal of Marketing
Research , 48(1), 87–102. doi:10.1509/jmkr.48.1.87
Rutz, O. J., Bucklin, R. E., & Sonnier, G. P. (2012). A latent instrumental variables approach to modeling keyword conversion in paid search
advertising. JMR, Journal of Marketing Research ,49(3), 306–319. doi:10.1509/jmr.10.0354
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Paper presented at the Collaboration Technologies and Systems (CTS),2013 International
Conference On, pp. 42-47.
Shim, K. (2012). MapReduce algorithms for big data analysis.Proceedings of the VLDB Endowment , 5(12), 2016–2017.
doi:10.14778/2367502.2367563
Singh, S. K. (2011). Organizational innovation as competitive advantage during global recession. Indian Journal of Industrial Relations , 46(4),
713–725.
Smith, R., Eaton, A., Arshad, H., & Ricketts, P. (2014). Global crowd funding to increase accessibility for small-scale biodigester projects. Boiling
Point, (62), 2-5.
Snijders, C., Matzat, U., & Reips, U. (2012). Big data: Big gaps of knowledge in the field of internet science. International Journal of Internet
Science , 7(1), 1–5.
Stern, J. S. (2013). Characteristics of content and social spread strategy on the indiegogo crowdfunding platform. (Master of Arts, University of
Texas at Austin).
Suhonen, T. (2014). Field-of-study choice in higher education: Does distance matter? Spatial Economic Analysis , 9(4), 355–375.
doi:10.1080/17421772.2014.961533
Ubierna, F., Arranz, N., & Fdez de Arroyabe, J. (2014). Entrepreneurial intentions of university students: A study of design undergraduates in
spain. Industry and Higher Education ,28(1), 51–60. doi:10.5367/ihe.2014.0191
Walter, T. (2014). New mourners, old mourners: Online memorial culture as a chapter in the history of mourning. New Review of Hypermedia
and Multimedia, (ahead-of-print), 1-15.
Wilson, M. W. (2014). Morgan freeman is dead and other big data stories. Cultural Geographies , 1474474014525055.
Wilson‑Ching, M., Pascoe, L., Doyle, L. W., & Anderson, P. J. (2014). Effects of correcting for prematurity on cognitive test scores in
childhood. Journal of Paediatrics and Child Health ,50(3), 182–188. doi:10.1111/jpc.12475
Woerner, S. L., & Wixom, B. H. (2015). Big data: Extending the business strategy toolbox. Journal of Information Technology ,30(1), 60–62.
doi:10.1057/jit.2014.31
Yuan, Y., Wang, F., Li, J., & Qin, R. (2014). A survey on real time bidding advertising. Paper presented at the Service Operations and Logistics,
and Informatics (SOLI),2014 IEEE International Conference On, pp. 418-423. 10.1109/SOLI.2014.6960761
Zhang, J., Li, J., Jiang, Y., & Zhao, B. (2014). The overview of the information technology industry chain in big data era. Future information
technology (pp. 429–432). Springer.
Zhu, Q., Kong, X., Hong, S., Li, J., He, Z., & Lewandowski, D. (2015). Global ontology research progress: A bibliometric analysis.Aslib Journal of
Information Management, 67(1).
KEY TERMS AND DEFINITIONS
Appalachia: A geographic and cultural region of the Mideastern United States. The population in media is portrayed as suspicious, backward,
and isolated.
Big Data: Big Data is the term used to describe information which is of a form that it is difficult to analyse using traditional business software or
where a data collected for one purpose is then used for another to improve a business’s offerings.
Clicks: The number of times an advert has been clicked on by a user on a given website(s).
Average Position: The extent to which an advert is offset from the top of the advert stream where it is competing for the top spot with other
advertisers, usually based on a bidding price.
Google Ads: A platform that enables business to advertise on Google websites and those websites that Google has agreed for its adverts to be
displayed on.
Impressions: The term that refers to how many time an advert has been displayed on a given website(s).
OECD: An international economic body whose data is used to understand the nations it covers.
QPress: A software application that allows for the collection and exporting of Q-methodology q-sorts via the Internet.
CHAPTER 67
Big Data Analytics in Retail Supply Chain
Saurabh Brajesh
JDA Software, India
ABSTRACT
Retail sector is in the state of flux. On one hand they face hurdles such as new technologies, tumultuous economy, new sales & distribution
channels, and on the other hand they have rapidly increasing population of demanding consumers. To overcome these challenges, and to remain
relevant and competitive in the market, the retail sector needs to have paradigm shift in their approach. If we track and analyze the pattern of
purchasing decisions of consumers, we would find that it involves various stages of decision lifecycle. Data generated at many of these stages can
be recorded, digitized, and transformed into matrices and strategic information. These matrices & information would prove to be the vital
element for retail industries in their strategic decisions. We need to focus on mechanism to extract valuable insights from retail supply chain.
These insights could be further leveraged to provide competitive advantage to the retailers and at the same time a better retail experience to the
customers.
INTRODUCTION
The success of supply chain planning process depends upon how closely supply is managed, demands are forecasted, inventories are optimized
and logistics are planned. Supply chain is the heart of the retail industry vertical, and if managed efficiently, it drives positive business and
enables sustainable advantage (Howe, 2014). If we observe carefully, huge amount of data is getting generated at each and every stages of the
supply chain. In today’s digital world we are generating around 200 Exabyte of data each year (Issa, 2013). Organizations are increasingly
questioning their own ability to realize full potential from the huge amount of data they have within their supply chain (Carraway, 2012). This
huge amount of data being generated inside organizational boundaries is a contemporary problem, but at same time it can provide an
opportunity for retail organizations to find information they have been looking for as an effective handle for decision making and planning. Core
strength of retail organization lies in its ability of aligning demand and supply. Organization’s ability to collect, interpret and leverage data, helps
in making informed decisions which can lead to better profitability and growth.
In recent times there has been a massive explosion in the volume of data generated in the supply chain and it has become a challenge for
enterprises to extract maximum value out of this ever-growing volume of unstructured and structured data. Organizations are busy exploring
options that would create required infrastructure to capture, process, analyse and leverage data across their supply chain in order to optimize
their capacities, inventories and logistic, without missing potential business opportunities. This infrastructure is expected to optimize processes
and create analytical engines that would help deliver accurate and appropriate decisions (Provost & Fawcett, 2013).
Customers are difficult to predict. Accurately forecasting of customer’s needs is risk-laden, but businesses are increasingly shifting towards
customer-driven production models. Retail organizations are familiar with using structured data for planning and decision making and are now
looking to combine data from external sources to better predict future risks (Dandawate, 2014). The objective here is to understand the role of big
data and predictive analytics in retail industry supply chain and its impact on its future growth (see Figure 1).
Figure 1. Big data sources for the retail industry
(IDC Retail Insights, 2012).
Picture above presents a laundry list of Big Data sources for the retail industry. There are four dimensions of data source:
1. Market;
2. Customer;
3. Supply and
4. Social media.
• From marketing perspective data is derived from events, weather, economic, place or promotions.
• From supply point of view the company need to asses data from Stocks, Shipments, Purchase orders, Product information and Design
Specification.
• From customer relationship management perspective organization needs to collect and analyze data from surveys, email, rating,
reviews and loyalty profiles.
• From Social media, high volume unstructured data flows in at a high velocity. Some of the known sources of social data are You Tube,
Twitter, Facebook, Blogs and Wikis.
• The challenges ahead that will force retail companies to change their operation.
• What impact will Big Data have on the design of future Retail supply chains.
Finally, based on our findings, propose a framework for adoption of Big Data Analytics in retail supply chain. This framework can help the
retailors to overcome some of the organizational challenges (Girard, 2012) talked about. The framework can be used to provide a roadmap for big
data implementation in organizations. (Bughin, Livingston & Marwaha, 2011) talk about the kind of big data architecture organizations can have.
Oracle’s big data architecture specifically designed for retail industry, provides more detailed information on this.
BACKGROUND
Big data and predictive analytics are forecasted as next strategic lever for retail industries. They could be leveraged for product and service
differentiation, provided the retail organizations are armed with right tools and approaches to utilize these huge unleveraged data assets. For
Example, utilizing customer transaction or loyalty information, predictive analytics can provide an organization the right inputs that are required
to make more informed decisions about pricing, products, promotions and assortment management. It can also be leveraged by retail
organization for better product lifecycle management and marketing mix.
In our endeavor to study the impact of big data and predictive analytics on supply chain of retail industry, we have tried to find out how they
could be aligned with each other in creating new opportunities for retail supply chain from sourcing to in-store availability.
Handling Big Data challenges and transforming them into opportunities is the key to success for the retail industries. These challenges in terms
of 3Vs (Volume, Velocity & Variety) were first identified by Gartner in early 2001. Later IBM added 3 more Vs such as Veracity, Visualization and
Value stating that they are closer to organizational needs. Without deliberating the genesis of these Vs of Big data, we have picked some of the key
challenges as described below:
• Volume: According to a McKinsey report, the number of RFID tags sold globally is projected to increase from 12 million in 2011 to 209
billion in 2021. Along with this, there has been a phenomenal increase in the usage of sensors, GPS devices and QR codes. The speed at
which supply chain data is being generated has multiplied manifold beyond our expectations. Handling such huge volume of data can be a
challenge for most of retail organization.
• Velocity: Today retail business environment is dynamic and volatile. Even the unexpected events must be handled in a timely and efficient
manner in order to avoid losing out in business. Enterprises are finding it extremely difficult to cope up with this data velocity. Optimal
business decisions need to be made quickly. The key is to have successful operational execution with a shorter processing, which is lacking in
traditional data management systems.
• Variety: For retail industry, we have data emerging in different forms which don’t fit in classical applications and models. Both Structured
(transactional) and unstructured (social) data have created nightmares among retail enterprise which are currently not is position to handle
such diverse and heterogeneous data sets (Byrne, 2011).
• Veracity: Veracity deals with uncertain or imprecise data. In traditional data warehouses there was always the assumption that the data is
certain, clean, and precise. That is why so much time was spent on ETL (Extract, Transform & Load) / ELT (Extract, Load & Transform),
Master Data Management etc. However, when we start talking about social media data like Tweets, Facebook posts, etc. how much faith can
or should we put in the data. Sure, this data can be used as a count toward your sentiment, but you would not count it towards your total
sales and report on that.
• Value: Google model of Value creation out of Big Data has generated curiosity in the retail industries to maximize value out of these
technologies. The fact that corporate world is spending lots of money to this technology indicates that “Value” would be playing an important
role in the use of Big Data and predictive analytics.
EARLY RESEARCH
Big Data
Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision
making, insight discovery and process optimization (Cecere, 2012).
If we look at evolution of big data we would find that data became big many years before the current buzz around big data (Press, 2013). Initial
attempts to quantify the growth rate in the volume of data was done as early as 1941 when the term “information explosion” was used for first
time according to the Oxford English Dictionary. In 1975 the Ministry of Posts and Telecommunications in Japan conducted the Information
Flow Census. The aim of the census was to track the volume of information circulating in Japan. In 1978 it reported that “the demand for
information provided by mass media, which are one-way communication, has become stagnant, and the demand for information provided by
personal telecommunications media, which are characterized by two-way communications, has drastically increased. Our society is moving
toward a new stage in which more priority is placed on segmented, more detailed information to meet individual needs, instead of conventional
mass-reproduced conformed information.”
Greater acceptance about big data potential came when companies such as Google and Amazon successfully leveraged big data (Kiron, Shockley,
Kruschwitz, Finch & Haydock, 2011). The potential of big data is so enormous now that World Economic Forum considers this as new class of
economic asset. StackIQ (2013), tried to examine the reason why big data is becoming so important for CFOs and for organizational strategies
especially for retail industry.
Big Data Analytics in Supply Chain
Today’s supply chain encompasses everything from sourcing to manufacturing to final delivery of product or services to customers. Since, it
encompasses wide range of activities, which most of time transcend factories or national boundary, complex interdependencies are built into it.
As the power base continues to shift from companies towards customers, customer demands have gotten more complex. Companies are looking
at Big Data analytics to revamp their supply chain, thereby using Big Data Analytics as a strategic lever.
Companies are collecting vast amount of supply chain related data with help of technologies such as sensors, Barcode and GPS (House, 2014). Big
Data Analytics offers companies the ability to leverage on the enormous amounts of information driving their global supply chains (Harvard
Business Review, 2013). Companies are aware that Big Data can be leveraged at various levels of a business. This holds true for supply chain
management also. The combination of large, high velocity and varied structure of big data and advanced analytics tools and techniques
represents the next frontier of supply chain innovation (Kotlik, Greiser & Brocca, 2015).
Companies can leverage Big Data for both supply-side and demand side activities (Cecere, 2013). Companies can leverage Big Data to optimize
supply-side activities, such as new product introduction, production plan, inventory management, and product distribution. This would help
companies to maximize revenue, and customer value. On demand side, companies are using real time information to sense changes in demand
pattern and respond quickly to changes in demand. With help of Big Data companies can confidently cut inventory without affecting the
customer service level, thereby reducing working capital requirements (Chase, 2013).
Big Data Analytics in Retail
Businesses that barely touched big data in the recent past are now very open to fully embrace these assets. They all are eager to take advantages
that these solutions may bring. This is true for retail sector as well. Big data analytics has the ability to provide actionable insights retail sector
has been looking for long time. These insights can be utilized as a guide for internal decision-making in a wide variety of capacities. Big data
analytics’ influence is both growing and expanding into various sectors. We can clearly see this trend for the retail sector also. Few of retail
companies are now relying heavily on these resources.
In order to fully leverage the maximum value from Big Data initiative, companies in the retail sector is likely develop big data specific strategies.
These big data specific strategies can be leveraged for understanding customer needs, since one of the most important drivers of organizational
success is that organization must understand the needs and wants of their customers (LaValle, Lesser, Shockley, Hopkins & Kruschwitz, 2010).
According to Robak, Franczyk, & Robak (2013), organization can gather privileged information about customer wants by making use of the
organization’s data relating to purchase & loyalty card information to provide more tailored offering than their competitors.
PROBLEM DEFINITION
Most retailers have struggled to maintain profitability in the fiercely competitive environment. Targeted customer segment vary by retailer and
can have different sub segment within a given segment. For example, a person may be more willing to spend on buying a smartphone than on
buying a new television set. Such trends drive the retailer strategy and marketing. In the last few years, the retailers have faced plenty of
challenges; most of which were because of the fierce competition, changing demand pattern of customer choice or providing better service at
lesser cost. Retailers are finding it difficult to cope up with the requirement of providing better service levels at lower costs while maintaining
their bottom line.
In today’s competitive environment, retailors are trying to find out new touch points to interact with customers. Some of the retailers have tried
and succeeded in achieving this. But most of them are still either unable to identify the right touch points to interact or not able to leverage these
touch point to influence the customer for purchasing the products. Retailors need to understand that they cannot treat every customer at same
way anymore. Customers are now looking for more personalized product and services. Theme of retailing has moved from satisfying customer to
being customer driven and thus increasing sales.
Large volumes of highly detailed data from the various strands of a business are expected to provide the opportunity to deliver significant
benefits to retailors. The advent of predictive analytics in recent years has made it easier to capitalize on the wealth of historic and real-time data
generated through supply chains, production processes and customer behaviors.
However, existing capabilities are not enough. We need innovation in the form of new solutions and the leading practices for handling the Big
Data. We need a breakthrough for large scale adoption of Big Data throughout the retail industry. Our goal has been to explore how and why
companies should make investments to develop a mature big data analytics capability. For this we need to look for commonalities among the
other industry verticals that have been early adopters of Big Data initiatives and have been able to generate higher returns from their investments
in big data analytics. At the same time we need to explore the path taken by such companies and how it helps them to differentiate themselves
from rest of the pack. Finally, explore how retail organization can build competitive advantage using big data analytics.
METHODOLOGY
The objective of our research has been to identify the challenges as well as the opportunities brought by big data and Predictive Analytics to a
retail organizations supply chain. We have adopted a mixed approach in our research, wherein we have laid same emphasis on both primary and
secondary set of data. Our methodologies has been to address the answers to specific questions related to Big Data based on earlier research and
at the same time create our own survey to get more detailed information relevant to our topic. For our secondary research we have studied
several published papers available on net. We have tried to take into account the view expressed by various pioneers in big data fields. We have
also tried to incorporate the views expressed by various organizations on Big Data. These organizations are the early adopters of Big Data or are
planning to adopt Big Data in the near future.
For our primary research we used the Survey method. In this method we created a Questionnaire with a set of nine questions. Our target
respondents for this survey were different executives, consultants and technical experts both from within as well outside our organization.
Questions were created keeping in mind the maximum coverage to different facets of Big Data and Predictive analytics. Once the response from
respondents were collected, we did a detailed analysis of the data points in order to find patterns or trends relevant to our topic. We also tried to
club the information gathered through analysis of survey with the information gathered from our secondary research in order to provide a more
holistic picture of Big Data in the retail supply chain.
In Table 1, we have listed the set of steps followed to arrive at our findings and gives the approach adopted for the preparation of the chapter.
Table 1. Analytical steps
Steps Description
Figure 2 shows the snapshot of the questions that we asked in our survey.
Figure 2. Snapshot of survey
FINDINGS AND DISCUSSION
The Role of Big Data Analytics in Retail Supply Chain
Use Social Data in Retail Supply Chain
Our first attempt was to establish basic assumption in terms of retail industry leveraging the information generated by the social media platform
such as Facebook, YouTube, and LinkedIn etc. We found that around 60% companies (Figure 3) currently rely on them to gauge change in
demand pattern, whereas around 55% companies use them to evaluate customer response. 45% companies are of the opinion that it could be a
platform for digital marketing strategy. Since there is not much difference in the score of 3 key elements, it appears that the Social Media is going
to play a balanced role in all these three areas.
Figure 3. Retail supply chain use social data for the following
Let us take a look at the use of social media to evaluate customer response. We find that social data is very important source for gauging customer
response. In today’s world online reviews by experts and friends can have great influence on buying behavior of consumers. It affirms the fact
that various websites with online reviews will continue to gain importance as an influencer.
Retail Companies need to focus on synergizing their big data initiatives to their core areas to get the maximum value to their business. This might
mean leveraging big data analytic to offer better, more customized offering to consumers based on better insight into consumers’ need and
purchase history. Review provided on various social platforms can be used to enhance the product which in turn can help companies in brand
building and customer loyalty. Today retailers need to communicate effectively with consumers on all touch points. Social platform is one such
place where retail companies can easily connect with their consumers in much more valuable ways. These values to consumers could be in terms
of tailored customer service, prompt resolution of issue or providing required information to consumer on social platform with the help of big
data analytics.
One of the big hurdles in leveraging social data for retail companies has been the trust that consumers have in retail companies in maintaining
confidentiality of personal data. This trust aspect restricts the kind of data available across various platforms which could be leveraged by retail
companies. If we go and ask people as to what level of trust they have on retail companies for keeping their personal data secured and at the same
time provide some value out of it, not many will have great level of confidence.
However, it doesn’t mean that people are totally averse to retail companies using their personal data either through social sites or through
companies’ own records. People are quite receptive when it comes to getting personalized offers via email. It is just that the retail industries need
to bridge the trust deficit. For this reason many organizations that are adopting big data framework also plans to put robust security and
governance model for data security.
Organizational Ability to Leverage Big Data
When we asked organizations about their current practices of leveraging big data analysis (Figure 4), we found that they are not being able to
leverage the enormous amount of data captured. For example, Retailers capture data from POS (point-of-sale), Customer loyalty programs and
demographics data but they do very little with the information captured. Our Survey indicates that 58% of the organizations (Figure 4) are not
able to unlock the power of their data. Only 8% Organizations feel comfortable with their endeavors in Big Data and Predictive Analytics.
Current Trends of Data Analytics in Retail
Next set of question was to understand the application of predictive analytics tools from retail supply chain management perspective (Figure 5).
Based on the response we got from the survey, we found that major thrust areas for predictive analytics has been Forecasting (82%) followed by
Inventory Optimization (50%). Predictive analytics tools are helping organization in their endeavor to meet the twin objective of having optimal
level of inventory in a multi echelon network to minimize cost and at the same time ensure availability of vital stock. Organizations are also trying
to discern changes in customer buying pattern with help of Predictive Analytics (Infosys Labs Briefings, 2013). Predictive analytics provide the
opportunity to look into the future by identifying patterns based on the Point of Sale data. With greater adoption of RFIDs and other tracking
devices, predictive analytics tools are frequently being used in gaining visibility of goods throughout the Retail Supply Chain and also identifying
bottle neck area in the chain.
Important Big Data Source
Our survey shows that “Product/Component Traceability” and “Barcode/RFID” are the major source of Big Data. While “Product/Component
Traceability” helps an organization in identifying product/component back to the point of origin, Barcode/RFID provides visibility of product
from replenishment perspective (Figure 6).
Figure 6. Important Big Data source for predictive analysis
Since these sources of data is part of supply chain activity for quite some time now, that might be one of the reason for them being the most
preferred. “Data from quality audits” may not only be used in verification process, but could also be used as predictive audits to identify possible
exception in the future.
Opportunities Predictive Analytic and Big Data for the Retail Industry
We know that data and technology play an important role in consumer decision making process and this trend is likely to accelerate further (see
Figure 7).
Figure 7. Opportunities predictive analytic and Big Data will
bring to the retail supply chain
As technology adoption by consumer gains momentum, analyzing and leveraging data across various channels becomes increasingly critical. For
example, a buyer will start researching about the product online, read the reviews with help of smartphone. Once consumer makes his mind to
buy the product he or she might buy it online or might pick it up at store based on ease or discounts provided. Retailer requires means to
coordinate these interactions across various channels and this is where big data can be of great help. Business now needs to collect, analyse and
leverage data from various touch points and at high velocity. It’s not just about gathering data. How organizations can get right information at
every step to create a seamless interaction for consumers and at the same time manage their demand, logistics, inventory and fulfillment
operations is also important.
1. Buying Pattern: With help of big data analytics retailers can now analyse millions of transaction and correlate buying patterns based on
gender, demographic, location etc. This might help identifying different trends which could in turn provide new opportunities for increasing
the sale or improving customer satisfaction. They can build more customize offering with help of big data analytics based on profiles and
trends. One of the most quoted example of big data analytics is how retailer were able to predict pregnancy of customers based on the
purchases made by women consumers.
2. Customer Segmentation: Today’s retailers generally use very broad customer segmentation from marketing perspective. With help of
Big Data analytics companies can simultaneously analyze each customer’s online shopping information, reviews he or she has given and
actual store purchase they made. This helps retailer to have more focused segmentation which in turn can provide more sales opportunities
by creating more segment focus offering, coupons or discounts. For example, some of the retail shops use such information to mail special
offers on baby products to pregnant women consumer.
3. Cross Selling: Big data analytics helps in collection and analysis of data across multiple channels, including online, social platform and
other sources which in turn are used by retailers to suggest additional products to consumers that match their requirements and budget
limitation.
4. LocationBased Marketing: With recent explosion of smartphone usage in the market, geographic location data has become a new tool
for retailers to leverage. With help of location data retailers can target consumer needs based on their location within or outside the stores.
For example, while a customer is in vicinity of the store, retailer can provide customize discount offer based on customer purchase history.
5. Optimization: Real time data can be used by retailers to optimize their supply chain for each category, channel, and location.
Information from radio frequency ID (RFID) tags and sensors in delivery trucks can help retailers to accurately know the location of
products in transit. This helps in optimization logistics operations both warehouse management and transportation management.
6. Fraud Detection: Big data analytics can be used to analyse high volumes of data from transactions. With help of this we can get patterns
which can be used by retailer to detect fraud in form usage of stolen credits cards or actual missing items from store.
Purpose of this question was to find out how the Big Data and Predictive Analytics can be aligned with each other in creating new opportunities
for Retail Supply Chain Management. We also tried to track the areas where companies are trying to leverage big data the most. We see that the
Big Data and Predictive Analytics can play a major role in improving Forecasting (Figure 6) and at the same time provide much better Business
Intelligence by factoring in both unstructured as well as structured data (Sethuraman, 2012). Other opportunities lie in providing business agility
or spotting demand shifts with the help of leveraging data from multiple sources. Predictive Analytics is already playing a major role in Supply
and Operation planning & Optimization (Manyika et al., 2011).
The Challenges Ahead
Analytics: The RealWorld Use of Big Data
Based on survey done by IBM and Saïd Business School (Figure 8) we observe that most of the retail organizations are currently in the early
stages of big data planning and implementation. The survey also points out that the development effort of retail industry is lagging behind the
other industry. We also see through this survey that a greater percentage of retail organization are now focused on understanding the nuances of
big data. But when it comes to implementation of big data initiative others industries are ahead of retail.
Figure 8. Big Data activities
We found that the main hurdle in utilizing the vast amount of data generated throughout the supply chain is how to leverage data from various
sources (some of which are unstructured data obtained from platforms like Facebook or YouTube). The next hurdle is in getting the right tool to
capture and analyse data. Since Big Data is still in its nascent stage, getting the right adoption strategy can be a daunting task (Figure 9).
Leveraging unstructured data in structured format is the key challenge ahead from the retail industry perspective. To add to this: the cost of data
and Data Security are important factor for the large volume of the data (Economist Intelligence Unit, 2012). Maintaining such huge amount of
data for business intelligence perspective can be a big challenge and not to forget the security concerns that comes with it.
While assessing requirements for big data analytics adoption, companies may like to look at two aspects. Firstly they should be aware of how
ready they are for adopting the big data initiative. Secondly, what value they would be getting form big data initiative. Getting significant,
measurable business value from big data analytics can only be realized if organizations are ready to leverage it. For this to happen, they need to
have an infrastructure which can support growing volume, velocity, variety and veracity of data.
For big data initiative to succeed, retail companies needs to have information management infrastructure. The information management
infrastructure should be scalable and should have high capacity. This would help companies support the rapid growth of current and future data
coming into the organization. They should also have strong security and governance model in place.
Evaluation Criteria for Predictive Analytics and Big Data
The dilemma the retail industry is facing today with respect to Big Data and Predictive Analytics is to find ways to evaluate them (see Figure 10).
Since both are relatively new technologies, we don’t have well-established evaluation criteria. Organizations are looking for how well these tools
integrate with their current system. At the same time it is important to see that the new system leverage both existing data and new data set in an
integrated manner so that organizations don’t have to let go one data set in order to utilize another (Troeste, 2012). Top 3 challenges that came
up from our survey (Figure 8) are Financial Benefits (55%), Data integration (52%) and Interoperability (50%). We understand that Proponents
pitching for adoption of Big Data and Predictive Analytics tools need to provide justification based on the cost and performance perspective.
Figure 10. Evaluation criteria companies should use for
predictive analytics and Big Data adoption
Impact of Big Data Analytics on the Design of Future Retail Supply Chains
Importance of Supply Chain Analytic Tools
There is growing a realization among organizations that Predictive Analytics will play a major role in shaping future strategies for Retail Supply
Chain Management (Haddad, 2014). Based on the survey we find that nearly 79% people feel that Predictive analytics tool will play important
roles in meeting company goals (Figure 11). The important observation is that none of the respondent had firm “NO” about these new tools.
Figure 11. Will supply chain analytic tools be important to
meeting company goals?
Retail companies need to have more pragmatic approach when it comes to Big Data analytics. Big Data analytics adoption should be driven only
by company’s business needs. Before embarking on any big data related initiative, companies need to prepare business case justifying the same.
Then these needs should be aligned with proper implementation strategy where big data infrastructure and resources are leveraged for the
required objective. The focused approach would help organizations extract new insights from existing sources of information. Big data
implementation strategy should be a step by step process, where based on previous advantages and learning we incrementally extend the sources
of data and infrastructures.
For implementation of big data, the retail companies need to follow an approach which is similar to what they follow for any new strategy
implementation. Retail companies need to be focused on data gathering and market tracking. They need to start developing strategy with well-
defined milestones based on business needs and resource available. They can start pilot implementation in order to validate that the initiative is
providing value they were looking for based on business need. Once the learning of pilot phase has been incorporated, a more large scale
deployment of big data initiative can take place. Once we have the big data initiative in place we can tie this with data analytics to leverage more
from such initiative.
Critical Success Factors for Big Data Implementation
In the last question, we asked what they think are the key success factors when it comes to having a successful Big Data and Predictive Analytics
implementations. The result shows (Figure 12) that seamless integration of domain expertise with Predictive Analytics tools was the key with
about 70% of the respondent echoing it. Second success factor was leveraging each set of data generated which includes the unstructured data as
well unstructured data. Interestingly, understanding importance of big data by the organization was another success factor. We find that
companies know Big Data is important for them but still are figuring out how leverage them. Big Data and Predictive analytics can’t be the
panacea for all their issues. These tools have limitation of their own and these limitations should be taken into considerations while adopting
them.
Characteristics of the 2016 Future Retail Supply Chain
From our secondary research, we found following information about next gen Retail Supply Chain:
• The future model will be based on multi-partner information sharing which will have stakeholders like consumers (end users), suppliers,
manufacturers, logistics service providers and retailers.
• Collaborative transport from the collaborative warehouse to deliver to city hubs and to regional consolidation centers.
• Warehouse locations on the edge of cities will be reshaped to function as hubs where cross-docking will take place for final distribution.
Data plays a very vital role for retails industry. Based on information gathered from data, most of retail planning is done. With help of big data we
can go to the next step where we can drive retail innovation and value creation for years to come. For this to happen we need to make sure that
consumers are willing to share more and more of their data with retail companies. Retail companies need to earn consumers trust regarding their
personal data. They should be able to convince the consumers that their information captured would be used in permissible ways only. That data
usage would be non-intrusive in nature and their information is secured with the organization. Finally consumers willing to share the data with
companies needs to see tangible benefits for them.
The suggested implementation process is a set of small steps taken to achieve a bigger organizational goal.
• Start with Existing Data to Achieve NearTerm Results: Like the implementation of any new concept or process, the aim should be
to target the low hanging fruits in order to get wider acceptance and gain momentum as well. This would help companies to sustain the big
data initiative. Retail companies should first look at the present data and resources and then create a big data initiative to deliver better near-
term value proposition. This would also be a good learning curve for people associated with process. Lesson learnt could then be leveraged
when the big data initiative is expanded in more challenging areas. Retail companies need to invest time and resource to get more out the big
data initiative. They would also need to incorporate larger volumes and greater variety of data. However, by being selective in initial phase
one can extract insight from data in shorter time span and with less investment. As the comfort level with big data increases and company
should be in a position to process variety of data at accelerated pace. That time the big data initiative can be expanded for more detailed
insight and patterns.
• Sustained Effort: Like any other change management, Big Data initiative also requires a sustained effort from both business and IT. The
initiative focus is to support identified business case. Therefore, the infrastructure and process should be driven by alignment with business
case on a sustained basis.
Retail companies need to create a framework based on strategy and requirements for big data within the organization. The framework should
clearly define the scope of initiative, milestones and resource allocation in pragmatic way. It should also provide actionable information about
how to tackle key business challenges. Big data analytics framework should define the business process requirements which need to be followed.
This would help in understanding data, analytics tools and hardware needs. This framework should make the task of big data implementation a
more realistic one. It should also provide insight on how this initiative would be sustained and built upon. All stakeholders should be part of
development and creation of framework from early stages.
Based on these findings of Accenture Global Operations Megatrends Study (2014), we are proposing a framework which will suit the working of
Future Retail Supply Chain, as shown in Figure 11.We are proposing a 3-phase approach. These 3 phases are expected to transform a company
from using the Descriptive Analysis to a much more mature level of using predictive analysis. These phases will be supported by 4 processes so
that a retail company implementing big data in supply chain management climbs up the complexity levels as shown in the Table 2.
Table 2. Framework process
The phases with their associated process and complexity level are described below:
Phase I Pick Your Spot: This is the first step in big data implementation. Retail companies’ need to select the KPIs they want to target
with the help of big data. Having a very broad scope can affect the timeline, cost and complexity at the same time selecting a narrow scope
would limit the benefit one can leverage from the implementation.
o Processes Analysis: The Company needs to understand that the Big Data does not arise out of a vacuum. The data is there because
it has been captured at some point-of-time in some form or other. Much of this data could be of no value and may be discarded or
compressed. Here lies the challenge of segregating bad data from good data. It needs to be analyzed properly and intelligently to make
sure that none of the data is missing that is required for the big data analytics to be effective.
o Complexity Descriptive Analytics: The Company would be working more in realm of descriptive analytics, where the effort
would be to understand the “What” of the process.
Phase II Prove the Value: In this phase the effort should be on targeting the low hanging fruits with help of big data analytics. Retail
companies need to identify areas where they think that the implementation can provide a much quicker return. It would help organization in
two ways. Firstly from investment side they would be able to justify the cost they have incurred with results they have achieved in relatively
smaller span of time. From change management perspective also it will help retail companies to get a broader acceptance within the
organization. This can be very important for future expansion of big data project as observed in the implementation of any other new
initiative; people working with new system should have confidence in it to get maximum benefits.
o Processes – Assert: This process is to provide results that could be leveraged by the retail companies for their previously identified
issues. Companies begin their big data implementation to address their core issues. Once the big data framework is up and running
(even on small scale), companies should be in position to get information from data they were looking for. It could be in the form of
providing better segmentation information, better product mix and real time customer engagement across various sales channels.
o Complexity Diagnostic Analytics: The main emphasis has been to look for information which could help retail companies
understand “Why” of business. Retail companies are aware of what is happening around them, but they need information which can
help them to understand why this is happening. To get such information big data analytics is needed, as it can help analyse data across
various sources, in various forms to get the information retail companies are looking for. For example, big data can help in review of
Social networking site or views expressed on You-tube by customers.
o Processes – Anticipate: The next step for the retail companies is to look for how to leverage analytics not for reactive purpose, but
for predictive purpose. Big data can provide companies the understanding of the changing dimension of their business. Big data
analytics can help analyse the systemic change going through on how the customer gather information about products, buy them, review
them and what factors made them buy product of company A instead of company B. These may be in form of changing buying pattern of
customers, way new product has been successfully introduced or even what new ways they can manage their supply chain logistics. This
is where the true value of big data analytics lies; it helps the company to move away from reactive method of doing business to more
predictive method.
o Complexity Predictive Analytics: Here Big Data needs to act as prophet and predict what is going to happen. As we all know
form our day-to-day experience that it is difficult task, but with big data analytics companies can skim through all possible forms of
information in order to provide the guidance about the direction business is moving in.
Phase III Continuous Value Delivery: It’s quite common to find inertia setting in with new initiative once the primary milestone has
been achieved. Retail companies will need to keep constant vigil in order to avoid this trap. Also the more big data analytics becomes part of
day-to-day business the more value companies can get out of it. Also, as mentioned in Phase I (Pick your Spot), the milestones of the initial
phase are generally the low hanging fruits. So if companies don’t build upon the success of their previous phases to make big data initiative,
they won’t be leveraging the true value of big data analytics.
o Processes Act: Retail companies would need to act on information provided by the analytics. It may require some time for people
to have faith in system, but by that time they are in Phase III, company as a whole should have faith in the system to act on information
they can leverage.
o Complexity Prescriptive Analytics: Without proper information companies are lurching in dark. It is difficult for retail
companies when they are forced to act with limited knowledge. Big data analytics can take them to a position where they might have
most of the information they have been looking for long time, but difficulty would be to how best to leverage them. This is where proper
alignment with business needs will help as the company would be looking for information from business optimization point of view.
Advantage at this stage is that direction would be already set and once the required information is available one can act fast to leverage
most out of it.
We further describe how the framework will support the top 5 areas of future supply chain in Table 3.
Table 3. Retail supply chain areas and challenges
Phases Processes
CONCLUSION
International Data Corporation (IDC) has predicted that big data market will grow from $3.2 billion in 2010 to $16.9 billion by 2015 at a
compound annual growth rate of 40%. Supply chains of retail companies are becoming more complex, with a greater number of suppliers,
distributors and logistics providers. Due to this supply chain managers need to managed and prevent risk which can crop up at any place and any
time in dozens of countries. Companies are used to leveraging complex data sets to plan manufacturing to meet customer demand. But firms are
now looking to combine data from across various sources to better predict future risks.
We are moving in an era where both scope and scale of supply chain operations for retail companies will expand. This era will be of Big Data.
Various forms of unstructured data right from social media to real time geolocation data will provide a great opportunity for better Supply chain
Management. However, the caveat would be that organizations are willing to embrace Big Data analytics to overcome the hurdles in the path.
We think that Big Data Analytics is here to stay and by incorporating big data in their supply chain organizations can truly revolutionize
themselves to meet the challenges of the future (LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011). It is not going to be an easy task. Retail
companies would need a more structured approach to leverage unstructured data. They need to be aware of the current trends in big data and
predictive analytics and understand the opportunities it provides. They need to look for right tools driven by people with right domain skill sets.
For retail companies which want to march ahead of competitors, it is time to embrace big data embedded with Predictive Analytic.
This work was previously published in the Handbook of Research on Strategic Supply Chain Management in the Retail Industry edited by
Narasimha Kamath and Swapnil Saurav, pages 269289, copyright year 2016 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Big Data Analytics in Supply Chain: Hype or Here to Stay. (2014).Accenture Global Operations Megatrends Study. Retrieved from
https://acnprod.accenture.com/_acnmedia/Accenture/Conversion-Assets/DotCom/Documents/Global/PDF/Dualpub_2/Accenture-Global-
Operations-Megatrends-Study-Big-Data-Analytics.pdf
Bughin, J., Livingston, J., & Marwaha, S. (2011). Seizing the potential of ‘big data’. The McKinsey Quarterly .
Economist Intelligence Unit. (2012). Big data Lessons from the leaders, The Economist.
Issa, N. (2013). Supply Chain: improving performance in pricing, planning, and sourcing. Opera Solutions, LLC.
Kiron, D., Shockley, R., Kruschwitz, N., Finch, G., & Haydock, M. (2011). Analytics: The Widening Divide. MIT Sloan Management
Review , 53(3), 1–22.
Kotlik, L. L, Greiser, C., & Brocca, M. (2015). Making Big Data Work. Supply Chain Management.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2010). Big Data, Analytics and the Path from Insights to Value.MIT Sloan
Management Review, 21.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byer, A. (2011). Big data: The next frontier for innovation, competition
and productivity . McKinsey Global Institute.
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data , 1(1), 51–59.
doi:10.1089/big.2013.1508
Robak, S., Franczyk, B., & Robak, M. (2013). Applying big data and linked data concepts in supply chains management. Proceedings of the 2013
Federated Conference on Computer Science and Information Systems (FedCSIS) (pp. 1215-1221). IEEE.
Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and
management. Journal of Business Logistics ,34(2), 77–84. doi:10.1111/jbl.12010
KEY TERMS AND DEFINITIONS
Big Data: Very large data sets which can be analyzed computationally to find patterns, trends, and associations, based on human behavior and
interactions.
Clustering: The process of organizing objects into groups whose members are similar in some way.
Cost Containment: The process of controlling the expenses required to operate an organization or perform a project within pre-planned
budgetary constraints.
Globalization: The process of interaction and integration among the people, companies, and governments of different nations, a process driven
by international trade and investment and aided by information technology.
POS: POS refers to materials, brochures, signs at point of sale, or can refer to point of sale or cash register.
Predictive: Denoting or relating to a system for using data already stored to generate information for the future.
Retail: The functions and activities involved in the selling of commodities directly to consumers.
Risk Management: A process of forecasting and evaluation of financial risks together with the identification of procedures to avoid or
minimize their impact.
Segmentation: Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways
relevant to marketing, such as age, gender, interests, spending habits, and so on.
Service Level: Measures the performance of a system. Certain goals are defined and the service level gives the percentage to which those goals
should be achieved.
Structured Data: Data that resides in a fixed field within a record or file is called structured data. This includes data contained in relational
databases and spreadsheets.
Supply Chain Planning: Oversight of materials, information, and finances as they move in a process from supplier to manufacturer to
wholesaler to retailer to consumer.
Unstructured Data: Refers to information that doesn't reside in a traditional row-column database. Examples include e-mail messages, word
processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
CHAPTER 68
The Impact of Big Data on Security
Mohammad Alaa Hussain AlHamami
Applied Science University, Bahrain
ABSTRACT
Big Data is comprised systems, to remain competitive by techniques emerging due to Big Data. Big Data includes structured data, semi-
structured and unstructured. Structured data are those data formatted for use in a database management system. Semi-structured and
unstructured data include all types of unformatted data including multimedia and social media content. Among practitioners and applied
researchers, the reaction to data available through blogs, Twitter, Facebook, or other social media can be described as a “data rush” promising
new insights about consumers’ choices and behavior and many other issues. In the past Big Data has been used just by very large organizations,
governments and large enterprises that have the ability to create its own infrastructure for hosting and mining large amounts of data. This
chapter will show the requirements for the Big Data environments to be protected using the same rigorous security strategies applied to
traditional database systems.
INTRODUCTION
The term Big Data has a relative meaning and tends to denote bigger and bigger data sets over time. In computer science, it refers to data sets
that are too big to be handled by regular storage and processing infrastructures. It is evident that large datasets have to be handled differently
than small ones; they require different means of discovering patterns, or sometimes allow analyses that would be impossible one small scale.
In the social sciences and humanities as well as applied fields in business, the size of data sets thus tends to challenge researchers as well as
software or hardware. This may be especially an issue for disciplines or applied fields that are more or less unfamiliar with quantitative analysis.
(Manovich, 2012) sees knowledge of computer science and quantitative data analysis as a determinant for what a group of researchers will be
able to study. He fears a “data analysis divide” between those equipped with the necessary analytical training and tools to actually make use of the
new data sets and those who will inevitably only be able to scratch the surface of this data.
New analytical tools tend to shape and direct scholars’ ways of thinking and approaching their data. The focus on data analysis in the study of Big
Data has even led some to the assumption that advanced analytical techniques make theories obsolete in the research process. Research interest
could thus be almost completely steered by the data itself. But being driven by what is possible with the data may cause a researcher to disregard
vital aspects of a given research object. (Boyd and Crawford, 2012) underline the importance of the (social) context of data that has to be taken
into account in its analysis. The scholars illustrate how analyses of large numbers of messages (“tweets”) from the micro blogging service.
Twitter currently used to describe aggregated moods or trending topics—without researchers really discussing what and particularly who these
tweets represent: Only parts of a given population are even using Twitter, often in very different ways. As Boyd and Crawford point out, these
contexts are typically unknown to researchers who work with samples of messages captured through Twitter. In addition, Big Data analyses tend
only to show what users do, but not why they do it. In his discussion of tools for Big Data analysis, (Manovich, 2012) questions the significance of
the subsequent results in terms of their relevance for the individual or society. The issue of meaning of the observed data and/or analyses is thus
of vital importance to the debate around Big Data.
Many organizations are relying on multiple tools to produce the necessary security data. This will lead to a huge and a complex data feeds that
must be analyzed, normalized, and prioritized. The scale of security data that needs analysis has simply become too big and complex to handle,
and its takes a very long time. According to the Verizon 2013 Data Breach Investigations Report, 69% of breaches were discovered by a third
party and not through internal resources and this is an example to clarify why the role of Big Data is important in security.
“BIG DATA” TERM
“Big Data” is a term that has quickly achieved widespread use among technologists, researchers, the media and politicians. Perhaps due to the
speed of dissemination the use of the term has been rather nebulous in nature. The concept of Big Data can be framed by one of three
perspectives. The first is a response to the technology problems associated with storing, securing and analyzing the ever-increasing volumes of
data being gathered by organizations. This includes a range of technical innovations, such as new types of database and ‘cloud’ storage that
enable forms of analysis that would not previously have been cost effective. The second perspective focuses on the commercial value that can be
added to organizations through generating more effective insights from this data. This has emerged through a combination of better technology
and greater willingness by consumers to share personal information through web services. The third perspective considers the wider societal
impacts of Big Data, particularly the implications for individual privacy, and the effect on regulation and guidelines for ethical commercial use of
this data. We now consider each of these perspectives on Big Data in more detail.
BIG DATA AND TECHNOLOGY INNOVATION
In its original form, Big Data referred to technical issues relating to the large volumes of data being created. While the rate at which data has been
generated by information technology has always been increasing, recent growth produces some startling statistics (Qin and Li, 2013). Take the
following two examples:
1. Ninety percent of all the data in the world has been produced in the past two years.
2. The sum of all information ever produced by humans by 1999, estimated at 16 Exabyte (16 trillion megabytes) in 1999, will be the same as
generated every nine weeks by the world’s largest telescope, the Square Kilometer Array, when it opens later this decade.
At the same time as data volumes have increased, the cost of storing this information has reduced drastically. For example, in 2011, $600 would
buy a disk drive with the capacity to store the entire world’s recorded music. In providing these statistics we seek to highlight that the Big Data
problem is not the volume of data itself, but the issues arising with analyzing and storing this data in a way that can easily be accessed. The costs
of storing large volumes of data mean that, until recently, it has been common practice to discard information not strictly required for legal,
regulatory or immediate business use. For example, hospitals and health care providers discard more than 90% of the data they generate,
including nearly all real-time video data generated from operations.
In addition, two other factors, velocity and variety, are significant in Big Data. Velocity refers to the challenges in Downloaded from accessing
stored data quickly enough for them to be useful. For most real-world uses, data need to be accessible in something close to real time. Offering
fast access to massive amounts of data at a reasonable cost is a key limitation of existing technologies, both in terms of commonly used relational
database software and the use of cheaper ‘offline’ tape storage devices. Variety refers to the type of information being stored. Previously, data
stored tended to be highly structured in nature. By contrast, the types of data that tend to dominate modern data stores are unstructured, such as
streams of data gathered from social media sites, audio, video, organizational memoranda, internal documents, email, organizational Web Pages,
and comments from customers.
From a technology perspective, the solution to the Big Data problem has occurred through the intersection of several innovations. These include
flash-based disk drives that allow much faster access to high volumes of information, and a new generation of non-relational database
technologies that make it practical to store and access massive amounts of unstructured data. Fittingly, much of this new database technology has
emerged from inside companies that run social media networks, including Google, Facebook, LinkedIn and Twitter.
BIG DATA AND COMMERCIAL VALUE
While technology has served as the enabler of Big Data services, the broader interest in Big Data has been driven by thoughts of the potential
commercial value that it may bring. This is derived from the ability to generate value from data in ways that were not previously possible. For
example, financial services are using high-performance computing to identify complex patterns of fraud within unstructured data that were not
previously apparent. This has enabled the cost-effective provision of financial services in areas that would previously have been regarded as too
risky to be sustainable. Another example is the use of personal location data, gained from a combination of smart phones, cell-tower tracking and
GPS navigation data within vehicles. Such information is already being used to calculate fuel-efficient smart routing, report diagnostic
information back to manufacturers or tracking applications to locate family members (Manyika et al. 2011).
In practical terms this gives organizations the ability to generate insights in minutes that might once have taken days or weeks – for example, by
using diagnostic information to predict quality issues or develop new understandings of consumer behavior. The need to derive commercial
value, and insight, from data is not new. Indeed, providing information to help support management insights can be considered a foundation of
the market research sector. However, the key difference with Big Data strategies is not simply the provision of high-quality and more timely data
into the decision-making process, but the enablement of continuous autonomous decision making via the use of automation. For example, the
use of remote monitors for health conditions such as heart disease or diabetes, or ‘chip-on-pill’ technologies, could enable the automation of
health decisions (Manyika et al. 2011).
THE ‘DATA RUSH’
Since the 1990s and early 2000s, social scientists from various fields have relied more and more on digitized methods in empirical research.
Surveys can be administered via Web sites instead of paper or via telephone; digital recordings of interviews or experimental settings make the
analysis of content or observed behavior more convenient; and coding of material is supported by more and more sophisticated software. But in
addition to using research tools that gather or handle data in digital form, scholars have also started using digital material that was not
specifically created for research purposes. The introduction of digital technology in, for example, telephone systems and cash registers, as well as
the diffusion of the Internet to large parts of a given population, have created huge quantities of digital data of unknown size and structure. While
phone and retail companies usually do not share their clients’ data with the academic community, scholars have concentrated on the massive
amounts of publicly available data about Internet users, often giving insight into previously inaccessible subject matters. Subsequently,
methodological literature began discussing research practices, opportunities, and drawbacks of online research.
Methodological Issues in the Study of Digital Media Data (Christians and Chen, 2004) discuss technological advantages of Internet research, but
urge their readers to also consider its inherent disadvantages. Having huge amounts of data available that is “naturally” created by Internet users
also has a significant limitation: The material is not indexed in any meaningful way, so no comprehensive overview is possible. Thus, there may
be great material for many different research interests, but the question of how to access and select it cannot be easily answered. Sampling is
therefore probably the issue most often and consistently raised in the literature on Internet methodology. Sampling online content poses
technical or practical challenges.
Data may be created through digital media use, but it is currently impossible to collect a sample in a way that adheres to the conventions of
sample quality established in the social sciences. This is partly due to the vastness of the Internet, but the issue is further complicated by the fact
that online content often changeover time. On Web sites, content is not as stable or clearly delineated assign most traditional media, which can
make sampling and defining units of analysis challenging. It seems most common to combine purposive and random sampling techniques.
The problems related to sampling illustrate that tried and tested methods and standards of social science research will not always be applicable to
digital media research. But scholars take opposing sides on whether to stick with traditional methods or adopt new ones: Some suggest applying
well-established methods tonsure the quality of Internet research. On the other hand, there are questions whether conventional methods would
be applicable to large data sets of digital media. Communication scholars trained in conventional content analysis will find they need to adapt
their methodological toolbox to digital media, at least to some degree. The incorporation of methods from other disciplines should be able to
adequately study the structure of Web sites, blogs, or social network sites. Scholars may find more appropriate or complementary methods in; for
instance, linguistics or discourse analysis. But methodological adaptation and innovation have their drawbacks, and scholars of new phenomena
or data structures find themselves in an area of conflict between old and new methods and issues. While other scholars argue for a certain level of
restraint toward experimenting with new methods and tools, researchers caught in the “data rush” seem to have thrown caution tithe wind,
allowing themselves to be seduced by the appeal of Big Data.
BIG DATA MEANING
Researchers are already wrote about the belief or hope of scholars involved in Internet research that data collected through Web sites, online
games, e-mail, chat, or other modes of Internet usage “represent::: well, something, some semblance of reality, perhaps, or some ‘slice of life’ on-
line”. Others illustrate that even the comparatively simple phenomenon of a hyperlink between two Web sites is not easily interpreted. What does
the existence of just the connection, as such, between two sites tell a researcher about why it was implemented and what it means for the creators
of the two sites or their users? Studying online behavior through large data sets strongly emphasizes the technological aspect of the behavior and
relies on categories or features of the platforms that generated the data.
Yet, the behaviors or relationships thus expressed online may only seem similar to their offline counterparts. (Boyd and Crawford, 2012)
illustrate that, for instance, network relationships between cell phone users may give an account of who calls whom how often and for how long.
Yet, whether frequent conversation partners also find their relationship important for them personally or what relevance they attribute to it
cannot be derived from (Christians and Chen, 2004) their connection data without further context. In addition, many authors' advice researchers
to be wary of data collected from social media to a certain degree, because they may be the result of active design efforts by users who
purposefully shape their online identities.
It is unclear how close to or removed from their online personas users actually are. This means that scholars should be careful to take data for
what it truly represents (e.g., traces of behavior), but not infer too much about possible attitudes, emotions, or motivations of those whose
behavior created the data—although some seem happy to make such inferences. Some researchers urge scholars to scrutinize whether a study can
rely on data collected online alone or whether it should be complemented by offline contextual data.
OPPORTUNITIES OF BIG DATA RESEARCH
Data collected through use of online media is obviously attractive to many different research branches, both academic and commercial. We will
briefly summarize key advantages in this section and subsequently discuss critical aspects of Big Data in more depth. Focusing on the social
sciences, advantages and opportunities include the fact that digital media data are often a by-product of the everyday behavior of users, ensuring
a certain degree of ecological validity. Such behavior can be studied through the traces it automatically left, providing a means to study human
behavior without having to observe or record human subjects first. These canals allow examination of aspects of human interaction that could be
distorted by more obtrusive methods or more artificial settings, due to observer effects or the subjects’ awareness of participating in a study. Such
observational data shares similarities with material used in content analysis since it can be stored or already exists in document form. Thus,
content analysis methodology well-established in communication or other research fields can be applied to new research questions. When
content posted on a platform is analyzed in combination with contextual data, such as time of a series of postings, geographic origin of posters, or
relationships between different users of the same platform or profile, digital media data can be used to explore and discover patterns in human
behavior, e.g., through visualization.
For some equally explorative research questions, the sheer amount of information accessible online seems to fascinate researchers because it
provides (or at least seems to provide) ample opportunities for new research questions. Lastly, the collection of Big Data can also serve as a first
step in a study, which can be followed by analyses of sub-samples on a much smaller scale. Groups hard to reach in the real world or rare and
scattered phenomena can be filtered out of huge data sets proverbial needle in the digital haystack. This can be much more efficient than
drawing, for instance, a huge sample of people via a traditional method, such as random dialing or random walking, when attempting to identify
those who engage in comparatively rare activities (Nunan and Marialaura, 2013).
CHALLENGES OF BIG DATA RESEARCH
Although Big Data seems to be promising a golden future, especially to commercial researchers, the term is viewed much more critically in the
academic literature. (Boyd and Crawford, 2012) discuss issues related to the use of Big Data in digital media research, some of which have been
summarized above. In addition to more general political aspects of ownership of platforms and “new digital divides” in terms of data access or
questions about the meaning of Big Data, its analysis also poses concrete challenges for researchers in the social sciences. One recurring theme in
many studies that make use of Big Data is what we call its availability bias: Rather than theoretically defining units of analysis and measurement
strategies, researchers tend to use whatever data is available and the entry to provide an ex-post justification or even theorization for its use. This
research strategy is in stark contrast to traditional theory-driven research and raises concerns about the validity and generality of the results.
SAMPLING AND DATA COLLECTION
The problem of sampling in Internet research has already been addressed previously and is mentioned in almost every publication on online
research. While there are some promising approaches for applying techniques such as capture-recapture or adaptive cluster sampling to online
research, the problem of proper random sampling, on which all statistical inference is based, remains largely unsolved. Most Big Data research is
based on nonrandom sampling, such as using snowball techniques or simply by using any data that is technically and legally accessible. Another
problem with many Big Data projects is that even with a large sampler complete data from a specific site, there is often little or no variance in the
level of platforms or sites. If researchers are interested in social network sites, multiplayer games, or online news in general, it is problematic to
include only data from Facebook and Twitter, World of Warcraft and Everquest II, or a handful of newspaper and broadcast news sites.
From a platform perspective, the sample size of these studies is tiny, even with millions of observations per site. This has consequences not only
for the inferences that can be drawn from analyses, but also from a validity perspective: Expanding and testing the generalize-ability of the
(Raghvendra, Langone et al., 2013) would not require more data from the same source, but information from many different sources. In this
respect, the hardest challenge of digital media research might not be to obtain Big Data from a few, although certainly important, Websites or
user groups, but from many different platforms and persons. Given the effort required to sample, collect, and analyze data from even a single
source, and the fact that this can rarely be automated or outsourced, this “horizontal’ ’expansion of online research remains a difficult task.
A third important aspect of Big Data collection is the development of ethical standards and procedures for using public or semi-public data.
Researchers provide an excellent account of the problems facing them when making seemingly public data available to the research community.
The possibility of effective de-anonymization of large data sets has made it difficult for researchers to obtain and subsequently publish data from
social media networks such as YouTube, Facebook, or Twitter. Moreover, the risk of in advertently revealing sensitive user information has also
decreased the willingness of companies to provide third parties with anonymized data sets, even if these companies are generally interested in
cooperation with the research community. Researches who collect their data from publicly available sources are at risk as well because the
content providers or individual users may object to the publication of this data for further research, especially after the data has successfully been
de-anonymized. The post-hoc withdrawal of research data, in turn, makes replications of the findings impossible and therefore violates a core
principle of empirical research.
Finally, basically all Big Data research is based on the assumption that use simplicity consent to the collection and analysis of their data by
posting them online. In light of current research on privacy in online communication, it is questionable whether users can effectively distinguish
private from public messages and behavior. But even if they can, since it is technically possible to recover private information even from limited
public profiles, Big Data research has to solve the problem of guaranteeing privacy and ethical standards while also being replicable and open to
scholarly debate.
BIG DATA MEASUREMENT
Concerns about the reliability and validity of measurement have been raised in various critical papers on Big Data research, most recently by
(Boyd and Crawford, 2012). Among the most frequently discussed issues are
Clearly, these concerns and their causes are related to an implicit or explicit tendency toward data-driven rather than theory-driven
operationalization strategies. In addition to the possible “availability bias” mentioned above, many prominent Big Data studies seem to either
accept the information accessible via digital media as face-valid, e.g., by treating Facebook friendship relations as similar to actual friendships, or
reduce established concepts in communication such as topic or discourse to simple counts of hash tags or re tweets.
While we do not argue that deriving measurement concepts from data rather than theory is problematic, per se, researchers should be aware that
the most easily available measure may not be the most valid one and they should discuss to what degree its validity converges with that of
established instruments. For example, both communication research and linguistics have a long tradition of content-analytic techniques that are,
at least in principle, easily applicable to digital media content. Of course, it is not possible to manually annotate millions of comments, tweets, or
blog posts. However, any scholar who analyzes digital media can and should provide evidence for the validity of measures used, especially if they
rely on previously unavailable or untested methods.
The use of shallow, “available” measures often coincides with an implicit preference or automatic coding instruments over human judgment.
There are several explanations for this phenomenon: First, many Big Data analyses are conducted by scholars who have a computer science or
engineering background and may simply be unfamiliar with standard social science methods such as content analysis. Moreover, these
researchers often have easier access to advanced computing machinery than trained research assistants who are traditionally employed as coders
or raters. Second, Big Data proponents often point out those automatic approaches are highly reliable, at least in the technical sense of not
making random mistakes, and better suited for larger sample sizes. However, this argument is valid only if there is an inherent advantage to
coding thousands of messages rather than a smaller sample, and if this advantage outweighs the decrease of validity in automatic coding that has
been established in many domains of content analysis research.
Some researcher find that supervised text classification is on average 20 percent less reliable than manual topic coding. Despite the vast amount
of scholarship on these methods, the actual trade, although it is central to the question of whether and when, for example, we accept shallow
lexical measures that are easy to implement and technically reliable as substitutes for established content-analytic categories and human coding.
DATA ANALYSIS AND INFERENCES
In addition to sampling, data collection, and measurement, the analysis of large data sets is one of the central issues around the Big Data
phenomenon. If a researcher deals with Big Data in the original technical sense, meaning that data sets cannot be analyzed on a desktop
computer using conventional tools such as Statistical Product and Service Solutions (SPSS) or Statistical Analysis System (SAS), he or she can
investigate the possibilities of distributed algorithms and software that can run analyses on multiple processors or computing nodes.
An alternative approach would be to take a step back and ask whether an analysis of a subset of the data could provide enough information to test
a hypothesis or make a prediction. Although in general, a larger sample size means more precise estimates and larger number of indicators or
repeated observations lead to less measurement error, most social science theories do not require that much precision. If the sampling procedure
is valid, the laws of probability and the central limit theorem also apply to online research, and even analyses that require much statistical power
can still be run on a single machine. In this way, Big Data can safely be reduced to medium-size data and still yield valid and reliable results. The
requirement of larger or smaller data sets is also linked to the question of what inferences one might like to draw from the analysis: Are we
interested in aggregator individual effects, causal explanation or prediction? Predicting individual user behavior, for example on a Web site,
requires both reliable and valid measurement of past behavior as well as many observations.
Longitudinal analyses of aggregate data, e.g., using search queries or large collections of tweets, do not necessarily require perfectly reliable
coding or large sample sizes. If a blunt coding scheme based on a simple word list has only 50 percent accuracy, it is still possible to analyze
correlations between time series of media content and user behavior as long as the amount of measurement error is the same over time.
Moreover, whether a time series is based on hundreds or thousands of observations rarely affects the inferences that can be drawn on the
aggregate level, at least if the observations are representative of the same population. If, on the other hand, a researcher is interested in analyzing
the specific content of a set of messages or the behavior of a pre-defined group of online users. As in other disciplines such as psychology,
education, or medicine, individual diagnostics and inferences require far more precision than the detection of aggregate trends.
Finally, one should ask how generalizable the findings of a study can or should be: In-depth analysis, both qualitative and quantitative, might
allow for accurate predictions and understanding of a single individual, but it often cannot be generalized for larger samples or the general
population. Observing a handful of Internet users in a computer lab can rarely lead to valid inferences about Internet users in general, simply
because there is often too little information about individual differences or, more technically, between-person variance. Correlations on the
aggregate level, on the other hand, cannot simply be applied to the individual level without the risk of an ecological fallacy, i.e., observing
something in aggregate data that never actually occurs on the individual level.
INTERPRETATION AND THEORETICAL IMPLICATIONS
If researchers have undertaken analyses of Big Data, they need; of course, to interpret their results in light of the decisions they have made along
the research process and the consequences of each of these decisions. The core question should be: What is the theoretical validity and
significance of the data? Large samples of digital media are limited in some respects, so scholars have to be careful about what inferences are
drawn from them. The problem of determining the meaning of some types of digital media data has already been alluded to above. The number of
times a message gets forwarded (“re tweeted”) on Twitter, for instance, may show a certain degree of interest by users, but without looking at the
content and/or style of tweet, “interest” could stand for popularity and support, revulsion and outrage, or simply the thoughtless routines of
Twitter usage behavior.
As (Boyd and Crawford,2012) point out: No matter how easily available Facebook, YouTube, or Twitter data is, it is based on a small and certainly
nonrandom subset of Internet users, and this is even more true when investigating specific Web sites, discussion boards, online games, or
devices. If less than 5 percent of Internet users in a given country are active on Twitter, as in Germany, an analysis of trending topics on the micro
blogging service can hardly represent the general population’s current concerns.
In addition, a platform’s interfaces (or ethical constraints) may not allow researchers to access information that would be most interesting to
them, confining them to descriptive exploration of artificial categories. Visualizations based on such categories, for example connections between
social media users, may allow the discovery of patterns, but without cases to compare them to, these patterns may not lead to insight. Likewise,
we have already underlined that the mere occurrence of certain keywords in a set of social media messages does not constitute “discourse,” per
se. Such theoretical constructs should not be tweaked beyond recognition to fit the data structure of a given platform.
In sum, researchers should not compromise their original research interests simply because they cannot be as easily approached as others. If
after careful scrutiny of the possibilities a certain platform or type of analysis really offers, the scholar decides that a Big Data approach is not
advisable, a thorough analysis of smaller data sets may well produce more meaningful results. While similar problems exist in all empirical
studies, such issues seem especially pressing in Big Data research.
THE BIG DATA IN CLOUD COMPUTING
With the development of Internet of Things, its technology has been widely used in to various fields, and has accumulated massive data. Since the
emergence of Cloud Computing, with the continuous development of science and technology and advance by academia and industry, the
applications of Cloud Computing are going on developing. Cloud Computing is moving from theory to practice. With the development of Cloud
Computing, data center is also improved. Nowadays, data center is not only a site which manages and repairs servers, but also a center of many
computers with high performance which could compute and store huge data.
Currently there are proposed many cleaning algorithms of Big Data that is mainly divided into regional object cleaning algorithm, the object
cleaning algorithm based on information theory, based on discernibility matrix and on the basis of improved object cleaning algorithm. Many
scholars mainly study on how to deal with inconsistent decision table and how to improve the efficiency of Big Data algorithm. The Big Data
cleaning is the important way to resolve the massive data mining problem, and the Big Data cleaning algorithm combined the parallel genetic
algorithm and co-evolutionary algorithm, to decompose of object cleaning task, which can improve the efficiency of Big Data algorithm (Zhang,
Xue and et al, 2013).
To this end, such Big Data cleaning algorithm that assume all the data can be loaded into the main memory at one-time, which are infeasible for
Big Data. Cloud computing is a new business computing model that was proposed in recent years, it is the development of distributed computing,
parallel computing and grid computing. The pioneers of cloud computing is Google Inc., proposed a massive data storage and access capacity of
large distributed file system GFS (Google File System), and providing a handle massive data parallel programming mode of Map Reduce, that
provides a feasible solution for massive data mining. Cloud computing technology has been applied in the field of machine learning, but there is
still no real application to Big Data cleaning algorithm.
BIG DATA IMPACTS BUSINESS ENTERPRISES
Data are generated in a growing number of ways. Use of traditional transactional databases has been supplemented by multimedia content, social
sensors. Advances in information technology allow users to capture, communicate, aggregate, store and analyze enormous pools of data, known
as “Big Data”. However, the businesses that have depended upon database technology to store and process data. Big Data derives its name
database systems are unable to capture. The actual size of Big Data varies by business sector, software tools available in the sector, and average
dataset sizes within the sector. Best estimates of size range from few dozen terabytes to many peta bytes. In order to benefit from Big Data, be
adopted, business executives must determine the suited to their information needs; eventually become non-competitive (Manyika & et al, 2011).
TYPES AND SOURCES OF BIG DATA
Executives need to be cognizant of the types of data three main types of data, regardless of data, structured data, and semi-structured data.
Unstructured data are they were collected; no formatting is used (Coronel, Morris, & et al, 2013). Unstructured data are PDF’s, e-mails, and
documents formatted to allow storage, use, and generation of information (Coronel, Morris, & et al, 2013).Traditional transactional databases
store structured data. Data have been processed to some extent aged text are examples of semi-structured data (Manyika et al., 2011).With
traditional database management systems need to broaden their data horizons to include collection, storage, and processing of data collection of
unstructured and semi-structured data is done through based technologies. Researchers describe sensors providing Big Data as being part of the
Internet of Things. The Internet of Things is described as sensors and actuators that are embedded in physical objects. Some industries that are
creating and using Big Data are those that have recently begun digitization of their data content; these industries include entertainment,
healthcare, life sciences, video surveillance, transportation, logistics, retail, utilities, and telecommunications.
Journal of Technology Research The emergence of Big Data media, and myriad types of new data collection methodologies pose a dilemma for
from the fact that the datasets are large enough that typical save, and analyze these datasets (Manyika et al., 2011). In new storage technologies
and analysis methods need to new technologies and methodologies best Business executives ignoring the growing field of Big Data will they need
to deal with whether or not a company is using Big Data structured data in the format in which Some Structured data are XML or HTML.
THE ROLE OF BIG DATA IN SECURITY
Many organizations are relying on multiple tools to produce the necessary security data. This will lead to a huge and a complex data feeds that
must be analyzed, normalized, and prioritized. The scale of security data that needs analysis has simply become too big and complex to handle,
and its takes a very long time. According to the Verizon 2013 Data Breach Investigations Report, 69% of breaches were discovered by a third
party and not through internal resources and this is an example to clarify why the role of Big Data is important in security (Tankard, 2012).
Big Data security needs to be correlated with its business criticality or risk to the organization. Without a risk-based approach to security,
organizations can waste valuable IT resources used for vulnerabilities that will cause in reality little or no threat to the business. Furthermore, big
security data needs to be filtered to just the information that is relevant to specific stakeholders’ roles and responsibilities. Not everyone has the
same needs and objectives when it comes to leveraging Big Data.
To deal with big security data and achieve continuous diagnostics, progressive organizations are leveraging Big Data Risk Management systems
to automate many manual, labor-intensive tasks. These systems take a preventive, pro-active approach by interconnecting otherwise silo-based
security and IT tools and continuously correlating and assessing the data they generate. In turn, this enables organizations to achieve a closed-
loop, automated remediation process, which is based on risk. These results in tremendous time and costs savings, increased accuracy, shorten
remediation cycles, and overall improved operational efficiency.
Big Data Risk Management systems empower organizations to make threats and vulnerabilities visible and actionable, while enabling them to
prioritize and address high risk security exposures before breaches occur. Ultimately, they can protect against and minimize the consequences of
cyber-attacks.
SECURITY AWARENESS IN BIG DATA ENVIROMNEMTS
Data security can be addressed in an efficient and effective manner to satisfy all parties. Start Big Data security planning immediately and
building security into Big Data environments will reduce costs, risks, and deployment pain.
Many organizations deploy Big Data environments alongside their existing database systems, allowing them to combine traditional structured
data and new unstructured data sets in powerful ways. Big Data environments consists of reliable data storage using different infrastructure, that
consist of Distributed File System, a column oriented database management system and a high-performance parallel data processing technique.
Big Data environments need to be protected using the same rigorous security strategies applied to traditional database systems, such as
databases and data warehouses, to support compliance requirements and prevent breaches.
Security strategies which should be implemented for Big Data environments include (Manovich, 2012):
• Sensitive Data Discovery and Classification: Discover and understand sensitive data and relationships before the data is moved to
Big Data environments so that the right security policies can be established downstream.
• Data Access and Change Controls: Establish policies regarding which users and applications can access or change data in Big Data
environments.
• RealTime Data Activity Monitoring and Auditing:Understand the: who, what, when, how and where of Big Data environments
access and report on it for compliance purposes.
• Compliance Management: Build a compliance reporting framework into Big Data environments to manage report generation,
distribution and sign off.
FUNDAMENTIALS TO IMPROVE SECURITY IN BIG DATA ENVIRONMENTS
According to the Future of Data Security and Privacy, organizations can control and secure the extreme volumes of data in Big Data
environments by following a three step framework (as shown in Figure 1):
(Zhang, Xue et al., 2013).
The planning phase presents the perfect opportunity to start a dialog across data security, legal, business and IT teams about sensitive data
understanding, discovery and classification. A cross-functional team should identify where data exists, decide on common definitions for
sensitive data, and decide what types of data will move into Big Data environments. Also, organizations should establish a life-cycle approach to
continuously discover data across the enterprise.
2. Dissect: Big Data environments are highly valuable to the business. However, data security professionals also benefit because Big Data
repositories can store security information. Data security professionals can leverage Big Data environments to more efficiently prioritize
security intelligence initiatives and more effectively place the proper security controls.
3. Defend: Aggregating data by nature increases the risk that an attacker can compromise sensitive information. Therefore, organizations
should strictly limit the number of people who can access repositories.
Big Data environments should include basic security and controls as a way to defend and protect data. First, access control ensures that the right
user gets access to the right data at the right time. Second, continuously monitoring user and application access is highly important especially as
individuals changes roles or leave the organization. Monitoring data access and usage patterns can alert policies violations like an administrator
altering log files. Typically attackers will leave clues or artifacts about their breach attempts that can be detected through careful monitoring.
Monitoring helps ensure security policies are enforced and effective.
Organizations can secure data using data abstraction techniques such as encryption or masking. Generally, attackers cannot easily decrypt or
recover data after it has been encrypted or masked. The unfortunate reality is that organizations need to adopt a zero trust policy to ensure
complete protection.
There are several actions organizations can take today to better secure Big Data environments: move your controls closer to the data itself,
leverage existing technologies to control and protect Big Data, ask legal to define clear policies for data archiving and data disposal, diligently
control access to Big Data resources and watch user behavior.
BIG DATA SECURITY AND PRIVACY CHALLENGES
It is reasonable the huge amount of data is creating new security challenges. Big Data is very important because most organizations now are
accessing Big Data and using it. In the past Big Data was been used just by very large organizations, governments and large enterprises that have
the ability to create its own infrastructure for hosting and mining large amounts of data. These infrastructures were typically a private proprietary
and were isolated to be accessed from outsiders. Nowadays, Big Data is cheap and available to all kinds of organizations through the public cloud
infrastructure and this led to new security challenges.
The following list contains the top challenges for Big Data security and privacy (Steve, 2013):
9. Granular audits.
(Manyika, Brown et al., 2011).
The challenges may be organized into four aspects of the Big Data ecosystem, as the following:
1. Infrastructure Security: This includes secure computations in distributed programming frameworks and Security best practices for
non-relational stores.
2. Data Privacy: It includes Privacy preserving data Mining and Analytics, Cryptographically enforced Data Centric Security and Granular
Access Control.
3. Data Management: It contains secure data Storage and transaction logs and Granular Audits. Data Provenance.
4. Integrity and Reactive Security: It includes End-point validation and filtering. Real time security monitoring.
(Cloud Security, 2013).
SECURITY ISSUES IN BIG DATA
Big Data offers a large store for data which make it a fine tool for the Intruders and attackers to violate the privacy for the people and to misuse
this treasure of data.
1. Big Data Privacy
As the collection of unstructured data becomes more economically viable, and shifts in consumer usage of technology make a much wider range
of data available, there is an incentive for organizations to collect as much data as possible. Yet, just because consumers are willing to provide
data this does not mean that its use is free from privacy implications (Boyd, 2010). Four examples of these privacy challenges follow. The first
arises from different sets of data that would not previously have been considered as having privacy implications concerns being combined in ways
that threaten privacy. One example, albeit experimental, was discovered by researchers Downloaded and asked who used publicly available
information and photographs from Facebook and, through application of facial recognition software, matched this information to identify
previously anonymous individuals on a major dating site. In another example, anonymous ‘de-identified’ health information distributed between
US health providers was found to be traceable back to individuals when modern analytical tools were applied. This creates an unintended use
paradox. How can consumers trust an organization with information when the organization does not yet know how the information might be
used in the future? The second challenge comes from security – specifically the issue around hacking or other forms of unauthorized access.
Despite increasing awareness of the need to maintain physical security, computer systems are only as strong as their weakest point, and for
databases the weakest point is usually human. For all the advanced technical security used to protect the US diplomatic network, the Wiki leaks
scandal was caused by a low-level employee copying data on to a fake ‘Lady Gaga’ CD. For Big Data stores to be useful there needs to be a certain
amount of regular access, often by a range of employees in different locations. While treating data like gold bullion and storing them in a vault
may guarantee security, this is not a practical solution for most use cases; but what of security breaches? When a credit card is stolen it is
relatively straightforward, if time consuming, to cancel the card and be issued a new one. Yet a comprehensive set of information about one’s
online activities, friends or any other type of Big Data set is more difficult to replace.
In a sense, these are not simply items of data but a comprehensive picture of a person’s identity over time. The third privacy challenge is that data
are increasingly being collected autonomously, independent of human activity. Previously, there was a natural limit on the volume of data
collected related to the number of humans on the planet, and the number of variables we are interested in on each individual is considerably
fewer than the number of people on the planet. The emergence of network-enabled sensors on everything from electricity and water supplies
through to airplanes and cars changes this dimension. Combining these sensors with nanotechnology it becomes possible to embed large
numbers in new buildings to provide early warnings of dangers relating to the structural integrity of the building.
The volume of data, and the speed with which the data must be analyzed, means that there is the requirement for data to be collected and
autonomously analyzed without an individual providing specific consent. This raises ethical concerns relating to the extent to which
organizations can control the collection and analysis of data when there is limited human involvement. The final privacy challenge relates to the
contextual significance of the data. Currently the ability of organizations to collect and store data runs far ahead of their ability to make use of it.
As a function of storing any, and all, unstructured data regardless of potential use cases this means that combinations of data for which there are
currently no capabilities to analyze could become subject to privacy breaches in the future.
2. Using Steganography with Big Data
With the advance of computer and Internet technologies, protecting personal information becomes an important issue. Traditional Cryptography
for symmetric encryption schemes and non-symmetric encryption schemes can provide high-level data security, but they are not flexibility for all
kinds of media. Steganography also called “Information Hiding” is the covert communication which embeds secret information into a meaningful
media with imperceptibly and only the authorized users can extract the hidden data. The meaningful media used to hidden secret message is
called the cover media, and the encoding result is called the stego media. Generally, an information hiding scheme should satisfy the followings
two issues. Firstly, the hiding capacity should be as large as possible. Secondly, the visual quality of the embedding result should not
distinguishable from the cover media (Chuang and Chen, 2012).
Digital image is the most popular camouflage media and the main reason is image pixels can be distortion. Images are very easy to make
imperceptible modification. When an image pixel is tiny modified, it is not easily awarded something difference by the human eyes. Recently,
some researches concerning the information hiding scheme on the text documents. The embedding payload of a text document is less than a
digital image because it is not easy to find the redundancy information in a text document. In generally, text hiding schemes can be classified into
two types, content format and language semantic. The content format methods adjust the width of tracking, the height of leading, number of
white spaces, etc. The language semantic methods change the meaning of a phrase or a sentence in a text document.
The traditional text hiding schemes embed secret information at between-word and between-character by adding tabs or white spaces. However,
the adjusted white spaces of between-word may look like strange. Therefore, we intend to design a text hiding scheme using Big-5 code. The
secret is first converted into binary and then embedded into whitespaces between-word and between-character of a cover text by placing a Big-5
code either 20 or 7F.
We can propose a method for using a steganography in text as an example, to embed secret into a cover text, we shall adjust the content of a cover
text. We need to add a white-space in each between-word and between-character. Secret messages are sequentially converted into 0’s and 1’s
binary stream. One white-space of between-word and between-character in a cover text is used to hide one secret bit. If we want to embed a
secret bit 0, the Big-5 code of white-space 20 is applied. If we want to embed a secret bit 1, the Big-5 code of blank character 7F is applied. After
finishing the secret embedding, we add an end of-code 7F to indicate no secret of input. The hiding capacity of a cover text can be determined
before data embedding. We can calculate the total number of white-spaces in a cover text. Assume a cover text contains w characters; the
embedding payload of a cover text is (w-1) bits.
3. The Ethics of Big Data
We never, ever in the history of mankind have had access to so much information so quickly and so easily. Concerns over privacy, and the
collection and use of personal information, have been closely associated with the growing influence of technology in society. As long ago as 1890,
legal scholars raised concerns over the commercial application of new photographic technologies in the newspaper industry. On the potential
impacts of the future use of information technologies in commerce, the prescient observation has been made: ‘what is whispered in the closet
shall be proclaimed from the house-tops’. These concerns have grown significantly since commercial use of the internet first became widespread
in the mid-1990s.
For many people, the question of how personal data are used for marketing purposes has become a defining social feature of the internet. Yet the
same technology has also created significant new opportunities for market researchers to collect and analyze information to generate more timely
and relevant insights. To date, these two perspectives have existed side by side, albeit sometimes uneasily, with market researchers able to
leverage the internet as an important research tool within the framework of existing ethical approaches.
However, the trend towards Big Data presents a number of challenges in terms of both the ways that personal information is collected and
consumer relationships with this information. Because of its key role in collecting, analyzing and interpreting data, many of the problems, and
opportunities, of Big Data are also those of market research. For market research to prosper it requires the continuing cooperation of
respondents, both in terms of providing data for research studies and in giving permission for these data to be analyzed. In an environment
where there are issues around increasing non-cooperation by respondents, it is essential for market researchers to be at the forefront of
understanding emergent ethical and privacy issues critical where regulatory change poses a potential threat to market researchers’ ability to
collect data in the future.
On the other hand, progressions offered by Big Data present significant opportunities for generating new insights into consumer behavior. The
ability to triangulate multiple data sources and perform analyses of these massive data sets in real time enables market researchers to gather a
range of insights that may not be possible using existing market research techniques.
USING BIG DATA FOR BUSINESS RESEARCH
The following areas represents an opportunity as a source of new insight for business and market researchers by using Big Data, but also has the
potential to create significant challenges in terms of privacy and the use of personal data.
1. The Social Graph
Much of the growth in data is driven by the voluntary sharing of information between members of social networks. Rather than focus on
individual responses, Big Data allows a picture to be built of group-level interactions and the nature of the bonds that Downloaded from
warc.com5bring these people together – a concept that has been labeled the ‘social graph’. The relationship is symbiotic: in order to create value
in their social graph, users need to contribute information about their lives, but in doing so they also increase the digital exhaust of information
that is available about them.
Yet, the boundaries of this social graph are imprecise. The challenge of continuously identifying and labeling 'friends’, particularly those where
there are weak social ties, creates the potential for social uncertainty. The labels for these virtual world connections such as ‘followers’ or ‘friends’,
may not be analogous to their physical-world meanings. It is this source of ambiguity that presents ethical challenges. Understanding how an
individual’s online social graph relates to real-world meaning is thus likely to be essential in effectively leveraging it.
2. Data Ownership
With Big Data the nature of the organizations that collect the largest stores of personal information is changing. In general, it is not central
governments or traditional large corporations that are storing information, but rather a breed of smaller high technology firms such as Facebook,
Twitter, LinkedIn, Google and others. On the one hand, this provides researchers access to sources of data that may not previously have been
available. While there is little incentive for governments to monetize their data commercially, the business models of the majority of consumer-
facing web services are built around, to put it simply, driving commercial value from customer data. For example, Twitter will now make a feed of
several years’ historical content available to anyone wishing to use it for research or analysis purposes. This raises the question of the long-term
ownership of personal data that consumers make available online. Even those companies that do not currently sell access to their data stores
could themselves be potentially sold in the future, and policies for the use of data changed.
3. Big Data Memory
The capability for Big Data technology to enable the storage, and recall, of large volumes of information gives a temporal dimension to the storage
of personal information. Information recorded today, even if not public now, can be recalled instantly in decades’ time. For example, the
emerging focus of Facebook on a ‘timeline’ has created challenges that activities people partake in while at college may reflect badly on them
when they enter the world of work. While analyzing data and building effective models of consumer behavior has always been a part of market
research, Big Data provides the promise of more accurate and far-reaching models. Thus Big Data enables the ability to rewind and fast-forward
people’s lives, but in doing so may remove the ability for individuals to forget and be forgotten.
4. Passive Data Collection
Much information collection is now automatic and passive. Existing approaches to market research are typically reliant on some form of active
opt-in. Big Data makes use of passive technologies, such as location-based information from mobile phones, data from autonomous sensors, or
facial recognition technology in retail stores. This creates the potential for powerful new variables to be included in consumer research. At the
same time the individual may no longer have specific knowledge and awareness that data are currently being collected about them. Even if
permission has been given initially, these services are not asking for permission every time such contextual data are gathered.
5. Respecting Privacy in a Public World
While privacy concerns have been raised over the use and creation of Big Data, these have been outpaced by individuals’ use of social networks.
The value inherent in the social graph provides some form of counterbalance to the potential privacy Downloaded from warc.com6issues. Put
another way, for all the privacy implications, people derive great benefit from services such as mobile applications and social networks – many of
which are available at no charge. Beyond this, for many social groups, contributing to Big Data stores becomes a socially necessary form of
communication in a world where avoiding social networking sites serves the potential to exclude people from their communities. This creates a
paradox in that, while individuals can opt out of having their personal data collected, to do so may result in increasing their exclusion from the
digitally connected world in which they reside.
6. Personal Data and Business Research
For many sectors the ability to collect data and turn it into insight has a key role in developing more innovative and successful products and
services. However, for market research, the importance of access is instrumental to the ability to deliver the product. The history of marketing
activity provides us with many examples of situations where regulators have responded reactively to public perceptions of over-zealous, or
unethical, marketing activity. From the promotion of ineffective ‘patent ’medicines in the 19th century through to tobacco and alcohol in the 20th
century, in sectors that generate negative externalities regulatory pressure is never far behind. Given the criticality of online data collection to
market research, and the potential for personal data to become a similarly hot topic of the 21st century, for the successful realization of the
potential of Big Data in market research it is also necessary to be proactive in responding to potential privacy issues, even if these have yet to
reach the public imagination.
BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT
Business Intelligence and Analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the
magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the Management
Information System (MIS) Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution,
applications, and emerging research areas of BI & A.BI&a 1.0, BI & a 2.0, and BI&A 3.0 are defined and described in terms of their key
characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and
education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a
decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in
terms of the proposed BI&A research framework.
Business Intelligence and Analytics (BI&A) and the related field of Big Data analytics have become increasingly important in both the academic
and the business communities over the past two decades. Industry studies have highlighted this significant development. For example, based on a
survey of over 4,000 Information Technology (IT) professionals from 93 countries and 25 industries, the IBM Tech Trends Report (2011)
identified business analytics as one of the four major technology trends in the 2010s. In a survey of the state of business analytics by (Bloomberg
Business week, 2011), 97 percent of companies with revenues exceeding $100 million were found to use some form of business analytics. A report
by the McKinsey Global Institute (Manyika et al. 2011) predicted that by 2018, the United States alone will face a shortage of 140,000 to 190,000
people with deep analytical skills, as well as a shortfall of 1.5 million data-savvy managers with the know-how to analyze Big Data to make
effective decisions. Hal Varian, Chief Economist at Google and emeritus professor at the University of California, Berkeley, commented on the
emerging opportunities for IT professionals and students in data analysis as follows:
BIG DATA AND MOBILE PHONES
By analyzing patterns from mobile phone usage, a team of researchers in San Francisco is able to predict the magnitude of a disease outbreak half
way around the world. Similarly, an aid agency sees early warning signs of a drought condition in a remote Sub-Saharan region, allowing the
agency to get a head start on mobilizing its resources and save many more lives.
Much attention is paid to the vital services that mobile phone technology has brought to billions of people in the developing world. But now many
policy-makers, corporate leaders and development experts are realizing the potential applications, like the examples above, for the enormous
amounts of data created by and about the individuals who use these services.
Sources such as online or mobile financial transactions, social media traffic, and GPS coordinates now generate over 2.5 quintillion bytes of so-
called “Big Data” every day. And the growth of mobile data traffic from subscribers in emerging markets is expected to exceed 100% annually
through 2015.
The data emanating from mobile phones holds particular promise, in part because for many low-income people it is their only form of interactive
technology, but it is also easier to link mobile – generated data to individuals. This data can paint a picture about the needs and behavior of
individual users rather than simply the population as a whole.
To turn mobile-generated data into an economic development tool, a number of ecosystem elements must be in place. For those individuals who
generate the data, mechanisms must be developed to ensure adequate user privacy and security. At the same time, business models must be
created to provide the appropriate incentives for private-sector actors to share and use data for the benefit of the society. Such models already
exist in the Internet environment. Companies in search and social networking profit from products they offer at no charge to end users because
the usage data these products generate is valuable to other ecosystem actors. Similar models could be created in the mobile data sphere, and the
data generated through them could maximize the impact of scarce public sector resources by indicating where resources are most needed. We can
see that data collected through mobile device usage can spur effective action in two primary ways: by reducing the time lag between the start of a
trend and when governments and other authorities are able to respond to them, and by reducing the knowledge gap about how people respond to
these trends.
Ecosystem actors have much to gain from the creation of an open data commons. Yet the sharing of such data especially that tied to individuals
raises legitimate concerns that must be addressed to achieve this cross-sector collaboration.
As ecosystem players look to use mobile-generated data, they face concerns about violating user trust, rights of expression, and confidentiality.
Privacy and security concerns must be addressed before convinced to share data more openly.
When individuals have multiple SIM cards, it is impossible to aggregate data from each SIM back to the same individual. This data is most useful
if it can be attached to demographic indicators, which allow the data to tell a story about the habits of a segment of the population. Improved
methods of tying subscriptions to demographic information are needed to ensure data generated by mobile devices is as individualized as
possible.
Individuals, facing security and privacy concerns, often resist sharing personal data. In addition, many private-sector firms do not see an
incentive to share data they regard as proprietary. Governments often cannot forces contractors to share data collected in the execution of public
contracts or make all government data available for use by academia, development organizations, and companies. All players must see material
benefits and incentives in data sharing that outweigh the risks.
CONCLUSION
The opportunities for large-scale digital media research are obvious—as are its pitfalls and downsides. Thus, researchers should differentiate
between alternative research approaches carefully and be cautious about the application of unfamiliar tools, analytical techniques, or
methodological innovation. With no or few references to compare one’s results to, findings will be difficult to interpret and online researchers
should “hold themselves to high standards of conceptual clarity, systematically of sampling and data analysis, and awareness of limitations in
interpreting their results” assert that methodological training should be part of the answer to the challenges of digital media research. Manovich
argues for advanced statistics and computer science methods, which could likely help in furthering an understanding of the underlying
algorithms of online platforms as well as analytical tools. Yet, a reflection on and an understanding of what comes before the first data is collected
or analyzed is equally or possibly even more important.
Big Data security needs to be correlated with its business criticality or risk to the organization. Without a risk-based approach to security,
organizations can waste valuable IT resources used for vulnerabilities that will cause in reality little or no threat to the business. Furthermore, big
security data needs to be filtered to just the information that is relevant to specific stakeholders’ roles and responsibilities. Not everyone has the
same needs and objectives when it comes to leveraging Big Data.
New data structures and research opportunities should not be ignored by media and communication scholars, and there are many relevant and
interesting research questions that are well suited to Big Data analysis. On the other hand, established practices of empirical research should not
be discarded as they ensure the coherence and quality of a study. After all, this is one of the key contributions that social scientists can bring to
the table in interdisciplinary research. It should go without saying that a strong focus on theoretically relevant questions always increases the
scientific significance of the research and its results. Yet, some developments in digital media research, particularly those related to Big Data,
seem to warrant affirmation of this fundamental principle.
This work was previously published in the Handbook of Research on Threat Detection and Countermeasures in Network Security edited by
Alaa Hussein AlHamami and Ghossoon M. Waleed alSaadoon, pages 276298, copyright year 2015 by Information Science Reference (an
imprint of IGI Global).
REFERENCES
Agrawal, D., Das, S., & El Abbadi, A. (2010, September). Big data and cloud computing: New wine or just new bottles . Proceedings of the VLDB
Endowment , 3(2), 1647–1648. doi:10.14778/1920841.1921063
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete Wired. Retrieved from
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory/
Bailenson, J. N. (2012). Contribution to the ICA Phoenix closing plenary: The Internet is the end of communication theory as we know it. 62nd
annual convention of the International Communication Association, Phoenix, AZ. Retrieved fromttp://www.icahdq.org/conf/2012/closing.asp
Barnes, S. (2006). A privacy paradox: Social networking in the United States. First Monday , 11(9).
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/ view/1394/1312doi:10.5210/fm.v11i9.1394
Batinic, B., Reips, U.-D., & Bosnjak, M. (Eds.). (2002). Online social sciences . Seattle, WA: Hogrefe& Huber.
Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technical, and scholary phenomenon.Information
Communication and Society , 15(5), 665–679. doi:doi:10.1080/136911BX.2012.678878
Busemann, K., & Gscheidle, C. (2012). Web 2.0: Habitualisierung der Social Communities [Web 2.0: Habitualization of social community
use]. Media Perspektiven, (7–8), 380–390.
Cheng, H. (2012, April). Identity based encryption and biometric authentication scheme for secure data access in cloud computing.Chinese
Journal of Electronics , 21(2), 254–259.
Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE , 5(11),
e14118. doi:10.1371/journal.pone.0014118
Christians, C. G., & Chen, S.-L. S. (2003). Introduction: Technological environments and the evolution of social research methods. Johns, M. D.,
Chen, S.-L. S., & Hall, G. J. (Eds.), Online social research: Methods, issues, & ethics (pp. 15–23). New York, NY: Peter Lang.
Chuang, J. C., & Chen, H. Y. (2012). Data hiding on text using big-5 code. International Journal of Security and Its Applications ,6(2).
Cloud Security Alliance. (2013). Expanded Top Ten Big Data Security and Privacy challenge. https://cloudsecurityalliance.org/research/big-
data/, Accessed on 15 Jan. 2014.
Colin, T. (2012).Big Data Security, Digital Pathways . Network Security , (July): 2012.
Coronel, C., Morris, S., & Rob, P. (2013).Database Systems: Design, Implementation, and Management (10th ed.). Boston: Cengage Learning.
Daniel, N., & Di Domenico, M. (2013). Market research and the ethics of Big Dat. International Journal of Market Research ,55(4), 2013.
Dodge, M. (2005). The role of maps in virtual research methods . In Hine, C. (Ed.), Virtual Methods: Issues in social research on the Internet (pp.
113–127). Oxford, UK: Berg.
Feng, Z., Hui-Feng, X., Dung-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big Data Cleaning Algorithms in Cloud Computing . Retrieved from;
doi:10.3991/ijoe.vqi3.2765
Mall, R., Langone, R., & Suykens, J. A. K. (2013). Kernal Spectral Clustering for Big Data Networks. Entropy, 15, 1567-1586. doi:3390/e15051567
Manovich, L. (2012).Trending: The Promises and the Challenges of big social data . In Gold, M. K. (Ed.), Debates in the Digital Humanities (pp.
460–475). Minneapolis: University of Minneapolis Press.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The next frontier for Innovation,
Competition, and Productivity. Retrieved from http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/.
Qin, H. F., & Li, Z. H. (2013).Research on the Method of Big Data Analysis. Information Technology Journal , 12(10), 1974–1980.
Van Till, S. (2013). Will Big Data Change Security?securityMagazine.com, April 2013.
Wang, H. (2012). Virtual machine-based intrusion detection system framework in cloud computing environment. Journal of Computers , 7(10),
2397–2403.
Yang, T. (2012). Mass data analysis and forecasting based on cloud computing. Journal of Software , 7(10), 2189–2195.
KEY TERMS AND DEFINITIONS
Big Data: Refers to data sets that are too big to be handled by regular storage and processing infrastructures.
Unstructured Data: Refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner.
Variety: Refers to the type of information being stored. Previously, data stored tended to be highly structured in nature.
Velocity: Refers to the challenges in Downloaded from accessing stored data quickly enough for them to be useful.
CHAPTER 69
The Role of Big Data in Radiation Oncology:
Challenges and Potentials
Issam El Naqa
McGill University, Canada
ABSTRACT
More than half of cancer patients receive ionizing radiation as part of their treatment and it is the main modality at advanced stages of disease.
Treatment outcomes in radiotherapy are determined by complex interactions between cancer genetics, treatment regimens, and patient-related
variables. A typical radiotherapy treatment scenario can generate a large pool of data, “Big data,” that is comprised of patient demographics,
dosimetry, imaging features, and biological markers. Radiotherapy data constitutes a unique interface between physical and biological data
interactions. In this chapter, the authors review recent advances and discuss current challenges to interrogate big data in radiotherapy using top-
bottom and bottom-top approaches. They describe the specific nature of big data in radiotherapy and discuss issues related to bioinformatics
tools for data aggregation, sharing, and confidentiality. The authors also highlight the potential opportunities in this field for big data research
from bioinformaticians as well as clinical decision-makers’ perspectives.
INTRODUCTION
Cancer is a leading cause of mortality in the United States and worldwide. It remains the second most common cause of death in the United
States, accounting for nearly 1 of every 4 deaths. It is projected that a total of 1,665,540 new cancer cases and 585,720 cancer deaths are to occur
in the United States in 2014 (Siegel, Ma, Zou, & Jemal, 2014). Radiation therapy (radiotherapy) is one of three major treatment modalities of
cancer beside surgery and chemotherapy and it remains the main option at locally advanced stages of the disease. More than half of all cancer
patients receive radiotherapy as a part of their treatment. Despite radiotherapy proven benefits, it comes with a Damocles’ sword of benefits and
risks to the exposed patient. A key goal of modern radiation oncology research is to predict, at the time of radiation treatment planning, the
probability of tumour response benefits and normal tissue risks for the type of treatment being considered. Although recent years have witnessed
tremendous technological advances in radiotherapy treatment planning, image-guidance and delivery, efforts to individualize radiotherapy
treatment doses based on in vitro assays of various biological endpoints has not been clinically successful (IAEA, 2002; C. M. West, 1995).
Conversely, several groups have shown that dose-volume factors play an important role in determining treatment outcomes (Bentzen et al., 2010;
Blanco et al., 2005; J. Bradley, Deasy, Bentzen, & El-Naqa, 2004; J. O. Deasy & El Naqa, 2008; I. El Naqa et al., 2006; Hope et al., 2005; Jackson
et al., 2010; Marks, 2002a; Tucker et al., 2004), but these methods may suffer from limited predictive power when applied prospectively. The
lack of progress or major breakthroughs in radiotherapy outcomes over the past two decades demands fundamentally new insights into the
methodologies used to better exploit the power of this unique non-invasive high-energy source, as well as it requires a new vision to guide the
analysis of radiotherapy response and the design of new therapeutic strategies.
Outcomes in radiotherapy are usually characterized by tumour control probability (TCP) and the surrounding normal tissues complications
(NTCP) (Steel, 2002; Webb, 2001). Traditionally, these outcomes are modeled using information about the dose distribution and the
fractionation scheme (Moissenko, Deasy, & Van Dyk, 2005). However, it is recognized that radiation response may also be affected by multiple
clinical prognostic factors (Marks, 2002b) and more recently, inherited genetic variations have been suggested as playing an important role in
radiation sensitivity (J. Alsner, C. N. Andreassen, & J. Overgaard, 2008; C. M. L. West, Elliott, & Burnet, 2007). Moreover, evolution in imaging
and biotechnology have provided new extraordinary opportunities for visualizing tumours in vivo and for applying new molecular techniques for
biomarkers discovery of radiotherapy response, respectively (Bentzen, 2008; Jain, 2007; Nagaraj, 2009; C. M. L. West et al., 2007). However,
biological assays, which can be performed on either tumour or normal tissues, may not be the only determinant of tumour control or risk of
radiotherapy adverse reactions. Therefore, recent approaches have utilized data driven models using advanced informatics tools in which dose-
volume metrics are mixed with other patient or disease-based prognostic factors in order to improve outcomes prediction (Issam El Naqa, 2012).
Accurate prediction of treatment outcomes would provide clinicians with better tools for informed decision-making about expected benefits
versus anticipated risks.
In this chapter, we will provide an overview of recent advances in radiotherapy informatics and discuss current challenges to interrogate big data
as it appears in radiotherapy using top-bottom and bottom top-approaches. We will describe the specific nature of big data in radiotherapy, and
the role of the emerging field of systems radiobiology for outcomes modeling. We will provide examples based on our and others experiences.
Finally, we will discuss issues related to bioinformatics tools for data aggregation, sharing, and confidentiality.
BACKGROUND
Radiotherapy of Cancer
Radiotherapy is targeted localized treatment using ablative high-energy radiation beams to kill cancer cells. More than half of all cancer patients,
particularly patients with solid tumours such as in the brain, lung, breast, head and neck, and the pelvic area receive radiotherapy as part of their
curative or palliative treatment. A typical radiotherapy planning process would involve the acquisition of patient image data (typically fully 3-D
computed tomography (CT) scans and other diagnostic imaging modalities such as positron emission tomography (PET) or magnetic resonance
imaging (MRI)). Then, the physician would outline the tumor and important normal organ structures on a computer, based on the CT scan.
Treatment radiation dose distributions are simulated with prescribed doses (energy per unit mass, Gray (Gy)). The treatment itself could be
delivered externally using linear accelerators (Linacs) or internally using sealed radioisotopes (Brachytherapy) (Halperin, Perez, & Brady, 2008).
Radiotherapy Response
The biology of radiation effect has been classically defined by “The Four R’s” (repair of sublethal cellular damage,reassortment/redistribution of
cells into radiosensitive phases of the cell-cycle between dose fractions, reoxygenation over a course of therapy, and
cellular repopulation/division over a course of therapy) (Hall & Giaccia, 2006). It is believed that radiation-induced cellular lethality is primarily
caused by DNA damage in the targeted cells. Two types of cell death have been linked to radiation effect: apoptosis and post-mitotic cell death.
However, tumour cell radiosensitivity is controlled via many factors (known and unknown) related to tumor DNA repair efficiency (e.g.,
homologous recombination or nonhomologous endjoining), cell cycle control, oxygen concentration, and the radiation dose rate (Hall & Giaccia,
2006; Joiner & Kogel, 2009; Lehnert, 2008). A rather simplistic understanding of these experimentally observed irradiation effects using in
vitro assays constituted the basis for developing analytical or sometimes referred to as mechanistic models of radiotherapy response. These
models have been applied widely for predicting TCP and NTCP and designing radiotherapy clinical trials over the past century. However, due to
the inherent complexity and heterogeneity of radiation physics and biological processes, these traditional modeling methods have fallen short of
providing sufficient predictive power when applied prospectively to personalize treatment regimens. Therefore, more modern approaches are
investigating more advanced informatics and systems engineering techniques that would be able to integrate physical and biological information
to adapt intra-radiotherapy changes and optimize post-radiotherapy treatment outcomes using topbottom approaches based on complex
systems analyses or bottomtop approaches based on first principles as further discussed below.
TopBottom Approaches for Modeling Radiotherapy Response
These are typically phenomenological models and depend on parameters available from the collected clinical, dosimetric and/or biological data
(J. O. Deasy & El Naqa, 2008). In the context of data-driven and multi-variable modeling of outcomes, the observed treatment outcome (e.g.,
TCP or NTCP) is considered as the result of functional mapping of several input variables (I. El Naqa et al., 2006). Mathematically, this is
expressed as f(x; w*):X→Y where xi∈∞N is composed of the input metrics (dose-volume metrics, patient disease-specific prognostic factors, or
biological markers). The expression yi∈Y is the corresponding observed treatment outcome. The variable w* includes the optimal parameters of
the model f(g) obtained by learning a certain objective functional. Learning is defined in this context of outcome modeling as estimating
dependencies from data (Hastie, Tibshirani, & Friedman, 2001). There are two common types of learning: supervised and unsupervised.
Supervised learning is used when the endpoints of the treatments such as tumour control or toxicity grades are known; these endpoints are
provided by experienced oncologists following Radiation Therapy Oncology Group (RTOG) or National Cancer Institute (NCI) criteria and it is
the most commonly used learning method in outcomes modeling. Nevertheless, unsupervised methods such as principle component analysis
(PCA) are also used to reduce the learning problem dimensionality and to aid in the visualization of multivariate data and the selection of the
optimal learning method parameters.
The selection of the functional form of the model f(g) is closely related to the prior knowledge of the problem. In mechanistic models, the shape of
the functional form is selected based on the clinical or biological process at hand, however, in data-driven models; the objective is usually to find
a functional form that best fits the data (I. El Naqa, J. Bradley, P. E. Lindsay, A. Hope, & J. O. Deasy, 2009). A depiction of this top-bottom
approach is shown in Figure 1. A detailed review of this methodology is presented in our previous work (I. El Naqa, 2013). Below we will highlight
this approach using logistic regression and artificial intelligence methods.
In radiation outcomes modeling, the response will usually follow an S-shaped curve. This suggests that models with sigmoidal shapes are the
most appropriate to use (Blanco et al., 2005; J. Bradley et al., 2004; J. D. Bradley et al., 2007; Hope et al., 2005; Huang, Bradley, et al., 2011;
Huang, Hope, et al., 2011; Marks, 2002b; Tucker et al., 2004). A commonly used sigmoidal form is the logistic regression model, which also has
nice numerical stability properties. The results of this type of approach are not expressed in a closed form as in the case of analytical models but
instead, the model parameters are chosen in a stepwise fashion to define the abscissa of the regression model f(g). However, it is the user's
responsibility to determine whether interaction terms or higher order variables should be added. A solution to ameliorate this problem could be
offered by applying artificial intelligence methods.
Artificial Intelligence Methods
Artificial intelligence techniques (e.g., neural networks, decision trees, support vector machines), which are able to emulate human intelligence
by learning the surrounding environment from the given input data, have also been utilized because of their ability to detect nonlinear complex
patterns in the data. In particular, neural networks were extensively investigated to model post-radiation treatment outcomes for cases of lung
injury (Munley et al., 1999; Su et al., 2005) and biochemical failure and rectal bleeding in prostate cancer (Gulliford, Webb, Rowbottom, Corne, &
Dearnaley, 2004; Tomatis et al., 2012). A rather more robust approach of machine learning methods is support vector machines (SVMs), which
are universal constructive learning procedures based on the statistical learning theory (Vapnik, 1998). For discrimination between patients who
are at low risk versus patients who are at high risk of radiation therapy, the main idea of SVM would be to separate these two classes with ‘hyper-
planes’ that maximize the margin between them in the nonlinear feature space defined by an implicit kernel mapping. Examples of applying these
methods are discussed in our previous work (Issam El Naqa, 2012; I. El Naqa, J. D. Bradley, P. E. Lindsay, A. J. Hope, & J. O. Deasy, 2009; I. El
Naqa et al., 2010).
BottomTop Approaches for Modeling Radiotherapy Response
These approaches utilize first principles of radiation physics and biology to model cellular damage temporally and spatially. Typically, they would
apply advanced numerical methods such as Monte-Carlo (MC) techniques to estimate the molecular spectrum of damage in clustered and not-
clustered DNA lesions (Gbp-1 Gy-1) (Nikjoo, Uehara, Emfietzoglou, & Cucinotta, 2006). The temporal and spatial evolution of the effects from
ionizing radiation can be divided into three phases: physical, chemical, and biological following the multiscale representation in time and space
shown in Figure 2. Different available MC codes aim to emulate these phases to varying extents. A detailed review of many current MC particle
track codes and their potential use for radiobiological outcome modeling is provided in our previous work (I. El Naqa, Pater, & Seuntjens, 2012).
Figure 2. Bottom-Top outcome modeling approach
representation showing multiscale-modeling of tissue
(Tumour) radiation response along the time and space axes
(El Naqa et al., PMB, 2012)
BIG DATA IN RADIOTHERAPY
Constituents of Big Data in Radiotherapy
Big data in radiotherapy could be divided based on its nature into four categories: Clinical, dosimetric, imaging, and biological. These four
categories of radiotherapy big data are described in the following.
Clinical Data
Clinical data in radiotherapy typically refers to cancer diagnostic information (site, histology, stage, grade, etc) and patient-related characteristics
(age, gender, co-morbidities, etc). In some instances, other treatment modalities information (surgery, chemotherapy, hormonal treatment, etc)
would be also classified under this category. The mining of such data could be challenging particularly if the data was in unstructured storage
format, however, this lends new opportunities for applying natural language processing (NLP) techniques to assist in the organization of such
data (Shivade et al., 2013).
Dosimetric Data
This type of data is related to the treatment planning process in radiotherapy, which involves simulated radiation dose distributions using
computed tomography imaging; specifically, dose-volume metrics derived from dose-volume histograms (DVHs) graphs are frequently extracted
to summarize this data. Dose-volume metrics have been extensively studied in the radiation oncology literature for outcomes modeling (Blanco et
al., 2005; J. Bradley et al., 2004; Hope et al., 2006; Hope et al., 2005; Levegrun et al., 2001; Marks, 2002b). Typical metrics extracted from the
DVH include the volume receiving certain dose (Vx), minimum dose to x% volume (Dx), mean, maximum and minimum dose, etc. More details
are in our review chapter (J. O. Deasy & El Naqa, 2008). Moreover, we have developed a dedicated software tool called ‘DREES’’ for
automatically deriving these metrics and modeling of radiotherapy response (I. El Naqa et al., 2006).
Radiomics (Imaging Features)
kV x-ray computed tomography (kV-CT) has been historically considered the standard modality for treatment planning in radiotherapy because
of its ability to provide electron density information for target and normal structure definitions as well as heterogeneous dose calculations (Khan,
2007; Webb, 2001). However, additional information from other imaging modalities could be used to improve treatment monitoring and
prognosis in different cancer types. For example, physiological information (tumour metabolism, proliferation, necrosis, hypoxic regions, etc.)
can be collected directly from nuclear imaging modalities such as SPECT and PET or indirectly from MRI (Condeelis & Weissleder, 2010;
Willmann, van Bruggen, Dinkelborg, & Gambhir, 2008). The complementary nature of these different imaging modalities has led to efforts
toward combining their information to achieve better treatment outcomes. For instance, PET/CT has been utilized for staging, planning, and
assessment of response to radiation therapy (Bussink, Kaanders, van der Graaf, & Oyen, 2011; Zaidi & El Naqa, 2010). Similarly, MRI has been
applied in tumour delineation and for assessing toxicities in head and neck cancers (Newbold et al., 2006; Piet et al., 2008). Moreover,
quantitative information from hybrid-imaging modalities could be related to biological and clinical endpoints; this is a new emerging field
referred to as ‘radiomics’ (Kumar et al., 2012; P. Lambin et al., 2012). In our previous work, we demonstrated the potential of this new field to
monitor and predict response to radiotherapy in head and neck (I. El Naqa, Grigsby, et al., 2009), cervix (I. El Naqa, Grigsby, et al., 2009; Kidd,
El Naqa, Siegel, Dehdashti, & Grigsby, 2012), and lung (Vaidya et al., 2012) cancers, in turn allowing for adapting and individualizing treatment.
Biological Markers
A biomarker is defined as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathological
processes, or pharmacological responses to a therapeutic intervention (Group, 2001).” Biomarkers can be categorized based on the biochemical
source of the marker into exogenous or endogenous biomarkers.
Exogenous biomarkers are based on introducing a foreign substance into the patient’s body such as those used in molecular imaging as discussed
above. Conversely, endogenous biomarkers can further be classified as (1) ‘expression biomarkers,’ measuring changes in gene expression or
protein levels or (2) ‘genetic biomarkers,’ based on variations, for tumours or normal tissues, in their underlying DNA genetic code. Biomarker
measurements are typically based on tissue or fluid specimens, which are analyzed using molecular biology laboratory techniques (I. El Naqa,
Craft, Oh, & Deasy, 2011).
Expression Biomarkers
Expression biomarkers are the result of gene expression changes in tissues or bodily fluids due to the disease or normal tissues’ response to
treatment (Mayeux, 2004). These biomarkers can be further divided into single-parameter (e.g., prostate-specific antigen (PSA) levels in blood
serum) versus bio-arrays. These can be based on disease pathophysiology or pharmacogenetics studies or they can be extracted from several
methods, such as high-throughput gene expression (aka genomics or transcriptomics) (Nuyten & van de Vijver, 2008; Ogawa, Murayama, &
Mori, 2007; Svensson et al., 2006), resulting protein expressions (aka proteomics) (Alaiya, Al-Mohanna, & Linder, 2005; Wouters, 2008), or
metabolites (aka metabolomics) (Spratlin, Serkova, & Eckhardt, 2009; Tyburski et al., 2008). We will discuss examples from high-throughput
gene expression (genomics) using RNA microarrays and protein expression (proteomics) analysis using mass spectroscopy.
An RNA microarray is a multiplex technology that allows analyzing thousands of gene expressions from multiple samples at the same time, in
which short nucleotides in the array hybridize to the sample, which is subsequently quantified by fluorescence (Schena, Shalon, Davis, & Brown,
1995). Klopp et al. used microarray profiling to identify a set of 58-genes using pretreatment biopsy samples that were differentially expressed
between cervix cancer patients with and without recurrence (Klopp et al., 2008). Another interesting set of RNA markers include microRNAs
(miRNAs), which are a family of small non-coding RNA molecules (~22 nucleotides), each of which can suppress the expression of hundreds of
protein-coding gene (‘targets’). Specific miRNAs can control related groups of pathways. MicroRNAs thereby comprising a particularly powerful
part of the cellular gene expression control system and are particularly attractive as potential biomarkers. We have developed a machine learning
algorithm for detecting miRNA targets (X. Wang & El Naqa, 2008) and showed that miR-200 miRNA clusters could be used as prognostic
marker in advanced ovarian cancer (Hu et al., 2009). In another example of the value of miRNAs as biomarkers, several miRNA (miR-137, miR-
32, miR-155, let-7a) expression levels have been correlated with poor survival and relapse in non-small lung cancer (Yu et al., 2008).
Mass spectroscopy is an analytical technique for the determination of the molecular composition of a sample (Sparkman, 2000). This tool is the
main vehicle for large scale protein profiling also known as proteomics (Twyman, 2004). Allal et al. applied proteomics to study radioresistance
in rectal cancer. The study identified tropomodulin, heat shock protein 42, beta-tubulin, annexin V, and calsenilin as radioresistive biomarkers,
and keratin type I, notch 2 protein homolog and DNA repair protein RAD51L3 as radiosensitive biomarkers (Allal et al., 2004). Zhu et al. applied
proteomics to study tumour response in cervical cancer. They found that increased expression of S100A9 and galectin-7, and decreased
expression of NMP-238 and HSP-70 were associated with significantly increased local response to concurrent chemoradiotherapy in cervical
cancer (Zhu et al., 2009). In our previous work, we demonstrated the feasibility of applying bioinformatics methods for proteomics analysis of
limited data in radiation-induced lung injury post- radiotherapy in lung cancer patients (J. H. Oh, Craft, Townsend, et al., 2011).
Genetic Variant Markers
The inherent genetic variability of the human genome is an emerging resource for studying disposition to cancer and the variability of patient
response to therapeutic agents. These variations in the DNA sequences of humans, in particular single-nucleotide polymorphisms (SNPs) have
been shown to elucidate complex disease onset and response in cancer (Erichsen & Chanock, 2004). Methods based on the candidate gene
approach and high-throughput genome-wide association studies (GWAS) are currently heavily investigated to analyze the functional effect of
SNPs in predicting response to radiotherapy (Jan Alsner, Christian Nicolaj Andreassen, & Jens Overgaard, 2008; Andreassen & Alsner, 2009; C.
M. L. West et al., 2007). There are several ongoing SNPs genotyping initiatives in radiation oncology, including the pan-European GENEPI
project (Baumann, Hölscher, & Begg, 2003), the British RAPPER project (Burnet, Elliott, Dunning, & West, 2006), the Japanese RadGenomics
project (Iwakawa et al., 2006), and the US Gene-PARE project (Ho et al., 2006). An international consortium has been also established to
coordinate and lead efforts in this area (C. West & Rosenstein, 2010). Examples of this effort include the identification of SNPs related to
radiation toxicity in prostate cancer treatment (S. L. Kerns et al., 2010; Sarah L. Kerns et al.).
Systems Radiobiology
In order to integrate heterogeneous big data within radiotherapy, engineering-inspired system approaches can have great potential to achieve
this goal. Systems biology has emerged as a new field in life sciences to apply systematic study of complex interactions to biological systems
(Alon, 2007) but its application to radiation oncology, despite this noted potential, has been unfortunately limited to date (Issam El Naqa, 2012;
Feinendegen, Hahnfeldt, Schadt, Stumpf, & Voit, 2008). Recently, Eschrich et al. presented a systems biology approach for identifying
biomarkers related to radiosensitivity in different cancer cell lines using linear regression to correlate gene expression with survival fraction
measurements (Eschrich et al., 2009). However, such a linear regression model may lack the ability to account for higher order interactions
among the different genes and neglect the expected hierarchal relationships in signaling transduction of highly complex radiation response. It
has been noted in the literature that modeling of molecular interactions could be represented using graphs of network connections as in power
lines grids. In this case, radiobiological data can be represented as a graph (network) where the nodes represent genes or proteins and the edges
may represent similarities or interactions between these nodes. We have utilized such approach based on Bayesian networks for modeling
dosimetric radiation pneumonitis relationships (Jung Hun Oh & El Naqa, 2009) and more recently in predicting local control from biological and
dosimetric data as presented in the example of Figure 3 (J. H. Oh, Craft, Al Lozi, et al., 2011).
Figure 3. A systems-based radiobiology approach. Top: A
Bayesian network with probability tables for combined
biomarker proteins and physical variables for modelling local
tumour control in lung cancer. Bottom: The binning
boundaries for each variable. (Oh et al., PMB, 2011)
In the more general realm of informatics, this systems approach could be represented as a part of a feedback treatment planning system as was
shown in Figure 1. In which, informatics understanding of heterogeneous variable interactions could be used as an adaptive learning process to
improve outcomes modeling and the personalization of radiotherapy regimens.
Radiotherapy Warehouses and Database
The recent progress in imaging and biotechnology techniques has provided new opportunities to reshape our understanding of radiation physics
and biology and potentially improving the quality of care for radiation oncology patients (Bentzen, 2008). However, this is also accompanied
with new challenges of archiving, visualizing and analyzing tremendous heterogeneous datasets of clinical characteristics, dosimetry, imaging
and molecular data in a clinical setting.
The group at John Hopkins developed a web-based infrastructure for integrating outcomes with treatment planning information called
oncospace (McNutt, 2013). Moreover, the group at Maastricht has developed in collaboration with Siemens a data warehouse with automated
tools for feature extraction to integrate different medical data sources for radiotherapy clinical trials. They tested the performance of this
warehouse against a manual collection process for non-small cell lung cancer and rectal cancer, using 27 patients per disease. They found that the
average time per case to collect the data manually was almost double that of using the warehouse tools (Roelofs et al., 2013). At our institution,
we developed a clinical protocol for generating a big data protocol for lung cancer patients who receive radiotherapy we denoted as the “Lung
Cancer Jamboree.” The protocol aims to collect imaging scans and blood samples at five different time points: (1) pre-treatment, (2) mid-
treatment, (3) end-treatment, (4) at 3-months follow-up, and (5) at 6-months follow-up. The patients are assessed for onset of radiation-induced
lung injury according to the NCI Common Terminology Criteria for Adverse Events (CTCAE) v4.0 as designated clinical endpoint for outcome
modeling. The different data types in this protocol including clinical data, imaging, and biological markers are collected and analyzed as
represented in the schematics of Figure 4.
Figure 4. Description showing the lung jamboree protocol for
investigating radiation-induced lung injury
(pneumonitis/fibrosis) by using a big data approach of
clinical, dosimetric, imaging, and biomarkers. The top image
shows the protocol longitudinal schema and the bottom one
shows sample imaging and biomarkers data.
Multi-institutional groups in the USA and elsewhere such as the Radiation Therapy Oncology Group (RTOG) had assembled bioinformatics
groups to facilitate the development of personalized predictive models for radiation therapy guidance. The data is collected from specific
characteristics of patients and treatments and is integrated with clinical trial databases, which include clinical, dosimetric, as well as biological
information through its biospecimen resource. Similarly, European groups have launched an initiative called the Euregional Computer Assisted
Theragnostics project (EuroCAT) in the Meuse-Rhine region to address the same problem with specific goals of developing a shared database
among participating European institutions that would include medical characteristics of cancer patients and identify suitable candidates for
multi-institutional clinical trials.
ISSUES AND RECOMMENDATIONS
Data Sharing
Data sharing remains an issue for technical and non-technical issues (Sullivan et al., 2011). Therefore, the Quantitative Analyses of Normal Tissue
Effects in the Clinic (QUANTEC) consortium has suggested that cooperative groups adopt a policy of anonymizing clinical trials data and making
theses data publicly accessible after a reasonable delay. This delay would enable publication of all the investigator-driven, planned studies while
encouraging the establishment of key databanks of linked treatment planning, imaging, and outcomes data (Joseph O. Deasy et al., 2010). An
alternative approach is to apply rapid learning as suggested by the group at Maastricht. In which, innovative information technologies are
developed that support semantic interoperability and enable distributed learning and data sharing without the need for the data to leave the
hospital (Philippe Lambin et al., 2013). An example of multi-institutional data sharing is developed by the group and the Policlinico Universitario
Agostino Gemelli in Rome, Italy (Gemelli) (Roelofs et al., 2014).
Imaging Transfer and Retrieval Technologies
DICOM (Digital Imaging and Communications in Medicine) version 3.0 is currently the main standard adopted by radiotherapy vendors as part
of the adherence to the “Integrating the Healthcare Enterprise” in Radiation Oncology (IHE-RO). A supplement has been added to incorporate
specific structures or objects related to radiotherapy called DICOM-RT, which includes organ contours and radiation dosimetry information. CT
images used for simulation and patient setup are typically stored in vendors’ local databases using DICOR-RT formats. Additional diagnostic
imaging data is stored in picture archiving and communication system (PACS), which provides tools to manage the storage and distribution of
images. The use of compression techniques is necessary to accommodate the continuously evolving size of stored images including 4D multi-slice
CT, PET, and different MRI pulse sequences (Strintzis, 1998). Despite the fact that storage cost has dropped over the years, the question of using
lossy compression methods has been re-opened given the exponential increase in image use in oncology(Koff & Shulman, 2006). A rather
interesting technology for retrieval of medical images is the use of content-based image retrieval (CBIR) technologies, which offer computerized
solutions that aim to query images for diagnostic or therapeutic information based on the content or extracted features of the images rather than
their textual annotation (Issam El Naqa & Yang, 2014).
Information Technology
Information technologies, whether it was hardware (e.g., disk space, memory, processors, networking, etc) or software (e.g., database
management systems, indexing, compression, querying, visualization, etc), are at the heart of any successful utilization of big data for outcome
modeling in radiotherapy. The field of radiation oncology is founded on using advanced treatment delivery and imaging technologies, however,
its informatics infrastructure has been limited to meet clinical needs of record and verification. The presence of such infrastructure could be
utilized as a starting step towards building a big data platform in radiation oncology as described in the oncospace or the EuroCat examples
mentioned above. However, additional information may need to be accessed from radiology (radiomics) and/or pathology
(genomics/proteomics). This would indicate the need for a comprehensive approach with multiple collaborative partners at the medical
institution to play an active role in sharing resources or access to resources using standards such as Health Level Seven (HL7).
Web Resources for Radiobiology
As of today, there are no dedicated web resources for big data studies in radiation oncology. Nevertheless, radiotherapy biological markers
studies can still benefit from existing bioinformatics resources for pharmacogenomic studies that contain databases and tools for genomic,
proteomic, and functional analysis as reviewed by Yan (Yan, 2005). For example, the National Center for Biotechnology Information (NCBI) site
hosts databases such as GenBank, dbSNP, Online Mendelian Inheritance in Man (OMIM), and genetic search tools such as BLAST. In addition,
the Protein Data Bank (PDB) and the program CPHmodels are useful for protein structure three-dimensional modeling. The Human Genome
Variation Database (HGVbase) contains information on physical and functional relationships between sequence variations and neighboring
genes. Pattern analysis using PROSITE and Pfam databases can help correlate sequence structures to functional motifs such as phosphorylation
(Yan, 2005). Biological pathways construction and analysis is an emerging field in computational biology that aims to bridge the gap between
biomarkers findings in clinical studies with underlying biological processes. Several public databases and tools are being established for
annotating and storing known pathways such as KEGG and Reactome projects or commercial ones such as the IPA or MetaCore (Viswanathan,
Seto, Patil, Nudelman, & Sealfon, 2008). Statistical tools are used to properly map data from gene/protein differential experiments into the
different pathways such as mixed effect models (L. Wang, Zhang, Wolfinger, & Chen, 2008) or enrichment analysis (Subramanian et al., 2005).
Protecting the Confidentiality and Privacy of Clinical Phenotype Data
Issues of confidentiality and patient privacy are important to any clinical research including big data. This can emerge in single institutional or
multi-institutional studies nationally or internationally. Confidentially requirement also applies in scenarios using cloud online storage systems
or when sharing data across borders (e.g., teleconferencing, telemedicine, etc) whether for consultation, quality assurance, or research purpose.
In the case of big data research, QUANTEC offered a solution to radiotherapy digital data (treatment planning, imaging, and outcomes data)
accessibility by asking cooperative groups to adopt a policy of anonymizing clinical trials data and making the data publicly accessible after a
reasonable delay (Joseph O. Deasy et al., 2010). With regard to blood or tissue samples, no recommendations were made, however, by extending
the same notion, gene or protein expression assay measurements could be made available under the same umbrella, while raw specimens data
could be accessed from the biospecimens resource. For example, in the RTOG biospecimen standard operating procedure (SOP), it is highlighted
that biospecimens received by the RTOG Biospecimen Resource are de-identified of all patient health identifiers when enrolled in an approved
RTOG study. Each patient being enrolled by an institution has to qualify and consent to be part of a particular study before being assigned a case
and study ID by the RTOG Statistical Center. No information containing specific patient health identifiers is maintained by the Resource
Freezerworks database, which is primarily an inventory and tracking system. In addition, information related to medical identifiers and any code
lists could be removed completely from the dataset after a certain period say 10 years or so. Moreover, it has been argued that current measures
by Health Insurance Portability and Accountability Act (HIPPA) of 18 data Elements may not be sufficient and techniques based on research in
privacy-preserving data mining, disclosure risk assessment data de-identification, obfuscation, and protection may need to be adopted to achieve
better protection of confidentiality (Krishna, Kelleher, & Stahlberg, 2007).
FUTURE RESEARCH DIRECTIONS
The ability to maintain high fidelity big data for radiotherapy studies remains a major challenge despite the high volume of clinical generated
data on almost daily basis. As discussed above there have been several ongoing institutional and multi-institutional initiatives such as the RTOG,
radiogenomics consortium, and EuroCAT to develop such infrastructure, however, there is plenty of work to be done to overcome, data sharing
hurdles, patient confidentiality issues lack of signaling pathways databases of radiation response, development of cost-effective multi-center
communication systems that allows transmission, storage, and query of large datasets such images, dosimetry, and biomarkers information. The
use of NLP techniques is a promising approach in organizing unstructured clinical data. Dosimetry and imaging data can benefit from existing
infrastructure for Picture Archiving and Communications Systems (PACS) or other medical image databases. Methods based on the new
emerging field of systems radiobiology will continue to grow on a rapid pace, but they could also benefit immensely from the development of
specialized radiation response signalling pathways databases analogous to the currently existing pharmacogenomics databases. Data sharing
among different institutions is major hurdle, which seems could be solved through cooperative groups or distributed databases by developing in a
cost-effective manner the necessary informatics and communication infrastructure using open-access resources through partnership with
industry.
CONCLUSION
Recent evolution in radiotherapy imaging and biotechnology has generated enormous amount of big data that spans clinical, dosimetric, imaging,
and biological markers. This data provided new opportunities for re-shaping our understanding of radiotherapy response and outcomes
modeling. However, the complexity of this data and the variability of tumour and normal tissue responses would render the utilization of
advanced informatics and datamining methods as indispensible tools for better delineation of radiation complex interaction mechanisms and
basically a cornerstone to “making data dream come true” (Nature Editorial, 2004). However, it also posed new challenges for data aggregation,
sharing, confidentiality and analysis. Moreover, radiotherapy data constitutes a unique interface between physics and biology that can benefit
from the general advances in biomedical informatics research such as systems biology and available web resources while still requires the
development of its own technologies to address specific issues related to this interface. Successful application and development of advanced data
communication and bioinformatics tools for radiation oncology big data is essential to better predicting radiotherapy response to accompany
other aforementioned technologies and usher significant progress towards the goal of personalized treatment planning and improving the quality
of life for radiotherapy cancer patients.
This work was previously published in Big Data Analytics in Bioinformatics and Healthcare edited by Baoying Wang, Ruowang Li, and
William Perrizo, pages 164185, copyright year 2015 by Medical Information Science Reference (an imprint of IGI Global).
REFERENCES
Alaiya, A., Al-Mohanna, M., & Linder, S. (2005). Clinical cancer proteomics: Promises and pitfalls. Journal of Proteome Research ,4(4), 1213–
1222. doi:10.1021/pr050149f
Allal, A. S., Kähne, T., Reverdin, A. K., Lippert, H., Schlegel, W., & Reymond, M.-A. (2004). Radioresistance-related proteins in rectal
cancer. Proteomics , 4(8), 2261–2269. doi:10.1002/pmic.200300854
Alon, U. (2007). An introduction to systems biology: design principles of biological circuits . Boca Raton, FL: Chapman & Hall/CRC.
Alsner, J., Andreassen, C. N., & Overgaard, J. (2008). Genetic markers for prediction of normal tissue toxicity after radiotherapy.Seminars in
Radiation Oncology , 18(2), 126–135. doi:10.1016/j.semradonc.2007.10.004
Alsner, J., Andreassen, C. N., & Overgaard, J. (2008). Genetic Markers for Prediction of Normal Tissue Toxicity After Radiotherapy. Seminars in
Radiation Oncology , 18(2), 126–135. doi:10.1016/j.semradonc.2007.10.004
Andreassen, C. N., & Alsner, J. (2009). Genetic variants and normal tissue toxicity after radiotherapy: A systematic review.Radiotherapy and
Oncology , 92(3), 299–309. doi:10.1016/j.radonc.2009.06.015
Baumann, M., Hölscher, T., & Begg, A. C. (2003). Towards genetic prediction of radiation responses: ESTRO's GENEPI project.Radiotherapy and
Oncology , 69(2), 121–125. doi:10.1016/j.radonc.2003.08.006
Bentzen, S. M. (2008). From cellular to high-throughput predictive assays in radiation oncology: Challenges and opportunities. Seminars in
Radiation Oncology , 18(2), 75–88. doi:10.1016/j.semradonc.2007.10.003
Bentzen, S. M., Constine, L. S., Deasy, J. O., Eisbruch, A., Jackson, A., & Marks, L. B. (2010). Quantitative Analyses of Normal Tissue Effects in
the Clinic (QUANTEC): An introduction to the scientific issues. International Journal of Radiation Oncology, Biology, Physics , 76(3Suppl), S3–
S9. doi:10.1016/j.ijrobp.2009.09.040
Blanco, A. I., Chao, K. S., El Naqa, I., Franklin, G. E., Zakarian, K., Vicic, M., & Deasy, J. O. (2005). Dose-volume modeling of salivary function in
patients with head-and-neck cancer receiving radiotherapy. International Journal of Radiation Oncology, Biology, Physics , 62(4), 1055–1069.
doi:10.1016/j.ijrobp.2004.12.076
Bradley, J., Deasy, J. O., Bentzen, S., & El-Naqa, I. (2004). Dosimetric correlates for acute esophagitis in patients treated with radiotherapy for
lung carcinoma. [pii]. International Journal of Radiation Oncology, Biology, Physics , 58(4), 1106–1113. doi:10.1016/j.ijrobp.2003.09.080
Bradley, J. D., Hope, A., El Naqa, I., Apte, A., Lindsay, P. E., Bosch, W., . . . Deasy, J. O. (2007). A nomogram to predict radiation pneumonitis,
derived from a combined analysis of RTOG 9311 and institutional data. Int J Radiat Oncol Biol Phys, 69(4), 985-992. doi:
10.1016/j.ijrobp.2007.04.077
Burnet, N. G., Elliott, R. M., Dunning, A., & West, C. M. L. (2006). Radiosensitivity, Radiogenomics and RAPPER. Clinical Oncology ,18(7), 525–
528. doi:10.1016/j.clon.2006.05.007
Bussink, J., Kaanders, J. H. A. M., van der Graaf, W. T. A., & Oyen, W. J. G. (2011). PET-CT for radiotherapy treatment planning and response
monitoring in solid tumors. Nat Rev Clin Oncol , 8(4), 233–242. doi:10.1038/nrclinonc.2010.218
Condeelis, J., & Weissleder, R. (2010). In vivo imaging in cancer.Cold Spring Harb Perspect Biol, 2(12), a003848. doi:
10.1101/cshperspect.a003848
Deasy, J. O., Bentzen, S. r. M., Jackson, A., Ten Haken, R. K., Yorke, E. D., Constine, L. S., . . . Marks, L. B. (2010). Improving Normal Tissue
Complication Probability Models: The Need to Adopt a “Data-Pooling” Culture. International Journal of Radiation
Oncology*Biology*Physics, 76(3, Supplement), S151-S154. doi:10.1016/j.ijrobp.2009.06.094
Deasy, J. O., & El Naqa, I. (2008). Image-based modeling of normal tissue complication probability for radiation therapy.Cancer Treatment and
Research , 139, 215–256. doi:10.1007/978-0-387-36744-6_11
El Naqa, I. (2013). Outcomes Modeling . In Starkschall, G., & Siochi, C. (Eds.), Informatics in Radiation Oncology (pp. 257–275). Boca Raton, FL:
CRC Press, Taylor and Francis.
El Naqa, I., Bradley, J., Lindsay, P. E., Hope, A., & Deasy, J. O. (2009). Predicting Radiotherapy Outcomes using Statistical Learning
Techniques. Physics in Medicine and Biology , 54(18), S9–S30. doi:10.1088/0031-9155/54/18/S02
El Naqa, I., Bradley, J. D., Lindsay, P. E., Blanco, A. I., Vicic, M., Hope, A. J., & Deasy, J. O. (2006). Multi-variable modeling of radiotherapy
outcomes including dose-volume and clinical factors.International Journal of Radiation Oncology, Biology, Physics ,64(4), 1275–1286.
doi:10.1016/j.ijrobp.2005.11.022
El Naqa, I., Bradley, J. D., Lindsay, P. E., Hope, A. J., & Deasy, J. O. (2009). Predicting radiotherapy outcomes using statistical learning
techniques. Phys Med Biol, 54(18), S9-S30. doi: 10.1088/0031-9155/54/18/S02
El Naqa, I., Craft, J., Oh, J., & Deasy, J. (2011). Biomarkers for Early Radiation Response for Adaptive Radiation Therapy . In Li, X. A.
(Ed.), Adaptive Radiation Therapy (pp. 53–68). Boca Baton, FL: Taylor & Francis.
El Naqa, I., Deasy, J. O., Mu, Y., Huang, E., Hope, A. J., & Lindsay, P. E. (2010). Datamining approaches for modeling tumor control
probability. Acta Oncologica (Stockholm, Sweden) , 49(8), 1363–1373. doi:10.3109/02841861003649224
El Naqa, I., Grigsby, P., Apte, A., Kidd, E., Donnelly, E., & Khullar, D. (2009). Exploring feature-based approaches in PET images for predicting
cancer treatment outcomes. Pattern Recognition ,42(6), 1162–1171. doi:10.1016/j.patcog.2008.08.011
El Naqa, I., Pater, P., & Seuntjens, J. (2012). Monte Carlo role in radiobiological modelling of radiotherapy outcomes. Physics in Medicine and
Biology , 57(11), R75–R97. doi:10.1088/0031-9155/57/11/R75
El Naqa, I., Suneja, G., Lindsay, P. E., Hope, A. J., Alaly, J. R., & Vicic, M. (2006). Dose response explorer: An integrated open-source tool for
exploring and modelling radiotherapy dose-volume outcome relationships. Physics in Medicine and Biology , 51(22), 5719–5735.
doi:10.1088/0031-9155/51/22/001
El Naqa, I., & Yang, Y. (2014). The Role of Content-Based Image Retrieval in Mammography CAD . In Suzuki, K. (Ed.),Computational
Intelligence in Biomedical Imaging (pp. 33–53). Springer New York. doi:10.1007/978-1-4614-7245-2_2
Erichsen, H. C., & Chanock, S. J. (2004). SNPs in cancer research and treatment. British Journal of Cancer , 90(4), 747–751.
doi:10.1038/sj.bjc.6601574
Eschrich, S., Zhang, H., Zhao, H., Boulware, D., Lee, J.-H., Bloom, G., & Torres-Roca, J. F. (2009). Systems Biology Modeling of the Radiation
Sensitivity Network: A Biomarker Discovery Platform.International Journal of Radiation Oncology*Biology*Physics, 75(2), 497-505.
Feinendegen, L., Hahnfeldt, P., Schadt, E. E., Stumpf, M., & Voit, E. O. (2008). Systems biology and its potential role in radiobiology. Radiation
and Environmental Biophysics , 47(1), 5–23. doi:10.1007/s00411-007-0146-8
Group, B. D. W. (2001). Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther, 69(3), 89-
95. doi: 10.1067/mcp.2001.113989
Gulliford, S. L., Webb, S., Rowbottom, C. G., Corne, D. W., & Dearnaley, D. P. (2004). Use of artificial neural networks to predict biological
outcomes for patients receiving radical radiotherapy of the prostate. Radiotherapy and Oncology , 71(1), 3–12. doi:10.1016/j.radonc.2003.03.001
Hall, E. J., & Giaccia, A. J. (2006). Radiobiology for the radiologist(6th ed.). Philadelphia: Lippincott Williams & Wilkins.
Halperin, E. C., Perez, C. A., & Brady, L. W. (2008). Perez and Brady's principles and practice of radiation oncology (5th ed.). Philadelphia:
Wolters Kluwer Health/Lippincott Williams & Wilkins.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction: with 200 full-
color illustrations . New York: Springer.
Ho, A. Y., Atencio, D. P., Peters, S., Stock, R. G., Formenti, S. C., Cesaretti, J. A., . . . Rosenstein, B. S. (2006). Genetic Predictors of Adverse
Radiotherapy Effects: The Gene-PARE project.International Journal of Radiation Oncology*Biology*Physics, 65(3), 646-655.
Hope, A. J., Lindsay, P. E., El Naqa, I., Alaly, J. R., Vicic, M., Bradley, J. D., & Deasy, J. O. (2006). Modeling radiation pneumonitis risk with
clinical, dosimetric, and spatial parameters.Int J Radiat Oncol Biol Phys, 65(1), 112-124. doi: 10.1016/j.ijrobp.2005.11.046
Hope, A. J., Lindsay, P. E., El Naqa, I., Bradley, J. D., Vicic, M., & Deasy, J. O. (2005). Clinical, Dosimetric, and LocationRelated Factors to
Predict Local Control in NonSmall Cell Lung Cancer.Paper presented at the ASTRO 47th Annual Meeting. Denver, CO.
10.1016/j.ijrobp.2005.07.394
Hu, X., Macdonald, D. M., Huettner, P. C., Feng, Z., El Naqa, I. M., Schwarz, J. K., . . . Wang, X. (2009). A miR-200 microRNA cluster as
prognostic marker in advanced ovarian cancer. Gynecol Oncol, 114(3), 457-464. doi: 10.1016/j.ygyno.2009.05.022
Huang, E. X., Bradley, J. D., Naqa, I. E., Hope, A. J., Lindsay, P. E., Bosch, W. R., . . . Deasy, J. O. (2011). Modeling the Risk of Radiation-Induced
Acute Esophagitis for Combined Washington University and RTOG Trial 93-11 Lung Cancer Patients. Int J Radiat Oncol Biol Phys. doi:
10.1016/j.ijrobp.2011.02.052
Huang, E. X., Hope, A. J., Lindsay, P. E., Trovo, M., El Naqa, I., Deasy, J. O., & Bradley, J. D. (2011). Heart irradiation as a risk factor for
radiation pneumonitis. Acta Oncologica (Stockholm, Sweden) , 50(1), 51–60. doi:10.3109/0284186X.2010.521192
IAEA. (2002). Predictive assays and their role in selection of radiation as the therapeutic modality . IAEA.
Iwakawa, M., Noda, S., Yamada, S., Yamamoto, N., Miyazawa, Y., & Yamazaki, H. (2006). Analysis of non-genetic risk factors for adverse skin
reactions to radiotherapy among 284 Breast Cancer patients. Breast Cancer (Tokyo, Japan) , 13(3), 300–307. doi:10.2325/jbcs.13.300
Jackson, A., Marks, L. B., Bentzen, S. M., Eisbruch, A., Yorke, E. D., Ten Haken, R. K., . . . Deasy, J. O. (2010). The lessons of QUANTEC:
recommendations for reporting and gathering data on dose-volume dependencies of treatment outcome. Int J Radiat Oncol Biol Phys, 76(3
Suppl), S155-160. doi: 10.1016/j.ijrobp.2009.08.074
Jain, K. K. (2007). Cancer biomarkers: Current issues and future directions. Current Opinion in Molecular Therapeutics , 9(6), 563–571.
Joiner, M., & Kogel, A. d. (2009). Basic clinical radiobiology (4th ed.). London: Hodder Arnold.
Kerns, S. L., Ostrer, H., Stock, R., Li, W., Moore, J., Pearlman, A., . . . Rosenstein, B. S. (2010). Genome-wide association study to identify single
nucleotide polymorphisms (SNPs) associated with the development of erectile dysfunction in African-American men after radiotherapy for
prostate cancer. Int J Radiat Oncol Biol Phys, 78(5), 1292-1300. doi: 10.1016/j.ijrobp.2010.07.036
Kerns, S. L., Stock, R., Stone, N., Buckstein, M., Shao, Y., Campbell, C., . . . Rosenstein, B. S. (n.d.). A 2-Stage Genome-Wide Association Study to
Identify Single Nucleotide Polymorphisms Associated With Development of Erectile Dysfunction Following Radiation†Therapy for Prostate
Cancer. International Journal of Radiation Oncology*Biology*Physics, (0).
Khan, F. M. (2007). Treatment planning in radiation oncology(2nd ed.). Philadelphia: Lippincott Williams & Wilkins.
Kidd, E. A., El Naqa, I., Siegel, B. A., Dehdashti, F., & Grigsby, P. W. (2012). FDG-PET-based prognostic nomograms for locally advanced cervical
cancer. Gynecol Oncol, 127(1), 136-140. doi: 10.1016/j.ygyno.2012.06.027
Klopp, A. H., Jhingran, A., Ramdas, L., Story, M. D., Broadus, R. R., & Lu, K. H. (2008). Gene expression changes in cervical squamous cell
carcinoma after initiation of chemoradiation and correlation with clinical outcome. International Journal of Radiation Oncology, Biology,
Physics , 71(1), 226–236. doi:10.1016/j.ijrobp.2007.10.068
Koff, D. A., & Shulman, H. (2006). An overview of digital compression of medical images: Can we use lossy image compression in
radiology? Canadian Association of Radiologists Journal , 57(4), 211–217.
Krishna, R., Kelleher, K., & Stahlberg, E. (2007). Patient confidentiality in the research use of clinical medical databases.American Journal of
Public Health , 97(4), 654–658. doi:10.2105/AJPH.2006.090902
Kumar, V., Gu, Y., Basu, S., Berglund, A., Eschrich, S. A., Schabath, M. B., . . . Gillies, R. J. (2012). Radiomics: the process and the
challenges. Magn Reson Imaging, 30(9), 1234-1248. doi: 10.1016/j.mri.2012.06.010
Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., van Stiphout, R. G., Granton, P., . . . Aerts, H. J. (2012). Radiomics: extracting more
information from medical images using advanced feature analysis. Eur J Cancer, 48(4), 441-446. doi: 10.1016/j.ejca.2011.11.036
Lambin, P., Roelofs, E., Reymen, B., Velazquez, E. R., Buijsen, J., & Zegers, C. M. L. (2013). ‘Rapid Learning health care in oncology’ – An
approach towards decision support systems enabling customised radiotherapy‚Äô. Radiotherapy and Oncology, 109(1), 159–164.
doi:10.1016/j.radonc.2013.07.007
Lehnert, S. (2008). Biomolecular action of ionizing radiation . New York: Taylor & Francis.
Levegrun, S., Jackson, A., Zelefsky, M. J., Skwarchuk, M. W., Venkatraman, E. S., & Schlegel, W. (2001). Fitting tumor control probability models
to biopsy outcome after three-dimensional conformal radiation therapy of prostate cancer: Pitfalls in deducing radiobiologic parameters for
tumors from clinical data.International Journal of Radiation Oncology, Biology, Physics ,51(4), 1064–1080. doi:10.1016/S0360-3016(01)01731-X
Lu, J. J., & Brady, L. W. (2011). Decision making in radiation oncology . Heidelberg, Germany: Springer.
Marks, L. B. (2002). Dosimetric predictors of radiation-induced lung injury. International Journal of Radiation Oncology, Biology,
Physics , 54(2), 313–316. doi:10.1016/S0360-3016(02)02928-0
Moissenko, V., Deasy, J. O., & Van Dyk, J. (2005). Radiobiological Modeling for Treatment Planning . In Van Dyk, J. (Ed.), The Modern
Technology of Radiation Oncology: A Compendium for Medical Physicists and Radiation Oncologists (Vol. 2, pp. 185–220). Madison, WI:
Medical Physics Publishing.
Munley, M. T., Lo, J. Y., Sibley, G. S., Bentel, G. C., Anscher, M. S., & Marks, L. B. (1999). A neural network to predict symptomatic lung
injury. Physics in Medicine and Biology , 44(9), 2241–2249. doi:10.1088/0031-9155/44/9/311
Nagaraj, N. S. (2009). Evolving 'omics' technologies for diagnostics of head and neck cancer. Brief Funct Genomic Proteomic, 8(1), 49-59. doi:
10.1093/bfgp/elp004
Newbold, K., Partridge, M., Cook, G., Sohaib, S. A., Charles-Edwards, E., & Rhys-Evans, P. (2006). Advanced imaging applied to radiotherapy
planning in head and neck cancer: A clinical review. The British Journal of Radiology , 79(943), 554–561. doi:10.1259/bjr/48822193
Nikjoo, H., Uehara, S., Emfietzoglou, D., & Cucinotta, F. A. (2006). Track-structure codes in radiation research. Radiation Measurements , 41(9-
10), 1052–1074. doi:10.1016/j.radmeas.2006.02.001
Nuyten, D. S., & van de Vijver, M. J. (2008). Using microarray analysis as a prognostic and predictive tool in oncology: Focus on breast cancer
and normal tissue toxicity. Seminars in Radiation Oncology , 18(2), 105–114. doi:10.1016/j.semradonc.2007.10.007
Ogawa, K., Murayama, S., & Mori, M. (2007). Predicting the tumor response to radiotherapy using microarray analysis [Review].Oncology
Reports , 18(5), 1243–1248.
Oh, J. H., Craft, J., Al Lozi, R., Vaidya, M., Meng, Y., Deasy, J. O., . . . El Naqa, I. (2011). A Bayesian network approach for modeling local failure
in lung cancer. Phys Med Biol, 56(6), 1635-1651. doi: 10.1088/0031-9155/56/6/008
Oh, J. H., Craft, J. M., Townsend, R. R., Deasy, J. O., Bradley, J. D., & El Naqa, I. (2011). A Bioinformatics Approach for Biomarker Identification
in Radiation-Induced Lung Inflammation from Limited Proteomics Data. Journal of Proteome Research , 10(3), 1406–1415.
doi:10.1021/pr101226q
Piet, D., Frederik De, K., Vincent, V., Sigrid, S., Robert, H., & Sandra, N. (2008). Diffusion-Weighted Magnetic Resonance Imaging to Evaluate
Major Salivary Gland Function Before and After Radiotherapy. International Journal of Radiation Oncology, Biology, Physics .
Roelofs, E., Dekker, A., Meldolesi, E., van Stiphout, R. G. P. M., Valentini, V., & Lambin, P. (2014). International data-sharing for radiotherapy
research: An open-source based infrastructure for multicentric clinical data mining. Radiotherapy and Oncology ,110(2), 370–374.
doi:10.1016/j.radonc.2013.11.001
Roelofs, E., Persoon, L., Nijsten, S., Wiessler, W., Dekker, A., & Lambin, P. (2013). Benefits of a clinical data warehouse with data mining tools to
collect data for a radiotherapy trial. Radiotherapy and Oncology , 108(1), 174–179. doi:10.1016/j.radonc.2012.09.019
Schena, M., Shalon, D., Davis, R. W., & Brown, P. O. (1995). Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA
Microarray. Science , 270(5235), 467–470. doi:10.1126/science.270.5235.467
Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., & Lai, A. M. (2013). A review of approaches to identifying
patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association. doi:doi:10.1136/amiajnl-
2013-001935
Siegel, R., Ma, J., Zou, Z., & Jemal, A. (2014). Cancer statistics, 2014. CA: a Cancer Journal for Clinicians , 64(1), 9–29. doi:10.3322/caac.21208
Siegel, R., Naishadham, D., & Jemal, A. (2013). Cancer statistics, 2013. CA: a Cancer Journal for Clinicians , 63(1), 11–30. doi:10.3322/caac.21166
Sparkman, O. D. (2000). Mass spectrometry desk reference (1st ed.). Pittsburgh, Pa.: Global View Pub.
Spratlin, J. L., Serkova, N. J., & Eckhardt, S. G. (2009). Clinical applications of metabolomics in oncology: a review. Clin Cancer Res, 15(2), 431-
440. doi: 10.1158/1078-0432.CCR-08-1059
Steel, G. G. (2002). Basic clinical radiobiology (3rd ed.). London: Oxford University Press.
Strintzis, M. G. (1998). A review of compression methods for medical images in PACS. International Journal of Medical Informatics , 52(1-3),
159–165. doi:10.1016/S1386-5056(98)00135-X
Su, M., Miftena, M., Whiddon, C., Sun, X., Light, K., & Marks, L. (2005). An artificial neural network for predicting the incidence of radiation
pneumonitis. Medical Physics , 32(2), 318–325. doi:10.1118/1.1835611
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., & Gillette, M. A. (2005). Gene set enrichment analysis: A knowledge-
based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of
America , 102(43), 15545–15550. doi:10.1073/pnas.0506580102
Sullivan, R., Peppercorn, J., Sikora, K., Zalcberg, J., Meropol, N. J., & Amir, E. (2011). Delivering affordable cancer care in high-income
countries. The Lancet Oncology , 12(10), 933–980. doi:10.1016/S1470-2045(11)70141-3
Svensson, J. P., Stalpers, L. J., Esveldt-van Lange, R. E., Franken, N. A., Haveman, J., Klein, B., . . . Giphart-Gassler, M. (2006). Analysis of gene
expression using gene sets discriminates cancer patients with and without late radiation toxicity. PLoS Med, 3(10), e422. doi:
10.1371/journal.pmed.0030422
Tomatis, S., Rancati, T., Fiorino, C., Vavassori, V., Fellin, G., & Cagna, E. (2012). Late rectal bleeding after 3D-CRT for prostate cancer:
Development of a neural-network-based predictive model.Physics in Medicine and Biology , 57(5), 1399–1412. doi:10.1088/0031-
9155/57/5/1399
Tucker, S. L., Cheung, R., Dong, L., Liu, H. H., Thames, H. D., & Huang, E. H. (2004). Dose-volume response analyses of late rectal bleeding after
radiotherapy for prostate cancer. International Journal of Radiation Oncology, Biology, Physics , 59(2), 353–365.
doi:10.1016/j.ijrobp.2003.12.033
Tyburski, J. B., Patterson, A. D., Krausz, K. W., Slavik, J., Fornace, A. J., Jr., Gonzalez, F. J., & Idle, J. R. (2008). Radiation metabolomics. 1.
Identification of minimally invasive urine biomarkers for gamma-radiation exposure in mice. Radiat Res, 170(1), 1-14. doi: 10.1667/RR1265.1
Vaidya, M., Creach, K. M., Frye, J., Dehdashti, F., Bradley, J. D., & El Naqa, I. (2012). Combined PET/CT image characteristics for radiotherapy
tumor response in lung cancer. Radiother Oncol, 102(2), 239-245. doi: 10.1016/j.radonc.2011.10.014
Viswanathan, G. A., Seto, J., Patil, S., Nudelman, G., & Sealfon, S. C. (2008). Getting Started in Biological Pathway Construction and
Analysis. PLoS Computational Biology , 4(2), e16. doi:10.1371/journal.pcbi.0040016
Wang, L., Zhang, B., Wolfinger, R. D., & Chen, X. (2008). An Integrated Approach for the Analysis of Biological Pathways using Mixed
Models. PLOS Genetics , 4(7), e1000115. doi:10.1371/journal.pgen.1000115
Wang, X., & El Naqa, I. M. (2008). Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics, 24(3), 325-
332. doi: 10.1093/bioinformatics/btm595
Webb, S. (2001). The physics of three-dimensional radiation therapy: conformal radiotherapy, radiosurgery, and treatment planning . Bristol,
UK: Institute of Physics Pub.
West, C. M. (1995). Invited review: Intrinsic radiosensitivity as a predictor of patient response to radiotherapy. The British Journal of
Radiology , 68(812), 827–837. doi:10.1259/0007-1285-68-812-827
West, C. M. L., Elliott, R. M., & Burnet, N. G. (2007). The Genomics Revolution and Radiotherapy. Clinical Oncology , 19(6), 470–480.
doi:10.1016/j.clon.2007.02.016
Willmann, J. K., van Bruggen, N., Dinkelborg, L. M., & Gambhir, S. S. (2008). Molecular imaging in drug development. Nat Rev Drug Discov,
7(7), 591-607. doi: 10.1038/nrd2290
Wouters, B. G. (2008). Proteomics: Methodologies and Applications in Oncology. Seminars in Radiation Oncology , 18(2), 115–125.
doi:10.1016/j.semradonc.2007.10.008
Yu, S. L., Chen, H. Y., Chang, G. C., Chen, C. Y., Chen, H. W., & Singh, S. (2008). MicroRNA signature predicts survival and relapse in lung
cancer. Cancer Cell , 13(1), 48–57. doi:10.1016/j.ccr.2007.12.008
Zaidi, H., & El Naqa, I. (2010). PET-guided delineation of radiation therapy treatment volumes: A survey of image segmentation
techniques. European Journal of Nuclear Medicine and Molecular Imaging , 37(11), 2165–2187. doi:10.1007/s00259-010-1423-3
Zhu, H., Pei, H.-, Zeng, S., Chen, J., Shen, L.-, & Zhong, M.- (2009). Profiling Protein Markers Associated with the Sensitivity to Concurrent
Chemoradiotherapy in Human Cervical Carcinoma.Journal of Proteome Research , 8(8), 3969–3976. doi:10.1021/pr900287a
ADDITIONAL READING
Alam, M., Muley, A., Joshi, A., & Kadaru, C. (2014). Oracle NoSQL database: real-time big data management for the enterprise . New York:
McGraw-Hill Education.
Berman, J. J. (2013). Principles of big data: preparing, sharing, and analyzing complex information . Amsterdam: Elsevier, Morgan Kaufmann.
Brady, L. W., & Yaeger, T. E. (2013). Encyclopedia of radiation oncology . Heidelberg: Springer. doi:10.1007/978-3-540-85516-3
Cox, J. D., & Ang, K. K. (2010). Radiation oncology: rationale, technique, results (9th ed.). Philadelphia: Mosby.
Crossborder challenges in informatics with a focus on disease surveillance and utilising bigdata: proceedings of the efmi special topic
conference, 2729 april 2014, budapest, hungary. (2014). Washington, DC: IOS Press.
Davis, K., & Patterson, D. (2012). Ethics of big data . Sebastopol, CA: O'Reilly.
Halperin, E. C., Perez, C. A., & Brady, L. W. (2008). Perez and Brady's principles and practice of radiation oncology (5th ed.). Philadelphia:
Wolters Kluwer Health/Lippincott Williams & Wilkins.
Hansen, E. K., & Roach, M. (2010). Handbook of evidence-based radiation oncology (2nd ed.). New York: Springer. doi:10.1007/978-0-387-
92988-0
Holzinger, A. (2014). Biomedical informatics: discovering knowledge in big data . New York: Springer. doi:10.1007/978-3-319-04528-3
Jorgensen, A. (2014). Microsoft big data solutions (1st edition. ed.). Indianapolis, IN: John Wiley and Sons.
Kagadis, G. C., & Langer, S. G. (2012). Informatics in medical imaging . Boca Raton, FL: CRC Press.
Kosaka, M., & Shirahada, K. (2014). Progressive trends in knowledge and system-based science for service innovation . Hershey: Business
Science Reference.
Kudyba, S. (2014). Big data, mining, and analytics: components of strategic decision making . Boca Raton: Taylor & Francis. doi:10.1201/b16666
Kutz, J. N. (2013). Data-driven modeling & scientific computation: methods for complex systems & big data (First edition. ed.). Oxford: Oxford
University Press.
Lane, J. I. (2015). Privacy, big data, and the public good: frameworks for engagement . New York, NY: Cambridge University Press.
Leszczynski, D. (2013). Radiation proteomics: the effects of ionizing and non-ionizing radiation on cells and tissues . New York: Springer.
doi:10.1007/978-94-007-5896-4
Lu, J. J., & Brady, L. W. (2011). Decision making in radiation oncology . Heidelberg: Springer.
Mehta, M. P., Paliwal, B., & Bentzen, S. M. (2005). Physical, chemical, and biological targeting in radiation oncology . Madison, WI: Medical
Physics Pub.
Ratner, B., Ratner, B., & Ratner, B. (2012). Statistical and machine-learning data mining: techniques for better predictive modeling and analysis
of big data (2nd ed.). Boca Raton, FL: Taylor & Francis.
Schlegel, W., Bortfeld, T., & Grosu, A. (2006). New technologies in radiation oncology . Berlin, London: Springer. doi:10.1007/3-540-29999-8
Starkschall, G., & Siochi, R. A. C. (2014). Informatics in radiation oncology . Boca Raton: CRC Press.
KEY TERMS AND DEFINITIONS
Genomics: Study of high throughput gene expression and variants. Normal tissues complications: side effects due to irradiation.
Systems Radiobiology: The use of engineering inspired network analysis techniques to model radiation response.
Niharika Garg
The NorthCap University, India
ABSTRACT
This chapter provides the relation between automated inventory control and generation of big data using the process. Conversion from manual to automated inventory process leads
to generation and management of too much data. Possible boons and banes of the conversion of inventory control system to automated one are discussed in detail. In the initial
sections explanation about inventory control and benefits of automating is given. Then overall architecture of big data and its management is discussed .Finally, tradeoff between the
usage of automated inventory control system with its benefits and generation of too much data and handling it, is discussed.
INVENTORY CONTROL: THE CONCEPT
Inventory Control is the system that involves processing the requisition, managing the inventory, purchasing, and physical inventory reconciliation. The following key objectives
define the design of Inventory Control (Board of Trade of Metropolitan Montreal, 2009):
• Informing about the availability of stocked items and the status of requisition in stock.
• Automated tools which assists in servicing, purchasing, and management of the inventory.
• Improvement in the financial control of the inventory through timely and regular check of the inventory balances with the physical counts.
A set of master tables (User as well as system-maintained), transaction document types, and offline programs are used to meet the above-mentioned objectives. The reports are also
created for the same.
Inventory Control is used to show how much stock we have at any one time, and keeping track of it. It is applied to every item, from raw materials to finished goods. It keeps check on
the stock at all the stages of the production process, starting from purchase till delivery and re-ordering the stock.
Right amount of stock at right time and right place is ensured by an efficient stock control.
Manually counting the number of orders and accurate deliver is too much prone to error due to size of the number of orders. An automated inventory control system helps to
minimize the risk of error. Now when it comes to automating the system, lot of data of various formats, continuously flowing also comes in the picture.
• Stock of consumables.
MANUAL INVENTORY CONTROL SYSTEMS
Stock taking process consists of making an inventory, taking record of its location and value. It's often an annual exercise - a kind of audit to work out the value of the stock as part of
the accounting process.
For any stock control system the following operations are must:
1. Tracking Stock Levels: It means tracking the levels of stock items for ordering and re-ordering, if demanded.
Manually log books are maintained for when only few stock items need to be managed and controlled. It keeps the record of stock received and stock issued. When number of stock
items are too many, then simple log book system won’t work .We need more complex system. In this case, each type of stock has a card associated with it which contains information
about the detailed description of the stock, value of the stock, where it is located, what are the re-order levels, quantities required, and details about the supplier and information
about the previous stock history.
Keeping track and maintaining the information about all the details related to stock or inventory management control manually is in fact a daunting task. To reduce the tediousness
of the tasks and make it simplified, manual inventory management control system should be a big no and automated Inventory controls systems should be encouraged. There are so
many advantages of automated Inventory controls system, as discussed next.
AUTOMATED INVENTORY CONTROL SYSTEM
An automated inventory control system (MIDCOM Data Technologies, Inc.,2014) can be defined as the combination of hardware and software based tools for tracking and managing
the inventory. Inventory used and managed can be of any type or kind including any type of good which can be quantified like food, clothing, grains, books, equipment, and any kind
of item purchased by consumers, retailers, or wholesalers. Modern inventory control systems are based on barcode technology and Radio frequency identification (RFID) systems for
the automatic identification of inventory items. Inventory control systems work in real-time using wireless technology to transmit information to a central computer system as
transactions occur.
Automated stock control systems have manual systems only as their base because the process involves similar functions as are there in manual ones, though they are much more
flexible with easy retrieval of information.
As mentioned earlier, barcodes and RFID are the two main ways through which automated inventory control system can be set up for reading and tracking the inventory data.
Considering the fact of no line of sight required in RFID unlike the case of barcode readers, RFID technology has an edge over barcode one. Let us discuss about RFID system in more
detail for the collection of data and storing of data about inventory items.
An RFID tag is a very small microchip, plus a small antenna, which contains information stored digitally about the particular item. Tags are fixed to the inventory item or to a van or
delivery truck. The tag information is collected by an RFID reader which actually transmits and receives radio signals to and from the tag .Readers can be also many types .They can
be fixed or portable like a hand held device varying in various sizes with various ranges. The information that the reader collects is collated and processed using special computer
software. Readers can be placed at different locations like at manufacturer’s site or warehouse to keep continuous track of goods in inventory control.
Use of RFID technology has many pros over barcode technology like more range of reading, multiple tags read at once, unique identification codes to individual products or group of
products, writable tags so that information updating can be done like change of destination location. It checks stock maintenance and prevent over-stocking and under-stocking of
inventory items by maintaining the complete data of the stock control. It also can very well check for any intrusion in security like theft or fake data and can thus raise alarm well in
time (Mark Austin and John Baras, 2003). For items with limited shelf life, control over quality can also be done very effectively by using RFID technology. The Figure 1 can illustrate
the complete process of tracking and managing inventory items using RFID technology.
Benefits of Automated Inventory Control Systems
2. Time Savings: Time saving is the biggest advantage as everything can be done in the automated streamlined process with just few mouse clicks. Time taken in manual
system for managing receipts of product delivery, removal of redundant information, handling complex invoices is so much reduced, when all is done digitally and using
automatic information generation. The process is very much faster than manual one. An automated inventory and procurement system creates a more efficient business model,
eliminating unnecessary and time-wasting activities and increasing profitability.
3. Increased Accuracy: Due to reduced manual intervention, possibility of occurring errors is also much less. Changes are updated in the inventory instantly, so up to date
inventory count would be available to the clients or customers. Moreover, historical data helps in making strategic decisions for procuring professional standardize products and
decisions on purchase efficiently.
4. Enhanced Negotiations: An automated inventory and procurement solution can enhance negotiations by providing a better platform for comparison of prices, discounts’
information and giving better payment terms. Due to instant availability of the data related to spending patterns of the customers, purchasing staff can make better fact-based
decisions and can have more accurate inventory counts.
5. Increased Compliance: Due to automated inventory control system, standardization of procurement process is done. Due to standardization of the process the multiple
tasks are done in a very smooth and efficient manner. Data records are also formatted with fixed standard format which make them easier to understand and use.
6. Gain a Competitive Edge: For any big and small size firms with multiple tasks done multiple times in a day, automated process has gained a competitive edge over manual
one.
In technology savvy world, new trends are coming up for inventory management by labelling it with a unique QR code (quick response code) which is a kind of two dimensional and is
machine readable containing information about the item attached with it. Smartphones can use for tracking and reading this kind of inventory and keep information about the count
and movement of items. It ensures fast readability and greater storage capacity compared to standard UPC barcodes.
Considering the above mentioned advantages, the overall efficiency is definitely going to be increased but side by side we have to find solutions for the data, which is very vast in size,
generated during the complete inventory control process. With so many flexible ways and instruments available to read and track the information about inventory items, generation
and management of so much data along with also needs to be considered.
INVENTORY MANAGEMENT SOFTWARES AND GENERATION OF BIG DATA
All the Inventory Management Softwares (Moskowitz 2010) have common modules used to handle the inventory. The functions (Harris 2005) include maintaining a balance between
overstocked or understocked inventory, keeping track of inventory from the point at manufacturer’s site to customer via data warehouse center, distribution center, reordering of
spoiled and obsolete products, keeping track of product sales and inventory levels, .Avoiding missing out on sales due to out-of-stock situations and keeping inventory up to date.
There are much popularly used software available which gives complete solution to the automated inventory control management e.g. Crib Master Client, Data Enterprises’ ATICTS,
Simple apps like Stitch Labs, Veeqo, intraKR Inventory and many more.
The complete solution provided by both hardware and software also generates lot of inventory management data. This leads to generation of continuous data in pool with various
formats and thus comes with it the challenges and advantages of using Big Data. In the next section, we will give a brief introduction of Big Data and its importance in Inventory
Control in a Supply Chain Process.
INTRODUCTION OF BIG DATA
The data is which has a volume greater than a petabyte and is capable with continuous growth of the varieties is known as Big Data. A petabyte is 1024 terabytes and one terabyte
data is 1024 gigabytes (Sutardja 2014) . As per the survey, organizations think that not always the large data sets are complex and similarly not always the small data set is simple. So,
not only the size decides whether it is big data but rather the complexity of the data set is the deciding factor whether it is “Big” or not. The three V’s associated for defining the
properties of big data are: Volume, Velocity and Variety (NYU School of Law, 2014). Smart phones, the RFID readers, social networking websites, sensor networks are all adding to a
huge number of autonomous data sources. These kinds of devices are continuously generating a huge amount of data without the intervention of humans which in turn is increasing
the velocity of data aggregation and its processing. A high variety of data types is being generated. As a result of massive generation of data, a fourth property is also included to
describe the big data i.e. Veracity. This is used to describe the trust and uncertainty in the data. The data is growing exponentially (Dumbill, 2012).
Big Data means extremely large data sets that enable companies to sense, analyze and better respond to market change. These data sets are usually larger than a petabyte of data
and involve data from disparate data sources: structured and unstructured data, sensor data, image data and other forms of visualization.
• Barcode systems were also responsible for generating the data but RFID (Radio Frequency Identification) is producing 1000 times more than it (Doug, February, 2001).
• Walmart handled around 10 million cash register transactions which is equivalent to 5000 items per second on black Friday in year 2012(BENTONVILLE, November, 2012).
• An average of 39.5 million tracking requests from the customers is received per day by United Parcel Service (Thomas and Jill, May 2013).
• VISA processes more than 172,800,000 card transactions each day (Bitcoinwiki 2015) .
• 500 million tweets are sent per day. That's more than 5,700 tweets per second (Craig Smith, 2015).
• Facebook has more than 1.15 billion active users generating social interaction data (Jose Martinez, October 2012).
• More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones (ITU, 2015).
• Hospitals are using big data to analyze the patient record from their medical data so that better prediction of disease can be done timely.
• Insurance companies are also using the big data to analyze the various home insurance applications which can be immediately processed and can make benefits to the
individuals.
Big Data as a Problem
Today, most companies feel that they do not have a big data (dirty data) problem. In Figure 2, the familiarity with big data is shown. It shows that 12% of the industries consider big
data as a problem where 23% is because of IT. While data volumes are growing, the velocity of data is accelerating and the variety of data is increasing; the larger top-of-mind issue is
that companies cannot use the data that they have today. While they are intrigued about how to use new forms of data to solve the issues of demand and supply volatility, few know
how to get started (Manyika, Chui, Brown, Bughin, Dobbs, Roxburgh, & Byers 2011).
Source: Supply Chain Insights LLC, Big Data Survey (May
June 2013).
Root Causes
o Unstructured Data: This type of data comes from the unorganized data. This data can be easily interpreted by the traditional data models and databases. Some sources
of this data can be social media, social networking sites like Twitter, Facebook.
o MultiStructured Data: This type of data consists of variety of data types and formats. The main source of this is the interaction between humans and machines like
web log data as it is a combination of txt and visual images. The various social networks and web applications are the major source of this data.
• Once the problem was solved, nobody hit the stop button on the data collection and unused and unnecessary data continued to accumulate.
The traditional database architectures are not suitable to provide support new unstructured data. As new formats of data like unstructured and multi-structured data are generating
now days, the current software with relational databases will become obsolete. Different forms of data offer a realm of new business challenges and opportunities. For companies new
to analytics, these data forms are often called “dirty” when they are really just “different.” They do not fit into conventional relational database applications. The data often has
different context and attributes. So while people want to use new data forms, they struggle with how to get started. The top element that is ability to use big or dirty data is 43% as
shown in Figure 3, which is top challenging factor in Supply Chains.
Figure 3. Top level elements of supply chain
Source: Supply Chain Insights LLC, Big Data Survey (May
June 2013).
Big Data as an Opportunity
The new formats of unstructured and multi structured data are generated from various social media, sensor transmissions, GPS (global positioning systems), emails, blogs, reviews
and ratings and Internet of Things. These all data provides new offers and opportunities to the markets by analyzing the requirements. Big data i.e. clean data represents a new
opportunity, but seizing it requires a new way of handling and leadership. They also require new techniques and technologies. It can start new models of business and drive new
channel opportunities (Agrawal & Abbadi, 2011). However, the big data is not big for itself. However, the utility of big data should be more aligned to the objectives of business and
the focus should be on small and iterative projects rather than big ones. It requires innovation. To move forward, companies need to embrace new technologies and redesign
processes. It is not the case of stuffing new forms of data into old processes. Companies focused on Clean Data today are in a better place. The benefits which are by using the
analytical techniques on big data are as follows:
1. The root causes of failures, issues and defects can be easily determined in real time by analyzing the big data and thus billions of dollars can be saved.
2. The courier and package delivery services are using the analysis on big data for better optimization of the routes to get shortest paths.
3. Analyze millions of SKUs to determine prices that maximize profit and clear inventory.
4. At the shopping centers and retail shops, the data is used to generate coupons based on the analysis of customers previous and current purchases.
5. Mobile communication companies are also analyzing the data to understand the need and data usage requirements of the users so that best offers can be offered to them.
Recalculate entire risk portfolios in minutes.
7. Fraud behaviors can be detected quickly by using the stream analysis and data mining algorithms.
8. The employees are analyzing for better understanding of the engagements, better decisions, optimizing the processes, error and faults detection, and threat prevention.
Tools for Big Data
A major concern these days is to perform the business analysis and reporting on a huge data. A common example is POS (Point of Sale) which generates more data volume than any
ERP system. The data is stored in a simple and organized way like the relational database tables. But the major problem is that there are thousands of products and hundreds of days
which is actually a “big data”. The size of this data generation annually is big if compared with the index of World Wide Web.
In these cases the conventional databases can handle the data only to a limit but the speed is also affected. Handling the basic data manipulation processes like joins, aggregation,
filtering etc. also faces problems. Using the right tools to store the data and processing can only make the analytic processes more intelligent and useful for the business.
The implementation of big data is centered around writing the complex programs in Hive and Pig tools and then processing and executing these codes using Hadoop Map-Reduce
framework on huge volume of data for different nodes. Hadoop is a framework which is used for distributed processing on the large data sets. Hadoop uses its own distributed file
systems which makes the data available to the various computing nodes.
The Figure 4 given shows the step by step procedure of how the Big Data is processed using Hadoop framework.
Source: Infosys Research.
The Process
Step 1: Loading of data to HDFS. Loading the data to HDFS is a two-step procedure. First the data is extracted from the multiple sources and secondly it is uploaded to HDFS.
The data like web logs or transactional data from the sources is extracted using the tools like Sqoop and to upload the files, they are distributed into multiple files.
Step 2: Processing of the input files. This step involves the processing of input files received from Step-1. Map and Reduce operations are performed in this step for the
processing of input files and the required output is generated.
Step 3: Processing of the input files. The output results are extracted from HDFS from step-2. This extracted data is loaded to the downstream systems i.e. data warehouse
enterprises. The data is then used for generating reports or transactional processing which can be further used for BI (business intelligence) tools.
Inventory Control in Supply Chain Process Trends by 2020
Big data initiatives are in their infancy. The big data opportunity is seen in “demand” as shown in Figure 5. The areas of demand management, order management, price management
and channel sensing rate the highest in terms of areas to benefit from big data. However, the highest importance is placed on “supply-based initiatives.” One of the barriers to
capturing the big data opportunity is the “definition of supply chain.” When the supply chain is seen as a “function”—versus an end-to-end process from the customer’s customer to
the supplier’s supplier—the organization is more likely to move forward with big data projects as a functional program (Weiss& Zgorski 2012, Robak, Franczyk & Robak 2013,
Sethuraman & Kundharaju 2013)
Source: Supply Chain Insights LLC, Big Data Survey (May
June 2013).
CONCLUSION
Automated inventory control and generation of big data, go hand in hand. In our chapter we have discussed all the aspects of big data generated through Inventory control systems
.There are many advantages of doing automated inventory management, many user friendly tools are also available for the same. Still comes along with a big challenge of handling
enormous amount of data generated through .We have to see and consider that aspect too, which needs to have a tradeoff between usages of inventory management hardware and
software systems and handling of big data generated from it.
This work was previously published in Optimal Inventory Control and Management Techniques edited by Mandeep Mittal and Nita H. Shah, pages 222235, copyright year 2016
by Business Science Reference (an imprint of IGI Global).
REFERENCES
Harris, A. D. (2005). Vendor-Managed Inventory Growing . Air Conditioning, Heating & Refrigeration News.
Manyika, J., Chui, M. B., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey
Global Institute. Retrieved from
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
Robak S. Franczyk B. Robak M. (2013). Applying Big Data and Linked Data concepts in Supply Chain management. In Proceedings of the Federated Conference on Computer Science
and Information Systems Fed CSIS. Kraków: IEEE Conference Publications.
Tridib Mukherjee
Xerox Research Center India, India
ABSTRACT
In the modern information era, the amount of data has exploded. Current trends further indicate exponential growth of data in the future. This
prevalent humungous amount of data—referred to asbig data—has given rise to the problem of finding the “needle in the haystack” (i.e.,
extracting meaningful information from big data). Many researchers and practitioners are focusing on big data analytics to address the problem.
One of the major issues in this regard is the computation requirement of big data analytics. In recent years, the proliferation of many loosely
coupled distributed computing infrastructures (e.g., modern public, private, and hybrid clouds, high performance computing clusters, and grids)
have enabled high computing capability to be offered for large-scale computation. This has allowed the execution of the big data analytics to
gather pace in recent years across organizations and enterprises. However, even with the high computing capability, it is a big challenge to
efficiently extract valuable information from vast astronomical data. Hence, we require unforeseen scalability of performance to deal with the
execution of big data analytics. A big question in this regard is how to maximally leverage the high computing capabilities from the
aforementioned loosely coupled distributed infrastructure to ensure fast and accurate execution of big data analytics. In this regard, this chapter
focuses on synchronous parallelization of big data analytics over a distributed system environment to optimize performance.
INTRODUCTION
Dealing with the execution of big data analytics is more than just a buzzword or a trend. The data is being rapidly generated from many different
sources such as sensors, social media, click-stream, log files, and mobile devices. Recently, collected data can exceed hundreds of terabytes and
moreover, they are continuously generated from the sources. Such big data represents data sets that can no longer be easily analyzed with
traditional data management methods and infrastructures (Jacobs, 2009; White, 2009; Kusnetzky, 2010). In order to promptly derive insight
from big data, enterprises have to deploy big data analytics into an extraordinarily scalable delivery platform and infrastructure. The advent of
on-demand use of vast computing infrastructure (e.g., clouds and computing grids) has been enabling enterprises to analyze such big data using
with low resource usage cost.
A major challenge in this regard is figuring out how to effectively use the vast computing resources to maximize the performance of big data
analytics. Using loosely coupled distributed systems (e.g., clusters in a data center or across data centers; public cloud with the internal clusters
as hybrid cloud formation) is often better choice to parallelize the execution of big data analytics compared to using local centralized resources.
Big data can be distributed over a set of loosely-coupled computing nodes. In each node, big data analytics can be performed on the portion of the
data transferred to the node. This paradigm can be more flexible and has obvious cost benefits (Rozsnyai, 2011; Chen, 2011). It enables
enterprises to maximally utilize their own computing resources and effectively utilize external computing resources that are further optimized for
the big data processing.
However, contrary to common intuition, there is an inherent tradeoff between the level of parallelism and performance of big data analytics. This
tradeoff is primarily caused by the significant delay for big data to get transferred to computing nodes. For example, when a big data analytics is
run on a pool of inter-connected computing nodes in hybrid cloud (i.e., the mix of private and public clouds), it is often experienced that an
extended period of data transfer delay is comparable or even higher than the time required to data computation itself. Additionally, the
heterogeneity of computing nodes on computation time and data transfer delay can make the tradeoff issue being further complicated. The data
transfer delay mostly depends on the location and network overhead of each computing node. A fast transfer of data chunks to a relatively slow
computing node can cause data overflow, whereas a slow transfer of data chunks to a relatively fast computing node can lead to underflow
causing the computing node to be idle (hence, leading to low resource utilization of the computing node).
This chapter focuses on optimally parallelizing big data analytics over such distributed heterogeneous computing nodes. Specifically, this chapter
will discuss how to improve the advantage of parallelization by considering the time overlap across computing nodes as well as between data
transfer delay and data computation time in each computing node. It should be noted here that the data transfer delay may be reduced by using
data compression techniques (Plattner, 2009; Seibold, 2012). However, even with such reduction, overlapping the data transfer delay with the
execution can reap benefits in the overall turnaround of the big data analytics. Ideally, the parallel execution should be designed in such a way
that the execution of big data analytics at each computing node, including such data transfer and data computation, completes at near same time
with other computing nodes.
This chapter will 1) discuss the performance issue of big data analytics in loosely-coupled distributed systems; 2) describe some solution
approaches to address the issue; and 3) introduce a case study to demonstrate the effectiveness of the solution approaches using a real world big
data analytics application on hybrid cloud environments. Readers of this chapter will have a clear picture of the importance of the issue, various
approaches to address the issue, and a practical application of parallelism on the state-of-art management and processing of big data analytics
over distributed systems. Specific emphasis of this chapter is on addressing the tradeoff between the performance and the level of parallelism of
big data analytics in distributed and heterogeneous computing nodes separated by relatively high latencies. In order to optimize the performance
of big data analytics, this chapter addresses the following concerns:
• Synchronized completion: How to do opportunistic apportioning of big data to those computing nodes to ensure fast and synchronized
completion (i.e., finishing the execution of all workload portions at the near same time) for best-effort performance;
• Autonomous determination for data serialization:How to determine a sequence of data chunks (from the ones apportioned to each
computing node) so that transfer of one chunk is overlapped as much as possible with the computation of previous chunks.
While unequal loads (i.e., data chunks) may be apportioned to the parallel computing nodes, it is important to make sure that outputs are
produced at the near same time. Any single slow computing node should not act as a bottleneck of the overall execution of big data analytics. This
chapter will demonstrate the synchronous parallelization with the aid of real experiences from a case study using a parallel frequent pattern
mining task as a specific type of big data analytics. The input to the frequent pattern mining is huge but outputs are far small. Parallel data
mining typically consumes a lot of computing resources to analyze large amounts of unstructured data, especially when they are executed with a
time constraint. The case study involves a deployment of the analytics on small multiple Hadoop MapReduce clusters in multiple heterogeneous
computing nodes in clouds, where we consider each cluster and cloud as a computing node. Hadoop MapReduce is a typical and most commonly
used programming framework to host big data analytics tasks over clouds. The case study would help readers further understand the concept,
definitions, and approaches presented in this chapter.
The remainder of this chapter is organized as follows. First, the chapter will introduce some background for readers on loosely coupled
distributed systems, scheduling and load balancing methodologies to apportion workload in distributed computing nodes, and handling big data
over distributed systems. Second, the chapter will discuss the performance issues and key tradeoffs tackled in this chapter. Third, the chapter will
describe how load balancing methodologies can be used toward synchronous parallelization. Fourth, the case study will be described to illustrate
the approach. Finally, future research directions in this area will be outlined before the conclusion.
BACKGROUND
Loosely Coupled Distributed Systems
Loosely coupled distributed systems is a general term for collection of computers (a.k.a. nodes) interconnected by some networks. A node can be
a single server, a cluster of servers (e.g., a Hadoop MapReduce cluster) in a data center, a data center with multiple clusters, or an entire cloud
infrastructure (with multiple data centers) depending on the scale of the system. There are various different examples of loosely coupled
distributed systems, and we provide a list of such systems to name a few in order to provide the proper context:
• Computing grids: A federation of whole computers (with CPU, disk, memory, power supplies, network interfaces, etc.) used toward
performing a common goal to solve single task. Such systems are special type of parallel computation systems, where the computing units
can be heterogeneous, geographically far apart, and/or can transcend beyond a single network domain. It can be used to have parallelized
processing of large-scale tasks as well as scavenging CPU cycles from resources shared as part of grid for large scientific tasks;
• Computing clusters: Sets of loosely-connected or in many case tightly connected computers working together such that a cluster can
itself be viewed as a single computing node. Computing clusters are more tightly coupled compared to computing grids. In most cases, a
cluster is designed in a single rack (or across set of racks) in a single data center;
• Computing clouds: Federations of computing servers hosted in a single data center or across multiple data centers that are offered to
users based on virtualization middleware for remote access. Clouds can be private (i.e., the infrastructure is inside an enterprise and the
offering are to the internal users), public (i.e., the infrastructure is available to all users through pay-per-use service offerings), or hybrid
(i.e., a combination of private and public clouds);
• Parallel computing systems: Allow simultaneous execution of multiple tasks. All the aforementioned systems can be a form of parallel
computing systems. However, by definitions parallel computing systems are typically tightly coupled (e.g., multi-core or multi-processor
systems) where computing units (processors) may not be inter-connected through a networking interface. A most prominent example of
parallel computing system can be found in High Performance Computing systems (HPC).
The major advantage of aforementioned systems is the amount of computation capability it can provide for any large-scale computation such as
big data analytics. This chapter focuses on how to effectively use the computation capability in loosely couple distributed systems to optimize
performance of big data analytics.
Loosely coupled distributed systems can be further homogeneous (i.e., all the inter-connected nodes having same configuration and capabilities)
or heterogeneous (i.e., the inter-connected nodes can differ in configuration and/or capabilities). An example of homogeneous system can
include a Hadoop MapReduce cluster comprising of same type of servers with same processor, memory, disk, and network configurations. An
HPC system or a computing cloud with a set of such clusters can also be considered as a homogeneous system, since each cluster will be identical
in their configuration and capabilities. The cluster can become heterogeneous if it does not consist of the same type of servers. Similarly, a cloud
can be heterogeneous with different underlying clusters in it.
Many enterprises are recently focusing on building their cloud-based IT infrastructures by integrating multiple heterogeneous clouds including
their own private clouds into a hybrid cloud, rather than relying on a single cloud. This hybrid cloud can be one of the most practical approaches
for enterprises, since it can reduce the inherited risk of losing service availability and further improve cost-efficiency. In order to optimize the
performance in such parallel heterogeneous environments, task or job schedulers with load balancing techniques have indeed become a recent
norm (Maheswaran, 1999; Kailasam, 2010; Kim, 2011). Such techniques will be further discussed in the following sections. This chapter will then
discuss how load balancing can be used for big data analytics. For general discussions, from now on we will assume heterogeneous systems over
loosely coupled distributed environments. However, all the problems, issues, and approaches discussed can also be applicable to homogeneous
systems over either loosely coupled or tightly coupled environments.
Task Scheduling in Distributed Systems
Distributing workload among nodes and executing big data analytics in parallel in loosely coupled distributed systems can obviously reap
performance benefits. However, such benefits are contingent upon how effectively the big data workload is distributed across the nodes. This
section reviews scheduling of workload (i.e., deciding in which order the incoming tasks need to be served) and in the following section, the
assignment of load (i.e., deciding in which node tasks would be assigned) will be discussed.
Workload scheduling has been an on-going research topic in parallel and distributed systems for many decades. There are many various
scheduling algorithms in distributed computing system. It is not feasible to enumerate all these scheduling algorithms, but the main focus of this
chapter is to provide an account of the main operating principles and intuitions used. Many variations of the algorithms exist for different types
of execution environments. However, the basic principle of assigning tasks to suitable nodes in suitable order remains similar. According to a
simple classification, scheduling algorithms can be categorized into two main groups - batch scheduling algorithms and online algorithms. It
should be noted that scheduling and allocation problems are generally NP-hard in nature, and heuristics solutions are more of a norm in practical
system settings.
In either batch or online scheduling algorithms, tasks are queued and collected into a set when they arrive in the system. The scheduling
algorithm executes in fixed intervals. Examples of such algorithms include First-Come-First-Served (FCFS), Round Robin (RR), Min–Min
algorithm, Max–Min algorithm, and priority-based scheduling. In FCFS, task in queue which comes first is served first. This algorithm is simple
and fast. In the round robin scheduling, processes are dispatched in a FIFO manner but are given a limited amount of time called a time-slice or a
quantum. If a process does not complete before its time-slice expires, the task is preempted and the node is given to the next process waiting in
the queue. The preempted task is then placed at the back of the queue. Min–Min algorithm chooses smaller tasks to be executed first. This
however may incur long delays for large tasks. Max–Min algorithm goes to the other extreme and chooses larger tasks to be executed first; thus
delaying the smaller tasks.
Most of these algorithms can be generalized into priority-based scheduling. The main intuition is to assign priority to tasks and the order of them
as per their priority. In FCFS, the job coming first gets higher priority. Similarly in Max-Min algorithm, the larger tasks have higher priority. Big
data scheduling can be thought of as a specialized batch scheduling algorithm, where the batch is the set of data (i.e., the big data) and the
scheduling problem boils down to which data chunk to be activated upon first to optimize performance. For online scheduling, the decision
making boils down to which nodes the tasks (or data) need to be assigned to and there are various algorithms in the literature (e.g., first-fit as
assigning to first available node, and best-fit as assigning to the best available node based on some performance criteria).
In some distributed systems (e.g., computing grids), a hierarchical scheduling structure is employed with two main schedulers, local and grid
schedulers. The local schedulers work in computational nodes mostly having homogeneous environment (Sharma, 2010), whereas grid
schedulers (a.k.a. meta-schedulers) reside at the top level to orchestrate nodes managed by their respective local schedulers (Huedo, 2005).
Scheduling can further be static or dynamic. In static scheduling, tasks are typically executed without interruption in the nodes where they are
assigned. On the other hand, tasks may get rescheduled in dynamic scheduling; tasks’ executions can be interrupted and they can be migrated to
different nodes based on dynamic information on the workload and the nodes (Chtepen, 2005).
Load Balancing
Distribution of load across nodes, normally referred as load balancing, has been explored for all the systems mentioned in the previous section.
Indeed, load balancing is a common research problem for parallel processing of workload in loosely coupled distributed systems (Li, 2005;
Miyoshi, 2010; Patni, 2011; Tsafrir, 2007; Liu, 2011).
Load balancing generally involves distribution of service loads (e.g., application tasks or workload) to computing nodes to maximize performance
(e.g., task throughput). Various algorithms have been designed and implemented in a number of studies (Casavant, 1994; Xu, 1997; Zaki, 1996).
Load balancing decisions can be made a-priori when resource requirements for tasks are estimated. On the other hand, a multi-core computer
with dynamic load balancing allocates resources at runtime based on no a-priori information. (Brunner, 1999) proposed a load balancing scheme
to deal with interfering tasks in the context of clusters of workstations.
(Grosu, 2005) and (Penmatsa, 2005) considered static load balancing in a system with servers and computers where servers balance load among
all computers in a round robin fashion. (Kamarunisha, 2011) discussed about load balancing algorithm types and policies in computing grids. A
good description of customized load balancing strategies for a network of workstations can be found in (Zaki, 1996). (Houle, 2002) focuses on
algorithms for static load balancing assuming that the total load is fixed. (Yagoubi, 2006) and (Kumar, 2011) presented dynamic load balancing
strategy for computing grids. (Patni 2011) provided an overview of static and dynamic load balancing mechanisms in computing grids.
The problem of both load balancing and scheduling in distributed systems can be formulated into a bin-packing problem. Bin-packing is a well-
known combinatorial optimization problem that has been applied to many similar problems in computer science. The main idea of the bin-
packing problem is to pack multiple objects of different volumes into a finite number of bins (or buckets) of same or different capacity in a way
that minimizes the number of bins used. The bin-packing problem is known as NP-hard in nature, and there is no optimal solution with
polynomial time to solve the problem. A task allocation problem can be generally reduced to the bin-packing problem, and usually heuristics
algorithms are developed to address the problem in various settings. Depending on the objective to be optimized, bins (or buckets) along with
their respective capacities (i.e., bucket sizes) and objects to be packed to bins need to be designed carefully to cater to the specific problem
settings. We will discuss how the data apportioning problem can be reduced to the bin-packing problem later in this chapter.
(Buyya, 1999) described load balancing for HPC clusters. Cloud friendly load balancing of HPC applications has been explored in (Sarood, 2012).
Although the term of load balancing may suggest that equal amount of workload is distributed across the computing nodes, it is not the case
often. (Sarood, 2012) used different parameters to distribute workload. This chapter focuses on distributing big data workload among distributed
nodes based on performance and network parameters. Such distribution may incur cost and delay overheads in many systems.
Impact of Data Communication
Impact of the data transfer delay on performance, when loads are redirected to multiple computing nodes, has been considered to be an
important problem recently (Fan, 2011; Andreolini, 2008). (Foster, 2008) compared the cloud and grid computing with light of how data transfer
can become an issue for loosely coupled distributed systems because of the communication overhead. The problem further gets exacerbated when
loads involve big data. This chapter provides an account on the distribution of big data over distributed computing nodes that can be separated
far apart. In this setting, the overhead of the data transfer can be significant, since network latencies may be high between computing nodes due
to the amount of data being transferred. Hence, the conventional notion of performance improvement through parallelization may not hold once
the data transfer becomes a bottleneck.
It is beneficial if the data transfer to a computing node can be serialized, when some previous data is being processed in the computing node, to
improve performance. Serialization of data transfers with instruction execution is an important feature for many-core architectures (Miyoshi,
2010). However, such serialization is restricted by a pre-defined order of instruction sets. Meanwhile, overlapping of data transfer and data
computation has been identified as an important requirement for load balancing in computing clusters (Reid, 2000). Data compression
techniques (Plattner, 2009; Seibold, 2012) can be used to reduce data transfer and associated overhead. However, the overlapping of data
transfer and computation can still be beneficial in order to reduce the impact of any transfer delay on the end-to-end execution or turn-around
time.
Platforms for Parallel Processing of Big Data
There are various existing platforms to execute big data applications on distributed systems. Google introduced the MapReduce programming
model (Dean, 2004) for processing big data. MapReduce enables splitting of big data across multiple parallel map tasks, and the outputs are later
merged by reduce tasks. Often map and reduce tasks are executed across massively distributed nodes for performance benefits (Chen, 2011;
Huang, 2011). Apache implemented the MapReduce model in the Hadoop open source and free distribution. Another related framework is
Apache GoldenOrb, which allows massive-scale graph analysis and built upon Hadoop. Open source Storm framework has also been developed at
Twitter. It is often referred to as real-time Hadoop since it is a real-time, distributed, fault-tolerant, computation system; whereas Hadoop
generally relies on batch processing.
Hadoop implementation operates on a distributed file system that stores the big-data and its intermediate data in the big-data processing. Many
scheduling strategies for MapReduce cluster have been integrated (e.g., FIFO scheduler (Bryant, 2007), LATE scheduler (Zaharia, 2008),
Capacity scheduler and Fair scheduler). Fair scheduling is a method of assigning resources to tasks such that all tasks get, on average, an equal
share of resources over time. Fair-scheduling has also been used in Hadoop. Unlike the default scheduler of Hadoop, which forms a queue of
tasks, fair scheduling lets short tasks complete within a reasonable time without starving long-running tasks. A major issue in all these
frameworks is how to effectively reduce the impact of large data transfer delay (caused by communication overhead of big data across nodes) in
the overall execution time of tasks.
One way to address the high data transfer latency is using only local cluster having high computation power with dynamic resource provisioning
(Alves, 2011). However, distribution among federated private and public clouds can be more flexible and has obvious cost benefits (Rozsnyai,
2011; Chen, 2011). This chapter discusses how load balancing can be precisely used for the distributed parallel applications dealing with
potentially big data such as the continuous data stream analysis (Chen, 2011) and the life pattern extraction for healthcare applications in
federated clouds (Huang, 2011). To achieve the optimal performance of big data analytics, it is imperative to determine “how many” and “which
computing nodes” are required, and then, how to apportion given big data to those chosen computing nodes.
PROBLEMS IN DEALING WITH BIG DATA OVER DISTRIBUTED SYSTEMS
The input big data to an analytics algorithm (e.g., a log file containing users’ web transactions) is typically collected in a central place over a
certain period (e.g., daily, weekly, monthly, or yearly), and is processed by the analytics algorithm to generate an output (e.g., frequent user
behavior patterns). To execute analytics algorithm in multiple computing nodes for given big data, the big data needs to be first divided into a
certain number of data chunks (e.g., log files of individual user groups), and those data chunks are transferred to computing nodes separately.
Before discussing on how to select computing nodes and how to apportion big data to these computing nodes, we first discuss some fundamental
problems in this section.
Performance and Parallelism Tradeoff
Intuitively, as the number of computing nodes increases, the overall data computation time can decrease. However, as the amount of data chunks
to be transferred to computing nodes increases, the overall data transfer delay can increase. As shown in Figure 1, the overall execution time,
which consists of the data computation time and the data transfer delay, can start to increase if we use more than a certain number of computing
nodes. This is because the delay taken to transfer data chunks starts to dominate the overall execution time.
To minimize the impact of data synchronization toward optimizing performance, it is needed to understand the characteristics of the input data
before designing the parallel process. For example, in case of mining frequent user behavior patterns from web transaction log files, the input
data is usually in temporal order and mixed with many users’ web transaction activities. To generate frequent behavior patterns of each
individual user (i.e., extract personalized information from the big data), the given big data can be divided into user groups as a set of data
chunks. Distributing and executing individual user data in different computing nodes, which have different computing and network capacities,
can reduce the data synchronization, since these data chunks are usually independent of each other. Then, the problem can be casted into a
typical load balancing problem as shown in Figure 2 that maximizes the time overlap across computing nodes and thereby, completes the
execution at the near same time (i.e., synchronization).
Capabilities of network channels and computing nodes determine if there is any performance bottleneck while parallelizing big data analytics. It
should be noted that a fast transfer of data chunks to a relatively slow computing node can cause data overflow, whereas a slow transfer of data
chunks to a relatively fast computing node can lead to data underflow causing the computing node to be idle over the time.
Overlapping between the data transfer delay and the data computation time as much as possible is thus imperative while distributing data chunks
to computing nodes. As shown in Figure 3, ideally, the goal should be to select a data chunk that takes the same amount of delay to be transferred
to a computing node with the computation time of the computing node with the previous data chunk.
Practically, this overlap can be maximized by ensuring that the difference between the data computation time and the data transfer time (of
following data chunk) is as low as possible. This chapter discusses how the performance can be optimized by maximizing the time overlap not
only across computing nodes via a load balancing technique, but also between the data transfer delay and the data computation time in each
computing node, simultaneously.
SYNCHRONOUS PARALLELIZATION THROUGH LOAD BALANCING
In the problem of the big data load distribution to parallel computing nodes as shown in Figure 4, each individual data chunk (i.e., Li) is
considered as the object to be packed, and the capacity of each computing node (i.e., ni) including both computation and network capacities is
considered as the bin.
Figure 4. Bin-packing with different sizes of data chunks and
computing nodes
More specifically, the principal objective to maximize the performance using a federation of heterogeneous computing nodes would be to
minimize the execution time difference among computing nodes by optimally apportioning given data chunks into these computing nodes. In
other words, the time overlaps across these computing nodes need to be maximized when a big data analytics is performed. A heuristic bin-
packing algorithm can be used for data apportioning in federated computing nodes by incorporating: weighted load distribution, which involves
loading to a computing node based on its overall delay (i.e., the computing node with lower delay gets more data to handle); and delay-based
computing node preference to ensure larger data chunks are assigned to a computing node with larger bucket size (i.e., with lower overall delay)
so that individual data chunks get fairness in their overall delay.
It is imperative to have good estimates of data transfer delay and data computation time in each computing node to efficiently perform the
apportioning. Many estimation techniques have been introduced for the data computation time and the data transfer delay such as response
surface model (Kailasam, 2010) and queuing model (Jung, 2009). The data computation time of each computing node can be profiled to create
an initial model, which can then be tuned based on observations over the time. The data transfer delay between different computing nodes can be
more dynamic than the data computation time because of various factors such as network congestions and re-routing due to network failures.
Estimating data transfer delay can be performed using auto-regressive moving average (ARMA) filter (Jung, 2009; Box, 1994) by periodically
profiling the network latency. Periodic injection of a small size of unit data to the target computing node and recording the corresponding delay
can allow for reasonable and up to date profiling of the network latency. We also assume that the larger data chunk needs the more computation
time and data transfer delay in this chapter. As mentioned in (Tian, 2011), this assumption can be applicable to many big data analytics
applications. For some special cases, these estimates must be carefully modeled to improve the accuracy of the overall approach.
Based on the estimates of the data transfer delay and the data computation time in each computing node, there are two steps for maximizing
performance through parallelization over federated computing nodes:
• Maximal overlapping of data processing: Involves decision makings on “how many” and “which” computing nodes are used for data
apportioning; and
Maximal Overlapping of Data Processing
A set of parallel computing nodes are chosen that have shorter execution time d than other candidate computing nodes. Figure 5 shows how those
computing nodes are selected from a set of candidate computing nodes, based on estimates of the data transfer delay t to each computing node
and the computing node’s computation time e for the data assigned. The estimation is assumed to be using the unit of data (i.e., a fixed size of
data) periodically. Those candidate computing nodes are sorted by the total estimate (i.e., t + e), and each computing node is added to the
execution pool N if required. The procedure starts by considering the execution time with a single computing node nmaster, which can be the
central computing node where the big data is initially submitted for processing. One extreme case can be only usingnmaster if the estimated
execution time meets the exit condition.
If Service Level Agreement (SLA) exists to specify the goal such as a low-bound of execution time, SLA can be used for the exit condition.
Otherwise, an optimal execution time can be used for the condition. More specifically, if SLA cannot be met using the current set of parallel
computing nodes (i.e., rtp > SLA), or the overall execution time can be further reduced by utilizing more nodes (i.e., rtp < rtp – 1), then the level of
parallelization is increased by adding one computing node into the pool N. The newly added computing node is the best computing node that has
the minimum delay (i.e., t + e which is estimated periodically) among all candidate computing nodes. Once the set of parallel computing nodes is
determined from the above step, one of load balancing techniques mentioned in the prior section can be applied to the distribution of given big
data. In this chapter, a specific load balancing technique using a bin-packing algorithm, referred to as Maximally Overlapped Bin-packing (MOB),
is performed that attempts to maximize the time overlap across these computing nodes and the time overlap between data transfer delay and
data computation time in each computing node.
Figure 6 illustrates two main steps for data distribution to computing nodes:
• Data Apportioning: This involves: (a) the determination of bucket size for each computing node; (b) sorting of data chunks in
descending order of their sizes; and (c) sorting computing node buckets in descending order of their sizes. The buckets’ sizes are determined
in a way that a computing node with higher delay will have lower bucket size. Then sorting of buckets essentially boils down to giving higher
preference to computing nodes which have lower delay. The sorted list of data chunks are assigned to computing node buckets in a way that
larger data chunks are handled by computing nodes with lower delay (i.e., higher preference). Any fragmentation of the bucket is handled in
this step;
• Serialization of Apportioned Data: After the data chunks are assigned to the buckets, this step organizes the sequence of data chunks
for each bucket such that the data transfer delay and data computation time are overlapped maximally.
Specifically, in data apportioning step, this approach intends to parallelize the given task by dividing input big data to multiple computing nodes.
If the delay of the task for a unit of data on a computing node i is denoted as di then the overall delay of a data si(that is provided to node i for the
given task) would be: delay per unit data to the node multiplied by amount of data assigned to the node (i.e., sidi). Ideally, in order to ensure
parallelization, for the set of computing nodes, 1, 2, …, n, the size of data to be assigned to each computing node, s1, s2, …, sn should be, s1d1 =
s2d2 = s3d3 = … = sndn,, where d1, d2, …, dn are the delay per unit of data at each computing node, respectively. The delay per unit data is defined
as the turnaround time (including data transfer and computation) when one data unit is assigned to a computing node. After such assignment, if
the overall turnaround time of the given task (i.e., for all data assigned to a node) is assumed to be r, then the size of data assigned to each
computing node would be as follows:
si = r / di, where 1 ≤ i ≤ n (1)
assuming full parallelization (i.e., the apportioning of data is ideal and all the nodes will be synchronous and complete execution together after
the time r). Now, let s be the total amount of n input data, which are distributed to each computing node (i.e., ). Then, we get
from Equation 1, and:
(2)
Equation 2 provides the overall execution time of the given task under the full parallelization. This can be achieved if data assigned to each
computing node i is limited by an upper bound si given by replacing r from Equation 2 to Equation 1 as follows:
(3)
Note here that si is higher for a computing node i if the delay di for that computing node is lower (compared to other computing nodes). Hence,
Equation 3 can be used to determine the bucket size for each computing node in a way where higher preference is given to computing nodes with
lower delay. Once the bucket sizes are determined, the next step involves assigning the data chunks to the computing node buckets. A greedy bin-
packing approach can be used where the largest data chunks are assigned to computing nodes with lowest delay (hence, reducing the overall
delay), as shown in Figure 6. To reduce fragmentation of buckets, the buckets are completely filled one at a time (i.e., the bucket with lowest delay
will be first exhausted followed by the next one and so on). This approach also enables to fill more data to computing nodes with lower delay.
Once data chunks are apportioned for computing nodes, our approach organizes the sequence of data chunks to each computing node. The
previous step achieves the parallelization of the given task over a large set of data chunks. However, the delaydi for unit of data to run on
computing node i can be explained by data transfer delay from the central computing node to computing node i and actual data computation time
on computing node i. Therefore, it is possible to further reduce the overall execution time by transferring data to a computing node in parallel
with execution on a different data chunk. Ideally, the data transfer delay of a data chunk should be exactly equal to the data computation time on
previous data chunk. Otherwise, there can be a delay incurred by queuing (when the data computation time is higher than the data transfer delay)
or unavailability of data for execution (when the data transfer delay is higher than the data computation time). If the data computation time and
the data transfer delay are not exactly equal, it is required to smartly select a sequence of data chunks for the data bucket of each computing node,
so that the difference between the data computation time of each data chunk and the data transfer delay of a data chunk immediately following is
minimized.
Depending on the ratio of data transfer delay and data computation time, a computing node i can be categorized as follows:
• Type 1: For which the data transfer delay ti per a unit of data is higher than the data computation time ei per a unit of data;
• Type 2: For which the data computation time ei per a unit of data is higher than the data transfer delay ti per a unit of data.
It is important to understand the required characteristics of the sequence of data chunks sent to each of these types of computing nodes.
If sij and si(j+1) are the size of data chunk j and (j + 1)assigned to computing node i, then for complete parallelization of the data transfer of
chunk (j + 1) and the data computation of j, it can said that si(j+1)ti = sijei. It should be noted here that if ti ≥ ei, then ideally si(j+1) < sij. Thus, data
chunks in the bucket for type 1 computing node should be in descending order of their sizes. Similarly, it can be concluded that for a computing
node of type 2, data chunks should be in ascending order, as shown in the last step in Figure 6 and ensured at the end of the approach (where
descending order of data chunks is reversed to make the order ascending in case ti < ei).
The following pseudo code (Algorithm 1) summarizes the above data apportioning and data serialization with the heuristic bin-packing approach
(i.e., MOB).
Algorithm 1. Maximal Overlapped Binpacking
CASE STUDY
This section shows the efficacy of the solution approach (presented in previous section) using a case study where a frequent pattern mining task
with big data is deployed into a federation of four different computing nodes, three of which are local clusters located in the northeastern part of
US, and one is a remote cluster located in the mid-western part of US. We first describe details of the frequent pattern mining application and the
federated computing nodes followed by some evaluation results in details.
Frequent Pattern Mining Application
Frequent pattern mining (Srikant, 1996) aims to extract frequent patterns from log files. A typical example of the log file is a web server access
log, which contains a history of web page accesses from users. Enterprises need to analyze such web server access logs to discover valuable
information such as web site traffic patterns and user behavior patterns in web sites by time of day, time of week, time of month, or time of year.
These frequent patterns can also be used to generate rules to predict future activity of a certain user within a certain time interval based on the
user’s past patterns.
In the scenario, we further combine a phone call log obtained from a call center with the web access log on a human resource system (HRS)
accessed by more than a hundred of thousands of users. The log contains data for a year including several millions of user activities, and the data
size is up to 1.2 TB. The frequent pattern mining obtains patterns of each user’s activities in the mix of the exploration on HRS web site and
phone calls for HR issues such as 401K and retirement plan. As such, the log can be first divided into a set of user log files, each of which is the
data chunk representing the log of a single HRS user and then, these user log files can be executed in parallel over the federation of
heterogeneous computing nodes.
A Federation of Heterogeneous Computing Nodes
One local computing node, referred to as a Low-end Local Central (LLC) node, is used as a central computing node, where big data is submitted
for frequent pattern mining task. This computing node consists of 5 virtual machines (VMs), each of which has two 2.8 GHz CPU cores, 1 GB
memory, and 1 TB hard drive. The data computation time is high in LLC than other computing nodes, while there is no data transfer delay.
Another local computing node, referred to as a Low-end Local Worker (LLW) node, has the similar configuration and data computation time with
LLC. We have set up LLW to simulate the data transfer delay between LLC and LLW by intentionally injecting a delay. The third local computing
node, referred to as a High-end Local Worker (HLW) node, is a cluster that has 6 non-virtualized servers, each of which has 24 2.6 GHz CPU
cores, 48 GB memory, and 10 TB hard drive. HLW is shared by other three data mining tasks other than the frequent pattern mining task. HLW
has the lowest data computation time among all the computing nodes, and the data transfer delay between LLC and HLW is very low. The remote
computing node, referred to as a Mid-end Remote Worker (MRW) node, has 9 VMs, each of which has two 2.8 GHz CPU cores, 4 GB memory,
and 1 TB hard drive. MRW has a lower data computation time than LLC. However, the overall execution time is similar to LLC due to the large
data transfer delay. Hadoop MapReduce is deployed into each of these computing nodes. Figure 7 shows each computing node’s performance
characteristics in the context of data computation time and data transfer delay. We have measured those values by deploying the frequent pattern
mining into each computing node and executing it with the entire log files. For example, to figure out the ratio of data computation time and
transfer delay in MRW, we have measured the data transfer delay by moving the entire log files into MRW, and measured the data computation
time by executing the frequent pattern mining only in MRW. As shown in Figure 7, MRW has higher transfer delay than other computing nodes,
and it is very close to the data computation time of MRW (but slightly lower than the data computation time).
Figure 7. Performance characteristics of different computing
nodes
Parallel Processing for Frequent Pattern Mining
Figure 8 outlines the main phases of the parallel pattern mining. For given big data (i.e., log files of HRS web site access log and HRS call center
log), the load balancer divides it into multiple data chunks, each of which represents a log file containing a single user’s site access log and phone
call log. In the phase of parallel data processing, data chunks are sent to MapReduce system in each computing node one by one and executed in
the node. Since each data chunk contains a single user log data, all data chunks can be independently processed in parallel in computing nodes.
Therefore, in this case, we need a single global merging phase to combine all patterns of individual users and generate the overall view of
patterns. However, since the number of data chunks is much larger than the number of computing nodes, we can also combine multiple data
chunks into a single larger data chunks and then, MapReduce system in each computing node processes the data chunk in parallel. In this case,
we need a local merging phase at the end of each MapReduce process. We omit this local merging in Figure 8.
Figure 8. The overview of parallel patterning mining phases
Impact of Data Transfer Delay on Overall Execution Time
As shown in Figure 9, the execution time increases exponentially as the data size increases. This explains the need of parallel execution of such
big data analytics to improve the performance. When using two computing nodes (i.e., LLC and LLW in this case) without the data transfer delay,
the execution time decreases by almost half. However, when the frequent pattern mining is executed with the data transfer delay (when transfer
delays are injected into LLW), the execution time is higher, and the difference in execution time increases as the data size increases. Therefore, it
is important to deal with the data transfer delay carefully. To control the data size in this experiment, we have used the different amount of each
individual data chunk (i.e., selected 10% to 100% of each original data chunk). This experiment also explains that the size of data chunk can be
strongly related to the data transfer delay and data computation time in the pattern mining algorithm.
Figure 9. Impact of data transfer delay on the overall
execution time
Effectiveness of Federated Heterogeneous Computing Nodes
As shown in Figure 10, the execution time decreases as computing nodes are added. However, the execution time is not significantly improved by
using all four computing nodes compared to using two computing nodes (i.e., LLC and HLW). This is because the contributions of MRW and
LLW to the performance are small, and the data transfer delays of MRW and LLW starts to impact the overall execution time. Therefore, using
two computing nodes can be better choice rather than using four computing nodes that needs higher resource usage cost.
Figure 10. Impact of using multiple computing nodes on
execution time
Comparison of MOB with Other Load Balancing Approaches
We run the frequent pattern mining task using Maximally Overlapped Bin-packing (MOB) and other three different methods that have been used
in many prior load balancing approaches (Miyoshi, 2010; Fan, 2011; Andreolini, 2008; Reid, 2000; Kim, 2011) and then, compare the results.
Methods used in this comparison are as follows:
• Fair division: This method divides the input big data into data chunks and equally distributes those data chunks to computing nodes. We
use this as a naive method to show as a baseline performance;
• Computationbased division: This method only considers the data computation power of each computing node when it performs load
balancing (i.e., distributes data chunks to compute nodes based on the computation time of each computing node), rather than considering
both data computation time and data transfer delay;
• Delaybased division: This method considers both data computation time and data transfer delay in load balancing. However, it does
not consider the time overlap between the data transfer delay and the data computation time.
Figure 11 shows the result when we run the frequent pattern mining task in HLW and MRW. As evident from the results in Figure 11, MOB can
achieve at least 20% (and up to 60%) improvement compared to the other approaches. Since MRW has large data transfer delay, the execution
time of Computation-based division is very close to the Fair division. Although Computation-based division considers the data computation
powers of MRW and HLW for load balancing, MRW becomes a bottleneck due to its large data transfer delay. Delay-division considers both the
data computation time and the data transfer delay. This significantly reduces the bottleneck. However, small data chunks are cumulated in the
queues of computing nodes, until data computation of previous large data chunk is completed. Additionally, while large data chunks are
transferred to MRW, small data chunks are computed earlier and then, MRW is often idle. This may incur a significant extra delay. MOB
considers the maximal overlapping of data transfer delay and data computation time and thereby, achieves further reduction in execution time.
We have also conducted another experiment to see the impact of the data chunk size on the overall execution time. The size of data chunk is
determined by the number of combined log files (i.e., the original data chunks) in this experiment. This can be empirically determined and then,
user can provide the right number as a parameter of the system. As shown in Figure 12, the execution time slightly increases as the size increases.
This is because the effectiveness of load balancing decreases as the load balancing gets coarser grained. Moreover, this coarse granularity makes
the impact of overflow and underflow on execution time being even worse in Fair and Delay-based approaches, since the size difference between
data chunks increases. However, it does not significantly affect the execution time of MOB, since the size difference in MOB is less than one in
Fair and Delay-based approaches. When the small size of data chunk is used, the execution time slightly increases as well in MOB and Delay-
based approaches. This is because a small amount of delay incurs when each data chunk is transferred to the node, and the node prepares the
data execution. As the size of data chunk decreases, the number of data chunks transferred increases.
FUTURE RESEARCH DIRECTIONS
Load balancing techniques such as Maximally Overlapped Bin-packing (MOB) algorithm can efficiently handle data-intensive tasks (e.g., big data
analytics) that typically require special platforms and infrastructures (e.g., Hadoop MapReduce cluster) and especially, can run in parallel, when
the target task can be divided into a set of independent identical sub-tasks. For the frequent pattern mining task used in the case study, the input
big data can be divided into a set of data chunks. Then, each data chunk can be run independently in parallel computing nodes without the need
for any synchronization until the task is completed. However, if the big data may not be divided into fully independent data chunks, such tasks
may require an iterative algorithm, where data transfer should occur just once from a central node to computing nodes but the data computation
is run multiple times on the same data chunk. In this case, those tasks may require multiple times of data transfer among computing nodes.
Running these tasks across federated computing nodes in clouds may not be practical, since these tasks may require considerable
synchronization among computing nodes and thereby, require considerable delay for transferring data before being completed (e.g., merging and
redistributing intermediate results iteratively). Extending the performance benefits through parallel processing for such tasks is an open research
issue.
The decision making of MOB is based on the current status of network and computation capacities. However, the status can change dynamically
due to various unexpected events such as computing node failures and network congestions, while sorting computing nodes based on the earlier
status. One of possible solutions can be that MOB periodically checks the available data computation capacities and network delays of computing
nodes. Another possible solution can be that distributed monitoring systems can push events into MOB, when the status is significantly changed.
In either case, the status change triggers MOB to re-sort computing nodes and re-target the sequence of remaining data chunks into the next
available computing nodes.
CONCLUSION
In this chapter, we have described how the performance of big data analytics can be maximized through synchronous parallelization over
federated, heterogeneous computing nodes in clouds that are a loosely-coupled and distributed computing environment. More specifically, this
chapter have discussed: (a) how many and which computing nodes in clouds should be used; (b) an approach for the opportunistic apportioning
of big data to these computing nodes in a way to enable synchronized completion; and (c) an approach for the sequence of apportioned data
chunks to be computed in each computing node so that the transfer of a data chunk is overlapped as much as possible with the data computation
of the previous data chunk in the computing node. Then, we have showed the efficacy of the solution approach using a case study, where a
frequent pattern mining task with big data is deployed into a federation of heterogeneous computing nodes in clouds.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 4771, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Alves, D., Bizarro, P., & Marques, P. (2011). Deadline queries: Leveraging the cloud to produce on-time results. In Proceedings of International
Conference on Cloud Computing (pp. 171–178). IEEE.
Andreolini, M., Casolari, S., & Colajanni, M. (2008). Autonomic request management algorithms for geographically distributed internet-based
systems. In Proceedings of International Conference on SelfAdaptive and SelfOrganizing Systems (pp. 171–180). IEEE.
Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis: Forecasting and control . Upper Saddle River, NJ: Prentice Hall.
Brunner, R. K., & Kale, L. V. (1999) Adapting to load on workstation clusters. In Proceedings of the Symposium on the Frontiers of Massively
Parallel Computation (pp. 106–112). IEEE.
Buyya, R. (1999). High performance cluster computing: Architectures and systems (Vol. 1). Upper Saddle River, NJ: Prentice Hall.
Casavant, T. L., & Kuhl, J. G. (1994). A taxonomy of scheduling in general purpose distributed computing systems. Transactions on Software
Engineering , 14(2), 141–153. doi:10.1109/32.4634
Chen, Q., Hsu, M., & Zeller, H. (2011). Experience in continuous analytics as a service. In Proceedings of International Conference on Extending
Database Technology (pp. 509-514). IEEE.
Chtepen, M. (2005). Dynamic scheduling in grids system. InProceedings of the PhD Symposium. Ghent, Belgium: Faculty of Engineering, Ghent
University.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating
System Design and Implementation (pp. 107-113). ACM.
Fan, P., Wang, J., Zheng, Z., & Lyu, M. (2011). Toward optimal deployment of communication-intensive cloud applications. InProceedings of
International Conference on Cloud Computing(pp. 460–467). IEEE.
Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud computing and grid computing 360-degree compared. In Proceedings of the Grid
Computing Environments Workshop (pp. 1-10). IEEE.
Grosu, D., & Chronopoulos, A. T. (2005). Noncooperative load balancing in distributed systems. Journal of Parallel and Distributed
Computing , 65(9), 1022–1034. doi:10.1016/j.jpdc.2005.05.001
Houle, M., Symnovis, A., & Wood, D. (2002). Dimension-exchange algorithms for load balancing on trees. In Proceedings of International
Colloquium on Structural Information and Communication Complexity (pp. 181–196). IEEE.
Huang, Y., Ho, Y., Lu, C., & Fu, L. (2011). A cloud-based accessible architecture for large-scale ADL analysis services. In Proceedings of
International Conference on Cloud Computing (pp. 646–653). IEEE.
Huedo, E., Montero, R. S., & Llorente, I. M. (2005). The GridWay framework for adaptive scheduling and execution on grids.Scalable Computing:
Practice and Experience , 6, 1–8.
Jung, G., Joshi, K., Hiltunen, M., Schlichting, R., & Pu, C. (2009). A cost sensitive adaptation engine for server consolidation of multi-tier
applications. In Proceedings of International Conference on Middleware (pp. 163–183). ACM/IFIP/USENIX.
Kailasam, S., Gnanasambandam, N., Dharanipragada, J., & Sharma, N. (2010). Optimizing service level agreements for autonomic cloud bursting
schedulers. In Proceedings of International Conference on Parallel Processing Workshops (pp. 285–294). IEEE.
Kamarunisha, M., Ranichandra, S., & Rajagopal, T.K.P. (2011). Recitation of load balancing algorithms in grid computing environment using
policies and strategies an approach.International Journal of Scientific & Engineering Research, 2.
Kim, H., & Parashar, M. (2011). CometCloud: An autonomic cloud engine . In Cloud Computing: Principles and Paradigms . Hoboken, NJ: Wiley.
doi:10.1002/9780470940105.ch10
Kumar, U.K. (2011). A dynamic load balancing algorithm in computational grid using fair scheduling. International Journal of Computer Science
Issues, 8.
Li, Y., & Lan, Z. (2005). A survey of load balancing in grid computing. Computational and Information Science , 3314, 280–285.
doi:10.1007/978-3-540-30497-5_44
Liu, Z., Lin, M., Wierman, A., Low, S., & Andrew, L. (2011). Greening geographical load balancing. In Proceedings of SIGMETRICS Joint
Conference on Measurement and Modelling of Computer Systems (pp. 233–244). ACM.
Maheswaran, M., Ali, S., Siegal, H., Hensgen, D., & Freund, R. (1999). Dynamic matching and scheduling of a class of independent tasks onto
heterogeneous computing systems. InProceedings of Heterogeneous Computing Workshop (pp. 30–44). IEEE.
Miyoshi, T., Kise, K., Irie, H., & Yoshinaga, T. (2010). Codie: Continuation based overlapping data-transfers with instruction execution.
In Proceedings of International Conference on Networking and Computing (pp. 71-77). IEEE.
Patni, J. C., Aswal, M. S., Pal, O. P., & Gupta, A. (2011). Load balancing strategies for Grid computing. In Proceedings of the International
Conference on Electronics Computer Technology(Vol. 3, pp. 239–243). IEEE.
Penmatsa, S., & Chronopoulos, A. T. (2005). Job allocation schemes in computational Grids based on cost optimization. InProceedings of 19th
International Parallel and Distributed Processing Symposium. IEEE.
Plattner, H. (2009). A common database approach for OLTP and OLAP using an in-memory column database. In Proceedings of the SIGMOD
Conference. ACM.
Reid, K., & Stumm, M. (2000). Overlapping data transfer with application execution on clusters. Paper presented at the meeting of Workshop
on Cluster-based Computing. Santa Fe, NM.
Rozsnyai, S., Slominski, A., & Doganata, Y. (2011). Large-scale distributed storage system for business provenance. InProceedings of
International Conference on Cloud Computing(pp. 516-524). IEEE.
Sarood, O., Gupta, A., & Kale, L. V. (2012). Cloud friendly load balancing for HPC applications: Preliminary work. In Proceedings of
International Conference on Parallel Processing Workshops(pp. 200–205). IEEE.
Seibold, M., Wolke, A., Albutiu, M., Bichler, M., Kemper, A., & Setzer, T. (2012). Efficient deployment of main-memory DBMS in virtualized data
centers. In Proceedings of International Conference on Cloud Computing (pp. 311-318). IEEE.
Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of International
Conference on Extending Database Technology(pp. 3–17). IEEE.
Tian, F., & Chen, K. (2011). Towards optimal resource provisioning for running MapReduce programs in public clouds. In Proceedings of
International Conference on Cloud Computing (pp. 155-162). IEEE.
Tsafrir, D., Etsion, Y., & Feitelson, D. (2007). Backfilling using system generated predictions rather than user runtime estimates.Transactions on
Parallel and Distributed Systems , 18, 789–803. doi:10.1109/TPDS.2007.70606
Yagoubi, B., & Slimani, Y. (2006). Dynamic load balancing strategy for grid computing. Transactions on Engineering .Computing and
Technology , 13, 260–265.
Zaharia, M., Konwinski, A., & Joseph, A. D. (2008). Improving MapReduce performance in heterogeneous environments. InProceedings of
Symposium on Operating Systems Design and Implementation (pp. 29-42). USENIX.
Zaki, M. J., Li, W., & Parthasarathy, S. (1996). Customized dynamic load balancing for a network of workstations. InProceedings of International
Symposium on High Performance Parallel and Distributed Computing (pp. 282–291). IEEE.
ADDITIONAL READING
Armbrust, M., Fox, A., Grith, R., Joseph, A. D., Katz, R. H., & Konwinski, A. … Zaharia, M. (2009). Above the clouds: A Berkeley view of cloud
computing (Technical Report UCB/EECS-2009-28). Berkeley, CA: EECS Department, University of California, Berkeley.
Bennett, C. (2010). MalStone: Towards a benchmarking for analytics on large data clouds. In Proceedings of Conference on Knowledge,
Discovery, and Data Mining (pp. 145-152). ACM.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization . Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511804441
Buyya, R., Broberq, J., & Goscinski, A. M. (Eds.). (2011). Cloud computing: Principles and paradigms . Hoboken, NJ: Wiley.
doi:10.1002/9780470940105
Buyya, R., Yeo, C., & Venugopal, S. (2009). Market-oriented cloud computing: Vision, hype, and reality for delivering IT services as computing
utilities. In Proceedings of International Symposium on Cluster Computing and the Grid (pp. 1-10). ACM.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. In Proceedings of Symposium on Operating Systems
Design and Implementation (pp. 107 - 113). ACM.
Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud computing and grid computing 360-degree compared. In Grid Computing Environments
Workshop, (pp. 1 – 10). IEEE.
Fraser, S., Biddle, R., Jordan, S., Keahey, K., Marcus, B., Maximilien, E. M., & Thomas, D. A. (2009). Cloud computing beyond objects: Seeding
the cloud. In Proceedings of Conference on ObjectOriented Programming Systems, Languages, and Applications (pp. 847-850). ACM.
Gupta, S., Fritz, C., de Kleer, J., & Wittevenn, C. (2012).Diagnosing heterogeneous Hadoop clusters. Paper presented at the Meeting of
International Workshop on Principles of Diagnosis. Great Malvern, UK.
Jiang, D., Ooi, B., Shi, L., & Wu, S. (2010). The performance of MapReduce: An in-depth study. In Proceedings of Very Large Databases
Conference (pp. 472-483). ACM.
Jung, G., Hiltunen, M. A., Joshi, K. R., Schlichting, R. D., & Pu, C. (2010). Mistral: Dynamically managing power, performance, and adaptation
cost in cloud infrastructures. In Proceedings of International Conference on Distributed Computing Systems (pp. 62-73). IEEE.
Kailasam, S., Gnanasambandam, N., Dharanipragada, J., & Sharma, N. (2010). Optimizing service level agreements for autonomic cloud bursting
schedulers. In Proceedings of International Conference on Parallel Processing Workshops (pp. 285-294). IEEE.
Kambatla, K., Pathak, A., & Pucha, H. (2009). Towards optimizing Hadoop provisioning in the cloud. In Proceedings of Workshop on Hot Topics
in Cloud Computing. ACM.
Kayulya, S., Tan, J., Gandhi, R., & Narasimhan, P. (2010). An analysis of traces from a production MapReduce cluster. InProceedings of
International Conference on Cluster Cloud and Grid Computing (pp. 94-103). IEEE.
Lai, K., Rasmusson, L., Adar, E., Zhang, L., & Huberman, B. A. (2005). Tycoon: an implementation of a distributed, market-based resource
allocation system. Multiagent and Grid Systems , 1(3), 169–182.
Lenk, A., Klems, M., Nimis, J., Tai, S., & Sandholm, T. (2009). What's inside the cloud? An architectural map of the cloud landscape.
In Proceedings of Workshop on Software Engineering Challenges of Cloud Computing (pp. 23-31). IEEE.
Mannila, H., Toivonen, H., & Verkamo, A. I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge
Discovery , 1(3), 259–289. doi:10.1023/A:1009748302351
Morton, K., Friesen, A., Balazinska, M., & Grossman, D. (2010). Estimating the progress of MapReduce pipelines. In Proceedings of
International Conference on Data Engineering (pp. 681-684). IEEE.
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce.
In Proceedings of Very Large Databases Conference(pp. 1426-1437). ACM.
Paylo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M. (2009). A comparison of approaches to large-scale
data analysis. In Proceedings of International Conference on Management of Data (pp. 165-178). ACM.
Pei, J., Han, J., Morazavi-Asl, B., & Pinto, H. (2001). PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth.
In Proceedings of International Conference on Data Engineering (pp. 215-224). IEEE.
Seibold, M., Wolke, A., Albutiu, M., Bichler, M., Kemper, A., & Setzer, T. (2012). Efficient deployment of main memory DBMS in virtualized data
centers. In Proceedings of International Conference on Cloud Computing (pp. 311-318). IEEE.
Sotomayor, B., Montero, R. S., Llorente, I. M., & Foster, I. (2009). Virtual infrastructure management in private and hybrid clouds.IEEE Internet
Computing , 13(5), 14–22. doi:10.1109/MIC.2009.119
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., & Sen Sarma, J. … Liu, H. (2010). Data warehousing and analytics infrastructure at
Facebook. In Proceedings of International Conference on Management of Data (pp. 1013-1020). ACM.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., & Stoica, I. (2008). Improving MapReduce performance in heterogeneous environments.
In Proceedings of Symposium on Operating Systems Design and Implementation (pp. 29-42). ACM.
KEY TERMS AND DEFINITIONS
Big Data: A collection of data that has so large volume, increases with a high velocity, and has a wide variety of data. Due to those
characteristics, it becomes difficult to process it using traditional database management tools and traditional data processing approaches.
Data Mining: A computational process to extract information and patterns from a large data set and then, transform information into an
understandable structure for further use. Typically, it involves a wide variety of disciplines in mathematics and computer science such as artificial
intelligence, machine learning, statistics, and database systems.
Federated Cloud: The combination of multiple cloud computing services (i.e., infrastructures, platforms, and software from multiple cloud
providers) to match business needs. Typically, it forms the deployment and management of hybrid cloud that consists of internal and external
clouds.
Load Balancing: A method for improving the performance of computing tasks by distributing workload across multiple computers or a
computer cluster, networks, or other resources. It is designed to achieve optimal resource utilization, maximize throughput, minimize task
execution time, and avoid some overload of computing resources.
MapReduce: A programming model for processing large data sets in parallel. It has been created and implemented by Google. MapReduce is
typically used to deal with big data with distributed computing on clusters of computing nodes. A popular free implementation is Apache
Hadoop.
Optimization: A procedure or a selection of a best case with regard to some criteria from a set of available alternatives. Typically, an
optimization problem can be defined as a mathematical function and then, it aims at maximizing or minimizing the function by systematically
choosing input values from a set and computing the value of the function.
Parallelization: A method of computing tasks with multiple computing nodes that are carried out simultaneously. Typically, it divides a large
problem into smaller problems and then, solves those smaller problems concurrently.
Synchronization: Two distinct but related concepts, synchronization of processes and synchronization of data. Synchronization of processes
indicates that multiple parallel processes are to join at a certain point before going forward. Synchronization of data is for keeping multiple copies
of a data in coherence with one another or maintaining data integrity.
CHAPTER 72
Visualization of Human Behavior Data:
The Quantified Self
Alessandro Marcengo
Telecom Italia, Italy
Amon Rapp
Computer Science Department – University of Torino, Italy
ABSTRACT
Although in recent years the Quantified Self (QS) application domain is growing, there are still some palpable fundamental problems that
relegate the QS movement in a phase of low maturity. The first is a technological problem, and specifically, a lack of maturity in technologies for
the collection, processing, and data visualization. This is accompanied by a perhaps more fundamental problem of deficit, bias, and lack of
integration of aspects concerning the human side of the QS idea. The step that the authors tried to make in this chapter is to highlight aspects
that could lead to a more robust approach in QS area. This was done, primarily, through a new approach in data visualization and, secondly,
through a necessary management of complexity, both in technological terms and, for what concerns the human side of the whole issue, in
theoretical terms. The authors have gone a little further stressing how the future directions of research could lead to significant impacts on both
individual and social level.
1. INTRODUCTION
Knowledge of the self is the mother of all knowledge. So it is incumbent on me to know my self, to know it completely, to know its minutiae, its
characteristics, its subtleties, and its very atoms. (Khalil Gibran)
In this chapter our proposal is to outline some aspects of human behavior generated data in the light of a specific research branch, the so called
Quantified Self (QS) area. We think that time has come to sketch current status and new directions in a field that even being explored for years,
only recently, also thanks to some technological advances, is finding its practical application. We will analyze the first attempts to trace and
represent human activity through the description of some experiments in this direction (e.g. LifeLog DARPA Project, MyLifeBits Microsoft
Project, etc.). The chapter will then describe the most important application fields of QS, in order to draw a clear picture of the current situation
and to give an overview of the most promising sectors in which this approach will be developed in the coming years.
In particular, we will describe the distinctive means to monitor and render human behavior and the applications aimed to persuade people to
change their practices of everyday life, in areas related to health, mood and fitness but also with few references to sports, training, social
networking, transportation, consumptions, emotions and communications.
The chapter will cover then two fundamental problems of the QS today, the technological problems and the theoretical problems both in terms of
data visualization of large dataset and in terms of behaviour change theories that still make difficult to adopt a non-purely empirical QS
approach. Will be then drawn the directions to handle these problems toward a more robust and credible QS scenario through new ways of
representing data and new directions in the management of complexity, both on the technological side and on the human side of the topic. Then
we will go further considering future directions of research with both individual and social impacts.
2. WHAT IS THE QUANTIFIED SELF?
The QS is a school of thought which aims to use the increasingly invisible technology means in order to acquire and collect data on different
aspects of the daily lives of people. These data can be “input” from the outside (such as the calories or the CO2 consumed), or they can be “states”
(as the mood or the oxygen level in the blood) or parametric indicators of performance or activity (such as the kilometers run, the mail sent, or
the mp3 heard). The purpose of collecting this data is the self-monitoring and the self-reflection oriented to some kind of change or improvement
(behavioral, psychological, medical, etc.).
It is immediately evident how this approach, which we will analyze in more details in the following pages, raises a series of theoretical problems
(for example, what are the foundations of human behaviour change?) but even before a series of technological issues. How can this data be
collected (input)? With what kind of sensors? And above all, how this knowledge can be returned to the user (output)? With what kind of data
visualization techniques? In this short arc, from doing an action (recorded by a sensor and stored by a database) to have the “image” of that
action (displayed on a screen), there are about 200 years of research studies in different fields.
Just to do a little bit of history of this topic, the whole spectrum of toolset, application and technical approaches related to this type of thinking
has taken different names over time. It can be found in literature as “Personal Informatics,” “Personal analytics,” “Self Tracking,” and “Living by
Numbers” according to the focus on what has been emphasized by each definition.
Considering only the QS sunrise, the movement was founded in 2007 by the editors of “Wired” Gary Wolf and Kevin Kelly, with the purpose of
creating collaboration between users and manufacturers involved in the development of self-knowledge through self-tracking technology. In
2008, the same Wolf and Kelly opened the site “quantifiedself.com”. In 2010, Wolf spoke at TED Conference, and in May 2011 in Mountain View
the movement held the first conference specifically on QS.
There are some basic points that define the QS movement: the data collection, the displaying of these data, and the cross linking of these data in
order to discovery some possible correlations. The interest that is developing in these years around this current is also evidenced by the
proliferation of gadgets that are appearing in trade shows such as CES (Consumer Electronic Association) or conference sponsors among which it
is possible to find names such as Vodafone, Philips and Intel.
The whole issue seems to revolve around the question “Who am I?” The supporters of the QS think that the answer lies in our daily activities and
hence the need to quantify any behavior, taking photos of everything is eaten or drank, recording the distances covered, monitoring the pattern of
REM/NREM sleep, noting the “mood of the day”, storing the blood pressure and heart rate, and so on.
According to this fundamental question, is maybe necessary to underline what we mean by “Self” that perhaps is something too often left aside.
The notion of “Self” has been a central aspect of many personality theories, including those of Sigmund Freud, Alfred Adler, Carl Jung, Carl
Rogers, and Abraham H. Maslow. In Carl Jung‘s concept the self is a totality consisting of conscious and unconscious contents that dwarfs the
ego in scope and intensity. The maturation of the self is the individuation process, which is the goal of any healthy personality. Many years later
Rogers theorized that a person’s self-concept determines his behaviour and his relation to the world, and that therapeutic improvement occurs
only when the individual changes his own self-concept. Maslow’s theory of self-actualization was based on a hierarchy of needs and emphasized
the highest capacities or gratifications of a person. Nowadays a simple “common sense” definition of “Self” is“one's identity, character, abilities,
and attitudes, especially in relation to persons or things outside oneself or itself” (Definition of Self, 2012a), while amongst the more “academic”
definitions we can find this kind of descriptions: “The individual as the object of his own reflective consciousness; the man viewed by his own
cognition as the subject of all his mental phenomena, the agent in his own activities, the subject of his own feelings, and the possessor of
capacities and character; a person as a distinct individual; a being regarded as having personality” (Definition of Self, 2012b).
The concept of “self” is central in all human history since the emergence of consciousness. The ability to reflect on oneself is central to each of us
as human beings. The fundamental human question “Who am I?” is exactly the research of our “Self”.
3. ROOTS OF QUANTIFIED SELF MOVEMENT
The idea to record the personal physical activity and psychological and emotional “states”, through technology (spread sheet, digital pictures,
etc.) has roots that can be traced to the mid-90s: the practice of lifelogging, “a form of pervasive computing consisting of a unified digital record
of the totality of an individual’s experiences, captured multimodally through digital sensors and stored permanently as a personal multimedia
archive” (Dodge & Kitchin, 2007), is the one that comes closest to the idea of QS and finds in people like Steve Mann, professor at the University
of Toronto, its precursors. Mann, a pioneer in the field of wearable devices, since 1994 decided to stream his daily life 24 hours a day. EyeTap
(2012), the name of his project, consists of a wearable device allowing storage of all that the user sees, making possible to record a complete
photographic memory of everything happens to himself. Furthermore, EyeTap displays information to the user itself, altering the visual
perception of the wearer and creating in this way a kind of augmented reality (Mann, 2004). In essence, users can “build” reality by altering the
visual perception of the environment and the visual appearance of the sight by adding or modifying what they are seeing: this mediation of reality
can be done in real time or in a following phase, going to retrieve what was recorded and applying to it the desired changes (Nack, 2005). Around
the experience of Mann has developed, over the years, a community of lifeloggers (or globbers, as they call themselves) that aim to visually record
everything that happens around them during their daily life: this community has reached the amount of 200 thousand units.
If EyeTap was one of the first individual projects designed to record all individual's perceptions on a digital media, “MyLifeBits” (Gemmell et al.,
2006) was the first attempt, supported by the industry (Microsoft Company), to aspire to the complete recording of all the experiences of a
human being. Conceived in 1998 as a tool to record daily life events, MyLifeBits aims to preserve the whole life of its creator: Gordon Bell. Bell
decided to store in digital format everything with which he came into contact during his day: articles, CDs, letters, events, notes, sounds,
conversations and photos. Bell, then began his experience of digitalization, still needed a human assistant to catalog and digitalize all the items
flowing in his life. However, even all materials digitalized, Bell still not had the chance to use them due to the limitations of the software available
at that time (Bell & Gemmell, 2007). MyLifeBits was designed to bring order and create links between the various data collected and stored by
Bell, taking advantage of a metadata system in order to make possible the navigation in these huge amounts of heterogeneous data. For instance
the system is able to generate automatic links, by correlating the GPS location of Bell, continuously recorded throughout the day, with the time
and date of a photo taken by the same Bell. Nevertheless, MyLifeBits records his telephone calls and the programs playing on his radio and
television, automatically stores a copy of every Web page he visits and a transcript of every instant message he sends or receives. With the aid of a
“SenseCam”, a wearable device developed by Microsoft Research (Hodges et al., 2006) that automatically takes pictures when its sensors
indicated that the user want to photograph, Bell preserves the image relating to the surrounding environments in which he moves, the people he
meets, the significant moments of his life. In fact, Gordon Bell realized the pioneer idea of Vannevar Bush that since 1945, in his article “As We
May Think” (Bush, 1945), envisioned a tool (the Memex–abbreviation for Memory Extender) able to store and catalog, using the microfilm
technology, all the documents that an individual faces in the course of his life. The Memex should have been able to generate automatic
annotation and create links between documents similarly to the process of human associative memory. Today MyLifeBits is evolving towards a
continuous connection with any kind of sensors, able to automatically generate data regarding the displacements of Bell, his health and his
physiological parameters. However, the limit of the Bell’s project is in his strong individual connotation because it was designed by and for its
creator: a system tailored to the constant recording of Bell’s daily life that had more the virtue of pointing the way, rather than to seek a large
scale diffusion on consumer market.
“Total Recall”, instead, is a Lifelog research project of the Internet Media Lab of the University of Southern California, that starts from a different
perspective (Internet Multimedia Lab, 2004). The basic idea of this system is to increase the memory of people through the storing of
experiences, events and knowledge relating to an individual or a multitude of individuals. Using a large amount of sensors, a microphone and a
camera mounted on a pair of glasses or in a necklace, Total Recall aims to record the world from a personal point of view, allowing on demand to
recover the data collected through customizable searching, analysis, and querying of this data. Total Recall was designed from the beginning to
have a wide range of applications, all aimed at gathering information about an individual with the purpose of making evident data and elements
that usually do not rise to his knowledge. For example, in the health care area, the system can monitor the daily diet of a diabetic person,
reporting any risk situations or correlating these data with the visited environments by the same individual in order to identify specific allergies:
the data, flowing continuously from the patient, can also help clinicians to fine tune the diagnosis and the treatments (Cheng et al. 2004).
All these attempts show how from the first half of the 90s the idea of recording the life of an individual and to return it back in some visual form
was spreading in the world of industrial and academic research. Worth a note, however, “LifeLog” as the first government project aimed at
monitoring and recording the lives of individuals with also potential military applications. LifeLog was a project developed by the Defense
Advanced Research Projects Agency (DARPA) recognized as a system capable of collecting information on an individual, on its activities, its
states and its relationship. The goal of LifeLog was to trace the entire life of an individual: his monetary transactions and his purchases, the
contents of his phone calls, sent and received emails and messages, his movements (traced by a GPS signal), and even biomedical data related to
his physical health (obtainable through specific wearable sensors). In the original concept the system could have been used in stand-alone mode
as a sort of journal or recorder of individual memories, allowing the user to search and retrieve information and experiences related to his past in
the form of images, sounds, videos. LifeLog Project was closed in 2004, due to privacy problems raised by the public opinion. However, during
his short life it made the people imagine a future in which military commands would have been equipped with systems for continuous recording
of their experiences, able to access their own data and to re-trace what happened to them (Allen, 2008).
All these early research efforts, trying to build applications for continuously tracking parameters about the whole human behavior, have found
some obstacles in the availability, the complexity and the cost of the technology (O’hara et al., 2006) as well as in issues concerning ethics and
privacy (Mayer-Schönberger, 2009). Only recently, technological advances in networks, sensors, search and storage seem to have changed this
situation. How we will see, the birth of a number of academic projects and a considerable amount of commercial tools capable of tracking
individual parameters of people's lives (then recomposing them in significant views) seems to make at least closer a credible QS scenario.
4. QUANTIFIED SELF PROMISES
Wolf in his famous speech at the TED (Wolf, 2010) opened his talk reciting a series of numbers. The time he woke up (6:10 am), his average beats
per minute (61), his blood pressure (127/74), how many minutes of exercise he had (0), how many milligrams of caffeine he had been drinking
(600), how many milliliters of alcohol (0), and such other numbers.
Looking at the current QS scenario the impression is that this knowledge is shortly considered in favor of a “wow effect” or an aesthetic
presentation. Hence the suspicion is that nowadays QS is not something oriented to build knowledge toward a purpose, but instead a way to
collect data, like collecting butterflies, beer caps, etc. something that end in itself. The flavor is that QS supporters are more interested in
collecting numbers, and putting them in some sort of neat filing cabinets (usually called infographics) than anything else. The risk is that the data
do not help people to reach a personal goal, but their collection becomes a goal in itself, losing the big picture and the original motivations that
should guide QS applications.
In fact, the interesting point of the whole QS movement is the ability to change something (for the better) in people’s lives. But often this
discussion is simplified in this kind of statements: “by seeing my data I have more information and this allows me to make better choices.
Furthermore if the graph is nice I will like more to collect my data”. In fact there are a number of mechanisms that govern the behavior change
and them are numerous and complex. In the following sections we will see some key moments related to the study of human psychology seeing
how some of these are used from time to time in QS application often without even realizing it.
Although Gary Wolf sometimes called the QS as “the next step in human development” stressing that having greater awareness of one’s data is a
form of action against choice standardization imposed by TV, advertising, various media, etc., and a sort of re-appropriation of the individual
self-determination, QS still looks like something very embryonic. In fact there are some fundamental problems yet to be solved. On the one hand
some technological problems, but these are not of great concern, because the technological evolution is a fairly linear and predictable process.
Those who appear the problems to be addressed in a deeper manner appear instead of a higher order of theoretical order. The huge amount of
data that can generate a QS scenario poses major challenge in the manipulation of this information and in the displaying of these data. It will be
discussed in the next sections how far we are from having a knowledge applied to data visualization for large amounts of data and how we have to
develop innovative solutions that take into account the ability of our brain to perceive data in a more natural way, as is done with the complexity
of the natural world.
A second and even more profound problem is the lack of a true theory of behaviour change, or rather the existence of knowledge modules used in
a more or less conscious way by QS applications but without an actual theoretical corpus that can be used by designers. There is therefore a
theoretical problem that does not allow, besides some personal experiments in some application domain, to say which is the mechanism of
operation that given some data should produce a change and consequently what should be considered in a QS application to produce a different
personal behaviour.
In order to make more clearness, there will be in the next section a brief review of the elements that somehow seem to have a role in the QS as
behaviour change engine and how these contribute to give a general impression of utility, though each has a different role in different application
fields and for different people.
5. CURRENT APPLICATION FIELD OF QUANTIFIED SELF
Today there are more and more widespread applications and research projects related to the self tracking of people’s behavior. Regardless the
particular application domain (e.g. health, fitness, mood, goals and time management) most of these tools can be outlined to tracking: a
physiological state (e.g. body temperature, and breathing), a state of mind (e.g. thinking patterns, mood), a location (e.g. environment, travel),
time (e.g. time intervals, performance time), people (people you are with, interactions). To collect these data, applications that spin around the
QS movement may resort today to direct measurement (using wearable and/or environmental sensors), inference (using semantic reasoning and
algorithms: some data can deduce or infer others) or self-reporting (requiring manual data entry). In this section we will review three of the
domains in which the QS idea seemed to realize most successfully in recent years: health, mood and fitness.
5.1. Healthcare
There is a growing trend in consumer market to develop sensors and services that support self-tracking of data concerning health. At moment
users can access a variety of services and applications that collect, analyze and display statistics about a diverse range of parameters and
behaviors related to their health. Although there are studies that seek to correlate these parameters with each other or with the context in which
these behaviors are produced (Bentley et al., 2012), typically these tools now detect a single parameter or health-related behavior of the
individual, often relying on a device able to measure it, storing it on a website where the user can view changes over time and compare them with
those of other users. These services are intended to improve the health condition of the patient and help him to live a more salubrious life by
changing its behavior in a positive direction. Although it is more and more common trying to find new ways to influence current and future
health behaviors (e.g. using “past-self” avatar able to “give birth” to the past behavior of a person, or improving self-efficacy of the patient's
presenting “reminders of success” when a failure occurs [Ramirez & Hekler, 2012], or even more by using game mechanics related to the world of
social media [HealthMonth, 2012]), the more common used methods today can be traced back to the display of data about current behaviors and
the presentation of information about the progress towards a particular short or long term goal.
Most part of the services are not targeted to specific diseases but rather aim to facilitate the health reporting of individuals suffering from
different disorders, trying, with the support of a social network, to facilitate the exchange of information between patients, providing disease
progression, drug prescriptions and symptom tracking. The most famous is PatientsLikeMe (2012), founded in 2004, that today gathers more
than 160 thousand registered patients: it allows to record the symptoms of their disease in order to find patients in similar conditions and to
exchange information about the effectiveness of the treatments and the evolution of their medical symptoms. As well CureTogether (2012) allows
to compare the effectiveness of thousands of treatments to find the best solution for one’s own health needs and Health Tracking Network (2012)
aims to get people to work together to monitor common illnesses and discover factors related to them, using a system of self-reporting and
collecting information from third party tracking tools. HealthEngage (2012) instead appears as an aggregator of health data about individuals,
able to trace the general condition of the patient, the medicine he has to take, his daily diet and any chronic disease he has, importing information
from different free tools; HealthVault (2012), a Microsoft service, can arrange in one place all medical information relating to an individual,
providing forms of data visualizations that should help patients to take more conscious decisions about their health; ReliefInsite (2012), finally,
allows to track pain conditions and store them in a journal that can also serve the medical staff to reconstruct the patient's condition.
Related to this kind of services there are more specific applications that can monitor physiological trends and behaviors related to a specific
chronic disease: DiaMedics (2012), DIALOG (2012), and SugarStats (2012) are all services that can monitor the health of a patient with diabetes,
providing statistics, community support and collaborative sharing to motivate and improve health; while Asthmapolis (2012) uses a specific
sensor to track spatially and temporally asthma attacks of a patient and the quantity of medicines he takes.
The use of specific sensors and dedicated devices is also implemented by applications that are able to track physiological parameters continuously
throughout the day. Glasswing allows to measure non-invasively hemoglobin (2012). SenseWear (2012), as well as Fitbit (2012) and Zeo (2012),
collect and analyze information related to physical activity performed by the patient and sleep behaviors: the tracker monitors how many times
and for how long the patient wake up during the night providing information on the quantity and quality of sleep patterns. Other research
projects, on the academic side, such as Lullaby (Kay et al. 2011) set more ambitious targets and, using a variety of different sensors (e.g. sensors
for air quality, noise, RGB light, infrared camera) seek to integrate the monitoring of physiological parameters with the environmental factors
that can affect the sleep of the patient. In the end, BAM Labs (2012) uses a non-intrusive system, called Touch-free Life Care (TLC), which,
through an under-the-mattress biometric sensor, collects motion, heart and breathing rate without attaching anything to the body.
Taking advantage of a small but growing number of consumers who prefer to skip costs and delays of specialist medical consultations to order
their laboratory tests, some companies have begun to offer services of medical lab tests, making it possible not only to order directly from their
websites the desired analyzes, but also to keep track over time of the variations in the parameters analyzed, as the level of cholesterol or glucose
in the blood. MyMedLab (2012) and Private MD Lab Services (2012) are two examples of services that are part of a more general trend that is
bringing individuals to care for themselves health status, tracking, storing and managing their physiological parameters, in collaboration with
health peers and in co-care with physicians. QS applications make it possible, and at the same time are a symptom of a change that is arising in
the health care, toward a patient-driven health care characterized by “having an increased level of information flow, transparency,
customization, collaboration and patient choice and responsibilitytaking, as well as quantitative, predictive and preventive aspects” (Swan,
2009).
Finally, companies like 23andMe (2012) Pathway (2012), Knome (2012), Navigenics (2012) deCODEme (2012) have begun to offer tracking of
DNA profiles to their clients: tracing the sequence of one million “snips” (snips are the current unit in personal genomics and are the part of the
gene researchers have noticed that varies between individuals), these companies provide a series of numbers and letters that can be related to
some aspects of their health and their genetic past. These genetic data can then be shared on network sites like Personal Genome Project (2012)
or compared with the information contained in wikis as SNPedia (2012). As Gary Wolf noticed in his article Genomic Openness (Wolf, 2008) the
field of Personal Genomics, or the ability to track data about their genes, is located at the edge of the field action of the QS. If the aim of the
movement of the QS is the tracking of the physical and psychological “self” to make changes in behavior or take action to improve conditions (e.g.
health) is not yet clear how knowledge of genome parts can lead to this result even if people like it. Melanie Swan from DIY Genomics (2012)
advocates this strategy to develop a form of personalized preventive medicine (Swan, 2011)). The complexity in correlating these data with
elements relating to personal health or personal phenotype makes to foresee that only an open sharing of raw genetic data and mass public
collaboration could lead to some kind of knowledge advancement in the future.
5.2. Mood
Today, the mood tracking area is having increasing consideration. Applications and services related to this field are intended to help users to
increase their awareness and understanding of all the factors that influence their “mood states” and their mental health. These applications,
through different kind of visualizations, are able to track changes in mood over time and identify patterns and correlations with environmental
and social factors, in order to facilitate the identification of variables that can affect the mental states of the person.
It is possible to find applications such as Track Your Happiness (2012) a research project of Mart Killingsworth at Harvard University: it allows
you to draw the happiness of people asking them on a regular basis, via email or sms, what they are doing and how happy they feel at that time
(through questions like: “How happy are you right now?”, “Do you want to do what you're doing?”, “Do you have to do what you're doing?”,
“Where are you?”, “Are you alone?”). Reports inform the user about the changes in his happiness over time and what possible factors have had
an influence on it. Happy Factor (2012) asks the user about their happiness by sending text messages: these data are recorded on a scale from 1 to
10 also associating them with some notes about activities carried out at that particular time. The application is then able to return in visual form
history, average happiness for days, months and a frequency chart of words used in the notes from happiest to unhappiest. On the same
mechanism is based MoodPanda (2012): happiness rating on a 0-10 scale, and optionally adding a brief twitter-like comment on what's
influencing the mood.
Other services instead, such Moodscope (2012) and MoodTracker (2012), aim to measure, track and share the mood not only onto the happiness
dimension. The first, for instance, uses a simple card game for tracking daily mood, brings out what can have a positive or negative influence
about it and allows to share data with a list of trusted friends who can support the person in order to improve his overall mood. Both applications
are designed not only for the common people, but especially for those suffering from depression and bipolar disorder. Even FindingOptimism
(2012) is a service that has the purpose of increasing the understanding of the factors that can affect the mental health of the individual, helping
to identify the “triggers” that affect the patient, and the early warning signs of new episodes of mental disorder, favoring the filling of a wellness
plan with detailed strategies for dealing with events related to his disorder. Less oriented to the therapeutic scope is Gotta Feeling (2012) that
aims to track the emotions of the user with the purpose to share them on his social networks; it, however, uses a different model, asking the user
to indicate what are the emotions he feels choosing from 10 different categories and connecting them to a more precise list of words that express
the feeling: the reports keep track of all the registered feelings, places and people to whom they were linked.
Finally, some services use dedicated devices to track and return the overall mood of the user. This is the case of Rationalizer (2012) a kind of
“emotional mirror” in which the user sees the intensity of his feelings, with the purpose of improving his financial decisions which should be less
emotionally charged and more rationally founded. Rationalizer consists of two dedicated devices. The EmoBracelet measure the intensity of the
user's emotion, in the form of arousal level, through a galvanic skin response sensor. The level of arousal is shown as a light pattern that is both
on EmoBracelet both on the second device, the EmoBowl (a kind of light dish that displays different light patterns). The higher the level of
arousal more intense will be the dynamic light pattern, larger the number of graphical elements of the pattern, greater their speed, the more
intense their color. BodyMonitor (2012) instead is a research project of the Leibniz-Institut for Sozialwissenschaft, which, using a wearable
armband, measures heart rate and skin conductance, to determine the emotional state of the user. In the end StressEraser (2012) is a portable
biofeedback device designed with the aim of reducing the stress of the user synchronizing his heartbeat with his respiratory rate: the device
displays heart rate variability on a graphical display, suggesting how to control breath using visual cues.
5.3. Fitness
Self-tracking wearable devices are increasingly used in the consumer market (especially in the fitness arena) to track calorie consumption and
daily physical activity and to support self-awareness and healthy behaviors. These devices automatically recognize positive behaviors (such as
walking) tracking changes over time: the underlay idea is that having always-available displays could be useful to increase the individual's
awareness about individual physical activity level and this could be valuable particularly when people try to change their habits (Consolvo et al.,
2008). In this sense, the fitness area is the one that most uses dedicated devices (or smartphones) able to track physical activity and physiological
parameters to improve physical well-being. All these systems can monitor the entire daily physical activity or can be addressed to the tracking of
some specific sports.
Nike + iPod Sport Kit (2012) is perhaps the today most widely used tracking system of physical activities in the consumer market, and consists of
a suite of interconnected devices like Nike + running shoes, Nike + Sensor, iPod Nano, iPod touch, or iPhone. The Nike + sensors can be
integrated into the Nike + shoes transmitting the frequency of your steps into the user Apple device, who can see time, distance, pace, and
calories burned. The system also allows to load physical performances into the nikeplus.com site where stats regarding physical activity can be
shared and compared with those of other users. On the same principles is based the Adidas service Micoach (2012) that through the
SPEED_CELLTM is able to track running speed, distance run and heart rate providing real time digital coaching, interactive training and post-
workout analysis of pace, distance and stride rate. Runkeeper (2012) instead uses the GPS built into the iPhone of the user to track run, the
distance, duration, speed and calories, preserving the history of run and showing progress and objectives.
Other service solutions are able to track more parameters and behaviors simultaneously correlating them. Jawbone UP System (2012) uses a
wrist wearable motion detector interfaced with an iPhone application permitting to track daily physical activity and sleep behaviors, reminding to
the user when it should make some exercise and waking him up at the right time. Even Fitbit (2012) is a device that can be worn throughout the
day, allowing to track physical activity and sleep patterns. Once the data is collected, the user can view them on a website or mobile application.
Fitbit has also been used by university research projects to identify unhealthy behaviors and intervene at the appropriate time to correct them.
Fitbit+ uses Fitbit technology to identify sedentary moments in the day and prompt users to take walking breaks (Pina et al., 2012). BodyBugg
(2012) uses a series of sensors such as an accelerometer, a skin-temperature sensor, a galvanic skin response sensor and a heat sensor to measure
physical exercise in order to track how many calories the body has used during a physical activity. And it is right on eating behavior that many
applications have been developed in the field of QS. They are capable of tracking daily weight changes and monitor the diet of the user.
MyFitnessPal (2012) allows to keep track of daily diet by simply adding to the personal food journal the foods eaten during the day and available
in the database of the system. The display of eating habits and calorie consumption, combined with the support of a social network of users,
should provide motivation to adopt strategies to reduce body weight. Loseit (2012) and My Calorie Counter (2012) are based on the same
principle of tracking the eating habits through self-reporting and viewing statistics that could make people more aware of how they behave every
day. The Withings scales (2012) instead track daily weight, body mass index and fat mass index: the user smartphone can then display statistics
showing significant changes during the time period selected.
6. QUANTIFIED SELF FUNDAMENTAL PROBLEMS
6.1. Technological Limits for Collection, Processing and Visualization of Large Dataset
At present the increasing miniaturization of electronic modules and processing power of microprocessors, the developments in sensor
technologies, the new potentials offered by displays and micro-displays, the advances of mobile networks and the new models of dynamic data
visualization focus to make large amounts of data accessible in a lesser amount of pixels and with littler attentional effort from the user.
On the data collection side nowadays the wearable computing, the body sensor networks, the RFID tags allow to gather information, in an
automatic way, making possible to envision the opportunity of a constant monitoring of the individual behavior. The data collected by these
technologies can today also be stored and structured (through semantic reasoning techniques) in almost tangible “knowledge”, manipulable and
usable for different purposes. The appearance on the market of very large screen and highly innovative technologies, such as 3D printing, enable
the exploration of new ways of viewing data and in some case the physical materialization of them.
Sensors available on the market today allow to monitor a variety of physical, biochemical and physiological parameters of the individuals as well
as the environments in which they are moving at a very affordable price and with a very low power consumption. For example, from the
perspective of physiological and biochemical parameters are increasingly spreading sensors for measuring heart rate, blood pressure, respiratory
rate, temperature, muscle and brain activity (e.g. Shaltis et al., 2006) and Corbishley & Rodriguez-Villegas (2008)), while the development of
flexible circuits insertable within tissues is allowing the integration of these sensors also in wearable device scenarios (Barbaro et al., 2010). In
addition, accelerometers, gyroscopes and magnetometers integrated in wearable devices are now used to track the movements of people, and the
data available from these sensors can be combined with information from other ambient sensors, such as motion sensors placed in a domestic
environment, in order to determine, for instance, the type of activity performed by an individual (Bonato et al., 2012). All these sensors are often
integrated in a sensor network that relies on modern wireless communication networks. In recent years numerous standards are born for
wireless communication networks that fulfill the requirements of miniaturization and low cost / low power consumption of transmitters and
receivers. Not only the development of IEEE 802.15.4/ZigBee (ZigBee Alliance, 2012) and Bluetooth, but also of IEEE 802.15.4a standard based
on Ultra-Wide-Band (UWB) impulse radio made possible to foresee a set of sensor network applications with a high data rate (Zhang et al., 2009)
easily deployable in many kind of environments. In addition, the need to transmit the data gathered from the sensors to a terminal that can
process them, such as a mobile phone or a personal computer, can now be easily met by GSM or 3G communication mobile networks and soon by
new LTE networks. The actual and future smartphones look like the ideal platform for applications that have to continuously monitor individual
data. They have built-in large computing capacity and excellent graphic displays, are equipped with motion sensors, GPS systems capable of
tracking the movements of individuals, and networking technologies, able to connect with the surrounding environment in order to become hubs
for body area networks (BAN). Moreover, smartphones are always brought on by their users, eliminating the disadvantage of having to bring
dedicated devices for recording specific daily activities and behaviors (Rawassizadeh et al., 2012).
On the data storage and processing side, the linear increasing in capacity, make now possible to store and process an amount of data just
unthinkable few years ago. Today, one terabyte hard drive is available on the market at less than $ 100 and may contain all the written
information we come in contact in the course of a lifetime (by mail, books, web pages). Twenty years from now, we will be able to buy at the same
price 250 terabytes of storage sufficient to store ten thousand hours of video and tens of millions of photographs: a capacity able to meet the
needs of recording of one hundred years of life (Bell & Gemmell, 2007). If the increasing miniaturization of digital media will continues to
proceed according to Moore's Law, in 70 years will double around 47 times and in 2072 the physical space of storage needed to contain all the
experiences of a lifetime will be the size of a grain of sand (Dix, 2002). In addition, all collected and stored data can now be structured
semantically, using reasoning rules (possible to the increasing computational power of todays processors) to extract relevant information in
response to complex queries. Ontologies and systems that can automatically annotate data with metadata can provide a structure to the
information collected (e.g. Gruber, 1995; Guarino. 1998) and also extract knowledge from large stores of unstructured data allowing the surfacing
of relationships, patterns and connections. The extreme evolution of this approach permits to envision the possibility of retrieving links to
specific items similarly to the retrieval mechanisms of human associative memory (O’Hara et al., 2006)
On the data visualization side technological advances are taking two opposite directions. On the one hand are increasingly available at a lesser
cost very large screens capable of ensuring access to large amounts of data and allowing interaction with them quickly in order to gain
understanding and take decisions faster; on the other hand there is a growing effort to find new ways to maximize the amount of information
visible in a limited space, due to the growing popularity of smartphones and tablets as first choice devices for human behavior tracking
applications (Few, 2009). Today are available many possible configurations of large displays. The Cave Automatic Virtual Environment
(CAVEsTM) is a projection-based display system (e.g. Mechdyne CAVETM Virtual Reality, 2012) with a resolution of 100 megapixels or higher. La
Cueva Grande, is a five-sided slot with 33 projectors for a resolution of 43 megapixels (Canada et al., 2006) which surround the viewer with an
immersive environment: they are commonly composed of four large wall displays arranged as a cube (Cruz-Neira et al., 1993). There are also
monitor-based wall displays that combining the resolution of LCD monitors can get to reach a comprehensive resolution of hundreds megapixels
(e.g. LambdaVision, a wall of 5 x 11 LCD panels with about 100 megapixels spread over a width of 5 meters [Renambot et al., 2012] and hyperwall
composed by 49 LCD panels tiled in a 7x7 array [Sandstrom, 2003]): however, these configurations have the problem of not providing a
continuous display, being interrupted by screen bezels (Thelen, 2010). There are then projector based systems that combines multiple computer
projectors arranged in a grid in order to project on very large screens (e.g. the ten Sony SXRD 4K SRX-S110 projectors combined to create an
image of approximately 88 Megapixels in ITC's Michigan Control Room (2012) and VisblockTM (2012)). The projectors resolution is increasing,
as they are steady progress to optimize the calibration for example on color gamut matching, luminance matching and image blending. (Ni et al.,
2006) Finally, are increasingly widespread in the consumer market stereoscopic displays that can display 3D images using special glasses or even
autostereoscopic displays that eliminate the need to wear glasses to get the 3D visualization effect (e.g. AU Optronics prototype of
autostereoscopic technology [Information Displays, 2012]). Although the costs of these devices are progressively going down, so that it is possible
to imagine in the near future their massive spread, outside a research context, even in the consumer market (both for very large LCD monitor,
and for autostereoscopic 3D display), commercial tools that aim to track the behavior of people and return back significant views are watching
today mainly to the market of mobile devices to convey their services. The advent of high resolution micro-displays, such the 4” Retina Displays
of Iphone5 (1136x640 resolution), or the 9.7” of the new iPad (with a resolution of 2048x1536), pushes the research towards new visualization
solutions, such as interactive dashboards able to deal to the relative narrowness of the display surface of these devices.
The spread of human behaviors tracking applications on mobile devices poses today serious technical issues that are currently not fully resolved.
The mobile devices, even if their ability to collect, process and display data is constantly increasing, suffer from both the small size of displays and
the limited processing power, which limits the amount of data that can be managed locally and prevents the use of computationally expensive
algorithms; the extreme segmentation of the smartphones and tablet market also rise the problem of the severe variability between different
models in terms of performance and input peripherals (Burigat & Chittaro, 2007). Moreover, the mobile context introduces a number of
problems in relation to fixed devices: on the one hand, the physical environment can affect the data visualization on the display (e.g. mobile
devices can be used in different light condition that may vary from the bright light of the sun to the total darkness affecting so the perception of
colors), on the other hand the mobility context makes it difficult to focus user attention on the device, because of the activities that are often
performed at the same time that transform the use of the device in a secondary task (Chittaro, 2007).
However, other technological problems are shared by applications that rely both on fixed and mobile environment. One of the central issues is
that often users and systems have to deal with digital data coming from various sources in heterogeneous forms and formats (Whittaker et al.,
2012). In particular, this problem is related to the capacity of storing large amounts of information not only limited to textual data but also
multimedial. Search and retrieval of textual information is now relatively simple, but other media present major problems about framing queries
and organization of memory stores, and adding text annotations requires a great effort, often not covering the entire range of possible meanings
that can convey an image or a sound: furthermore the future possibility to integrate new types of information (such as olfactory and haptic) poses
new problems of integration between different “media” and how these heterogeneous information can be indexed, searched and retrieved
(O’Hara et al., 2006). Last but not least, there is the central problem of how to handle this immense amount of information, because the
possibility of capturing vast arrays of data is not yet balanced by the possibility of an efficient and really meaningful search, and this can
overwhelm users in their effort to retrieving valuable information from large clusters of data (Abigail & Whittaker, 2010).
The technological problem as we have seen is quite relevant, even if the trend line makes it probable that this will have less and less importance in
the future. What appear instead the most worrying aspects of the QS field are at a higher level. What is missing is some convincing positions
about how to return people with sufficient clarity and sense the data they collect and how to structure a model of human behavior change that in
front of these data makes possible to use them in the desired direction.
6.2. Theoretical Limits in Perception and Cognition
6.2.1. Data Visualization
The data visualization is the way in which all the data collected are rendered and made available to the user. How the data “appears” is the first
step that data make in the user brain, so it is important to consider all the studies on perception, reading charts, organization of complex data,
because this first step is crucial in the pathway that data make into cognition. How data visualization is realized is able to modify and affect
subsequent processing of the same data. The manner in which data are presented has a huge importance and is strongly influenced by the model
of knowledge that is embraced: this is the reason why we will treat this aspect first and the cognition aspect later.
The first theoretical foundations that are available to help designers to organize information on a perceptive point of view are the Gestalt
principles. The associationist psychology considered perception as the sum of more simple stimuli linked directly to the physiological substrate of
the sensory systems. With the development and consolidation of Gestalt psychology, the center of the investigation on the perceptual processes
passes from the previous elementaristic conception to a more complex notion of perception as result of interaction and global organization of
various components. Gestalt, using a phenomenological approach to perception, canonizes a series of perceptual laws independent from external
experience (hence not connected with learning processes) and present since birth.
These laws analyze the figural organization taking into account the separation of figure from background (by color, density, texture, contour).
Wolfgang Köhler, the father of Gestalt psychology, suggested the following laws:
1. Law of Superposition: The forms above are figures. In order to distinguish an overlap is necessary that there is evidence of depth.
2. Law of the Occupied Area: The separate zone which occupies minor extension tends to be seen as a figure, while the wider as
background. This mechanism for the identification of objects in the background works even if the closure is incomplete.
Further studies related to Gestalt psychology aimed to postulate general laws oriented to synthesize multiple items in a single global perception:
1. Simplicity or “good form” law which summarizes the whole logic of perception. Data are organized in the simplest and more consistent
way possible, according to previous experiences.
2. Similarity law states that for elements arranged in a disorderly way, those who are similar tend to be perceived as a form, separated from
the background. The perception of the figure is as strong as stronger the items similarity.
3. Continuity Law states that a perceptual unit emerge between the elements that offers the least number of irregularities or interruptions,
being equal other features.
Other studies have instead explored the figural elements used for the perception of the third dimension. It is in fact linked to the motion
perception. The main indicators are identified:
• Brightness
The laws of perception are considered innate because are not the result of learning, although it is demonstrated that there is a developmental
progression in the development of perceptions. From the first months the baby is able to recognize colors and shapes (especially the human
figure), but only later he acquires the “perceptual constancy”, the ability to link a shape or figure already known, with a different one in which he
recognizes characteristics of similarity (e.g. a statue associated with a person).
These laws are easily testable in everyday life and it is easy to have a direct evidence with optical effects. But they are only the basis of knowledge
useful for improving the data visualization for human perception. Instead the purpose of data visualization as a discipline is to display parametric
information essentially with a twofold perspective: on the one hand to better understand the data, on the other to extract evidence non-
extractable otherwise. There are a lot of hidden meanings behind each set of data. To discover these meanings is necessary that the visualization
choices of this data set are functional to the emergence of these meanings. When we say that “a picture is worth a thousand words” means just
that that image is effective to convey with a single glance all the meaning hidden in a thousand words or a thousand numbers. But how to choose
for the best visualization? Again we are dealing with a problem that has its basis in the theoretical research on perception and cognition, that is
about how humans can see, filter, store, learn, retrieve complex content. But for the moment let's stop at the first step, we will see the basic
theories of cognition in the following paragraphs. Speaking instead about perception we have already mentioned the fundamental Gestalt laws
but what other theoretical elements we have to provide a good visualization to the users? Much of what is now the data visualization we owe to
Descartes that in the 17th century invented the coordinate axes that in some extent are the first and most basic form of visualization of a set of
two-dimensional data. In the modern sense the first significant contributions are due to William Playfair that as early as the 18th century began
to develop almost all types of graphics that are still in use (bar charts, pie charts, etc.). There are no other significant contributions until the
work “Semiologie Graphique” of Jacques Bertin in 1967 in which for the first time we found a complete reflection about the best ways to
represent different kind of information. Then there is the fundamentals work of Tukey (1977), of Tufte (1983), and of Card et al. (1999), the most
advanced in defining the best way of representation. Last but not least for a complete and comprehensive work on data visualization and its
bases, see the work of Colin Ware-Visual Thinking for Design (2008). All the attention to the way we represent the data has only one purpose:
that our eyes can distinguish good information and our brain can understand it. All this can be done simply running back on the results of 100
years of experimental psychology studies. Despite this knowledge has always been available, it has not been used very much. In the last 10-15
years it has been developed a number of products designed right to give meaning to data, but in fact simply based on covering data with shining
graphic work with purely aesthetic value, not considering even some basic requirements that a good visualization must have to be considered a
good one. We can resume them in few statement (Few, 2010). A good visualization:
These simple requirements inform us about the quality of visualization, and for example, show us how pie charts are not suitable for many types
of data, or like sometimes the old bar charts are much more efficient. If a visualization does not meet these criteria is not suitable for that
particular type of data.
The motivation to make better visualization is to make possible a primary processing already at the perceptual level, without having even turn on
the cognitive processing level. It is a form of cognitive economy that allows us to already have all the information we need with fewer resources.
The Gestalt laws mentioned at the beginning of paragraph are based right on this economical concept. But today there are interesting studies that
come from the fields of neuroscience that provide material even more interesting (Guidano, 1983). The first research area is the pre-attentive
visual processing. It is a sort of pre attentional processing that occurs before the data arrives at a consciousness level. It is made by particular
neuronal structures able to perceive length orientation, but also by more complex properties such as shading, groupings and three-dimensional
orientation.
The second broader research area is about the mechanisms that govern attention and memory. In particular for the attention, latest studies have
shown that the attentional processes focus only on some parts of the scene. Also data visualization techniques should be able to govern their
points of attentional salience organizing the data meant to be conveyed almost like a “real world” (Rensink, 2002). These aspects, together with
the dynamic views required to manipulate large data sets (nowadays we are limited to organize levels starting from an overview, providing
filtering and zooming features), introduce how the entire discipline of data visualization, specially if linked with QS large dataset, should become
a discipline of data interaction rather than simple data visualization. In the next paragraphs we will see how storytelling in some extent
represents a solution that includes both the aspect of interaction with the real world (fostering pre attentive elaboration), and the complexity of
managing large amounts of data. Roambi (2012) a company specialized in data visualization is going in this direction with its latest product
Roambi Flow that allows “to tell” data to the intended audience.
After this theoretical analysis on data visualization let’s move on to what are the theoretical foundations of behaviour change theory that
implicitly (sometimes without awareness) are used by many applications of QS.
6.2.2. Behavior Change Theories
As mentioned previously, one of the fundamental problems of QS is that it does not have a well-structured and well-established theoretical basis.
QS is nowadays mostly a series of experiences more or less significant where, an application, in a certain domain, worked at a certain time for
someone in particular. Basically it is not a generalizable or falsifiable approach according to scientific criteria (Popper & Eccles, 1977). It is a
pragmatic and empirical approach which carries with it the limitation of not progressing beyond few single anecdotal cases and takes the risk of
disappearing or becoming irrelevant. On the other hand, today's technology makes QS look very modern, but in a variety of fields the habit of
data recording in order to find relationships between variables and to drive changes it’s quite an old idea, at least dating back to the birth of the
industrial age. In the medical, financial, industrial fields since decades it is custom to fill graphs and spreadsheets, and also on a personal level in
many biographies is possible to find peculiar a way to organize self-knowledge through manual annotation of notes an tables and filling of
summary sheets.
The most innovative aspect is perhaps, more than anything else, that QS, taking advantage of new technological devices, frees the individual from
the burden to personally record every data. But there is another aspect of substantive novelty. As long as you remain in the domain of financial or
political science the change process driven by new knowledge emerged by data seems easier to implement according to a rationalistic model.
When we move in the domain of personal and individual change, things get complicated. What are the rules that govern the individual change?
Are they so clear? Can we put them in a structured system that can predictably drive the change? In part yes and in part not yet. The various
th
paradigms that have developed over 20th century and are still evolving thanks to the neurosciences have precisely tried to answer these questions.
Some pieces of knowledge are now part of these simplified change theories that are implicitly implemented (often misused and distorted) within
many QS applications. In the following lines there is a brief summary of the paradigms borrowed by this “naïve change theory”.
6.2.2.1. Behaviorism and the Positive/Negative Reinforcement
Who has tried to do fitness with the Nintendo Wii Fit will have noticed that by according to his performance the system will critique you if the
performance is poor or on the contrary praise you if it is positive. The system simply seeks to introduce a positive or negative reinforcement
providing paradigm derived from behaviorism. Behaviorism was originally developed by psychologist John Watson at the beginning of the
twentieth century, based on the assumption that explicit behavior is the only unit of analysis to be studied scientifically by psychology, because it
is directly observable by the researcher. The mind is thus seen as a black box, a black box whose inner workings are unknowable and, in some
respects, irrelevant: what really matters for behaviorists is to have a thorough understanding of empirical and experimental relationships
between certain types of stimuli (environment) and certain types of responses (behavior). Within this broad approach, emphasis is placed on
particular aspects. One of the major assumptions is the mechanism of the conditioning, according to which the repeated association of a stimulus,
said neutral stimulus, with a response that is not directly related to it, will ensure that, after a period of time, such stimulus will follow the
conditioned response. The famous experiment of Pavlov's dog that everyone knows makes reference to this type of mechanism.
Skinner with his writings “The Behavior of Organisms” (1938) and “Science and Human Behavior” (1953) laid the foundation for the discovery of
the laws and of the most important paradigms of matter, giving rise to a new way of conceiving the causes and enabling thus to enlarge
significantly the possibilities to influence the observable behavior. His great merit is in fact to have found that human behavior is predictable and
controllable through an appropriate management of two classes of stimuli from the physical environment: “antecedent” stimuli that the body
receives before implementing a behavior and “result” stimuli that the body receives immediately after the behavior has been put in place. After
the discoveries of Skinner, a growing number of researchers have progressively developed many techniques for behavioral change in almost all
areas of application and, from the mid-seventies, even within organizations and in the specific field of work safety.
In North America, the birthplace of the QS, this behaviorist perspective strongly permeates the environment especially in the common sense
psychology. In terms of QS, this paradigm was translated using positive or negative reinforcement depending on the behavior to drive. An
example is the badge gaining, the collection of awards, such as elements of positive reinforcement (e.g. Foursquare, Nike +) or, conversely, the
blame, the criticism, etc. if the behavior is not consistent with the purposes of the application. This mechanism is well documented but, as the
same latest generation behaviorists have stressed, is not rigidly determined. There are in fact a number of intervening variables that change or at
least modulate the Stimulus-Response (S-R) arc. For example, it is necessary for the user to perceive correctly the reinforcement, to understand
it, to weight it, to check if is relevant for him and so on. In the case of Nike +, the praise that arrives from the system or by the community can
have a totally different weight for two different people, or even for the same person at different times of his life or even of his day.
Recognizing excessive simplicity and rigidity of the behaviorist paradigm, some behavioral psychologists, called “neo behaviorists”, proposed
some corrective premises (the so-called “intervening variables of the SR process”) opening the way for further development of cognitive
psychology. The evident role of internal and external variables in determining the behavior demonstrated by many experiments paved the way for
at least two other types of theoretical contributions to the behaviour change topic: cognitivism introducing the internals variables and social
psychology introducing the externals variables influencing behavior.
6.2.2.2. Cognitivism the Role of Mind
Cognitivism is sometimes considered an evolution of behaviorism because it introduces more complexity in the S-R arc recovering the concept of
mind (the black-box originally excluded by behaviorism). Cognitivist focus is on the mind as an intermediate element between the behavior and
purely neurophysiological brain activity. The mind operations are metaphorically compared to that of a software that processes information
(input) coming from the outside, giving in return information (output) in the form of knowledge representation and semantic and cognitive
networks. Perception, learning, reasoning, problem solving, memory, attention, language, and emotions are mental processes studied by
cognitive psychology (Neisser, 1967). In the early cognitivist models, processing was conceived as a process that occurs in subsequent stages,
finished one step the “system” move to the next, and so on. In the '70s were presented new models that put in evidence both the possibility of
feedback of a processing stage to the previous ones, and the possibility to activate operations of the next stage without previous ones had already
processed their information.
Another important aspect was the emphasis on specific objectives targeted by mental processes. The behavior was now conceived as a series of
acts guided by cognitive processes for the solution of a problem, with constant adjustments to ensure the best solution. The notion of “feedback”,
developed by cybernetics became central in this conception of behavior directed toward a goal. The experimental psychologist of language G.
Miller with his works brought to a real turning point in the representation of behavior: the behavior was seen now as the product of a data-
processing system, driven by the development of a plan helpful to solve a problem, in certain sense like a computer (Miller et al., 1960). In this
new model the behavior was therefore not an epiphenomenon of a reflex arc (sensory input, processing, motor output), but the result of a process
of continuous retroactive monitoring of the behavior plan according to the TOTE unit (test, operate, test, exit). The final act (exit) does not follow
directly to a sensory input or a motor command, but it is the result of previous environmental conditions verification (test), intermediate
operations (operate) and new tests (test). In the feedback model is expected that to complete a specific goal there are some verification stages and
then an exit from the plan (if the goal is reached) or new operations (if the goal is not reached). This model is also implicitly widely used in QS
application. The concept is that if there is a goal that the application advocates, it is possible through constant quantification (test) show to the
user how far away he is from and so drive him to focus and continue the action (operate) toward the goal or stop it when the same goal is reached
(exit).
The concept of the mind as a computer based on feedback engine underlies many QS applications and in a certain extent is also one of their main
limitations. Knowing how far from a goal we are is certainly informative in itself; however it is not enough to motivate a behavior in a
generalizable way (for instance the fact that I know what my sleep patterns are, hardly makes me change the time that I go to bed). There are
other variables that can intervene: for instance external social variables.
6.2.2.3. Social Psychology: The Role of External Variables
Many applications of QS refer to a social and interpersonal dimension both in agonistic terms, the cases in which there is a gaming dimension
and for example the most virtuous climb a rank (many applications of energy savings are based on this example) and in cooperative terms, where
many people work together to obtain an objective valuable for all (for instance in many healthcare QS applications).
Social psychologists typically explain human behavior in terms of interaction between mental states and social situations. In the famous heuristic
formula of Kurt Lewin (1951) behavior (C) is seen as a function (F) of the interaction between the person (P) and the environment (A). “Social” is
an interdisciplinary domain that bridges the gap between psychology and sociology. During the years immediately following the Second World
War, there was frequent collaboration between psychologists and sociologists. In recent years, the two disciplines are increasingly specialized and
isolated from each other, with sociologists focusing on “macro variables” (social structure). Nevertheless, sociological approaches to social
psychology remain an important counterpart to psychological research in this area. Cognitive strategies are influenced by our relationships with
others, our expectations on their reactions, from belonging to a group or another, membership of which brings us to the definition of who we are
and our social identity. In particular, the group is a unit with its overall social identity, which determines what each member expects from others
in terms of behaviors. This social identity is then linked to the social identity of the group members. Our identity is, in fact, largely a function of
our belonging to different social groups.
7. TOWARD A MORE ROBUST APPROACH IN QUANTIFIED SELF
7.1. New Model of Data Visualization: Telling Stories
In recent years, much research has focused on the role that storytelling can play in data visualization. Often are also been highlighted the great
similarities that a good data visualization has with the ability to tell engaging stories through images. Since 2001, Gershon & Page (2001)
predicted that the technology could have used different genres and media to convey information in a story-like manner. However, since what
makes data visualization different from other types of visual storytelling is the complexity of the content that needs to be communicated
(Wojtkowski & Wojtkowski, 2002), to effectively use storytelling are necessary skills like those familiar to movie directors, beyond a technical
expert’s knowledge of computer engineering and computer science (Gershon & Page, 2001). At present applications that attempted to implement
narrative elements within data visualization scope are very few (e.g Heer et al., 2008; Eccles et al., 2007). Moreover, since none of these seems to
go beyond the incorporation of some superficial narrative mechanisms in the flow of data visualization, also the researches carried out appear to
be limited to the enumeration of stylistic and narrative mechanics, but decontextualized from their original media. Edward Segel and Jeffrey
Heer (2008), for example, analyzing several case studies of narrative visualizations, identify three divisions of features (genres, visual narrative
tactics, and narrative structures) that can be considered patterns for narrative visualization. Genres identify established visual structures that can
be used to communicate data, such as comic strips, slide shows, and film / video / animation; visual narrative tactics are instead visual devices
that assist and facilitate the narrative; while narrative structures are mechanisms that assist the narrative. Nevertheless, the analogy with stories
seems to be hard. As Zach Gemignani (2012) underlines many of the key elements of the stories are not present in data visualization: characters,
a plot, a beginning and an end. On the other hand data visualizations have characteristics that are missing from the traditional storytelling. As
interactive means data visualizations allow users to explore in an active and dynamic way the data to find insights by themselves. For these
reasons Gemignani notes that the data visualization today has nothing to do with telling a story more than with accompanying the audience in a
guided conversation.
However, the QS field seems to offer new opportunities for the use of storytelling in data visualization. Visualizing human behavior is in a certain
extent to put in the center of the visualization the subject as individual. Finding new ways to make alive these data, to make them meaningful in
the eyes of the person they belong to, seems to be the real challenge that QS must face. In this sense a huge importance can gain one of the key
elements of storytelling, the character, through which the story takes a perspective and a point of view. Aggregating data about personal behavior
in the form of a fictional character can be seen as a way to give sense, in two ways, as meaning and as direction. The direction is firstly temporal
from the past to the future, and through the overcoming of continuous testing and objectives, key elements in narratological theory (e.g. Greimas,
1987; Propp, 1927), sets in motion a narrative development that can really bring, no more superficially, but essentially, data visualization to a new
level closer to the storytelling. From this point of view today video games seem the more suitable media narrative forms from which to take
inspiration for the creation of new ways to display behavioral data. Video games in fact have managed to create a form of interactive narrative
where hypertext narrative had failed, managing to involve deeply the audience while leaving the user the power to influence the story told.
Although not leaving a completely open narrative, which, as noted by Jesse Schell in his book (2008), is difficult to achieve both in technical and
design terms, as well as difficult to use, video games, especially in the form of MMORPGs (Massively Multiplayer Online Role-Playing Games),
leave broad room for their users to build the mirror of themselves in dynamic ways. Growing his own virtual avatar on different form and sizes
depending on the choices made, users can, within these worlds, see themselves from different perspectives in a logic that encourage reflection
about the actions taken, the objectives achieved, and the changes that they have produced on own subjectivity and identity. From these media
products, therefore, the QS can draw on design strategies, tactics, forms of representation, and temporal evolution that in the near future could
revolutionize the display of human behavior data.
7.2. Managing Complexity in Technology and People
We can understand the management of complexity in both directions, one more technological and one more human. Let's see them separately.
7.2.1. Technological Side
What today is missing is a platform that integrates all the data a person collect during his daily activities, a complete, adequate, integrated,
picture of the individual. Nowadays there are only a small amount of recorder that manage certain parameters to render them in different and
separate applications. In this sense, the technological complexity that QS will have to face in the future years is the tracking of many different
kind of data and the integration of them in meaningful visualizations. In a certain sense is the rediscovery of the original ambitions of Lifelogging
depicted in the first paragraphs but on a massive and more pervasive seamless plan. We could call it the “ultimate lifelogging scenario”.
Some examples in this sense already exist: for instance Capzels (2012) uses social storytelling to allow users to create chronological slideshows
containing photos, videos and slide decks located on a timeline. LifeLapse (2012) is an iPhone app that lets you take a photo every 30 seconds
tying your iPhone to the neck. The preserved images can then be seen one by one or mounted in a video that evokes the recorded life experience.
If in these two applications build from scratch the lifelogging experience require an active intervention by the user to continuously record his life
experience, other applications such as your.flowingdata (2012) and Daytum (2012) exploit the popular micro-blogging site Twitter as input
mechanism to trace the daily life of users. Daytum allows you to collect, categorize and communicate your everyday data through the storage and
display of personal statistics related to the daily events, providing various forms of statistical display, as pie charts, bar charts, timelines, and so
forth. Your.flowingdata captures the lives of users using data from Twitter. Following @YFD on Twitter the user can begin to record his
experiences simply sending a tweet to @YFD. Data about what the user eats, sees and more generally experiences are recorded by the system in
order to be displayed on the application site. Display modes range from timelines to charts, also integrating experimental visualizations that
allow the user to find cross-correlations between data, to explore durations between a start and stop actions and to use calendar visualizations,
which displays the frequency of an action on a given day through the intensity of color (Yau & Hansen, 2010).
However, lifelogging systems are still suffering some difficulties to spread on the consumer market and it seems quite far away the possibility to
see in the next future an application that can be used by all users to really record all aspects of a person's life, as it was hoped by the first
pioneering research projects of the lifeloggers. Although the miniaturization of sensors and chips, such as audio recorders and cameras, makes
possible now to integrate them into common smartphones, these applications seem destined to remain for some time prerogative of academic
research projects. This is partly due the excessive involvement of the user that is necessary today (through self-reporting or manual annotation of
recorded media) in order to record all the experiences of his life. Research projects try also to address another technological problem: how to
cope with the difficulties of recovering significant data in the enormous amount of information that can be recorded in an extensive lifelogging
scenario. For instance Poyozo is an automatic journal able to generate summaries and statistics displays from heterogeneous activity data,
integrating this information into a calendar trying to create a meaningful narrative life for the user (Moore et al., 2010). Or like in an interesting
research project of the Aizu University, trying to overcome the problem of information overload storing only the significant highlights of a
lifetime, without requiring active intervention by the user. This system automatically saves significant events for a group of people if their
emotional arousal (detected by a heart rate monitor and compared through a peer-to-peer network) exceeds a certain threshold (Gyorbiro et al.,
2010).
7.2.2. Human Side
Another level of complexity that we must address is that about the complexity that drives behaviors (Guidano, 1987). We have to accept that
human action is not simply driven by rationality but also from infinite variables that affect the final behavior. We all know that is not enough just
to look at the weight on the balance (awareness) to make us want to do more sport or eating less. We have to take into account the human
complexity. An attempt in this direction is carried out by the second cognitivism which introduces greater complexity compared to the first
theoretical assumptions. This development suggests how the reading of the “reality”, the “world”, is in fact very personal and can lead to different
meaning depending on who is “reading”. There is a shift from a realistic/objective paradigm to a subjective one. Human behavior becomes so the
result of an articulated and variously structured cognitive process of information processing.
The most recent results on the analysis of cognitive processes focus these dynamics in the social and interpersonal contexts in which the thought
develops. This approach based on cognitivism, defined as social cognitive theory, studies the interaction between cognition and social context.
Great importance in this theoretical core is attributed to the reflections of Albert Bandura (1986) about cognitive-emotional processes, who sees
that these processes express themselves through behaviors. There is essentially a complete re-evaluation of interpersonal and emotional
components as human action drivers.
Let’s consider just one the many examples of QS interpreted as “augmented awareness”. Asthmapolis (2012) links sensors attached to the
inhalers used by patients with asthma to smartphones. These sensors gather data on where and when they are used. Recording this information
helps patients identify the triggers that make their conditions worse. The concept seems simple: more data lead to greater awareness and make
possible the anticipation. In fact, looking at this example the impression is that the driver of who decides to rely on Asthmapolis is more
emotional, driven by the concern, the fear and the attempt to gain control on this. One of the main fear mastery strategies is to seek information
to improve or prevent damage to ourselves, or even to prevent the stress of uncertainty. It is what is called in psychology coping strategy: looking
frantically for information, for new data. Collecting personal information is an effective coping strategy that manage the emotional stress of the
disease. The real driver of those who use this form of QS is not the rational data collection in order to develop more knowledge, at least not only
that, there is also (and maybe is even more crucial) in the application adoption, the coping strategy to decrease in a certain extent the fear given
by their disease condition. Fear is not instead what guides those who collect their own fitness data: in this case the motivational mechanism
underlying the emotional matrix will be linked rather to the positive emotions related to competitive motivational systems. In this sense we speak
about taking into account the human complexity, to really address the full spectrum of human behaviour drivers.
8. FUTURE RESEARCH DIRECTIONS
Future research directions should start from what is stated in the preceding paragraph, a different management of technological and human
complexity and new ways to address this complexity keeping in mind that the real and final purpose of the QS is the personal and social
improvements.
8.1. Individual Impact
Let’s consider this example. A QS supporter claims to have lost 20 kg of his 100 by either writing the word “lethargic” or “energised” on a flash
card at 3pm every day for 18 months, depending on how he felt: “I gradually noticed that my perception of some foods shifted from thinking
they were delicious to starting to feel their heaviness and the effects they were going to have on me. The act of paying greater attention has an
effect on your behaviour”(Guardian, 2011). What we see in this example again is not an example of augmented awareness; it is rather a gradual
reorganization of the “Self”, through the focusing on the connection between behavior and the mood, the emotional component (“lethargic” or
“energised”) linked to the behavior (food consumption) (Damasio, 1999). The effect of this focusing is well known as instrument of change in
psychology and therapy under different names and has its theoretical bases in Cognitive Behavioral Therapy (CBT) and even more in the post
rationalist paradigm. The studies that have helped to define the post-rationalist paradigm have pointed out that it is not possible to have a unique
idea of an objective reality. Echoing comments made by Ricoeur (1992) and Morin (2003) the reality is quite captured through a reading that is in
large part subjective and in which are critical, even more than the rational, the emotional and affective components. The reality is then
reconstructed on the basis of emotional and cognitive tools available to the person at some point of his life cycle. The focus is so shifted from what
is valid or common for all individuals to subjective experience.
As Guidano (1991) evidenced the varied and changeable flow of experience inbound is compared with the pre-existing mental configurations that
act as frames of reference. Any knowledge of external reality, as well as data from the QS, trigger subjective mode of experience, from which are
then extracted the personal knowledge and the vision of the world. There is thus a dialogic dimension of continuous mutual influence between
the “Self” and the narration of incoming data that continuously restructure the same “Self”. This means that the data that I collect and how they
are presented to me, on the one hand will be seen in a different way depending on the current emotional and cognitive configuration but on other
hand will also contribute to define it. So we are in a scenario where the data really have a role in building the “Self” and the personal identity
(Guidano & Liotti, 1987). This is especially true if we speak about data that reflect our mental activity in terms of thought, action and emotional
aspects. There are already some examples in this regard. DreamBoard (2012) is a platform that collects dreams and renders a graphic story of
dream activities, quantifying characters, mood, recurring figures, etc. Because the dream, as many studies proved, is a mental activity among the
others (Fosshage, 1997) this kind of restitution contributes to develops new meanings to some cognitive restructuring and actual construction
“self”. This is a greater self-knowledge but not in the trivial terms of knowing more “things” in a quantitative way, but instead to know more in the
terms expressed by Gibran in the sentence quoted at the beginning of the chapter, in a qualitative way.
8.2. Social Impacts
On a social level again an example is the social dimension of the management of our data and the return of favorable options. It is possible in this
case to extend the QS concept, toward Quantified family, Quantified cities, Quantified country. We can take for example the case of our travel
habits, consumption trends, all the personal health or needs that we express individually and that nowadays are totally disconnected from others
individuality. The aggregate use of this data for statistical analysis in order to provide solutions based on these analyzes would be essential in
order to optimize, in a social sense, many of the resources that are now wasted. Of course, this opens a hot topic for QS: privacy and ethical
issues. In fact, the described scenario is already happening partly for our consumption behaviors: all our “Consumption Self” is already recorded
by third parties, not for social purposes but for profiling and customization to encourage further consumption.It should therefore paid much
attention to this type of use of information even when it is done for social purposes. An example of this was a research project of UCLA (2011)
who built a stress app for young mothers oriented at personalized health care based on a phone's GPS system,. The purpose of the research was
noble, to develop a pilot program based on Android smartphones technology to monitor assess, and treat participants. The device was able
(thanks to the accelerometer on board) to track the location and the movement in detail. All these data clearly started to become an issue so as to
include the use of a “personal data vaults” a digital lock box run by a third-party intermediary. Beside the privacy issue some QS companies are
already going in this “social” direction. For instance Zeo databases now contains more than 400,000 nights of sleep. Cure Toghether is the site
where thousands of patients can post and compare their own symptoms and treatments for more than 500 diseases. For example thanks to the
posting on CureTogheter the aggregation of the data showed that people who have had vertigo in association with migraine had four times more
likely to have side effects using the Imitrex (a medicine for migraine) than those who did not had vertigo. It is obvious that the critics of this
approach point out that all the intervening variables does not make this data reliable, making this exchange of the same value of a chat bar and
not a form of medical research. However, it remains the great value of the enormous amount of data collected.
9. CONCLUSION
In this chapter we set ourselves the aim of defining the growing movement of QS in the light of the two words that compose the name itself. In
fact, we have seen how under other names a similar technological ambition was seen at least since the mid 90’s. Many are the experimental
projects that have sought to rely on technology to quantify the “Self”, or at least the human behavior in the world. This original ambition was
declined in the last years according to various purposes some of which still unclear, unrealistic or naïve. Although many are the application
domains of QS (we outlined in paragraph 5 those related to healthcare, mood management, and fitness), are still palpable some fundamental
problems that relegate the QS movement in a phase of low maturity. The first is a technological problem and specifically a lack of maturity in
technologies for the collection, processing and data rendering. This is accompanied by a perhaps more fundamental problem of deficit, bias and
lack of integration of aspects concerning the human side of the QS idea. There are in fact some theoretical aspects that still to be understood
about how to render meaningfully the representation of the large amounts of data that an integrated QS scenario produces (despite some
theoretical elements are already available but often not applied in favor of the banality of a simple “wow effect”). But there are also and above
some theoretical limits on the most fundamentals goals of QS: change, improvement, transformation. Just the difficulty in having an effective
theory of human behaviour change, that is now fragmented between modules from behaviorism, from cognitivism, from social psychology
theories, makes the deployments of QS application so unexpectedly ineffective, unsatisfactory or irrelevant. The step that we tried to make in this
chapter is to highlight aspects that could lead to a more robust approach in QS area. Primarily through modes of data representation, taking into
account the limitations and potential of human pre-attentional processing, identifying in studies on storytelling approach a possible way to reach
a new form of QS data visualization. Secondly, through a necessary management of complexity, both in terms of technology (toward “ultimate
lifelogging” scenarios that integrate and create value between different data of areas today addressed vertically) both in theoretical terms for what
concerns the human side of the whole issue (thereby integrating social and emotional components in an all-embracing theory of human behavior
change). We have gone a little further stressing how the future directions of research could lead to significant impacts on both individual and
social level. On the one hand by configuring the QS as a tool to support the personal constant construction of the individual “Self” (a kind of
mirror in constant communication with our identity) and on the other driving the aggregation of QS data at a sufficient level to be able to extract
knowledge and new social services propositions based on statistical analyzes (for instance in Healthcare, Mobility, Energy sector). In conclusion
are necessary some considerations about the possible risks that QS scenarios can generate especially considering its possible massive use. The
privacy risks combined with ethical issues related to social control scenarios lead up to a glimpse of dystopia in which in the name of a “superior”
aggregate knowledge the individual is excluded from any kind of the decisions, even minimal, concerning him.
This work was previously published in Innovative Approaches of Data Visualization and Visual Analytics edited by Mao Lin Huang and
Weidong Huang, pages 236265, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Abigail, S., & Whittaker, S. (2010). Beyond total capture: A constructive critique of lifelogging. Communications of the ACM ,53(5), 70–77.
doi:10.1145/1735223.1735243
Allen, A. L. (2008). Dredging-up the past: Lifelogging, memory, and surveillance. The University of Chicago Law Review. University of Chicago.
Law School , 75(1), 47–74.
Bandura, A. (1986). Social foundations of thought and action . Englewood Cliffs, NJ: Prentice-Hall.
Barbaro, M. (2010). Active devices based on organic semiconductors for wearable applications. IEEE Transactions on Information Technology in
Biomedicine , 14(3), 758–766. doi:10.1109/TITB.2010.2044798
Bell, G., & Gemmell, J. (18 February 2007). A digital life. Scientific American Magazine. Retrieved from
http://www.scientificamerican.com/article.cfm?id=a-digital-life.
Bentley, F. (2012). Personal health mashups: Mining significant observation from wellbeing data and context. In Proceedings of CHI2012
Workshop on Personal Informatics in Practice: Improving Quality of Life Through Data. New York: ACM Press.
Bonato, P. (2012). A review of wearable sensors and systems with application in rehabilitation. Journal of Neuroengineering and
Rehabilitation , 9(21).
Burigat, S., & Chittaro, L. (2007). Geographic data visualization on mobile devices for user’s navigation and decision support activities .
InBelussi, (ed.), Spatial Data on the Web-Modeling and Management . Berlin: Springer. doi:10.1007/978-3-540-69878-4_12
Canada, C. (2006). La cueva grande: A 43-megapixel immersive system. In Proceedings of Virtual Reality Conference 2006. New Brunswick, NJ:
IEEE Press.
Card, S. K. (Ed.). (1999). Readings in information visualization: Using vision to think . San Francisco: Morgan Kaufmann.
Consolvo, S. (2008). Flowers or a robot army? Encouraging awareness & activity with personal, mobile displays. InProceedings of the 10th
International Conference on Ubiquitous Computing: UbiComp 08, 54-63. New York: ACM Press.
Corbishley, P., & Rodriguez-Villegas, E. (2008). Breathing detection: Towards a miniaturized, wearable, battery-operated monitoring
system. IEEE Transactions on Bio-Medical Engineering , 55(1), 196–204. doi:10.1109/TBME.2007.910679
Cruz-Neira, C. (1993). Surround-screen projection-based virtual reality: The design and implementation of the cave. InProceedings of ACM
SIGGRAPH. New York: ACM Press.
Damasio, A. R. (1999). The feeling of what happens: Body and emotion in the making of consciousness . London: Harcourt Inc.
Daytum . (2012). Retrieved from http://www.daytum.com/.
Dodge, M., & Kitchin, R. (2007). Outlines of a world coming into existence: pervasive computing and the ethics of forgetting.Environment and
Planning. B, Planning & Design , 24, 431–445. doi:10.1068/b32041t
Few, S. (2009). Data visualization past, present and future.Innovation in Action Series . Armonk, NY: IBM.
Few, S. (2010). Data visualization for human perception. In: Mads& Friis (Eds.), Encyclopedia of HumanComputer Interaction. Aarhus,
Denmark: The Interaction Design Foundation. Retrieved from http://www.interaction-
design.org/encyclopedia/data_visualization_for_human_perception.html.
Gemmell, J. (2006). MyLifeBits: A personal database for everything. Communications of the ACM , 49(1), 88–95. doi:10.1145/1107458.1107460
Gershon, N. D., & Page, W. (2001). What storytelling can do for information visualization. Communications of the ACM , 44(8), 31–37.
doi:10.1145/381641.381653
Greimas, A.-J. (1987). On meaning: Selected writings in semiotic theory . Minneapolic, MN: University of Minnesota Press.
Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer
Studies , 43(5-6), 907–928. doi:10.1006/ijhc.1995.1081
Guarino, N. (1998), Formal ontology in information systems. InProceedings of FOIS ‘98. Amsterdam, IOS Press.
Guidano, V. F., & Liotti, G. (1986). Cognitive processes and emotional disorders . New York: Guilford.
Heer, J. (2008). Graphical histories for visualization: Supporting analysis, communication, and evaluation. IEEE Transactions on Visualization
and Computer Graphics , 14(6), 1189–1196. doi:10.1109/TVCG.2008.137
Hodges, S. (2006). L. SenseCam: A Retrospective Memory Aid. [Berlin: Springer.]. Proceedings of UBICOMP , 2006, 81–90.
ITC's Michigan Control Room . (2012). Retrieved from http:/ / pro.sony.com / bbsccms / assets / files / micro / sxrd / articles /
ITC_IGI_SXRD_CASE_STUDY.pdf.
Kay, M. (2011). Lulaby: Environmental sensing for sleep self-improvement. In Proceedings of CHI2011 Workshop on Personal Informatics. New
York: ACM Press.
Lewin, K. (1951). Field theory in social science; Selected theoretical papers . New York: Harper & Row.
Mann, S. (2004) Continuous lifelong capture of personal experiences with eyetap. In Proceedings of 1st ACM Workshop on Continuous Archival
and Retrieval of Personal Experiences. New York: ACM Press.
Mayer-Schönberger, V. (2009). Delete: The virtue of forgetting in the digital age . Princeton, NJ: Princeton University Press.
Miller, G. A. (1960). Plans and the structure of behavior . New York: Holt, Rhinehart, & Winston. doi:10.1037/10039-000
Moore, B. (2010). Assisted self reflection: Combining lifetracking, sensemaking, & personal information management. InProceedings of CHI
2010 WorkshopKnow Thyself: Monitoring and Reflecting on Facets of One's Life. New York: ACM Press.
O’Hara, K. (2006). Memories for life: A review of the science and technology. Journal of the Royal Society, Interface , 3, 351–365.
doi:10.1098/rsif.2006.0125
Pina, L. R. (2012). Fitbit+: A behavior-based intervention system to reduce sedentary behavior. In Popper & Eccles (Eds.), The Self and the
Brain. London: Springer International.
Propp, V. (1927). Morphology of the folktale. Transactions of Laurence Scott (2nd ed.). Austin, TX: University of Texas Press.
Ramirez, E. R., & Hekler, E. (2012). Digital histories for future health. In Proceedings of CHI2012 Workshop on Personal Informatics in
Practice: Improving Quality of Life Through Data.New York: ACM Press.
Rawassizadeh, R. (2012). UbiqLog: A generic mobile phone based life-log framework. Personal and Ubiquitous Computing, 2012 . London, UK:
Springer.
Schell, J. (2008). The art of game design: A book of lenses . Burlington, MA: Elsevier.
Segel, E., & Heer, J. (2010). Narrative visualization: Telling stories with data. IEEE Transactions on Visualization and Computer Graphics , 16(6),
1139–1148. doi:10.1109/TVCG.2010.179
Shaltis, P. A. (2006). Wearable, cuff-less PPG-based blood pressure monitor with novel height sensor . In Proceedings ofIEEE Engineering in
Medicine and Biology Society . New Brunswick, NJ: IEEE Press. doi:10.1109/IEMBS.2006.260027
Swan, M. (2009). Emerging patient-driven health care models: An examination of health social networks, consumer personalized medicine, and
quantified self-tracking. International Journal of Environmental Research and Public Health , 6(2), 492–525. doi:10.3390/ijerph6020492
Swan, M. (2011). Genomic selfhacking. Retrieved from http://quantifiedself.com/2011/11/melanie-swan-on-genomic-self-hacking/.
Thelen, S. (2010). Advanced visualization and interaction techniques for large high-resolution displays . InMiddel, Ariane (eds.), Visualization of
Large and Unstructured Data Sets-Applications in Geospatial Planning, Modeling, and Engineering ,19 (73-81). Berlin: DFG.
Tufte, E. R. (1983). The visual display of quantitative information . Cheshire, CT: Graphics Press.
Whittaker, S. (2012). Socio-technical lifelogging: Deriving design principles for a future proof digital past. Human-Computer Interaction , 27, 37–
62.
Wojtkowski, W., & Wojtkowski, W. G. (2002). Storytelling: Its role in information visualization. In Proceedings of European Systems Science
Congress. New Brunswick, NJ: IEEE Press.
Yau, N., & Hansen, M. (2010). your.flowingdata: Personal data collection via twitter. In Proceedings of CHI 2010 WorkshopKnow Thyself:
Monitoring and Reflecting on Facets of One's Life.New York: ACM Press.
KEY TERMS AND DEFINITIONS
Gestalt: It means “form”, and in gestalt psychology it means that perception is a result of interaction and global organization of various
components, emphasizing that the whole is greater than the sum of its part.
Massively Multiplayer Online RolePlaying Games: It is a class of role-playing video games in which a huge number of users play together
in a shared virtual world.
Personal Informatics: It is another way to call the Quantified Self, a class of tools that support people to collect personal information for the
aim of self-monitoring.
Reinforcement: It is a term in the behaviorist paradigm for a process of strengthening the probability of a specific response.
CHAPTER 73
Generational Differences Relative to DataBased Wisdom
Gino Maxi
Minot State University, USA
Deanna Klein
Minot State University, USA
ABSTRACT
The purpose of this chapter is to present research findings and address the Generational Differences Relative to Data-Based Wisdom. Data-Based
Wisdom is defined as the use of technology, leadership, and culture to create, transfer, and preserve the organizational knowledge embedded in
its data, with a view to achieving the organizational vision. So what will comparing Generational Differences effectively do to help achieve
organizational vision? If you don’t know your history, you are doomed to repeat it; therefore, with the accumulation of ever growing data,
understanding the necessary steps to store them properly and ability to retrieve them in an efficient manner are both explicit and tacit knowledge
that are outside the scope of the conventional multi-disciplined approach to achieving organizational objectives. With time, technology,
leadership, and culture have transformed into more than tangible items, social leadership concepts, and learned behavioral patterns. The latter
three ideas have evolved along with the technological advances infused into society as we know it today. Therefore, the value and emphasis to
develop and maintain intricate and efficient knowledge management databases suitable to create, transfer, and preserve organizational
knowledge embedded in its data has never been more vital. The importance will continue to grow as changes in technology, leadership concepts,
and culture continue to inundate.
INTRODUCTION
This chapter lists the different uses of Data-Based Wisdom and their purpose; the use of technology, leadership, and culture to create, transfer,
and preserve organizational knowledge embedded in its data. Because each use has such broad “organizational visions,” this chapter will focus on
the comparison of generational differences of each use and their purpose both in the past and in today’s society. Technology, leadership, and
culture are essential within the big data era. Technology plays a part in setting the framework where big data lies; technology not only sets the
framework for big data, but it also allows for the creation, transfer and preservation of such data due to its purpose within society. Leadership’s
role within the big data era is designating an individual whose sole purpose is to manage how organizational goals and visions are being met; and
when dealing with big data, approaching such complex situations in an organizationally-friendly method, to successfully implement an
information system where data is created, transferred, and preserved within an organization. Leadership styles are in abundance, but the style
used within an organization when dealing with big data is determined by the leader and what kind of environmental factors he/she is dealing
with i.e. high risk, high stress, Community of Practice, etc. In the theory of cultural impact within the big data era, culture ties in with big data by
explaining the changes in how big data is used and viewed by the culture of today’s generation, relative to the culture of past generations.
TECHNOLOGY: AND EVERYTHING THAT MAKES IT WORK
Approximately 23 years ago, the quantity and quality of technological devices infused within our society were great advances to the generation of
the 90’s. Where the World Wide Web (WWW), the Pentium processor, Web TV, and a whole variety of great inventions were introduced and
revitalized, the underlying goal of ephemeralization was in mind.
According to the article “Ephemeralization,” the term ephemeralization is the concept of “doing more with less.” I think we can all agree that with
new advances in technology comes convenience and efficiency (Funch, 1995). Convenience and efficiency also, unfortunately, creates a sense of
lackadaisical behavior; making things easier and more accessible. Technologies in the 90’s were innovations that redefined the concept of
communication. Imagine if Paul Revere were able to phone the Colonial militia and inform them that the British forces were fast approaching.
That is food for thought as we expand on the advances made by the technological innovations of the 90’s.
The growth of the Internet is the common denominator in the equation of the exponential growth of technology. The Internet has allowed for
minimizing the distance between international communities; communication has been revolutionized by interconnecting nations and individuals
by the vast expansion of this intricate web of copper lines and fiber tubes we know as the Internet. In comparison, the arrival of the Internet and
its preliminary intent was a study completed by the US government in conjunction with several prestigious private companies and universities.
By the 90’s, its use has far exceeded its intent by giving individuals access to a myriad of information at the click of a button. With time, the
Internet’s intent and use has evolved to where it now is the fountain of knowledge for organizations, individuals, and society as a whole. As stated
before, ephemeralization is a concept based on “doing more with less” where the Internet has become the archetype to this concept. With the e-
commerce boom in the 90s to the social media hype of the 21st century, access to a combination of computer data including graphics, sounds,
text, video, multimedia and interactive content has made the accumulation of tangible items possible without having to step foot outside the
comfort of your home to the reality of the world. Virtualization has never before been more evident until advances in the technology, used to
transport data from one node to another, made the concept applicable and practical.
Exchange of information is the fundamental objective of the Internet. With the vast use of email in the early 90s, exchange of information from
one author to another or multiple recipients was the epitome of the creation, transmission, and preservation of data. No longer was the
dependence of brick-and-mortar mail delivery type services absolutely necessary; for the concept of expeditious data transfer of ones thought,
idea or an organization’s knowledge embedded into its data, via the Internet, has revitalized the way individuals communicated. Email is a portal
which purpose’ is to allow authors to send electronic messages to one or more recipients. The caveat for some email systems in the 90s was in
order to accomplish the deliverance of the message; the recipient must be online at the same time. Email was essential during early stages of the
Internet; basic text as we see today when writing a plain word document parallels the format of email sent during its peak in the 90s.
One of the more noticeable technological innovations via the Internet during the early 90s is WebTV which is essentially watching a television
series online. Although speed of data transmission was at its archaic stages in the early 90s, WebTV was pioneered to operate within its
limitations, giving new meaning to Internet usage. Technological laggard individuals may not have fathomed this new entry to the virtualization
world, for the concept of being able to watch television programs offered from anything but a cable TV provider may have been a Jetsons family
fanatic’s visionary dream. Geared toward personal use, WebTV was positioned as a consumer device for web accessed television. Although pricey,
consumers were willing to pay a hefty price for the WebTV set-top box just as consumers excited for the roll out of new products and services are
always willing to do. According to the article A Websurfing Box for Peanuts (Wildstrom, 1996), set-top boxes ranged from $329-349 for WebTV
in the early 90s which definitely makes an impact on the average household budget. Monthly service fee for WebTV was initially $19.95 per
month for unlimited Web surfing and integration of email capabilities. WebTV became what we know now as ‘watching videos online;’ it was the
pathway to countless technology advances that essentially allowed for sharing of information across the Internet in a different format. Yet at its
inception, WebTV did not offer as much content you would find today on the Internet. Rather, WebTV was created for the purpose of airing an
infinitesimal selection of online soap opera type TV series, which did not last long. When it comes to cumulating data, and embedding them in to
a vast knowledge management database, ready for the creation, retrieval, and addition of data within the database (usually via the Internet)
there-in-lies one great generational change from the early 90s to the present.
According to the survey “Social Media User Demographics” performed by Pew Research Center (2013), individuals ranging from ages 18-29 years
use the Internet to access social media sites 90% of the time.
Figure 1. Pew research center’s internet project library survey
Email is a messaging utility that played a great part in the inception of the Internet. The Internet was based solely on the ability to connect
individuals using a web of network devices, wires, and nodes. The email application was the instrument (at the time) that allowed for messages
sent via copper wires, to be created, received, and read in an adequate format that was legible and easily understood by the end user.
With time, focus on protocols developed with the purpose of ensuring proper transmission of messages over the Internet shifted to programming
the creation of an environment where people are able to communicate to multiple individuals, at the same time if needed, with graphical
interfaces and tools used to get their points across. Social media sites such as Facebook and YouTube differ in each intended purpose. Although
one can argue that they ultimately accomplish the same goal [to bring everyone together via an engine that provides users with tools to
communicate with one another, no matter where you are as long as you have an Internet connection], Facebook would fall within the social
networking category of social media, where YouTube is in the social photo and video sharing category. There is a huge jump from sending a
message in plain text format to posting pictures and videos in a framework where multitudes of individuals are able to interact, converse, view,
and share the objects posted within the social media site.
Relative to change, email is to Facebook, where WebTV is to Netflix. WebTV went from being a delicacy offered to a target group of consumers,
those who were able to easily afford the pricey services with the pricey devices it came with, to an everyday use type of service accessible to
consumers who are able to afford eight dollars a month for unlimited access to movies and shows at the click of a button. An American provider
of on-demand Internet streaming media, Netflix, is offered within North America, South America, and many countries across the globe.
Compared to the early days of online TV series steaming, WebTV proprietarily serviced the United States; whereas Netflix is able to expand
across the globe as Internet protocols expanded to allow for vast amounts of data to be transmitted simultaneously to remote sites that may be
located across, well…the globe. Once again, this illuminates the crucial generational difference in technology; the ability to access vast amounts of
data, stored in a large knowledge management database, at the tip of ones fingers, any time and around the world. Fact is, brick-and-mortar
concept is slowly being stripped of its popularity and is progressing towards digitalization and virtualization. With social uncertainty, comes a
behavior of withdrawal which impedes knowledge creation and sharing. Rapid change indisputably creates the consequence of loss of
predictability, whereas it was very predictable that one who decides to send an email in the early 90s had no ill intentions but to get his or her
message across to another. Today, with the exponential growth and change in the technological innovations, a fear factor is created both in the
sense of unpredictability and too much information.
As already stated, email is to social media as WebTV is to Netflix. However, there are huge differences relative to the affirmed and the
generational differences are implied within those differences, but where do we as a society go? Where do we go from having vast amounts of tacit
knowledge, to exponential amounts of explicit knowledge ready to be created, retrieved and deleted at the click of a button? How do we as a
society manage the substantial amount of information that already exists and is being created every second? These are questions that have
become reasons to drive change; change in technological advances, its purpose and use. Such change has driven the exponential growth of
technological advance in network devices, application, and knowledge management bases. Applications such as Facebook, Twitter, and YouTube
generate substantial amounts of data every day. According to the article “Netflix Remains King of Bandwidth Usage, While YouTube Declines,”
the video on demand program accounted for 34.2% of all downstream usage during primetime hours, up from 31.6% in the second half of 2013
(Spangler, 2014). With such increase of Internet usage, comes the creation, management, and acquisition of data. This is where new innovations
of databases, and changes to privacy and security settings play a big part of the technological advances within the big data era. Individuals view
changes’ extent, rate, desirability and how it affects us, differently. So with the increase of uncertainty comes the result of incomplete knowledge;
with incomplete knowledge, innovations such as knowledge management systems were and still are necessary to be implemented, to allow for the
storage and retrieval of data by organizations and or individuals. This is when leadership becomes important.
LEADERSHIP: THE STYLES AND ROLES THEY PLAY
As a noun, leadership owns several definitions which ultimately define it as the ability, position and function of a leader. Leadership is a position
of high value due to the responsibilities attached to it. Leadership holds value because the individual who acquired such a role was looked up to,
admired, and held responsible by others who pledged loyalty to follow the steps and requests of their leader. This was more apparent during the
early stages of human society; what is referred to as old-fashion leader views, meaning that leaders were chosen to be leaders from an early age.
Even if you weren’t a good leader, there was not much that could be done in order for you to become a better leader. If we think back in the days
of Egyptian ruling to the Ottoman Empire, most can agree that the leadership style resembled what we know today as Coercive; where the leader
demands immediate compliance from individuals whom submit to their leadership role. To mention, coercive leadership styles were not limited
to the era of AD and BC, they were and are also used within the last century and more than likely today in some aspects. Coercive leadership style
parallels the emotional intelligence of self-control and drive to achieve certain goals whether for the good of others or personal. Dictatorship and
royal ruling was the governing norm in the past, where demands are made by one who is in charge and others must obey. Failure to obey may
result in negative, sometimes arduous consequences.
According to the article “6 Leadership Styles, and When You Should Use Them,” coercive type leadership must be used with great caution and
discretion (Benincasa, 2012). Losing the faith and loyalty of your adherents and supporters is a failure of leadership and should be taken
personally with great disappointment. Although in the early days of empires and dictators, concern of a leader’s followers, relative to how the
followers viewed their leader’s leadership skills, were not taken into consideration due to the power and position of the leader, and the lack of
respect given to those under him or her. Today, though coercive leadership styles are harder to find, they are still implemented when a group or
organization may be faced with a dilemma or impasse; where a leader would have no choice but to step up and grab the reigns of the situation.
Orders may be given, but with great discernment in regards to the individuals that the leader is working with, and the severity of the situation at
hand. What situations may be so drastic that would allow for a leader to step up and be assertive using coercive leadership style? Within an
organization, if company standard operating procedures or policies are not being met and or ignored by the workers, the leader or upper
management in this case will need to step in and use coercive leadership skills to take control of the situation at hand and bring the company to
order. For instance, breaking company ethics code by acquiring customer sensitive personal information via the company’s internal database by
an entrusted employee are grounds for coercive leadership intervention. Coercive leadership style has its pros and cons; the pros stem from the
leader having vast control of situations at hand, yet the cons stem from when followers or those who submit to the authority of the leader has less
responsibility for their actions, as long as they are following orders, and the use of coercive leadership decreases flexibility with little reward
reciprocated for good work.
Coercive leadership style still minutely mirrors the basic concepts of dictatorship; where a governing body is ruled by either an individual or a
small group of people. Often positions were assigned to certain individuals in times of uncertainty or great crossroads; a lot like when a leader
within an organization has to step in to gain order during high climax situation, which may put the company in jeopardy of losing profits or
infringing on company’s policy and values.
What did traditional leadership styles focus on within an organization in the past? And what did it accomplish? Let’s use the coercive leadership
style as the previous or traditional style used in the past, and agile management as the leadership style used today, for comparison purposes.
Organizations are created for two reasons, to sell a product or provide a service. Drivers of a company’s vision and goals vary between companies;
yet the ultimate goal is to provide something to a consumer who is unable to provide for themselves. The process of accomplishing a company’s
vision and goals stringently depends on management’s capacity and competency to lead the team of worker they oversee, to work towards
increasing shareholder wealth and expand the company’s presence within the economy. The leader’s knowledge capacity of the competency of
goods and services their company or departments [which they oversee] provide are essential in accomplishing goals to meet an outcome based on
reason and incentive. The use of traditional coercive leadership style will get the job done. Yes, concerns have been expressed for the use of
coercive leadership styles because of the lack of input by the team players or individuals receiving orders, and the “my way or the highway” theory
of this style of leadership. But when a company is in need of accomplishing a vital goal, such as correcting a major outage within their database,
which in turn is essential for growth and accomplishing its vision, a leader who takes charge and invites no contrary opinions may be necessary. It
is by far the most attractive of the many different types of leadership styles, but we must remember, it was the way things were done in the past.
With time, everything gets better, whether it be the realization that input from individuals under the rule of a coercive leader is vital towards
work morality improvement, therefore, creating a new type of leadership style or resolving an issue and completing a goal making it necessary for
the leader to shift his or her style to a more democratic or agile type leadership style when the goal is accomplished. As the shape of technology
shifted and advanced within society due to time and knowledge gained, leaders and those who were being led also changed the way they viewed
teamwork.
One cannot expect that a particular way of handling things will always be the best. With time, progression, and knowledge gained, what was once
viewed as acceptable might be tweaked to conform to changes and the norms of today’s society. A coercive leader who is able to articulate the goal
of implementing a massive knowledge management base, and accomplish it by persuading his or her employees to focus solely on the clear cut
mission at hand without steering the faith and trust of their workers, understands the proper way to use such leadership style. Traditionally, a
leader is considered a person of importance, more important than say their workers. That traditional view of how leaders and workers are to
coexist within a government or organization is a dangerous point of view. To assume that an individual holding a higher position makes that
person more important as a leader, is to also assume that inevitably the employees or ones who are being led are insignificant and hold no true
value within the government or organization. Therefore, this can lead to loss of confidence and loyalty, making the organization body at the
contingency of a disaster; which in turn shows a failure to lead and/or a failure of leadership. Arguments have been made that a leader is there to
help workers and followers who have a certain goal to exceed or faith to uphold, be as important as they can be to maintain the operations of a
company or purpose of the governing body. That theory is more effective to use towards accomplishing a goal, when the leader decides to take the
coercive direction. In all, coercive leadership style, although still used for critical moments or emergency situations such as loss of network
availability, is more of a traditional approach and is one of the many traditional leadership styles used. Coercive style’s effectiveness is questioned
when it comes to the affects that it can cause by alienating individuals and diminishing agile-ability, intuitiveness, and inventiveness. These are
attributes that are essential to increase productivity to accomplish company or organization vision, goals, and camaraderie to grow the
relationship amongst leaders and the individuals receiving guidance and instructions from them.
If we fast forward to leadership styles often used today, they seem to gear more towards the inclusion of all parties within the company, governing
body, or organization. For example, democratic leadership style includes everyone who is part of the association that a leader is part of in
decisions and enabling the leader to make decisions based on a winning consensus. It takes a strong leader to use democratic leadership styles
because it takes large amount of trust in your employees, constituent, or whichever association the follower has within the body. Trust is a major
factor and plays a major role within a democratic body, because the leader is counting on his or her followers’ knowledge, discretion,
discernment, and intuition to make sometimes crucial decisions. This can be hair-raising if the outcome of the decision, if bad, will fall on the
responsibility of the leader. While the leader did listen to the followers, he or she neglected one’s own intuitive leadership skills to discern
whether it was a wise choice or not. Democratic leadership style overall is a healthy leadership theory to approach when leading large groups of
individuals. The democratic leadership style is a current popular choice used in the United States and other countries to govern the multitudes of
people within various bodies. With the “what do you think” approach of democracy, leaders essentially persuade followers or team members to
buy into what they are trying to get accomplished.
One facet of democratic leadership style within a company, when dealing with big data as a factor, is called agile management. According to the
paper “Agile Project Management,” (Maxi, 2011), agile management is a popular management style used today within organizations where
flexibility is the main focus. Agile management is an iterative method of determining requirements for development projects in a highly flexible
and interactive manner. Agile management is an organizational leadership style that conforms to the fast rate of change within the information
technology realm. Technology is changing at a rate unimagined and unforeseen by experts and analysts. Creating management theories based on
new science of complexity that exploit the understanding of autonomous human behavior gained, in the study of living systems in nature, came to
derive the complex adaptive systems into the management assumptions and practices. So, within a community of practice, using agile
management is the best form of leadership style where managers or leaders in charge of a certain organizational project need a set of plain
guiding practices that provide a framework in which to manage, instead of a set of inflexible instructions. With these guidelines in place, a
manager can learn to become an adaptive leader which includes stating the direction in which he or she feels the organizational executives wants
the information system to ultimately lead. This involves setting the rules, expectations, and limitations of the information system, and
encouraging feedback, adaptation, and collaboration for each project.
Notice the mention of community of practice (CoP); what is a community of practice? For the purpose of this subject, community of practice is
working within an environment or organization that uses agile management to accomplish company goals and vision. A community of practice
may be used to collect knowledge, whether it be tacit or explicit, from the various individuals participating within the practice. According to the
website Business Dictionary, a community of practice is an organized network of individuals with diverse skills and experience in an area of
practice or profession (2014). Its use is best suited for members with the desire to help other individuals by sharing information and the need to
accumulate new knowledge themselves by learning from others. Agile management leadership style is well suited to coordinate a community of
practice, for the flexibility of knowledge acquisition, idea sharing, and input, allows for the leader to take a step back and observe rather than step
in and lead by giving orders. The leader may now consider him or herself part of the group, rather than the outlier that feels the need to bark
orders. Input within a community of practice is essential for projects, especially projects that deal with big data to advance within its life cycle.
Big data indicates that the company may be looking to manage a vast amount of knowledge, whether tacit or explicit. Therefore, requirements
such as how data is collected, stored, managed, and acquired, and the many security and privacy concerns looming around such data collectivity
and storage must be brought to the table and discussed among the CoP with different individuals having various experiences and backgrounds,
during the systems analysis and systems design stage. With the leader becoming adaptive and interactive within the community rather than
assertive and reactive, the leader then begins to state the direction in which he or she wants the goal and vision to lead. If the company or
organization decided to take on the complex adaptation system derived from the flexible or agile project management methodology, it provides
the team with an intrinsic ability to cope with change and look at the organization as a flowing adaptive system made up of intelligent human
beings. Many established leadership styles still apply to agile development projects; with some adaptation and a strong dose of leadership,
incorporating a certain leadership style within a community of practice should not be difficult. With the ability to inform workers and have them
fully understand the end goal, a talented leader should be able to convince his or her followers, or in this case workers, to get on board with what
the project charter as stated. The best leaders involved in agile project management environments are not just organizers. Good leaders combine
business vision, communication skills, soft management skills, and technical savvy; along with the ability to plan, coordinate, and execute.
An individual, whose workers can look up to for new ideas, encouragement, and a great partner to work with, is the prototype of a leader within
an agile management community of practice. Keep in mind, when it comes to big data and information systems, the agile management theory
and community of practice practices are widely popular.
Using the agile management environment for instance, will allow software programmers to create and monitor their own iteration plans as well
as collaborate with customers and focus on customer requirements. The customer would provide answers to inquiries and suggestions for
requirements for a certain product; the programmer would then divide the tasks themselves as they work and measure progress for every
iteration while adjusting plans with the customer as necessary. This essentially frees the leader from the toil of being a taskmaster, enabling the
leader to focus on, well, being a leader. A leader that adheres to the company’s goals and vision, a leader who inspires the team to focus and
continue working hard on their responsibility, a leader who promotes teamwork and collaboration, a leader who expedites the project and cancels
out the obstacles impeding the team’s progress is a successful leader. Meanwhile, this style of leadership still requires the leader to define and
initiate the project, create a project charter, execute the plan, create a work breakdown structure, and control the outcome; while dealing with
overriding diversions such as cost estimations, fast tracking, or crashing.
Coercive leadership styles are becoming less used within an organization because the realization of the importance of feedback from ones’
workers has transformed the working environment across the globe. The paradigm shift towards more democratic leadership styles evolved over
time and has changed for the best. We must recognize that the extinction of coercive leadership style is not plausible, for it still has its place and
importance within a governing body or organization; effective when necessary, yet widely unpopular. Change from such vastly used leadership
styles during their respective time of popularity since human civilization shows the difference in today’s society and the cultures that cultivate
themselves in it.
CULTURE: THEN AND NOW
Today, the word culture contains two meaning. According to the article, “What is Culture?” it is the evolved human capacity to classify and
represent experiences with symbols, and to act imaginatively and creatively (Kohls, 2013). Culture is also the distinct ways that people, who live
differently, classify and represent their experiences and act creatively.
Stemming from the term used by ancient Roman orator Cicero (Online Etymology Dictionary, n.d.), culture initially meant the cultivation of the
soul or mind. In the 18th century, German thinkers who were on various levels developing Rousseau’s criticism of “modern liberalism and
enlightenment,” help shape the meaning of culture to what we know today as the arts and other manifestations of human intellectual
achievement regarded collectively. Culture encompasses the beliefs, customs, arts, etc. of a particular society, group, place, or time. Culture is a
part of life as life is a part of us; whereas, culture is more of an idea, a noun, that explains the unseen, seen, and helps with speculation for the
future.
As much as culture is a part of life, it goes without saying that culture is part of generational changes. Social generations are groups of individuals
who existed during a certain period of time and practiced similar cultural practices and shared similar experiences. As the terms progressed
alongside time, a new idea of culture and generations sprouted; the terms began describing a society separated into categories of individuals
based on age. This stemmed from the processes of modernization, industrialization, or westernization. Each generation presents its own set of
values, beliefs, life experiences and attitude. This can also present issues within society, for being able to understand others ideas or thoughts,
with age and generational differences being a factor, at times can be significantly different within an organization as well as data. Generational
and cultural differences pose obstacles for companies; it creates hiring issues and lack of communication. Not only do the differences of
generations affect the cultural environment within an organization, but it also affects the efficiency of implementing new technology such as
devices, operating systems, or information systems.
According to the website Generation Culture, there are four different generational types: Mature, Baby Boomer, Generation X, and Nexter
(2014). For the purpose of this chapter that discusses generational difference in technology, leadership, and culture, we will discuss the
differences between Generation X and the Nexter.
Generation X is known for coming up during the 80’s and 90’s; education and loyalty to an individual rather than a company. They were also
known to give up a job without having a contingent plan or another job in line. Generation X grew up during the Watergate scandal, AIDS
epidemic, crack cocaine epidemic, and of course the rapid advancements of the Internet. Growing up in this time period meant exploring new
possibilities and the average individual becoming more involved. More involved included but not limited to technology, big data, and politics. As
aforementioned, while in existence, the Internet came to life in the early 90’s which allowed for individuals to connect with one another without
distance being a factor. With the 90’s primarily known as the virtual age or when e-commerce and data transmission spiked to higher than ever
before, users were able to communicate via email, spend money and conduct business online. Life got a little easier for us in the 90’s; technology
was booming, our confidence within the nation was at an all-time high with unemployment being around 4.2% and the high school graduating
rate was astounding compared to the 60’s (kclibrary.lonestar.edu, n.d.).
The implementation of Elementary and Secondary Education Act (No Child Left Behind Act) helped students with limited proficiency in English,
Math, and Science by raising the standards for expectations when our future generations were to graduate and head into the workforce. This,
along with political advances made by the country to be the “world policeman” where the US invaded Kuwait, Somalia and sent troops to Haiti,
Bosnia, and Yugoslavia, just to name a few. The U.S. showed strength in becoming the arbitrator, enforcer, and peace keeper of the world,
standard setter for child knowledge obtained in public schools, and leader in technological advances. With its ups and downs, the individuals
growing up in the 80’s and 90’s were faced with situations that played a big part in shaping the future of the Nexters.
Needless to say, individuals who delved into big data and took advantage of the era of the Internet boom, had the opportunity to watch their
capital gains grow before their eyes and also had the opportunity to control their gains and losses at the click of a button. The Internet gave way
to individuals being able to buy and trade their stocks, securities, view and manage bank accounts, transfer money electronically, and
communicate with financial advisors via the Internet. The new and upcoming generation was full of excitement, hope, diversity, and chance.
Fast forward to present time, the generation known by many names such as Millennial, Generation Y, Net generation, or as referred to here, the
‘Nexters’ are an interesting bunch. Known for their assimilation with computers, technology, big data, and communication, this generation is
expected to make great impacts on business, culture, social status, monetary status, and consumption. The Nexters grew up when technology
advanced at exponential rates, faster than previously hypothesized and therefore, information was always a click away. We can expect that
Nexters will be inclined to question everything, while taking little face value of what is told or presented to them. Having access to tools like
Google has changed the way this generation conceives what is presented to them; therefore, changing the culture that surrounds us today.
Today’s culture is based on big data and accessing information at moment’s notice, access to devices that browse the Internet has never been
easier or as efficient. This causes a culture that delves into selfishness; selfishness because individuals in today’s culture come to rely on
themselves or the information spewed by their mobile devices. Some readers may ask, is that bad? Not that it’s bad, but it transitions how
individuals once communicated with one another and were sociable in past generations, to them choosing to be self-sufficient and lacking
interpersonal skills today. Something that is very noticeable about today’s culture is the Nexter’s acceptance to change. Much like the way
leadership styles transposed from a more dictatorship leadership style to a more democratic leadership style, the culture has transitioned through
the different generations from a more conservative, patient, disciplined, and conformity culture type, to a more materialistic, liberal, impatient,
and cynical behavior culture type. Nexters are a group willing to compromise the privacy and security of their personal information for the ability
to network with peers, acquire data at a moment’s notice, and market products and\or themselves.
There are many reasons why characteristics and behaviors of a certain culture in different generations come to change. One reason is the influx of
data; past culture struggled when high volumes of data creation, entry, and retrieval were too much to handle. Unified clarifications were formed
as to why such vast data collection were an issues, and solutions were derived by latter generations when dealing with the situation, changed
certain characteristics and behaviors of the culture within the later generations. For instance, the reason behind Netflix’s inception was because
of a customer who was unhappy with late fees when borrowing movies from brick-and-mortar type movie rental companies; but the way that
generation approached and solved the situations was based on their own understanding of how things in life or in the world functioned. The way
generations approached and solved the situations were based on their own understanding of how things in life, or in the corporate world
functioned. By understanding what is known at the time (tacit knowledge), generations in general would prepare and or resolve issues, cultural
differences all point back to changes in generations and the invention of the ephemeralization theory; whether it be technology or the way we
interact with one another.
Big data is playing such a big role in today’s culture. As mentioned, today the ability to access information whenever and wherever has never been
more possible. This allows for the culture of today’s generation to be more achievement oriented. Mobile devices have opened doors for new ideas
and methods to showcase talent. For instance, social media apps such as Vine, Instagram and SnapChat to name a few, have turned the tables on
the bottom line for entertainment and communication. The culture of promoting and showcasing oneself proves to be the next big thing within
our culture. Where past generations would be more inclined to privacy and conservativeness, people of today’s culture prove to seek attention
and approval by their peers. This is not necessarily a bad thing, for as much as the culture of social media tend to give the generation of today a
sense of no limit and confidence, it brings in an influx of potential leaders who are well aware of the world and what’s going on in it; whether it be
political, social, or technical. It also brings an influx of innovators that use social media, new technology, etc. to their advantage by understanding
the importance of how to store and maintain large amounts of data that reside behind the knowledge management systems or tools. Surely the
interrelation of technology, leadership, and culture culminates the viability for exploring different ideas and facets of social technology and
technology used for business advancement.
In order to access the information readily available to the individuals of today’s generation, a sense of the necessity for knowledge management is
essential to house big data; allowing one to acquire, transfer, and manage the data efficiently, securely, and effortlessly. The knowledge base is an
improved facet of the creation, housing, sharing, and acquirement of knowledge, spawning a culture where explicit knowledge competes well with
tacit knowledge. Earlier, data taken from the Pew Research Center revealed that individuals with ages ranging from 18-29 use the Internet 90% of
the time. That data manifests the importance of innovations to how organizations acquire, sort, and house information in databases built to
handle large amounts of traffic.
Let’s take, for instance, a database that will house information for an organization that is in the retail business. When the company is in operating
mode, your main goal is to build your brand, make it unique, and appeal to customers so that they consistently come back to purchase and/or use
your products and services. With influx of customers returning to do business with that retail company, it would be wise to create a management
system for their customer in order to store crucial information about the particular individual and also useful data that the company can use to
accomplish things such as advertising. This will allow the organization to concentrate its resources on the optimal opportunities, with the goals of
increasing sales and achieving a sustainable competitive advantage. According to the paper “Knowledge Management (Maxi, 2012),” Customer
Relationship Management system (CRM) will grant the company the opportunity to create and share knowledge; employees are able to input
customer or sales information for later use by other co-workers, the company, and many times the customer themselves.
Essentially, complete visibility across the entire organization is established, a sense of the cultural habits and norms are realized, and a means to
access knowledge for all is now available for anyone who can utilize it. CRMs are not limited to nodes implemented within a brick-and-mortar
company; it is also available via portable devices, creating limitless access of information anywhere in the world as long as there is an Internet
connection.
With the culture of accessing information NOW, anywhere I am, also comes with its cons; many business failures occur because of poor
management. Poor management does not always mean the managers are unqualified or inexperienced. It also does not always mean that the
manager is not good at managing their staff; it means that they have done an insufficient job at managing the employee to customer relationship,
whether it is the manager or staff member such as a sales representative. In a situation where a manager deals with a customer, they can also
have poor management tactics with their customer relationship. In today’s culture, when a business failure occurs in a sale-to-customer situation,
generally somewhere along the line, the customer relationship has faltered. With the abundance of technology in today’s generation, messages
can be lost due to high volumes of communication and messages may be obscured due to the massive amounts of data being spewed. Mobile
devices were supposed to help make communication easier by allowing one to communicate multiple ways on one device. However, in a culture
of accessing information anywhere and anytime, the importance of great leaders who are in charge of analyzing, designing and implementing
security-breach-proof, non-redundant, intricate knowledge management systems is stressed and highly searched for within the big data era.
Experts believe in and vouch for knowledge management databases that house vast amounts of information whether personal or public.
Because this well-known system is designed to decrease clutter while making the process of acquisition, transfer, and management of pertinent
data more easily accessible to users in need of reference and information at a moment’s notice and anywhere there is an Internet connection.
FUTURE RESEARCH DIRECTIONS
Generational differences vastly vary in topic, whether it is technology, culture, leadership, etc. Ultimately, generational differences apply to the
necessities, styles, and theories that we depend on to operate on a daily basis and solve issues to the best of our knowledge. As a society, we share
dominant cultural expectations because we also share the same geographical and social territory. We benefit tremendously and collectively in
ways that would be impossible on an individual basis. Throughout time, we have collectively advanced our use of technology and changed the way
we lead; these changes have naturally and unintentionally changed today’s culture. Culture is based on how individuals within our society
approach various facets of life within the scope of the understanding they have gained through experiences. Advances in technology has created
the need to make life easier and more efficient, so changes from a more dictatorship leadership style to a more democratic leadership style has
allowed the inclusion of different opinions and expertise within an organization’s CoP in regards to big data. Also, transition from a more
conservative culture to being susceptible to change has created a culture of new trends and ideas when using technology that allows for the
creation and retrieval of data. One cannot conclude that the changes within these facets of life are detrimental to generations to come; for we are
living within the scope of our understanding. What we do understand today is that with time, comes change and more data. According to the
article “Privacy and Big Data,” big data analysis often benefits those organizations that collect and harness the data (Polonetsky, 2013).
Therefore, the emergence, expansion, and widespread use of innovative products and services at decreasing marginal costs have revolutionized
global economies and societal structures, facilitating [leadership] access to technology and knowledge [technology], and fomenting social change
[culture]. Future research opportunities within the topic of Generational Differences Relative to Data-Based Wisdom are inevitable, for changes
and the acquisition of data are bound to happen; and contrasting the generational differences and how they were able to handle change is
important to move forward.
CONCLUSION
In regards to big data, as the saying goes, “if you don’t know your history, you are doomed to repeat it.” Therefore, with the accumulation of ever
growing data, understanding the necessary steps to store them properly and ability to retrieve them securely and in an efficient manner, are both
explicit and tacit knowledge that are outside the scope of the conventional multi-disciplined approach to achieving organizational objectives.
Over time, technology, leadership, and culture have transformed into more than tangible items, social leadership concepts, and learned
behavioral patterns. The latter three ideas have evolved along with the technological advances infused into society as we know it today. Therefore,
the value and emphasis to develop and maintain intricate and efficient knowledge management databases suitable to create, transfer, and
preserve organizational knowledge embedded in its data, has never been more vital than ever. The importance will continue to grow as changes in
technology, leadership concepts, and culture continue to inundate our society.
This work was previously published in Strategic DataBased Wisdom in the Big Data Era edited by John Girard, Deanna Klein, and Kristi
Berg, pages 126140, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Benincasa, R. (2012). 6 leadership style, and when you should use them. Fast Company Article. Retrieved January 5, 2014, from
http://www.fastcompany.com/1838481/6-leadership-styles-and-when-you-should-use-them
Business Dictionary. (2014). Community of practice. Business Dictionary. Retrieved January 24, 2014, from
http://www.businessdictionary.com/definition/community-of-practice.html
Generation Culture. (2014). Generation X, and Nexter. Generation Culture. Retrieved February 3, 2014, from http://www.generationmodel.eu/
Kohls, R. L. (2013). What is culture? Systran Publications. Retrieved January 31, 2014, from http://www.bodylanguagecards.com/culture/31-
what-is-culture
Maxi, G. (2011a, September). Agile project management. Paper presented in Project Management class, Minot State University, Minot, ND.
Maxi, G. (2011b, October). Knowledge management. Paper presented in Knowledge Management class, Minot State University, Minot, ND.
Online Etymology Dictionary. (2014). Culture. Online Etymology Dictionary. Retrieved January 31, 2014, from
http://www.etymonline.com/index.php?term=culture
Pew Research. (2014). Internet user demographics. Pew Research Internet Project. Retrieved January 2, 2014, from
http://www.pewInternet.org/data-trend/Internet-use/latest-stats/
Polonetsky, J., & Tene, O. (2013). Privacy and big data: Making ends meet. Stanford Law Review. Retrieved July 3, 2014, from
http://www.stanfordlawreview.org/online/privacy-and-big-data/privacy-and-big-data
Spangle, T. (2014, May). Netflix remains king of bandwidth usage, while YouTube declines. Variety. Retrieved June 20, 2014, from
http://variety.com/2014/digital/news/netflix-youtube-bandwidth-usage-1201179643/
Wildstrom, S. H. (1996). A web-surfing box for peanuts. Business Week. Retrieved November 25, 2013, from
http://www.businessweek.com/1996/38/b349363.htm
KEY TERMS AND DEFINITIONS
Community of Practice (CoP): An organized network of individuals with diverse skills and experience in an area of practice or profession.
Culture: A stage or period of time that illustrates the way individuals interacted with one another, and solved problems.
Customer Relationship Management (CRM): A system used by companies to manage data driven by consumer transactions within the
organization.
Technology: An instrument used by an individual to make things more convenient and efficient.
ABSTRACT
The goal of this chapter is to explore the practice of big data sharing among academics and issues related to this sharing. The first part of the
chapter reviews literature on big data sharing practices using current technology. The second part presents case studies on disciplinary data
repositories in terms of their requirements and policies. It describes and compares such requirements and policies at disciplinary repositories in
three areas: Dryad for life science, Interuniversity Consortium for Political and Social Research (ICPSR) for social science, and the National
Oceanographic Data Center (NODC) for physical science.
INTRODUCTION
The September 2009 issue of Nature included an interesting special section on data sharing. An opinion article in the section discussed the
Toronto International Data Release Workshop, where attendees “[recommended] extending the practice to other biological data sets” (Birney et
al., 2009, p. 168) and developing a set of suggested best practices for funding agencies, scientists, and journal editors. The February 2011 issue
of Science compiled several interesting articles to provide a broad look at the challenges and opportunities posed by the data deluge in various
areas of research, including neuroscience, ecology, health, and social science, where there is a demand for the acquisition, integration, and
exchange of vast amounts of research data.
The term big data is a current buzzword. It is a loosely defined term to describe massive and complex data sets largely generated from recent and
unprecedented advancements in data recording and storage technology (Diebold, 2003). Explosive growth means that revolutionary measures
are needed for data management, analysis, and accessibility. Along with this growth, the emergence of a new “fourth paradigm” (Gray, 2009) for
scientific research, where “all of the science literature is online, all of the science data is online, and they interoperate with each other” (Howe et
al., 2008, p. 47), has created many opportunities. Therefore, the activity of organizing, representing, and making data accessible to both humans
and computers has become an essential part of research and discovery.
Given the significance of this context, data sharing has become a hot topic in the scientific community. Data is a classic example of a public good
in that shared data do not diminish in value. In particular, scientific data have long underpinned the cycle of discovery and are the dominant
vehicles by which scientists earn credit for their work. So shared data have served as a benchmark that allows others to study and refine methods
of analysis, and once collected, they can be creatively repurposed indefinitely by many hands and in many ways (Vision, 2010). Sharing data not
only reinforces open scientific inquiry but also promotes new research and expedites further discovery (Fienberg, 1994). As science has become
more data intensive and collaborative, data sharing has become more important.
Promoting the effective sharing of data is an increasing part of national and international scientific discourse and essential to the future of
science (National Science and Technology Council, 2009). Today, many U.S. government agencies recognize that scientific, biomedical, and
engineering research communities are undergoing a profound transformation in regard to access to and reuse of large-scale and diverse data sets;
as such, these agencies have developed policies that mandate and/or encourage data sharing. For instance, the National Science Foundation
(NSF) expects grantees to share their primary data, samples, physical collections, and other supporting materials created or gathered in the
course of work under the grant.1 The National Institutes of Health (NIH) has had a data-sharing policy since 2003; the policy states that any
investigator submitting a grant application seeking direct costs of $500,000 or more in any single year is expected to include a plan to address
data sharing in the application or state why data sharing is not possible.2
To support these needs, infrastructure is being built to store and share data for researchers as well as educators and the general public. In 2008,
the NSF awarded nearly $100 million over 5 years to data preservation and infrastructure development projects under the DataNet
initiative.3 DataONE4 is one of the awards, which is dedicated to large-scale preservation and access to multiscale, multidiscipline, and
multinational data in biology, ecology, and the environmental sciences. Recently, the White House announced a $200 million initiative to create
tools to improve scientific research by making sense of the huge amount of data now available. Programs like these are needed to improve the
technology required to work with large and complex sets of digital data.5
Researchers and scientists in academia, industry, and government may choose to store and share their data in a number of ways. Among the
various means, data repositories often appear to offer the best method of ensuring that data are preserved and presented in a high-quality
manner and made available to the largest number of people. Data repositories are constructed with the chief goal of storage and preservation and
emphasize use/reuse. In other words, the implementation of data repositories is constrained by not only the needs of data sharing but also
concurrent data access. They have data as its primary focus and are often shared by a scientific community.
The goal of this chapter is to explore the practice of big data sharing among academics and issues related to this sharing. The background section
of this chapter reviews literature on researchers’ practices and trends with regard to data sharing and access. The main section reviews
disciplinary data repositories in the areas of social science, life science, and physical science, and describes and compares the requirements and
policies at disciplinary repositories. It also examines recommended and accepted file formats and data structure repositories, metadata, and
specifications and guidelines on data access and sharing.
DATA SHARING PRACTICES AND TRENDS
Data sharing has been a critical issue in scientific research for some time. Since the National Academy of Sciences published the book Sharing
Research Data in 1985, the benefits of sharing data have been discussed widely and data sharing has been a regular practice in many academic
disciplines. While researchers may share a relatively low volume of data via emails or disks, the rate of data produced in many academic
disciplines has now exceeded the growth of computational power predicted by Moore’s law; the enormous amount of data that scholars and
researchers generate now can easily overwhelm their computers. As the size of data sets grow, managing heterogeneous data sets that contain
different formats, data types, and descriptions can be burdensome. This leads to difficulties sharing and reusing data as well. The explosion in the
amount of data available means researchers need better tools and methods for handling, analyzing, and storing data.
As a result, cloud computing has become a popular solution for storing, processing, and sharing data because it integrates networks, servers,
storage, applications, and services, enabling convenient and on-demand access to a shared pool of configurable computing resources.
Additionally, cloud computing offers other benefits: cost efficiency as it reduces investment in generic hardware systems; faster implementation
of features of new products and systems; flexible provisioning and resource scalability of systems; and a pay-as-you-go service model (Agrawal,
Das, & Abbadi, 2011). Currently, cloud computing services are provided by commercial vendors, such as Amazon and Microsoft, as well as
academic centers and government agencies. Such cloud computing services may be suitable for sharing certain types of data and offer immediate
scalability of storage resources necessary to successfully facilitate a big data sharing project, but they are not recommended for data that may be
confidential because of the issue of individual privacy (Wang, 2010). In addition, users need to be aware that they do not control where data are
ultimately stored.
As scientific research endeavors are increasingly carried out by researchers collaborating across disciplines, laboratories, institutions, and
geographic boundaries, additional tools and services are needed. Because collaboration involves geographically distributed and heterogeneous
resources such as computational systems, scientific instruments, databases, sensors, software components, and networks, information and
computing technology, popularly called “eScience,” plays a vital role in large-scale and enhanced scientific ventures; as such, grid computing has
become an emerging infrastructure for eScience applications by integrating large-scale, distributed, and heterogeneous resources. In their
foundation paper “The anatomy of grid,” Foster, Kesselman, and Tuecke (2001) distinguished grid computing from conventional distributed
computing or cloud computing by its focus on large-scale resource sharing, innovative applications, and in some cases, high-performance
orientation. Scientific communities, such as high-energy physics, gravitational-wave physics, geophysics, astronomy, and bioinformatics, are
utilizing grids to share, manage, and process large data sets (Taylor, Deelamn, Gannon, & Shields, 2006).
A Virtual Research Environment (VRE), a platform which allows multiple researchers in different locations to work together in real time without
restrictions, has been developed from this context (e.g., De Roure, Goble, & Stevens, 2009; Neuroth, Lohmeier, & Smith, 2011; Avila-Garcia,
Xiong, Trefethen, Crichton, Tsui, & Hu, 2011). VREs include access to data repositories and grid computation services, but also collaboration
tools, such as e-mail, wikis, virtual meeting rooms, tools for sharing data, and the ability to search for related information. The key issue of a VRE
is the development and implementation of an information and data-sharing concept. It should be noted that such VREs require significant setup
and maintenance costs, and are usually mono-institutional. Consequently, many researchers are not comfortable with their features and may still
resort to sharing data via email and/or online file-sharing services.
The following section reviews the impact of big data in four different academic disciplines and changes in their data-sharing practices.
Arts and Humanities
Researchers in the arts and humanities often prefer to publish considered works in monographs. There has been a slow adoption of digital
publishing in the field, because there has been a general reluctance to experiment with new technologies and a distrust of online dissemination.
Nevertheless, networked access to information sources in the field has increased since the early 1990s (American Council of Learned Society,
2006). Examples include Project MUSE,6 which provides online access to 500 full-text journals and 15,000 full-text e-books from 200 nonprofit
publishers in the humanities; JSTOR,7 a full-text archive of journal articles in many academic fields, including arts, literature, and humanities;
and ARTStor,8 which holds hundreds of thousands of images contributed by museums, archaeological teams, and photo archives.
Blanke, Dunn, and Dunning (2005) foresee that research in the humanities is becoming data-centric, with a large amount of data available in
digital formats and that this trend will change the landscape of humanities research. The increasing availability of massive collections of digitized
books, newspapers, images, and audio, combined with the development of accessible tools for analyzing those materials, means computationally
based approaches are growing in the field. Recently, a number of initiatives have been taken; for instance, Digging Into Data9 is an initiative that
shows the importance of big data in the humanities and social sciences.
Compared with other disciplines, researchers in the arts and humanities do not produce a great deal of research data. However, a critical mass of
information is often necessary for understanding both the context and the specifics of an artifact or event. Thus, the field has used primary
sources and data from manuscripts, early printed editions of classical texts, ancient inscriptions, excavated artifacts, and images of classical art
objects, among many other types of sources. Nevertheless, some disciplines produce data that carry complex interoperability and semantic
challenges. For instance, researchers in the field of archaeology, epigraphy, and art history produce lexica, edited catalogs, and statistical data.
They have long been proponents of data sharing and reuse because of the unrepeatable nature of the work. Open access collections of primary
archaeological data, such as the Archaeology Data Service10 and the Archaeobotanical Database,11 bring large quantities of raw data directly to the
researcher’s fingertips.
Social Sciences
The social sciences tend to be more dependent on technology. In particular, quantitative social science researchers have long used mainframes
and personal computers for statistical analysis and other types of data processing even though they have dealt with smaller samples. Social
scientists have also expressed interest in using technology to improve access to conference papers, unpublished research, and technical reports
(American Council of Learned Society, 2006). Data have often been shared in many different ways, ranging from informal dissemination with
known peers to formal archives. However, as Crosas (2011) asserted, traditional approaches to storing and sharing data sets in social sciences
have been either inadequate or unattractive to many researchers.
There have been few clear guidelines on data sharing in the social sciences as data collection and use are bound by rules or agreements relating to
confidentiality and legal and ethical considerations. These factors have been significant barriers to the sharing and reuse of research data. Yet, the
value of data sharing has become apparent to the social science field because of massive increases in the availability of informative social science
data (King, 2011). The urgent need to understand complex, global phenomena leveraging the deluge of data arising from new technologies is
driving an agenda in the social sciences.
The social science research community, in fact, was among the first to recognize the benefits of archiving digital data for use and reuse. Since the
advent of survey research in the 1930s, many data archives, most of them nationally funded, have been established around the world to preserve
social science data resources (Vardigan & Whiteman, 2007). Data archive and data library development in the field has been discussed since the
late 1960s (Heim, 1982). In the United States, a set of archives with ties to major research universities has emerged, including the Roper Center
for Public Opinion Research at the University of Connecticut;12 the Howard W. Odum Institute for Research in Social Science at the University of
North Carolina at Chapel Hill;13the Henry A. Murray Research Archive14 and the Harvard-MIT Data Center at the Institute for Quantitative Social
Science,15Harvard University; and the ICPSR at the University of Michigan.16
Life Sciences
The research process in the life sciences often involves the use of data produced from a range of sources, with data generated in the laboratory
being complemented by imported data. The quantity of data created in the life sciences is certainly growing at an exponential rate and the size of
individual data sets is increasing massively. Large-scale data sets from genomics, physiology, population genetics, and neuroimaging are rapidly
driving research. Today, genomics technologies enable individual labs to generate terabyte- or even petabyte-scale data. At the same time,
computational infrastructure is required to maintain and process such large-scale data sets and to integrate them with other large-scale sets.
Furthermore, interdisciplinary collaborations among experimental biologists, theorists, statisticians, and computer scientists have become key to
making effective use of data sets (Stein, 2008). However, existing data storage systems and data analysis tools are not adapted to handle large
data sets and have not been implemented on platforms that can support such big data sets.
It should be noted that there is no single data culture for the life sciences because the field ranges in scope and scale from the field biologist
whose data are captured in short-lived notebooks as a prelude to a narrative explanation of observations to the molecular biologist whose data
are born digital in near terabyte quantities and are widely shared through data repositories (Thessen & Patternson, 2011). Nevertheless, the life
sciences have had a stronger culture of open data publication and sharing than other disciplines. Researchers in the field have participated in
collaborative environments that allow data to be annotated, such as EcoliWiki17 or DNA Subway.18 Some journal publications have data-sharing
policies that encourage their authors to archive primary data sets in an “appropriate” public data repository. In the biosciences, the mandatory
deposits of sequence data to GenBank19 or the Protein Data Bank20 is well established; these repositories have highly structured data, rich
metadata, and analytical capabilities uniquely tailored to their contents (Scherle et al., 2008). All large funding bodies, such as NIH, now make
data sharing a requirement of support for all projects and have created data repositories for the funded research data. For instance, the National
Database for Autism Research,21 an NIH-funded research data repository, aims to accelerate progress in autism research through data sharing,
data harmonization, and the reporting of research results.
Physical Sciences
The physical sciences, which are an aggregation of astronomy, astrophysics, chemistry, computer sciences, mathematics, and physics, deal with
more data other disciplines. The field is also experiencing an unprecedented data avalanche due to the fast advance and evolution of information
technology that enables capture, analysis, and storage of huge quantities of data. Astronomy, for example, has a long history of acquiring,
systematizing, and interpreting large quantities of data but has become a very data-rich science, driven by the advances in telescope, detector,
and computer technology (Brunner, Djorgovski, Prince, & Szalay, 2002). It was one of the first disciplines to embrace data-intensive science with
the Virtual Observatory,22 enabling highly efficient access to data and analysis tools at a centralized site.
Physicists have long enjoyed the tradition of sharing their research ideas and results with their peers in the format of preprint through an online
archive database, arXiv,23 operated by Cornell University. The astronomy community also has a well-established culture of data sharing that was
pioneered by the National Aeronautics and Space Administration (NASA) space missions. Many astronomical observatories and institutes have
important data archives and databases that contain large amounts of data. Data discovery, access, and reuse are common in astronomy; the
Space Telescope Science Institute (STScI) reports that more papers are published with archived data sets than with newly acquired data (Space
Telescope Science Institute, 2012). Climate and environmental science is another field that benefits from the existence of centrally funded data
archives, such as National Climate Data Center (NCDC)24 and GEONGrid Portal.25
CASE STUDIES: DISCIPLINARY REPOSITORIES
A repository is defined as “a networked system that provides services pertaining to a collection of digital objects” (Bekaert & Sompel, 2006, p.4).
A repository, in general, provides services for the management and dissemination of data by making it discoverable, providing access, protecting
its integrity, ensuring long-term preservation, and migrating data to new technologies (Lynch, 2003).
Disciplinary repositories, often called domain repositories or subject-based repositories, are thematically defined to serve specific community
users and store and provide access to the scholarly output of a particular academic domain, field, or specialty. Their scope is often specialized;
they are often community endorsed, conform to established standards, serve the needs of researchers in the discipline, and bring together
research from multiple institutions and/or funders. In contrast to institutional repositories, these repositories accept work from scholars across
institutions. They have specialized knowledge of approaches to data in a specific scientific field, such as domain-specific metadata standards, and
have the ability to give high-impact exposure to research products. These disciplinary repositories can also act as stores of research data sets
related to a particular discipline. Financial support for such repositories may come from a variety of sources; grant funding often covers the start-
up costs, institutions provide pro bono services, and volunteers often serve as editors. Universities are sometimes motivated to host and manage
a disciplinary repository to reflect and build upon a center of excellence. There are a number of repositories run by federal agencies, such as
NASA.
As the preservation and reuse of data through data sharing has become a strategic issue, various discussions on disciplinary repositories have
been presented in several publications. Green and Gutmann (2007) differentiated institutional repositories and disciplinary repositories; they
asserted that institutional repositories do not fully support the scientific research lifecycle as they often focus on capturing final or near-final
forms of scholarly productivity and partner less with researchers during the initial process and phase of a typical research project. In a survey of
data repositories, Marcial and Hemminger (2010) found that a significant majority of the data repositories they identified on the open Web are
funded or directly affiliated with individual universities. In addition, most of those repositories are described as highly domain specific. Johnson
and Eschenfelder (2011), in their preliminary report of the study on access and use control in data repositories, reported some interesting
differences in repositories; repositories in biology and the social sciences cited privacy as a reason for restricting access to data as they deal with
human subject data, whereas repositories in social science mentioned intellectual property as a concern.
The following section reviews disciplinary repositories from three disciplines: Dryad for life science, ICPSR for social science, and the National
Oceanographic Data Center (NODC) for physical science. Each repository’s requirements and policies are explored. In addition, recommended
and accepted file formats and data structure repositories, metadata, and policies regarding data access and sharing are examined.
Dryad
Dryad26 is an international, open, cost-effective data repository for the preservation, access, and reuse of scientific data and objects underlying
published research in the field of evolutionary biology, ecology, and related disciplines. The repository is designed specifically to enable authors
to archive data upon publication and to promote the reuse of that data.
Dryad, named after tree nymphs in Greek mythology, was initially designed in 2007 as a “response to a ‘crisis of data attrition’ in the field of
evolutionary biology” (Greenberg, 2009, p. 386) by the National Evolutionary Synthesis Center and Metadata Research Center at the School of
Information and Library Science, University of North Carolina at Chapel Hill. Today it is operated as a nonprofit organization, governed by its
member organizations, including journals, publishers, scientific societies, funding agencies, and other stakeholders.27 In particular, it has been
supported by a number of grants, including the Institute for Museum and Library Services (USA), the Joint Information Systems Committee
(UK), and the National Science Foundation (USA); these grants and others have allowed continued development of the repository and ensured its
sustainability. The repository, initially developed to help support the coordinated adoption of a policy by a number of leading ecology and
evolution journals, would require all authors to archive their data at the time of publication.28
The scientific, educational, and charitable mission of Dryad is to promote the availability of data underlying findings in the scientific literature for
research and educational reuse. It welcomes data files associated with any published article in the biosciences, as well as software scripts and
other files important to the article. The repository software is based on DSpace, which allows Dryad to leverage a technology platform being used
by hundreds of organizations, and is maintained by a large and active open-source software community. Dryad accepts data in any format, from
spreadsheets or other tables, images, and alignments to video files. All data files in Dryad are available for download and reuse, except those
under a temporary embargo period, as permitted by editors of the relevant journals. Primary access to Dryad is through its Web interface, where
users most commonly search on authors, titles, subjects, and other metadata elements. As of May 23, 2013, Dryad contains 3,287 data packages
and 9,446 data files, associated with articles in 223 journals (Dryad, n.d.).
For submissions, Dryad recommends authors use nonproprietary file formats wherever possible and use descriptive file names to reflect the
contents of the file. Authors are also asked to provide documentation to help ensure proper data reuse and additional keywords to make the data
easier to discover. The documentation, in the form of ReadMe files, consists of a file name, a short description of what data it includes, how data
are collected, who collected the data, and whom to contact with questions. Once data are submitted, the data files are given a digital object
identifier (DOI), which is a permanent, unique, and secure identifier that should be used whenever referring to data in Dryad. All data deposited
to Dryad is released to the public domain under Creative Commons Zero (CC0), which reduces legal and technical impediments to the reuse of
data by waiving copyright and related rights to the extent permitted by law.
Specifically, Dryad focuses on providing thorough metadata to allow new access to the unique data types generated by its constituents
(Greenberg, White, Carrier, & Scherle, 2009; Vision, 2010). Dryad’s metadata application profile supports basic resource and data discovery,
with the goal of being interoperable with other data repositories used by evolutionary biologists. The application profile has been developed in
compliance with the Dublin Core Metadata Initiative’s guidelines, including the Singapore Framework. The application profile Version 3.0
consists of three modules: 1) a publication module for representing an article associated with content in Dryad; 2) a data package module for
representing a group of data files associated with a given publication; and 3) a data file module for representing a deposited bitstream (Dryad
Development Team, 2012). The profile (Table 1) includes 19 properties.
Table 1. Dryad application profiles: Property names and definitions
Property Definition
Name
As one of the oldest and largest archives of digital social science data in the world, ICPSR29 was originally started as a partnership among 21
universities in 1962. It has served as the long-term steward and a primary channel for sharing a vast archive of social science data.
As a part of the Institute of Social Research at the University of Michigan, ICPSR is a consortium of 700 academic institutions and research
organizations worldwide.30 This consortium has served the social science community’s need for capturing data and has evolved its practices
through the technology transitions from punch cards, floppy disks, and compact discs (CDs) to today’s electronic submissions (Rockwell, 1994;
Vardigan & Whiteman, 2007). Each entity provides representation to a council that manages “administrative, budgetary, and organizational
policies and procedures” (Beattie, 1979, p. 354) of the consortium. The member institutions of a consortium pay annual dues that enable faculty,
staff, and students free and direct access to the full range of data resources and services provided by ICPSR.
The mission of ICPSR is to “provide leadership and training in data access, curation, and methods of analysis for a diverse and expanding social
science research community” (ICPSR, n.d.). ICPSR has collected and made available the data sets of many major government studies in their
entirety, along with polls and surveys conducted by organizations and individual researchers. The repository data span many disciplines,
including sociology, political science, criminology, history, education, demography, gerontology, international relations, public health,
economics, and psychology; it maintains a data archive of more than 500,000 files of research in such disciplines. Most of the data sets in the
ICPSR are raw data from surveys, censuses, and administrative records. Furthermore, ICPSR has collaborated with a number of funders,
including U.S. statistical agencies and foundations, to widen its vast archive of social science data for research and instruction; as a result, it hosts
23 specialized collections of data thematically arranged by topics, including education, aging, criminal justice, substance abuse, and terrorism.
Data submissions at ICPSR are initiated in various ways (Vardigan & Whiteman, 2007). The data deposit may be voluntary and unsolicited; for
instance, a researcher who understands the importance of long-term preservation of digital data may decide to deposit his or her data for future
generations of scholars to use. In other cases, data are submitted as a requirement of a grant or sponsoring agency agreement with ICPSR.
Deposits are made using a secure data deposit form to describe the data collection and upload content. ICPSR accepts data related to the social
sciences for research and instructional activities, but selectively accepts data that fit within the scope of its collection and would be of potential
future interest to its members based on their appraisal criteria in collection development policy (ICPSR, 2012). ICPSR accepts both quantitative
data for standard statistical software packages and qualitative data, including transcripts and audiovisual media for preservation and
dissemination. Increasingly, however, ICPSR seeks out researchers and research agencies to identify content for acquisition through a wide
variety of means, including press releases and published reports announcing results of a study, and papers presented at professional meetings
and scholarly conferences (Gutmann, Schürer, Donakowski, & Beedham, 2004).
Archival work at ICPSR begins by teaming staff members with the researcher to ensure that ICPSR understands the data that the investigator
wishes to deposit and to identify any constraints on future access to data (Albright & Lyle, 2010). Once the data are received, they are reviewed
for confidentiality risks, errors, and internal consistency issues; direct and/or indirect identifiers, such as name, Social Security number, or
telephone number, are removed and level of access is determined. At this point in the process, data with a proprietary data format, such as SPSS,
are transformed into a more appropriate software-independent archival format, specifically raw ASCII text data with SPSS “setup” files that
enable a user to read in the raw data to re-create the proprietary SPSS format. Additionally, ICPSR creates high-level metadata about the study
using data documentation initiative (DDI) markup, an international standard for documenting social science research, using information
provided by data depositors and other sources (Table 2). A DOI is assigned to each study held, and ICPSR encourages the use of DOI for journal
publications and other articles to make it easier for researchers to find relevant work.
Table 2. Important Metadata Elements for ICPSR
Elements Definition
NODC
The NODC31 is one of three national environmental data centers operated by the National Oceanic and Atmospheric Administration (NOAA) of
the U.S. Department of Commerce.32 As the world’s largest collection of publicly available oceanographic data, it provides scientific and public
stewardship for national and international marine environmental and ecosystem data and information.
Established in 1961, the NODC was originally an interagency facility administered by the U.S. Naval Hydrographic (later Oceanographic) Office.
The NODC was transferred to NOAA in 1970 when NOAA was created by Executive Order of the President of the United States. The mission of
the NODC is to enhance oceanographic services and promote further marine research by making ocean data and products available in real and
non-real time to policymakers and marine communities for the efficient management and sustainable development of coastal and marine
resources. Since May 2011, the NODC has served as the NOAA Ocean Acidification Program (OAP) data management focal point through its
Ocean Acidification Data Stewardship (OADS) project (NODC, n.d.).
NODC holdings include in situ and remotely sensed physical, chemical, and biological oceanographic data from coastal and deep ocean areas;
they were originally collected for a variety of operational and research missions by federal, state, and local organizations, including the
Department of Defense, universities and research institutions, international data exchange partners, and industry. It also offers climatology
products, ocean profile data, fisheries closure data, coastal ecosystem maps, ocean currents data, satellite data, as well as a selected bibliography
(Collins & Rutz, 2005). Most digital data in the NODC are available to the general public in its original format at no cost or on customized media
for the cost of distribution. For some types of digital data, a specialized product is available to subset and select or retrieve specific data from
multiple sources. For example, the NODC World Ocean Database contains a collection of millions of temperature, salinity, and other parameter
profiles that have been reformatted to a common format (Collins & Rutz, 2005).
The Federal Ocean Data Policy requires that appropriate ocean data and related information collected under federal sponsorship be submitted to
and archived by designated national data centers. For data submission, depositors are required to provide proper data documentation, which
includes complete descriptions of what parameters/observations were measured; how they were measured/collected; where and when they were
collected (latitude, longitude, Greenwich mean time [GMT], depth[s], altitude[s]), and other geographic descriptions; the data collector or
principal investigator; collecting institution/agency and platforms; collecting/measuring instrumentation; data processing and analyses
methodologies; description of units, precisions, and accuracies of measured parameters; descriptions of the data format, and the computer
compatible media submitted. The NODC also solicits references to literature that have pertinence to the data, both published and gray literature.
Upon receipt and acceptance by the NODC, a unique accession number is generated for each data submission. Files are often converted into
ASCII so they are readable for the long term. A copy of the original data and metadata files, as well as any relevant additional information about
the original data, is placed in the archive. There are also “deep archive processes” that include the creation and validation of off-site copies
intended for use in disaster recovery situations or when the local working archive copy is rendered temporarily unavailable due to equipment
malfunction or other reasons (Collins, 2004). The system exports metadata into XML files. The metadata format follows the Federal Geographic
Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM), which uses a controlled vocabulary for data set
descriptions (Collins et al., 2003). Such data sets and metadata are periodically reviewed for completeness and correctness.
Table 3. Metadata used in the NODC
Elements Definition
FUTURE RESEARCH DIRECTIONS AND CONCLUSION
The need for big data sharing in academic disciplines via publicly accessible repositories has been emphasized in this chapter. The remainder of
this chapter discusses four issues that future study should address in terms of sharing data via disciplinary repositories.
First, as discussed, most disciplinary repositories exist to serve specific community users; they store and provide access to the scholarly output of
their identified community. Community-based approaches to the challenge of high volume within the data domain have proven to be the most
effective and efficient in the long term. In particular, community standards for data description and exchange are important because it facilitates
data reuse by making it easier to import, export, and combine data (Lynch, 2008). Recognizing the importance of such standards, the NSF issued
a call for proposals to support community efforts to provide for broad interoperability through the development of mechanisms, such as robust
data and metadata conventions, ontologies, and taxonomies. Metadata is a critical factor in this area. It is important to provide rich metadata,
which includes information about the context, content, quality, provenance, and/or accessibility of a set of data, and make it openly available to
enable other researchers to understand the potential for further research and reuse of the data (Griffiths, 2009). Metadata, using appropriate
standards, needs to be used to ensure adequate description and control over the long term. As a matter of fact, many academic disciplines have
supported initiatives to formalize the metadata standards the community deems to be required for data reuse; they already have established
metadata standards for describing and sharing data sets within the discipline. In disciplinary repositories, metadata standards generally are most
usefully considered within the limits of their user communities’ standard practices. For instance, the geospatial field has long utilized the Federal
Geographic Data Committee’s (FGDC) Content Standard for Digital Geospatial Metadata; the NODC’s metadata requirements are based on this
standard. Within the social science data community, a standard for describing the content of data files has been established through the Data
Documentation Initiative (DDI); ICPSR uses the DDI metadata specification in documenting its data holdings.
Second, the long-term sustainability of disciplinary repositories is contingent upon various aspects of management, such as maintaining the
repository services, managing the repository budget, and coordinating activities of repository personnel. One way to promote such sustainability
is to recognize and provide funding to support them. Funding is essential for the physical facilities, viable technological solutions for conducting
data stewardship processes, programs to assess services, and programs to educate and train a skilled workforce. It is exemplary that the
government is commiting long-term funding for a national data repository where researchers can deposit publicly funded research data, as such
repositories can accept data sets that are too big to store locally at institutional repositories and provide convenient public access. There are a
number of repositories hosted and managed by federal agencies, such as GenBank, Protein Data Bank, and National Nuclear Data Center. On the
other hand, disciplinary repositories owned and managed by other agencies, such as libraries and research institutions, still need federal support
and sponsorship. Green and Gutmann (2007) asserted, “[T]he next step in the evolution of digital repository strategies should be an explicit
development of partnerships between researchers, institutional repositories, and domain-specific repositories” (p. 50). In a similar vein, Lynch
(2008) viewed that challenges with big data in science can be overcome with a focused effort and collaboration among funders, institutions, and
researchers in academia.
Third, a robust and scalable technical infrastructure is essential to support the collection, storage, retrieval, and sharing of data. Repositories
have grown at a rapid pace over the past decade with open-source software, including EPrints, DSpace, and Fedora. In particular, DSpace and
Fedora are two of the largest open-source software platforms for managing and providing access to digital content. DSpace33 is a turnkey
application for building digital repositories; its built-in structure consists of a pre-determined hierarchy that allows users to organize content
easily. It provides an internal metadata schema based on Dublin Core for describing the content. It also includes a variety of preservation and
management tools and a simple workflow for uploading, approving, and making content available via the web. The Dryad repository reviewed in
this chapter is based on DSpace. Fedora (Flexible Extensible Digital Object Repository Architecture) Commons34 is a flexible framework to
manage, preserve, and link data of any format with corresponding metadata. It enables users to customize a pre-existing application designed to
work with Fedora, such as Islandora or Libra, a variant of Hydra. It permits users to construct simple to complex object models representing any
number of unique use cases for data preservation and archiving. Recently, considerable attention has been paid to iRODS, integrated Rule-
Orientated Data-management System, a data grid software system developed by the Data Intensive Cyber Environments research group and
collaborators. It supports collaborative research and, more broadly, the management, sharing, publication, and long-term preservation of data
that are distributed. It also enables the management of large sets of computer files, which can range in size from moderate to a hundred million
files totaling petabytes of data. A particular feature of iRODS is the ability to represent data management policies in terms of rules.
Last, implementing appropriate regulatory and legal frameworks is important. To this end, it may be ideal to mandate digital data deposits into
disciplinary repositories. Even though several funding agencies have their own data-sharing requirements, some policies are ambiguous with
respect to what must be released (Borgman, 2012). Furthermore, federal agencies can account for inherent differences between scientific
disciplines and different types of digital data when developing data management policies by adopting a relatively general mandate for data
sharing while requiring more specificity for the practices within each discipline. However, it should be noted that not all data is reusable or can be
repurposed. Therefore, funding agencies should customize requirements based on the type of research proposed. Additionally, federal policy
should give clear direction as to what data may be shared publicly to increase legal certainty for data users and producers. Such direction should
be regulated by legal standards that ensure and promote free public access, discovery, and reuse.
This work was previously published in Big Data Management, Technologies, and Applications edited by WenChen Hu and Naima Kaabouch,
pages 177194, copyright year 2014 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Agrawal, D., Das, S., & Abbadi, A. E. (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the
14th International Conference on Extending Database Technology (pp. 530-533). IEEE.
Albright, J. J., & Lyle, J. A. (2010). Data preservation through data archives. PS: Political Science & Politics , 43(1), 17–21.
doi:10.1017/S1049096510990768
Avila-Garcia, M. S., Xiong, X., Trefethen, A. E., Crichton, C., Tsui, A., & Hu, P. (2011). A virtual research environment for cancer imaging
research. In Proceedings of the IEEE Seventh International Conference on eScience (pp. 1-6). IEEE.
Beattie, R. (1970). ICPSR: Resources for the study of conflict resolution: The inter-university consortium for political and social research. The
Journal of Conflict Resolution , 23(2), 337–345. doi:10.1177/002200277902300207
Birney, E., Hudson, T. J., Green, E. D., Gunter, C., Eddy, S., & Rogers, J. (2009). Prepublication of data sharing. Nature , 461, 168–170.
doi:10.1038/461168a
Blanke, T., Dunn, S., & Dunning, A. (2006). Digital libraries in the arts and humanities – Current practices and future possibilities.
InProceedings of the 2006 International Conference on Multidisciplinary Information Sciences and Technologies (INSciT 2006). INSciT.
Borgman, C. L. (2012). The conundrum of sharing research data.Journal of the American Society for Information Science and Technology , 63(6),
1059–1078. doi:10.1002/asi.22634
Brunner, R. J., Djorgovski, S. G., Prince, T. A., & Szalay, A. S. (2002). Massive datasets in astronomy . In Abello, J., Pardalos, P., & Resende, M.
(Eds.), Handbook of massive data sets . Norwell, MA: Kluwer Academic Publishers.
Collins, D. W. (2004). US national oceanographic data center: Archival management practices and the open archival information system
reference model. In Proceedings of the 21st IEEE Conference on Mass Storage Systems and Technologies. IEEE. Retrieved from
http://storageconference.org/2004/Papers/39-Collins-a.pdf
Collins, D. W., & Rutz, S. B. (2005). The NODC archive management system: Archiving marine data for ocean exploration and beyond.
In Proceedings of MTS/IEEE Data of Conference. IEEE. doi:10.1109/OCEANS.2005.1640202
Collins, D. W., Rutz, S. B., Dantzler, H. L., Ogata, E. J., Mitchell, F. J., Shirley, J., & Thailambal, T. (2003). Introducing the U.S. NODC archive
management system: Stewardship of the nation’s oceanographic data archive. Earth System Monitor , 14(1). Retrieved from
http://www.nodc.noaa.gov/media/pdf/esm/ESM_SEP2003vol14no1.pdf
Crosas, M. (2011). The dataverse network: An open-source application for sharing, discovering and preserving data. D-Lib Magazine , 17(1/2).
Retrieved from http://www.dlib.orgdoi:10.1045/january2011-crosas
De Roure, D., Goble, C., & Stevens, R. (2009). The design and realization of the myExperiment virtual research environment for social sharing of
workflows. Future Generation Computer Systems, 25, 561–567. doi:10.1016/j.future.2008.06.010
Diebold, F. X. (2003). Big data dynamic factor models for macroeconomic measurement and forecasting: A discussion of the papers by Reichlin
and Watson . In Dewatripont, M., Hansen, L. P., & Turnovsky, S. (Eds.), Advances in Economics and Econometrics: Theory and applications .
Cambridge, UK: Cambridge Press. doi:10.1017/CBO9780511610264.005
Fienberg, S. E. (1994). Sharing statistical data in the biomedical and health sciences: Ethical, institutional, legal, and professional
dimensions. Annual Review of Public Health , 15, 1–18. doi:10.1146/annurev.pu.15.050194.000245
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the gird: Enabling scalable virtual organizations. International Journal of High
Performance Computing Applications , 15(3), 200–222. doi:10.1177/109434200101500302
Gray, J. (2009). Jim Gray on eScience: A transformed scientific method . In Hey, T., Tansley, S., & Tolle, K. (Eds.), The fourth paradigm: Data-
intensive scientific discovery . Redmond, WA: Microsoft Research.
Green, A. G., & Gutmann, M, P. (2007). Building partnerships among social science researchers, institution-based repositories and domain
specific data archives. OCLC Systems & Services ,23(1), 35–53. doi:10.1108/10650750710720757
Greenberg, J. (2009). Theoretical considerations of lifecycle modeling: An analysis of the dryad repository demonstrating automatic metadata
propagation, inheritance, and value system adoption. Cataloging & Classification Quarterly , 47(3-4), 380–402.
doi:10.1080/01639370902737547
Greenberg, J., White, H., Carrier, S., & Scherle, R. (2009). A metadata best practice for a scientific data repository. Journal of Library
Metadata , 9(3/4), 194–212. doi:10.1080/19386380903405090
Griffiths, A. (2009). The publication of research data: Researcher attitudes and behaviors. International Journal of Digital Curation ,4(1), 46–56.
doi:10.2218/ijdc.v4i1.77
Gutmann, M., Schürer, K., Donakowski, D., & Beedham, H. (2004). The selection, appraisal, and retention of digital social science data. Data
Science Journal , 3, 209–221. doi:10.2481/dsj.3.209
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., & Hide, W. (2008). Big data: The future of biocuration. Nature ,455(7209), 47–50.
doi:10.1038/455047a
King, G. (2011). Ensuring the data-rich future of the social science.Science , 331(6018), 719–721. doi:10.1126/science.1197872
Lynch, C. A. (2003). Institutional repositories: Essential infrastructure for scholarship in the digital age. ARL: A Bimonthly Report, 226.
Retrieved from http://www.arl.org
Marcial, L., & Hemminger, B. (2010). Scientific data repositories on the web: An initial survey. Journal of the American Society for Information
Science and Technology , 61(10), 2029–2048. doi:10.1002/asi.21339
Neuroth, H., Lohmeier, F., & Smith, K. M. (2011). TextGrid – Virtual research environment for the humanities. The International Journal of
Digital Curation , 2(6), 222–231.
Rockwell, R. C. (1994). An integrated network interface between the researcher and social science data resources: In search of a practical
vision. Social Science Computer Review , 12(2), 202–214. doi:10.1177/089443939401200205
Scherle, R., Carrier, S., Greenberg, J., Lapp, H., Thompson, A., Vision, T., & White, H. (2008). Building support for a discipline-based data
repository. In Proceedings of the 2008 International Conference on Open Repositories. Retrieved from
http://pubs.or08.ecs.soton.ac.uk/35/1/submission_177.pdf
Stein, L. D. (2008). Towards a cyberinfrastructure for the biological sciences: Progress, visions and challenges. Nature Reviews. Genetics , 9,
677–688. doi:10.1038/nrg2414
Taylor, I., Deelamn, E., Gannon, D., & Sheilds, M. (2006).Workflows for e-science: Scientific workflows for grids . London, UK: Springer-Verlag.
Thessen, A. E., & Patterson, D. J. (2011). Data issues in the life sciences. Zookeys , 150, 15–51. doi:10.3897/zookeys.150.1766
Vardigan, M., & Whiteman, C. (2007). ICPSR meets OAIS: Applying the OAIS reference model to the social science archive context. Archival
Science , 7(1), 73–87. doi:10.1007/s10502-006-9037-z
Vision, T. J. (2010). Open data and the social contract of scientific publishing. Bioscience , 60(5), 330. doi:10.1525/bio.2010.60.5.2
Wang, H. (2010). Privacy-preserving data sharing in cloud computing. Journal of Computer Science and Technology , 25(3), 401–414.
doi:10.1007/s11390-010-9333-1
ADDITIONAL READING
Kowlczyk, S., & Shanker, K. (2011). Data sharing in the sciences.Annual Review of Information Science & Technology , 45, 247–294.
doi:10.1002/aris.2011.1440450113
KEY TERMS AND DEFINITIONS
Cloud Computing: A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service
provider interaction.
Digital Object Identifier (DOI): A unique persistent identifier for a published digital object, such as an article or a study.
Grid Computing: A form of networking which, unlike conventional networks that focus on communication among devices, harnesses unused
processing cycles of all computers in a network for solving problems too intensive for any standalone machine.
Metadata: Data about data, i.e., data about the content, quality, condition, and other characteristics of data.
Repository: A place where electronic data, databases, or digital files have been deposited, usually with the intention of enabling their access and
dissemination over a network.
Virtual Research Environment (VRE): A platform dedicated to support collaboration whether in the management of a research activity, the
discovery, analysis and curation of data or information, or in the communication and dissemination of research outputs.
ENDNOTES
3 http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm
4 http://www.dataone.org/
5 http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf
6 http://muse.jhu.edu/
7 http://www.jstor.org/
8 http://www.artstor.org/
9 http://www.diggingintodata.org/
10 http://archaeologydataservice.ac.uk/
11 http://www.ademnes.de/
12 http://www.ropercenter.uconn.edu/
13 http://www.irss.unc.edu/odum/
14 http://www.murray.harvard.edu/
15 http://hmdc.harvard.edu/
16 http://www.icpsr.umich.edu/
17 http://ecoliwiki.net
18 http://dnasubway.iplantcollaborative.org/
19 http://www.ncbi.nlm.nih.gov/genbank/
20 http://www.rcsb.org/pdb/home/home.do
21 http://ndar.nih.gov/
23 http://arxiv.org/
24 http://www.ncdc.noaa.gov/
25 http://www.geongrid.org/
26 http://datadryad.org
29 http://www.icpsr.umich.edu/
32 NOAA also operates two other data centers: National Climatic Data Center (NCDC) at http://www.ncdc.noaa.gov and National Geophysical
Data Center (NGDC) at http://www.ngdc.noaa.gov/.
33 http://www.dspace.org
34 http://www.fedora-commons.org
CHAPTER 75
Bringing the Arts as Data to Visualize How Knowledge Works
Lihua Xu
University of Central Florida, USA
Read Diket
William Carey University, USA
Thomas Brewer
University of Central Florida, USA
ABSTRACT
Professional audiences, scholars, and researchers bring varied experiences and expertise to the acquisition of new understandings and to problem
solving in visual art and literary contexts. The same breadth of experience and learning capability was found for students at eighth grade,
sampled from the national population of students in the United States who were queried in the National Assessment of Educational Progress
(NAEP) about formal knowledge, technical skills, and abstract reasoning in visual art and in language arts. This chapter explores statistical data
relating to the presence of art specialists in the sampled eighth grade classrooms. In particular, schools with specialists in place varied in density
across the country as is demonstrated through geographic mapping. Secondary analysis of NAEP restricted data showed that students in schools
with art specialists performed significantly better than students in schools with other types of teachers, or no teacher. The authors surmise that
art specialists conveyed something fundamental to NAEP 2008 Response scores. An aspirational model of assessment assumes broad audience
clarity through knowledge visualization technology, via thematic mapping. The authors explore through analog Deleuze and Guattari’s double
articulation of signs in natural and programming languages and demonstrate through knowledge representation the means by which complex
primary and secondary statistical data can be understood in a discipline and articulated across disciplines. This chapter considers NAEP data that
might substantiate a general model of aspirational learning and associates patterns in perception discussed by researchers and philosophers.
INTRODUCTION
In 2008, the National Assessment of Educational Progress (NAEP) in visual art captured the very thought processes by which some students
successfully approached visual analysis and interpretation of meaning. Though the process was verified through statistical analysis of restricted
data from NAEP, vetted by the Institute of Educational Sciences (IES), and released for public scrutiny through peer reviewed channels at the
American Educational Research Association (AERA) annual meetings and through the National Art Education Association (NAEA), it was
apparent to us that a disconnect existed between end users understanding of the statistical path, the descriptive information about the sample,
and the avowed achievement aims of stakeholders in the arts. Our team of Diket, Brewer, and Xu stayed with the problem of audience
understanding, employing various visual aids and interactive discussions to draw in the audience at conferences and published widely in the
field. A breakthrough was accomplished in communication to broader audiences in 2014 when the team began using data visualization
techniques to explain NAEP statistical path analyses.
Assessment concerns an individual learner’s or a population’s ability to grapple with field specific and domain-centered batteries of tests which
are intended to serve as indicators of achievement levels in knowledge and skill, especially stemming from school-based education or formal
education, and from life experiences. At a theoretical level, fields of endeavor have the potential to inform one another by providing cases that
contextualize and express complex information. Our work with NAEP data, presented through knowledge visualization techniques, falls at the
current edges of secondary analysis. We are seeking common elements in learning that cross subject areas and transcend school-based
achievement, while drawing upon specific cases from populations known to be capable of grappling with formal thinking, This chapter considers
NAEP data that might substantiate a general model of aspirational learning, and associates patterns in perception discussed by researchers and
philosophers.
BACKGROUND
Elizabeth Kowachuk (1996) considers the essentials of substantive thinking and dispositions for lifelong learning in art in her
NAEATranslations publication “Promoting higher order teaching and understanding in art education”, where she cites to a well thought out
thorough list of references in the assessment development era. As Kowalchuk explains in her treatment of higher order thinking in art: to develop
higher order understanding students must gather, process and apply information in settings that may be academic or personal. National
achievement tests provide a “dipstick” for substantive thinking and dispositions for lifelong learning. The 2008 NAEP continued most of the test
blocks designed for and used in the 1997 NAEP Arts assessment, exercises which were generated from national standards developed in the 1990s.
NAEP subject area examinations have additionally queried student knowledge, aptitudes, and attitudes that include an array of facts, principles,
and discipline concepts.
Engagement, continues Kowalchuk (1996) is essential in higher order processing; and generative topics in a content area likely connect to
students’ lives. Generative topics are pivotal, accessible, and connectable to other knowledge students bring to new learning. Kowalchuk
anticipates the current national focus on a common core of expectation that links goals for education to generative topics. She concludes by
asserting that to understand achievement/performance, we must discover what students do and how they understand their process. She
maintains that a reliable teacher strategy for finding generative topics and developing goals necessitates asking what students ought to internalize
from instruction—specifying the fundamental issues, methodologies, and ideas of a content area.
While some may proffer instrumental claims that art education transfers to general literacy, Ellen Winner and Lois Hetland (2001) in their
NAEA Translations publication argue the need for a theory driven experiment in which the arts serve as motivational points of entry to learning,
especially beneficial to at risk students. As the main thrust of the monograph, Winner and Hetland compute “effect size” between variables in 10
meta-analyses that addressed related research areas. While two causal links were found for music that relate to spatial reasoning, and one found
for classroom drama and a variety of verbal areas, no reliable causal link was found for visual arts to reading achievement. Based on 4 reports, a
medium size relationship was found between integrated visual arts/reading instructions and reading outcomes, but the results could not be
generalized in replications. Instead, Winner and Hetland maintain, art education needs to justify and establish scientifically the valuable benefits
within the discipline of art. With study and attention to how thinking in art works as a “frame” for cognitive achievement and general
understanding of artistic structure, principles of investigation and societal values (as in Multiple Intelligences proposed in the 1980s by Howard
Gardner), art could contribute importantly to current school curriculum goals, and do so through images that investigate culture and societal
values. Our team is exploring the interdisciplinary properties of language forms by extending our statistical analyses work into NAEP reading
blocks. From the NAEP “cases” that are translated here into graphic arrays, we hope to establish data as a form of knowledge that can inform the
educational domain and use knowledge visualization as a means of transmitting the form and substance of active learning.
PRESENTING FINDINGS FROM NAEP DATA EXPLORER, REPORTS, AND SECONDARY USE DATA
We began using the 2008 NAEP visual arts data when data became available through public access software Data Explorer online through the
National Center for Education Statistics website. Tom Brewer led an art education graduate class through the use of public access software and
his student, Kathy Arndt (2012), found a statistical significance on achievement for “full-time art specialists” over part-time art specialists,
teachers of other subjects and grades, and schools offering no program in the arts. This statistic laid hidden in the data, and when presented
using statistical notation, did not convey to general audiences the important of subject area credentials in the classroom and the likely
contribution of full-time teachers of art to school infrastructure and cognitive culture (Diket, Xu, Brewer & Sutters, 2015).
NAEP secondary analysts use restricted data coded from survey information reported directly from schools and by individual students in the
sample. The data can be explored descriptively and used for explanation of variance. Along with statistical investigation, secondary analysts can
organize findings for specific and general audiences using geographic visualization. The content of the question blocks, item maps, and the
framework for NAEP Arts used in 1997 and 2008 is available through the internet at http://nces.ed.gov/nationsreportcard/ to inform thematic
mapping of the Mother/Child (M/C) test block, and to enable visual comparisons to the NAEP Reading problem Ellis Island from 2005.
Lihua Xu verified Arndt’s finding for specialists’ efficacy in student achievement using NAEP 2008 Data Explorer (see Table 1). Using restricted
data, Xu isolated teacher credentials for the schools sampled, recoded the schools for type of instructor, and developed maps for the distribution
and density of the sample by states. In Xu’s geographic representation, for states with likelihood of having larger numbers of Hispanic students
who were oversampled 2008, density gives some cues to the viewer. NAEP corrects for oversampling through statistical means, but provides
essential information consumers want to know what areas of the country were sampled and how deeply. Most specifically, arts advocates want to
know how important a teacher’s professional preparation is to achievement. A visual allows viewers to form a sense of the expertise of teachers
associated with achievement data in NAEP 2008 pilot level examination of visual art. Though a several-thousand-student sample (constituting a
pilot level) is large for visual art data sets, reading’s full-scale assessments embrace huge sample sizes and thousands of schools instead of
hundreds. The data represented through maps generated in the googlefusiontableprogram is descriptive, yet the numbers substantiate the origin
(and sampling logic) of the positive finding for art specialist as instructors of greatest impact (see Table 1). The comparisons were conducted on
NAEP Data Explorer between visual arts taught and not taught by full-time specialist at the levels of national, national public schools, and
national private schools. Independent t-tests were conducted in the primary analysis to generate the significance results.
Table 1. Lihua Xu, Read Diket, & Thomas Brewer, Difference in average scales scores of visual arts taught and not taught by fulltime
specialist. (© 2015, L. Xu, R. Diket, & T. Brewer. Used with permission).
Data Visualization Application
Data Visualization, as applied here, associates geographic maps to explore sample clustering and works with tables to provide insight into the
distribution of effects. The maps convey general spatial information and support comparisons if maps are read as overlays
(http://civicinfographics.ahref.eu/en/ahref-log/civic-infographics). Maps emphasize spatial relationships and density/scale and are an example
of common language aesthetics; thus, maps embrace the common properties of visual perception. The visual image becomes the medium for
conveying descriptive information. Awareness of an intended audience assumes equal importance in preparing data as image. Through thematic
mapping, a reader is expected to engage with the presentation of a path for learning and to discern the message of its increments. The specific
names given data visualization types vary in disciplines, here we use geographic and thematic as terms with general utility to distinguish type.
Following are three U.S. maps. A marker is associated with each state participating in the 2008 NAEP visual arts assessment. Those states that
were not part of the NAEP study in 2008 do not contain markers.
th
The first map (Figure 1) presents the distribution of full-time specialists serving as 8th-grade visual arts teachers in schools participating in the
2008 NAEP. According to the data, most of the states were sampled at 5 or fewer full-time specialists. The number of full-time specialists in New
York, Georgia and Florida teaching 8th-grade visual arts in 2008 ranged from 10-15. California and Texas had the greatest number (15-20) of
full-time specialist teachers included in the sample.
The second map (Figure 2) presents the distribution of other types of teachers of visual arts by state in the 2008 NAEP data. This collapsed
category includes part-time specialist, elementary teacher, artist-in-residence, volunteer etc. The range of number of other types of teachers in
Utah, Arizona, Michigan, Illinois, Ohio and New York was from 3-6. California had 6-9 part-time visual art teachers in the sample in this
category.
Figure 2. Lihua Xu, Read Diket, & Thomas Brewer. Other
teachers in sample, locations and density in 2008 NAEP Arts
(© 2015, L. Xu, R. Diket, & T. Brewer. Used with permission).
The third map (Figure 3) presents the distribution by state of schools where no teachers are teaching 8th grade visual arts. Among the schools that
participated in the 2008 Visual Arts NAEP assessment, 3-6 schools in California and Alabama reported no visual arts instruction. Florida topped
the school sample in this category. Between 6 and 9 schools sampled in Florida did not have teachers teaching 8th-grade visual arts in 2008.
Figure 3. Lihua Xu, Read Diket, & Thomas Brewer. No teacher
for arts, school locations and density in 2008 NAEP Arts (©
2015, L. Xu, R. Diket, & T. Brewer. Used with permission).
Our other finding about the thematic path by which students undertake a question block, and what they knew and were able to successfully
answer in 2008 uses thematic mapping. Diket, Xu, and Brewer (2014) confirmed an achievement path in the 2008 visual arts data using LISREL
some two years ago, but found that general audiences experienced difficulty with understanding the derivation of concepts involved in discerning
the path. When data was computed according to the perceived design, the statistical path suggested that the problem of stylistic determination as
a cultural manifestation or markers that could be revealed by critical analysis were not intelligible to most students taking the NAEP Arts
Assessment in 2008. General audiences of educators and qualitative researchers ex post facto evidenced confusion about what the statistical
“path” represented as an example, and how the block could represent achievement goals. Thus, it appeared unlikely that the released questions
would help educators in fostering achievement in visual art.
Sweeny (2011) sought to involve students in the construction of meaning and to use network analysis to encourage conversation. Sweeny explains
his image construction that uses scale, links, and clustering to impose order on complex systems. He uses network theory and network
construction to look at relationships between art works. He, too, begins with a case. He assumed a clear distinction between visual art and visual
culture. In his article, Sweeny poses the possibility that patterns of behavior, presented through new forms of communication, and exploration of
interrelationships may shift networks (with aspects of “materiality, interaction, and power”) from physical, to social, to digital forms in a
rhizomatic manner to inform “art educational practice” (Sweeny, 2011, p. 222). He is particularly interested in the conditions through which
relationships occur in practice.
Brief History and Explication of the LISREL Path Analyses
In anticipation of presenting a paper session during the AERA annual meeting in 2012, the three authors for this paper convened in late July of
2011 to verify the feasibility of path analysis for exercise blocks from NAEP 2008. Diket and Brewer charted the likely organization of the items
and posited constructs, and Xu recoded variables in Statistical Package for the Social Sciences (SPSS) and wrote syntax for the structural
equation modeling analysis, using Linear Structural Relations (LISREL). In recoding the variables, unacceptable constructed response was coded
as 0, partial response as 1, acceptable response as 2, and failure to answer (off task/illegible/non-ratable/omitted/not reached) was coded as
missing. The Mother/Child exercise was foremost in the achievement booklets (Figure 4) and, thus, contributed heavily to achievement. We
specified a linear impact from Art Knowledge toTechnical Properties to Aesthetics to Meaning with an additional linear path from Art
Knowledge to Meaning. Diagonally weighted least squares estimation method and asymptotic covariance matrix were used in the analysis of
polychoric correlation matrix (a technique for estimating the correlation between two normally distributed continuous, latent variables, from two
observed ordinal variables). The first run was statistically significant which is not desirable in model testing, possibly due to large sample (N
=1648). More importantly, the model was defensible theoretically and conformed to known practices in the field. However, the expected path
from Art Knowledge to Meaning was negative and not statistically significant. The second model dropped the Art Knowledge to Meaning path,
and was again significant. But an unexpected change occurred in the items defining constructs forAesthetic Properties and Meaning that are
noted in the NAEP item map as higher order/more difficult questions. The path between Aesthetic Properties and Meaning was reduced in
impact due to a lack of student understanding for a general connection between items that queried the underlying aesthetic-system and informed
meaning making. For optimal performance, students needed to expect that their answers to items would ultimately provide insights that would
help in completing the question block. In comparing the two statistical paths, the investigative team reasoned that students must sense the
limitation of their general art knowledge when answering questions about specific referents/artworks. Once sensed, successful students address
the technical issues in the problem set, proceed on an investigative path using the images provided, though very few could go on to grasp
historical thinking (using artwork as a primary source, perspective-taking, chronological awareness, and conceptual evaluation) made possible by
explicit study of features of artworks.
Figure 4. Lihua Xu, Read Diket, & Thomas Brewer. Path
diagram for mother/child questions associated to major
constructs in 2008 NAEP Arts (© 2015, L. Xu, R. Diket, & T.
Brewer. Used with permission).
We took extensive time to determine what level of response each Mother/Child question elicited from students (see Table 2). Three questions in
the M/C block (labeled VD00002, VD00003, VD00004, VD00011 in the data set) were used to define Art Knowledge. Three questions
(VD000A5, VD000A6, BD000B6) were defining Technical Knowledge, four questions (VD000D5, VD00007, VD00009, VD00010)
defining Aesthetic Properties; two questions bracketing the beginning and ending of the block (VD00001, VD00008) were associated
with Meaning. The last two digits of those question numbers indicate the order of presence in the block, with the smallest number occurring first.
In general, the questions were ordered in a somewhat linear fashion as the constructs they represented could, by design, take students logically
through the problem set.
Table 2. Lihua Xu, Read Diket, & Thomas Brewer. List of coded mother/child questions by construct (Refer to Figure 4) (© 2015, L. Xu, R.
Diket, & T. Brewer. Used with permission).
A second model was tested with the path from art knowledge to meaning deleted due to its statistical insignificance in the first model. The
strengths of path coefficients in the second model changed and the model appears more stable judging from the standard error and significance
level associated with each path. Increased art knowledge still predicted increased technical knowledge (unstandardized coefficient = .95, p <
.001). Students’ increased technical knowledge significantly predicted increased knowledge of aesthetic properties (unstandardized coefficient =
1.03, p < .001), which in turn predicted increased ability to make meaning sense from artists’ works (unstandardized coefficient = .97, p < .001).
The significance of the intervening variables in the second model was evaluated using tests of indirect effects. Students’ technical knowledge
served as an intervening variable between art knowledge and knowledge of aesthetic properties (unstandardized coefficient = .98, p < .001).
Students’ aesthetic properties served as a significant intervening variable between technical knowledge and students’ meaning making from art
works (unstandardized coefficient = 1.00, p < .001).
We still needed to relate the data and organize a means for multiple comparisons. The problem now was devising a static presentation that could
effectively convey conceptual differences in the aspirational (or difficulty) level among blocks. We knew that NAEP constructed response
question blocks could be “solved” as paths; and, we called the path “aspirational” because constructed response NAEP problem blocks require
dedicated attention to accruing a working body of knowledge, recording growing understanding through detailed critical analysis, comparing
back to aesthetic expectations of a field, with an aim of establishing a clear meaning for the exercise by the final answer.
Diket formulated the graphic representation for the Aspirational Model found in the art data. She took the idea from a knowledge representation
in Time that appeared February 17 (Grossman, 2014, p. 32-33). Here, the X-Axis represents the four categories of the Aspirational Model:
(engagement/theme cue to social values revealed by stylistic elements in an art work) art knowledge, technical knowledge, aesthetic principles,
and meaning making. The X-Axis acts as a landscape of cognitive and affective features. The Y-Axis contains the item map scale, representing
probable achievement values associated with answering questions that discriminated most in the NAEP 2008 M/C exercise block. Theswath of
color (and the red data points) indicates the item order to achievement scale that increases in complexity as the problem builds within the block.
Diket incorporated photos of the paintings and African sculpture; objects referred to in the items, and provided abbreviated text for the
individual assessment items. This block was previously released by NAEP as an exercise; the NAEP 2008 visual art report and its item map in
particular connected those released items to an estimated achievement value. Thus, viewers can discern the progression in question difficulty and
scaffolding of items, and use the example to understand something about art achievement. These properties can be discerned in the tri-colored
swath from question one’s initial engagement with theme to the most complex question concerning the meaning conveyed by stylistic
conventions that concludes the block. Relatively few students answered the final question. The images and questions used in the assessment
block, here placed on the vertical scale taken from the item map, gives audiences the information in a recognizable configuration. The
explanation for what happens mentally when viewing the visualization of data derives from quantum physics uncertainty principle and
fromquantum annealing where it is the couplings that link together suppositions, possibilities for both failure in thought or advancement further
in the problem. The coloration of the swath provides evidence of more advanced thinking that students are accomplishing in the case
of aspirational NAEP problem blocks. The graphic can also be understood as an example of double articulation.
Ursyn (2014) remarks in Perceptions of Knowledge Visualizationthat the notion of articulation may have disparate meanings and may lack a
common definition. In a general sense, articulation happens “when we assign a form (in words, notes, or algorithms) to an idea, information, or
feeling” (p. 2). In formal language contexts, we find strings of symbols that are defined variably by structural patterns. Deleuze and Guattari
(Bogue, 1989) reason that expression-form distinctions “are not reducible to words, but to a group of statements which arise in the social field
considered as a stratum…[where] content-form is not reducible to a thing, but to a complex state of things as formation of power” (translated
from Thousand Plateaus by Bogue, 1989, p. 130). The discussion of a prison and delinquency serves as the focus of the theoretical discussion. As
analog, we can see the relationship of the Deleuzian/Guattarian example to school curriculum delivery and NAEP achievement blocks. Outlining
their argument, Bogue posits that neither prison nor delinquency functions as an “immanent common cause” rather the immanent cause is “’an
abstract machine’” (p. 131). In our analogy, schools and mediocre achievement do not have an immanent common cause either, but school
viability and achievement can be associated with a design for learning. Improved social functioning, including education, could follow from
designing for formal understanding and sustainable learning across the lifespan.
Path diagrams represent different question formats and aesthetic and ethical strategies that are associated with meaningful higher order thinking
(formal operations), and with disparate, desirable and aspirational goal for learning at 8th grade and above. Interest in question blocks and path
analyses dates back to 2000, when Diket and Thorpe analyzed the NAEP 1997 Visual Arts path responses for the portrait block, using Asset
Management Operating System (AMOS) software, and reported significant findings at the AERA annual meeting. Diket and Thorpe found that
8th graders who answered cuing questions and connected the exemplar portraits to their own self-portraits proceeded on a more successful path
through the question block. In other words, those students who could use expressive marks and other indicators in creating a self-portrait
perceived the aesthetic concepts for composition and expressive values driving the task to draw their own portrait. A replication study by a team
with Siegesmund, Diket, and McCulloch (2001) was funded by the National Art Education Foundation. Siegesmund et al., found that students
failing to recognize the exemplars’ expressionist features followed truncated or divergent paths that limited the scoring for their portrait. Several
explanations were posited by Siegesmund and Diket, who reviewed student-made portraits, interviews, video footage, and school curriculum
information: (1) Student did not have sufficient skill or studio understanding to draw self beyond a circle with eyes, mouth, hair, and thus had
nothing to discuss; (2) Student reconstituted the drawing problem to a non-objective format that was not scored as a problem solution, and (3)
Student reworked a problem from the school art class, using symbols surrounding the face instead of employing expressive movements that
showed media to advantage and evidenced consideration of the proportions, features and clothing preferences reflected in their hand held
mirrors provided for the exercise. In the portrait task the aesthetic issue concerned variation in expression as revealed by gesture, color, and
mark making in the two artist self-portraits given as examples. Many students in the replication failed to recognize the dual focus of the task
block as mark maker and as meaning maker.
The Mother/Child block functions better as a theoretical model forgood curriculum. The responding block asks 8th graders what they think is a
common feature of the five images (the subject). The next stage initiates with a question about the identity of a 20thcentury art examples that
must be selected from images representing several historical periods. Throughout the block students are comparing and garnering features of
Renaissance stylistic conventions which must be further related to the historical understanding of secular values and religious concerns in the
early 1500s. Students must bring to the problem what they have gleaned in society as external, interactive, visual components; all operations then
must be focused to the task of demonstrating growth in knowledge of the problem. To conclude the block, students must embrace complexity of
image and societal context that resides at the core of artistic literacy.
In all the blocks, strands of questions lead to more difficult thinking operations. It is the upper levels of difficulty that illuminate the theoretical
challenge in teaching visual art. NAEP blocks likely have the capacity to work as asynchronous learning models (after Piaget), and facilitate and
stimulate interaction with the problem block—and encourage students to work through the block in pursuit of new knowledge from which they
can derive meaning. Further, if using the language of Deleuze, we might want to seek as “matter” the learners, the school, the teacher, and the
“substances” through which matters enter via a “diagram of power” (p. 131) which just might be a curriculum. The abstract machine,
achievement, likely lies beyond any diagram and is coextensive with the social field. But the diagram “plays the role of common, non-unifying
immanent cause” (p. 131). The strata of the diagram present the first and second articulation. As a first articulation the data points/measurable
units for each question and the Y-Axis scale for achievement are plotted as a graph, presenting the statistical form and substance of numeric data.
The second articulation appears with the swipe that expresses the knowledge level strata that is associated with key images of artworks that are
implicated in the correct answer to each question. It is the second articulation, imposed on the vertical and horizontals of the graph, its direction
and color association, which enables novice, emergent, and expert viewers alike to grasp the essentials of the statistical solution. Thus, knowledge
visualization techniques, explained through double articulation theory, show connections to the numeric data extracted from the item response
table and plots successions (as in a path) as the first articulation. The second articulation establishes functional forms that relate to the
assessment block as a micro determination of what students know and can (or might) do in a subject area. Our intent is to “make the unseen
visible” (see Ursyn, 2014, p. 16) and to enable viewers to explore the messages embedded in NAEP assessment.
Explanation of the Problem of Representing Achievement between Subject Area Test Cycles and Academic Subjects
We have a crude analogy at this point, and one example. An early 2014 issue of Educational Researcher suggests that hard science and learning
research attend to similar principles. Each research area can inform the other field’s understanding. Wieman (2014) makes an argument on how
science and learning research might work together that we interpret further using the NAEP blocks. At any item junction, the student taking a
NAEP examination can answer correctly or not, but until the next layer of questioning is observed, it is not possible to determine the subsequent
or, especially, the ultimate achievement of the student. It is a combinatorial optimization problem that the student must solve, using what they
know and logical exploration of specific referents, sequence and timing, and general understanding of phenomena. Our team had an answer with
the statistical path within theAspirational Model shown earlier in the paper, but we needed the scientific terminology and visualization by
graphic array to see how something (an item response) could potentially be both 0 and 1 at the same time. That happened with the Reading 2005
Ellis Island exercise. As can be seen in Figure 6, question 13 is deemed easy to answer. However, the associated level of achievement lies in the
swath as broaching upon formal thinking. How could the question be easy and hard at the same time? Seventy-four percent of students answered
question 13 correctly, but most of those did not answer the second order thinking section very well. Thus, it was the pattern of correct answering
that could be associated to Question 13 that most informed its placement on the scale, rather than the percentile of correct response.
Figure 6. Lihua Xu, Read Diket, & Thomas Brewer, data
visualization of released items for mother/child block, with y-
axis showing item relationship to overall student achievement
score, and x-axis indicating the building of artistic
understanding through four constructs (© 2015, L. Xu, R.
Diket, & T. Brewer. Used with permission).
The complexity of NAEP administration in schools, task restraints, and fewer blocks in 2008 for the arts, prohibits direct comparisons of visual
arts Responding achievement data back to NAEP Arts 1997 when art blocks were first used to test achievement in Responding and Creating.
When subject areas differ, direct comparisons are similarly prohibited with NAEP. Frameworks are also variable for testing years, especially as
the Common Core Curriculum (CCC) operations and STEM (Science, Technology, Engineering, and Mathematics) learning expectations became
visible incentives embedded in NAEP. CCC incentives and STEM were gaining strength in American dialogues during the period of the selected
block example for the 2005 Reading assessment.
Simple statistical comparisons can be made using the NAEP data tool on the Internet at http://www.ed.gov, though our project officers use
NAEP restricted data (under license to our respective universities) and the LISREL program from SSI scientific software. Cognitive
understandings are illuminated by examination of item maps from the 2008 visual arts NAEP and NAEP Reading 2005. Both examples present
levels of proficiency in mental operations, and students can use known approaches to answer the block of questions. The two structural equation
modeling analyses we have conducted thus far provide statistical verification of strong curricular constructs in NAEP and support a general
theory of how these thinking exercises structure. Now researchers can bind theory to NAEP blocks in other subject areas (notably Reading), and
posit how best practice might look for instruction and assessment of learning in an open system of problem based learning in education.
Open Systems Approach. With the NAEP Arts 2008, investigators have determined the statistical significance of self-reports regarding how hard
the student tried and how important it was to the student to achieve on the assessment. Across DRACE variables (how school or student
categorizes his or her racial identity) the patterns are quite revealing and suggest ways in which curriculum could be differentiated for diverse
students in subtle, affirmative ways—ultimately closing gaps in art achievement. We hypothesize art achievement reflects how art experience is
valued by individual students in their very different lives and cultural settings (Diket, Xu, Brewer, & Davis, 2012).
How is achievement in the visual arts related to other subject areas? A connection can be made between visual arts and reading at the test block
level, using student and school questionnaires: (1) NAEP question blocks at grade 8 use both text and image (direct influence of education—
teachers, curriculum, school culture, cognitive environment, and trend to interdisciplinary learning), reference to the young person for the
reasoning behind their answer (integration and purpose); (2) The student questionnaire explores conative experiences outside of school
(achievement messages at home, competitions attempted, self-respect issues, perception of choice and volition for action). Patricia Hollingsworth
(2003) configures achievement as a continuum of systems, with interrelated parts (similar to Gruber’s evolving systems approach, Gruber &
Davis, 1988); and (3) The school questionnaire conveys the structure of instructional delivery in the classroom. Thus, NAEP captures information
about these overlapping cognitive and affective systems with test blocks, student questionnaires, and school-based questionnaires.
Referencing the content of the assessment. As a matter of protocol both reading and visual arts require constructed responses predicated on
specifics of exemplar “readings” that can be decoded in terms of language, voice, context, and message. The constructed responses can reveal the
ecosystem of achievement as understood within the greater culture. Our team is currently using the regression capabilities of AM data software,
to explore the ecosystem of achievement in reading. Responsive writing is part of an open system of component parts that are taught at school,
embraced by students to various degrees, and supported outside of formal school by the popular domain. Responding to art or literature for
NAEP requires writing answers and justifications, thus readings of text and images can be captured through multiple-choice items and in
constructed responses.
Restricted Data from Reading
Which structures stronger as a path to achievement, Reading or Visual Art? Figure 5 presents the path diagram from the Ellis Island block in
the NAEP 2005 Reading restricted data, which tested the second measurement and structural model found for visual art. In the reading
model Knowledge (R012703, R012704, R012707, R012712) predicts Technical Analysis/Technicality(R012709, R012714), which in turn
predicted Aesthetic Properties(R012702, R012713), which in turn predicted Meaning (R012701, R012705, R012706, R012710, R012711). The
model fit the data well (χ2 = 270.35, df = 62, p < .001, RMSEA = .014, CFI = 1.00, SRMR = .019) and χ2 was significant, which could be due to the
extremely large sample size (N =18411).
All the path coefficients were statistically significant at α = .001 level. The three paths were strong with the effect of Knowledge on Technical
Properties being 1.00, the effect of Technical Properties on Aesthetics being 1.02, and the effect of Aesthetics on Meaning being 0.99.
As was the case for the visual arts NAEP data, the model was considered well fit (χ2 = 108.92, df = 62, p < .001, RMSEA = .021, CFI = .99, SRMR
= .033). Figure 5 is the path diagram. All the path coefficients were statistically significant at α = .05 level.
Examination of the paths to achievement. The path coefficients from Knowledge to Technical Properties, from Technical to Aesthetics, from
Aesthetics to Meaning were subjected to comparison between 8th grade visual arts and reading. None of the differences between the two subject
areas were significant. If eighth-grade achievement is not conditioned in the model on prior knowledge as a path to meaning making then the two
subjects operate similarly as the test taker works to answer questions by constructing knowledge from reading words or images.
Eighth-graders who have not had much exposure to arts education, as might be taught by a specialist in the subject, likely do not have the
essential knowledge and technical skills coming into the problem set (which in Mother/Child dealt with style as dependent on the intellectual
environment and social contexts of societies). The arts do afford, however, through technical analysis (art criticism methodology), a means of
gleaning information about style directly from the images. Without a critical strategy, the interpretation of art as Responding does not happen.
Figure 6 presents the data visualization graph of art achievement information with the Y-Axis providing the scale for Mother/Child question
difficulty, released item content as data points, and the x axis showing the four constructs found previously in the path analysis of M/C. In M/C
the swath indicates the broad path of the released items and, in yellow, the aspirational status of questions not released in the item map and held
back in the released questions and images. Only the top ten percent of 8th graders could answer correctly the last five questions from the M/C
block. The implications of our data visualization, and the location of the concluding question, suggest that the goal lies beyond the contents of the
item as stated, that the question understood at its universal level of meaning could guide students to a new level of insight based in critical
reflection, a technical skill highly valued in visual art.
Figure 7, the data visualization for NAEP 2005 Reading, the Ellis Island block, shows previous knowledge appearing to have a greater influence
on subsequent achievement in meaning making than in the art model. The graphic suggests that reading entry questions help set up the problem
type for technical examination of the text narrative, leading to discerning author aesthetic intent and locating meaning derived from the inclusion
of authentic voices. In contrast, the sign system upon which the Mother/Child Art block depends does its work via a generative system (from the
student’s own experience) as was remarked upon by Kowalchuk (1996). The Mother/Child block uses a system of spatial cues based on images of
paintings and sculpture that can lead to discovering the end point of the question block that places one of the paintings in its historical period. In
contrast, the Ellis Island works forward from historical accounts to general principles, while the Mother/Child works backwards from the
student’s own experience to historical understanding of the relationship of art exemplars to valued ideas and aesthetic goals. Both languages
(written text and visual text), clearly, have a place in curriculum that emphasizes thinking and each one can lead and support the other. However,
when the student can scaffold experientially from personal experience and context familiar beginnings, we see thinking that is less dependent on
prior knowledge.
Figure 7. Lihua Xu, Read Diket, & Thomas Brewer, data
visualization of released items for Ellis Island reading block,
with y-axis showing scale scores, and x-axis indicating the four
constructs (© 2015, L. Xu, R. Dicket, & T. Brewer. Used with
permission).
Examination of the shapes of data swaths (as expressive coding) suggests that the aspiration for the Ellis Island block was to capture performance
in the mid-range of difficulty (technical issues), while still providing some evidence for achievement in formal thinking (understanding intent and
finding universal meaning). Our team called this variation a “bullet” model for its trajectory and focus on a procedural target. This interpretation
aligns with literature addressing critical awareness at the middle level (Johnson & Freedman, 2005). Johnson and Freedman preface that middle
school learners are developing a sense of who they are and what they believe in as general principles. In the case of NAEP, 8th graders can
progress from the academic understanding associated with 1st order propositions to the empathy required in 2nd order thinking if they have some
background knowledge. To accomplish formal thinking, students must assume a historical perspective that is made personal as a choice to be
recorded in writing and defended. As Johnson and Freedman posit, “these students are attempting to confront adults’ rules” (p. vii). The Ellis
Island block follows the goal that Johnson and Freedman pose so well: the block works by “presenting and exploring ways to engage students’
emotions, criticism, and pleasure in being part of…the ‘literacy club’” (p. viii). As the authors point out, few young teens are unconcerned about
issues of social justice. The Ellis Island block explores issues of power, oppression, identity, and critical consciousness. The primary point made
by Johnson and Freedman is that active questioning is primary to becoming critically conscious. In another book from 2005, Taylor and Nolan’s
treatment of classroom assessment, addresses the need to achieve standards, and to use tests and other assessments to do so. Taylor and Nolan
(2005), see assessments as formative tools that guide teachers who must teach how to make sense of events, human history, settlement issues,
and so forth. Thus, NAEP 2005 aligned with reading theory and practice of its time.
The aspirational shape swoops upward toward a specific goal, and is clearer in the Mother/Child exemplar. Through image analysis, 8th graders
can glean knowledge within the problem, provided they have some sense of critical methodology. In art class, especially those taught by
specialists, students are taught to describe, relate, interpret, and evaluate information from images. These skills, while made particular to
analyzing art images, are based in the same theories informing literary criticism. However, the language carrying the meaning differs, and
students’ ultimate ability to decode meaning depends in art on understanding the aesthetics of critical analysis.
Which is stronger as a learning exemplar? For at risk students, visual arts may offer a clearer path; and, it can be argued that for visually oriented
students, an arts frame makes more sense. Knowing about art criticism allows all learners to decode visual information to some extent. This is
why the art specialist mattered in visual arts achievement, over teachers from other backgrounds. Teaching critical methodology, and the
meaning of mark making as expression, and addressing personal and social justice issues in art making provides critical approaches that can help
solve other problems and creative needs of society. However, neither mode exists in isolation. Text and image appear together in reading and in
art NAEP test blocks. When text is primary and the image is a symbolic or narrative in content, teachers will be wise to attend to engagement
protocols and social justice themes so that the aims expressed for Reading in 8th grade Advanced achievement in the 2013 Reading first look
report card (see http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2014451) can be placed in focus. Eighth grade students should be able to
make connections across text sources and explain relationships. They should be skilled in evaluating and supporting their responses to an
author’s or an artist’s creative work; and, they need to be able to manage the processing demands of formal thinking.
SOLUTIONS AND RECOMMENDATIONS
Our solution was to develop and support a theory of aspirationalmodeling that we supported through structural equation modeling analysis of
NAEP 2008 data in visual art. We further tested theaspirational model components with NAEP 2005 Reading data. But, statistical language and
data arrays do not communicate well to broad audiences.
Limitations of Statistical Analyses
Structural equation modeling (SEM) was used in this study to test the theoretical model for both visual arts and reading data.
TheAspirational model was generated from the Mother/Child block in NAEP 2008 visual arts data. One of the limitations of SEM is its data
driven approach and the directions of cause-effect relationships as depicted in the path diagram are tentative and should be subjected for
scrutiny. This study, however, combines theory with statistical analysis approach. Statistical models were underpinned by theory and when used
as in data visualization further develop and strengthen the theory.
Moreover, data visualization that is based in the statistical data and its expressivity can transcend the limitations of experience and expertise, and
serve to broaden audience recognition of essential features, relationships, aesthetic expectations, and informs meaning making. With data
graphics based on statistics, teachers can better understand the fundamental reasoning that guides assessment in art and reading, and employ
questioning as a major strategy in art and reading classrooms. Rather than teaching to the NAEP questions and other similar achievement
measures, teachers can aspire to initiate their students into how beliefs, procedures, and understandings in art and literature can inform
educated people.
FUTURE RESEARCH DIRECTIONS
As we continued to work, a group formed within the NAEA with project leader Christopher Grodoski to initiate collaborative work with a focus on
visualization of numeric data as a means of representing quantitative research findings in the discipline. Our team members, along with other
members of a consortium of university researchers dedicated to NAEP research, were contacted to submit visualization projects based on our
work with NAEP 2008 and the earlier NAEP 1997 findings.
We will continue to look for possible universal structural models for curriculum and assessment that may have positiveaspirational learning
application across varied disciplines. In June 2014 the research team amended the restricted data license to gain access to the 2009 and 2011
Reading, Math, Science data for further comparison and analysis.
CONCLUSION
This work was previously published in the Handbook of Research on Maximizing Cognitive Learning through Knowledge Visualization edited
by Anna Ursyn, pages 515534, copyright year 2015 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Brewer, T., Xu, L., Diket, R., & Davis, D. (2012, March). Discussing educational policy and implications of efficacy in the arts as school subject
areas. Paper presentation at American Educational Research Association, Vancouver, Canada.
Diket, R., Xu, L., & Brewer, T. (2014). Toward an aspirational learning model gleaned from large-scale assessment. Studies in Art
Education , 56(1), 397–410.
Diket, R., Xu, L., Brewer, T., & Sutters, J. (2015). Making data visible: Using demographic data for research and advocacy.Paper presented to
National Arts Education Association. New Orleans, LA.
Gruber, H. E., & Davis, S. N. (1988). Inching our way up Mount Olympus: The evolving-systems approach to creative thinking. In R. J.
Sternberg’s (Ed.), The nature of creativity (pp. 243-270). New York: Cambridge University Press.
Hollingsworth, P. (2003). The ecosystem of giftedness and creativity . In Ambrose, D., Cohen, L., & Tannenbaum, A. (Eds.),Creative intelligence:
Toward theoretic integration (pp. 113–129). Cresskill, NJ: Hampton Press.
Johsnon, H., & Freedman, L. (2005). Developing critical awareness at the middle level . Newark, DE: International Reading Association.
Kowalchuk, E. (1996). Promoting higher order teaching and understanding in art education. Translations: From Theory to Practice, 5(1).
Retrieved January 21, 2014, from: http://nces.ed.gov/nationsreportcard/naepdata/
Siegesmund, R., Diket, R., & McCulloch, S. (2001). Revisioning NAEP: Amending a performance assessment for middle school art
students. Studies in Art Education , 43(1), 45–56. doi:10.2307/1320991
Taylor, C. S., & Nolen, S. B. (2005). Classroom assessment: Supporting teaching and learning in real classrooms . Upper Saddle River, NJ:
Pearson.
Ursyn, A. (2014). Perceptions of knowledge visualization: Explaining concepts through meaningful images . IGI Global. doi:10.4018/978-1-4666-
4703-9
Wieman, C. E. (2014). The similarities between research in education and research in the hard sciences. Educational Researcher , 43(1), 12–14.
doi:10.3102/0013189X13520294
Winner, E., & Hetland, L. (2001). The arts and academic improvement: What the evidence shows. Translations . Theory into Practice , 10(1).
ADDITIONAL READING
Brewer, T. (2008). Developing a Bundled Visual Arts Assessment Model. Visual Arts Research , 34(1), 63–74.
Brewer, T. (2011). Lessons learned from the Bundled Visual Arts Assessment. Visual Arts Research , 37(1), 79–95.
doi:10.5406/visuartsrese.37.1.0079
Diket, R. M. (2001). A factor analytic model of eighth-grade art learning: Secondary analysis of NAEP arts data. Studies in Art Education , 43(1),
5–17. doi:10.2307/1320988
Diket, R. M., & Brewer, T. (2011). NAEP and policy: Chasing the tail of the assessment tiger. Arts Education Policy Review , 112(1), 35–47.
doi:10.1080/10632913.2011.518126
Diket, R. M., Burton, D., McCollister, S., & Sabol, F. R. (2000). Taking another look: Secondary analysis of the NAEP Report Card in the visual
arts. Studies in Art Education , 41(3), 202–207. doi:doi:10.2307/1320377
Diket, R. M., Sabol, F. R., & Burton, D. (2001). Implications of the 1997 NAEP visual arts data for policies concerning artistic development in
America’s schools and communities. CFDA Number 84.902B/Award Number 902B000006 . Hattiesburg, MS: William Carey University.
Diket R. M. Xu L. Brewer T. (2012). Discussing educational policy and implications of efficacy in the arts as school subject areas: NAEP>Teacher-
Made Assessment,> Subject Area Growth. (AERA Symposium Presentation).
Keiper, S., Sandene, G. A., Perskey, H. R., & Kuang, M. (2009).The Nation’s report card: Arts 2008 music and visual arts. (NCES 2009-488) .
Washington, DC: National Center for Educational Statistics, Institute of Education Statistics, U.S. Department of Education.
Persky, H. R., Sandene, B. A., & Askew, J. M. (NCES 1999-486).The NAEP1997Arts Report Card. National Center for Education Statistics,
Washington, DC.
Persky, H. R., Sandene, B. A., & Askew, J. M. (1999). The NAEP 1997 Arts Report Card . U. S. Department of Education, Office of Educational
Research and Improvement. Department of Education, Office of Educational Research and Improvement.
Portal to NAEP Data Explorer in 2008 Art Assessment. (n.d.). Retrieved January 21, 2014 from
http://nces.ed.gov/nationsreportcard/naepdata/dataset.aspx
Sample Questions for 2008 Visual Arts Test. (n.d.). Retrieved January 21, 2014 from http://nces.ed.gov/nationsreportcard/itmrlsx/search.aspx?
subject=arts
Siegesmund, R., Diket, R., & McColloch, S. (2001). Re-visioning NAEP: Amending a performance assessment for middle school students. Studies
in Art Education , 43(1), 45–56. doi:10.2307/1320991
Stankiewitz, M. A. (1999). Spinning the arts NAEP. Arts Education Policy Review , 100(1), 29–32. doi:10.1080/10632919909600233
The, N. A. E. P. 1997 Arts Report Card: Eighth-Grade. (n.d.). Retrieved January 21, 2014 from http://nces.ed.gov/pubsearch/pubsinfo.asp?
pubid=1999486
The, N. A. E. P. 2008 Arts Report Card finding for Eighth-Grade. (n.d.). Retrieved January 21, 2014 from
http://nationsreportcard.gov/arts_2008/
KEY TERMS AND DEFINITIONS
Aspirational Learning Theory: Theory derived from statistical analyses using the NAEP arts data. It moves from art knowledge to technical
knowledge to aesthetic knowledge to meaning in a sequential manner. Technical knowledge appears requisite to developing aesthetic
understanding and meaning. We called the path “aspirational” because constructed response NAEP problem blocks require dedicated attention
to accruing a working body of knowledge, recording growing understanding through detailed critical analysis, comparing back to aesthetic
expectations of a field, with an aim of establishing a clear meaning for the exercise by the final answer. Authors reason that an aspirational
modelgenerated from the Mother/Child block might be used in curriculum planning and implementation.
Bullet Model: A trajectory and focus on a procedural target. It is a pictorial depiction of the relationship between categories of reading
questions in the NAEP reading data and students’ performance and color coded according to levels of thinking.
Data Visualization: Associates geographic maps to explore clustering and distribution of effects; and, by thematic maps reveals specific
information about locations, conveys general spatial information, and makes comparisons as map are internalized visually as overlays. It involves
creation and study of the visual representation of data through tables and charts.
Generative Topics: Are pivotal, accessible, and connectable to other knowledge students bring to the table. They are issues, concepts and ideas
that provide sufficient depth, significance, and variety of perspectives and support students’ development of powerful understandings.
NAEP Data Explorer: NCES web-based system that offers users selection of variables and provides with detailed tabular results from the
National Assessment of Educational Progress (NAEP) national and state assessments for a given subject in a particular test administration year
http://nces.ed.gov/nationsreportcard/naepdata/dataset.aspx. Users can use the Data Explorer to compare a state’s results to those of the nation,
the region, and other participating states.
NAEP SecondaryUse Data Files: Restricted-use data files containing student-level cognitive, demographic, and background data and
school-level data. They are available for use by researchers who have obtained a license from NCES and wish to perform analyses of NAEP data.
CHAPTER 76
The Role of Knowledge Management (KM) in Aged Care Informatics:
Crafting the Knowledge Organization
Margee Hume
University of Southern Queensland, Australia
Craig Hume
Griffith University, Australia
Paul Johnston
Care Systems Pty Ltd, Australia
Jeffrey Soar
University of Southern Queensland, Australia
Jon Whitty
University of Southern Queensland, Australia
ABSTRACT
Aged care is projected to be the fastest-growing sector within the health and community care industries (Reynolds, 2009). Strengthening the
care-giving workforce, compliance, delivery, and technology is not only vital to our social infrastructure and improving the quality of care, but
also has the potential to drive long-term economic growth and contribute to the Gross Domestic Product (GDP). This chapter examines the role
of Knowledge Management (KM) in aged care organizations to assist in the delivery of aged care. With limited research related to KM in aged
care, this chapter advances knowledge and offers a unique view of KM from the perspective of 22 aged care stakeholders. Using in-depth
interviewing, this chapter explores the definition of knowledge in aged care facilities, the importance of knowledge planning, capture, and
diffusion for accreditation purposes, and offers recommendations for the development of sustainable knowledge management practice and
development.
INTRODUCTION
Key to responding to this pressure is increased empowerment and capability of leadership and management within the aged care workforce and
offsetting practices through advance technological developments and knowledge creation. Aged care is becoming more diverse and complex
advancing from residential care to incorporate community directed care. As a result, “aged care knowledge” is becoming increasingly
heterogeneous which puts more emphasis on the need for better Knowledge Management (KM) including its creation, access and diffusion to
ensure an appropriate/fit for purpose level of care. It other words, one size does not fit all and while there might be some commonalities there
will also be substantial differences.
Health informatics is a field of growing interest, popularity and research. It deals with the resources, ICT (information computer technology), and
methods required to facilitate the acquisition, storage, retrieval, and use of information in the health sector. Tools include computers, formal
medical terminologies and information and communication systems, with knowledge management systems at the forefront of thought in health
(Murray & Carter, 2005). This chapter embraces the important area of knowledge generation and informatics in aged care healthcare. This
chapter focuses on informing the development of an analytics - driven operational systems and advanced KM hub for aged care management and
patient care services. Analytics is focused on communication and decision-making based on meaningful patterns in data gained from a
methodological analysis. The chapter introduces the concepts of knowledge management, decision support systems and big data management in
aged care and focuses on the importance of diffusion of knowledge to those in need.
This chapter focuses on the important area of aged care services as a national priority with this a priority for many countries worldwide
(Cartwright, Sankaran, Kelly, 2008) and adopts a case study approach with the Australian aged care sector as the basis of analysis. The Australian
aged care system is seen as innovative globally and provides the benchmark for many countries developing reforms and strategies for aged care.
Many countries including Australia are burdened with an ageing population (Venturato & Drew, 2010). This burden has created the need for
policy reform and the introduction of new programs to improve the quality of life of senior citizens (Department of Health and Ageing, 2013). The
changing industry needs are driven by a combination of changing demographics, changing care needs, increased funding for community care and
restructuring by service providers to meet government reforms and initiatives. The reform and accreditation process has created the need to
exploit new information and knowledge to ensure innovative delivery. This need and the increased complexity of the information required
encourage the need to be innovative in the management of knowledge ((Bailey & Clarke, 2001; Binney, 2000; Blair, 2002; Wiig, 1997) There is no
doubt that the sector manages some types of knowledge efficiently such as patient medical records, funding reporting and basic accreditation
records however there is much data available that can enable better work practice that is not being accessed (Venturato & Drew, 2010; Sankaran,
Cartwright, Kelly, Shaw, Soar, 2010).
THE AUSTRALIAN AGED CARE SECTOR
The aged sector needs are driven by a combination of demographics, changing care needs, increased funding for community care and
restructuring by service providers to meet government reforms and initiatives. With 84% of community care packages and approximately 60% of
residential aged care services, provided by not-for-profit (NFP) organizations (Productivity Commission, 2011) it is vital to assist and inform
leadership, decision making and productivity improvements through advanced leadership techniques and decision making support such as KM
(Bailey &Clarke, 2001; Binney, 2000; Blair, 2002) . The notion of effective leadership warrants continued investigation with focus on skilling
governance bodies of NFPs to solve complex business challenges including financial, legal, property, service delivery, ethical and management
issues. Previous work (Jeon, Merlyn, & Chenoweth, 2010; Cartwright, Sankaran & Kelly, 2008) identified that NFP aged care leaders require
improved and supported decision making and knowledge support for leadership and performance was identified as an essential part (Riege,
2005). Knowledge supporting the accreditation process and ability to meet compliance expectation and standards was the primary focus of this
knowledge. While some KM systems and knowledge hubs have been developed in health they have failed to meet the needs of the sector related
to knowledge creation, information management, diffusion and the offered preparation for the leadership needed for the sector (Pinnington,
2011; Hume and Hume 2008; Hume, Pope & Hume 2012; Hume Clarke & Hume, 2012).
• Are business cases for profit firms and insufficiently flexible for use in NFP organizations and in particular health and faith based firms;
• Fail to embrace the aged care sector diversity, complexities and requirements of accreditation and quality of care; and
• Do not encompass the emergence of increased reporting requirements, client demand and flexibility required to deliver end to end aged
care and community directed care.
Examining the role of knowledge management and advancing the understanding of the requirements will assist in developing rigorous corporate
system to support the sector.
The Australian Bureau Statistic (ABS) has projected growth of the number of Australians aged 65 years or over to increase from the current 15 per
cent to 21 per cent by 2026 and 28 per cent by 2056 (ABS 2012). This trend towards an ageing population poses social and economic challenges
which are being responded to by the Australian Government through its Living Longer Living Betterreform package (DOHA: Department of
Health and Ageing, 2013). The 2011 Productivity Commission report, Caring for Older Australians and the Commonwealth Department of
Health and Ageing 2012 Living Longer, Living Better, aged care reform package, will be implemented between 2013 and 2022. The NFP
providers in aged care sector employ nearly 900,000 staff, with support from 4.6 million volunteers (Productivity Commission 2010 and face
continued funding and regulatory constraints, workforce shortages (Palmer, 2012), increasing frailty of clients and a rapidly increasing demand
for complex professional and health services. These have been heralded major changes in the aged care sector (Comondor, Devereaux, Zhou,
Stone, Busse, Ravindran, Burns, Haines, Stringer, Cook, Walter, Sullivan, Berwanger, Bhandari, Banglawala, Lavis, Petrisor, Schünemann,
Walsh, Bhatnagar, Guyatt, 2009). These changes include an expansion of home care services and the introduction of a dementia supplement to
support people with dementia receiving care at home and in residential care. In addition, recent changes to the Aged Care Funding Instrument
(ACFI) for residential care aims to embed consumer-directed care principles into mainstream aged care program delivery and ensure the
sustainability of facilities in regional, rural and remote areas. The reforms combine income and assets tests into a means-testing arrangement
and introduced a lifetime cap on care fees (AIHW 2012). This reform package aims to deliver systems that provide older Australians with more
choice, control and easier access to a full range of services, when and where they require it.
With these reforms come advanced management, leadership and business applications requirements for providers to meet the challenges and
requirements of the future and comply with government legislation. The management of knowledge in aged care sectors is currently erratic and
inconsistent and firms are suggested to be low on the capability maturity index KM considers the requirements and current practice of knowledge
capture, storage, retrieval and diffusion (Davenport & Prusak, 2000) and the system assists current aged care providers in meeting the
compliance and reporting needs of reforms and government legislation. A KM system also assists in creating flexibility catering for these and
other imminent reforms and changes. Better management of knowledge and business intelligence will enable aged care providers in improved
and streamlined patient management and service delivery (Lettieri, Borga, Savoldelli, 2004). The aged care sector is growing in most OECD
countries; cost pressures and challenges of access and affordability are commonly reported. With baby-boomers now entering their late-60s, it is
critical to incorporate technology-enabled solutions to facilitate the delivery of cost-effective aged care and address many of the challenges of
managing this sector.
AGED CARE AND KM
KM is a foundation practice in the organization and informs what is valued as knowledge, capture and diffusion of it. Embracing knowledge
management and decision support mechanism ensure improved productivity and efficiency (Vestal, 2005). Knowledge management (KM) is
increasingly being discussed as a cornerstone of an organisation’s ability to compete (Treleaven & Sykes, 2005). Despite the increasing
recognition of the benefits of KM, a number of significant implementation challenges exist for consideration by both practitioners and academics
especially in not for profits (NFPs) and sectors with low capability maturity for technological innovations and practices such as KM. Identifying
the current KM needs and practices to inform the development of KM for future implementation and compliance is vital. This chapter presents
and collates current knowledge capture, storage, retrieval and diffusion and builds the business case for the data needed to advance decision-
making in the aged care industry. Vichitvanichphong, Kerr Talaei-Khoei and Hossein Ghapanchi, (2013) have explored and collated the extent
literature examining the role of technologies in aged care. Although this work focused primarily on assistive technologies adopted by the elderly it
strongly highlights the limited amount or research conducted on general purpose information technology and online information services in this
sector. This supports the need for research into knowledge management in this sector.
Many aged care organisations are being driven to adopt more commercial practices in order to improve their ability to effectively provide End to
end care and delivery quality patient outcomes. Knowledge Management (KM) is one such “corporate” practice being explored to address the
increasingly competitive environment, competitor intelligence and strategic intelligence (Haggie & Kingston, 2003). Although the concept of
knowledge management may be basically understood researchers and managers are yet to explore and fully understand the complex inter-
relationships of big data management, informatics, organizational culture, ICT, internal marketing, employee engagement and performance
management as collective enablers on the capture, co-ordination, diffusion and renewal of knowledge in the aged care environment. This chapter
will present research into the relationship of KM with those enabling elements and will offer a conceptual framework and implementation model
to assist in planning and sustaining KM activity from integrated organizational and knowledge worker perspectives in aged care. The chapter will
emphasize an enduring integrated approach to aged care industry KM to drive and sustain the knowledge capture and renewal continuum. The
chapter will provide an important contribution on how to do KM in aged care and propose the need for interactive and Web 2.0 enabled
knowledge hubs.
Aged care providers require leadership and innovation in Informatics and KM for Ageing and Aged-care to ensure productivity, innovation and
growth and the ability to manage the imminent growth in the sector. Moreover, industry and consumer peak bodies have become more
structured and planned in raising their concerns about aged care services (Reynolds 2009) and supporting and advising aged care providers.
Aged care services are currently provided to more than 1 million people in Australia every year, through residential, community and flexible care
services (Kane, 2003; Jeong & Keatinge, 2004). By 2050, it is expected that the number of Australians needing aged care services will increase to
3.5 million (Productivity Commission, 2011). As Australia's population ages at 20% per year with a predicted number of retirees by 2020 and the
number of people reaching retirement doubling by 2030 and tripling by 2050, the aged care sector is under increasing pressure to ensure that
quality aged care. Supporting providers, industry and consumer peak bodies with knowledge and data management frameworks will assist in
moving them successfully into a new era of aged care. Many reports on ageing and the age (AIHW, 2012) identify the significant shortage in the
current workforce trained to care for the needs of our nation’s older adults, community care services and residential places. These places are
predicted to grow explosively as the Baby Boomers retire and residential aged services will need to grow, evolve and innovate their service
provision. Capital investment and allocation of capital is becoming increasingly hard for aged care providers and any measure which will assist in
cost savings, improved service provision and efficiency such as KM is welcomed.
Managing the new advent of big data for known and sought knowledge and also having the ability to search for relevant but unknown sources will
assist with industry transformation to improve productivity and enhance consumer directed care in the aged care services sector (AIHW 2012).
Management of big data, knowledge, analytics and informatics are suggested to build services efficiency, improve quality of service and business
profitability in many sectors especially knowledge intensive sectors like aged care and health (Hillmer, Wodchis, Gill Anderson Rochon, 2005;
Tsai, Chien-Tzu, & Pao-Long, Chang, 2005; Vestal, 2005). It is widely believed that the use of information technology can reduce the cost of
healthcare while improving quality and delivery (Manyika, Chui, Brown, Bughin, Dobbs, Roxburgh, and Hung Byers, 2011). The main issue with
informatics and data management in this sector is the embryonic understanding of the data needed the ability to access the data, the storing of
data, technology efficacy and how to use and integrate data and knowledge into the service delivery to improve service provision and patient
outcomes. These are reinforced by the case of the Australian aged care sector.
DATA AND METHOD
The purpose of this research is to examine how KM and knowledge in operating in the aged care sector in Australia, what practices are required
and preferred and provide a validation of the need for KM in an aged care NFP setting. This research reports qualitative findings from a 22 in-
depth interviews of NFP managers, senior full time staff. See Table 1 for a breakdown of survey respondents’ roles.
Table 1. Distribution of subjects
Ceo 5
Facility manager 5
Operational staff 6
Professional contractor 2
TOTAL 22
A qualitative exploration and orientation study (Patton, 1990) was adopted for this research using in depth interviews with each of the aged care
employees and is used to reflect the aged care sector. The advantage of using qualitative methods is the ability to extract rich and thick data and
use inductive strategies for theory development (Patton, 1990). As recommended by Eisenhardt (1989), the qualitative research sample consisted
of more than six (6) case subjects with sampling continuing until theoretical saturation. These studies complete 22 interview subjects to ensure a
consistent and complete set of responses. The sample consisted of a good coverage of all role types with these depicted in Table 1.
The interviews selected subjects ranging from NFPs senior managers to operational staff. This reflected the leadership structure in the facilities
accessed. These facilities were accessed using snowballing technique. Interestingly, in many facilities it was operational staff operation leading
the facility on a regular basis. For the interviews all NFPs aged care managers, senior and operational were purposively selected to ensure full
assessment of current behavior and needs of line staff in aged care facilities and also to explore opinion of managerial level workers in the aged
care sector. The interviews were conducted onsite and took approximately 2 hours to complete. Each interview was recorded and transcribed
with emergent themes highlighted. The scope of questions followed previous research knowledge management to focus topic areas. Probing
questions were used to gather thick and deep descriptions of opinions and attitudes.
Open ended question were used and allowed for free flowing comments from subjects related to knowledge needs and the aged care sector. An
open ended approach provides a forum for data variation from subjects, the collection of in depth scripts and ultimate theoretical saturation as
recommended by Glaser and Strauss (1967), Eisenhardt (1989), Eisenhardt and Graebner (2007) and Perry (1998). The questions, based on
previous KM research, were used to lead the interview and bring focus to KM and the aged care sector. They focused on the following areas:
• Do aged care NFP employees think KM assists in improving the understanding and management of information and knowledge in an aged
care setting?
• Do aged care NFP organizations currently try manage the capture collection and diffusion of knowledge effectively using a KM system?
• How do you define knowledge; what is data in your sector what is known and what is recognized but not known?
A sample of 22 created a usable set of answer scripts enabling rigorous inductive analysis. As the sampling method was non-random,
generalizability inferences of findings to the overall population are restricted, making the findings indicative to the population tested. However,
these indicative findings contribute to the development of our understanding of KM in aged care NFPs. A set of transcripts were created verbatim
from each of the respondents and were coded and organized using sequential incident analysis. A content analysis was undertaken resulting in
the findings with this process consistent with the method outlined by Eisenhardt (1989). These findings identified the emergent themes and
behaviors of KM in an aged care context. Further, inductive analysis was undertaken based on understanding of the extant literature and
narratives were drawn from the scripts. Each of the question areas lead to specific response with some leading to advanced discussions in the
topic areas.
PROBLEMS IDENTIFIED
The interviews responses were coded and analyzed through logical deduction by experts in the field. These identified the following problems.
Problem 1: Lack of leadership and expertise in management of Knowledge: Current leadership in the sector is suggested to have little
expertise in how to implement efficient KM strategies. This is compounded by a high turnover of staff, low technical efficacy and low
capability maturity with these specifically hindering the adoption and implementation of KM strategies. The majority of subjects suggested
that the understanding of the current capabilities and developing training strategies would assist in better knowledge definition, capture,
storage and use. With this offering a real benefit to the sector.
Problem 2: The subjects suggested a lack of capability of current KM systems to meet the needs for efficient knowledge systems: Current
KM/information support systems rely heavily on informal socialization and paper based systems with limited consistent document capture
and storage, and limited formal engagement of staff. Many ICT systems lack the ability to manage knowledge and intelligence with very little
understanding to date of Cloud technology and other virtual storage mediums. Improving the capability of accessing, storing massive
amounts of data in its various forms would aid in providing efficiency and cost reductions for delivery. It was also identified that staff had
relatively low technological efficacy, especially with more complex administration technology. Coupled with the informal nature of
interactions and paper based recording KM was at a very embryonic stage with much tacit knowledge lost and much repetition with
collection.
Problem 3: limited comprehensive understanding of what is knowledge, what needs to be stored and who are the key players in the
knowledge process. Many subjects knew that they needed knowledge and that it was required for accreditation purposes however the
specifics of who, what, where, when and why was overlooked.
Problem 4: The subjects identified some specific types of data that they collected and they knew was relevant. These have been identified as
MUST. These were related to patient data and medical practice. The understanding of knowledge related to organizational systems,
continuous improvement, (an essential component of accreditation) and specific needs for funding we found to be rudimentary. The
foundation factors were known yet more complex and advanced knowledge were unknown and unrecognized. The areas have been clustered
based on current standards requirements (ACSAA http://www.accreditation.org.au/accreditation/accreditationstandards/).
Table 2 depicts the data identified by the subjects as known and with further questioning and discussion, subjects discussed data they were not
considering and were not aware of. Interestingly, the most unknown data was management and service data especially ICT tools and data
management factors that may assist in accreditation and continuous improvement capability. There was little patient data that the subjects were
not aware of or did not recognize. This supports the need for stronger managerial training and leadership in the sector with clinical training for
registered staff, satisfactory.
Table 2. Data known and unknown
RECOMMENDATIONS AND CONTRIBUTIONS
Consideration 1: Recognize knowledge and discover knowledge leaders. A key element of this project is that there is currently a wide gap
between the potential of managing known information, acquiring information using big data and accessible data, analyzing the knowledge
and its realization into business service provision and profits in aged care services. One of the hardest decisions is to know what data and
information is of value to decision making and should be acquired and stored and what information should be discarded. Data and
information can come from many sources as identified in Table 2 and can be tacit and or explicit and as such, the volume of data in the
knowledge era is extensive. Figure one identifies many of the areas where data can be gathered and proposed that as KM is better understood
that each of these areas has a specific strategy for the capture and creation of knowledge in aged care.
The problems of knowledge acquisition commence immediately with deciding what data to keep and what to discard. The next challenge is how
to store what we keep reliably with the right metadata or filing system. Meta data refers to a set of data that describes and gives information
about other data (www.techterms.com/definition/metadata). Exploiting vast new flows of big data captured from all paths of the internet and
information can radically improve your company’s performance, if managers can locate it easily and use it to measure and inform problem
solving. This forms the basis of what we call analytics. Analytics is the science of logical analysis including identifying patterns in data that can
inform decisions. Firms that capture and store data well know radically more about their businesses and directly translate that knowledge into
improved decision making and performance. These firms have strategies and practices to transform big data such as KM and have a decision-
making culture constructed on evidence based data. Clever KM systems have the ability to embrace the velocity and speed at which new data and
information is entering the market with no question that technology enablement assists in the collection and storage of data and knowledge. Less
consideration has been given to technology enablement for diffusion and the push of knowledge. A push knowledge strategy is the solution for
unfamiliarity with knowledge especially known unknown data and unknown- unknown data . This strategy needs knowledge champions (Jones,
Herschel, & Moesel, 2003) and needs knowledge leaders (Wenger &Snyder, 2000). Knowledge leaders in the sector need to create and
disseminate knowledge through hubs and identify the key areas of core business, external legislation and accreditation, industry metrics and
practices, consumer metrics like satisfaction and quality and organizational successes and failures. Knowledge champions disseminate this in the
organization and encourage others to engage with knowledge creation. They promote that essential data is available and where to find it and what
to do with. Currently in the aged care sector this is limited.
Figure 1 is not conclusive for all areas of data required in the sector however it introduces conceptually the areas identified by the subjects and
shows how the capture, creation and diffusion is a feedback link for improved organizational performance and is the data flower. These are in
essence the known areas of data with the unknown needs requiring push dissemination. Knowledge champions work to support technology
enablement and the practice of KM in the firm whilst also championing the role of knowledge in lessons learnt (Zuber-Skerritt, 2001) and
improving performance.
Consideration 2: The future of knowledge- interactive hubs.A feature of globally competitive knowledge-based economies is that
governments, universities, consumers and industry work together in these economies to create regional ‘knowledge hubs’. Recent advances
in the role of big data are topical and it proved very effective. Interactive big data capture and diffusion is now moving to real time and
constant sourcing. Developing Knowledge hubs with real time collaboration via “subscribed” knowledge groups with constant pull and push.
Creating knowledge networks is a solution to the support the leadership and decision making issues in the age care sector (Venturato, &
Drew, 2010). Many facilities are managed by clinically trained staff that lack managerial and administrative training and knowledge supports
systems would support the deficit (Salipante & Aram, 2003, O’Sullivan & McKimm, 2011). To create world best practice and position the
Australian aged care system as the benchmark for global aged care knowledge sharing and diffusion of knowledge through a collaborative,
inclusive aged care knowledge hub is vital. Improving diffusion through using social media and interactive web 2.0 platforms will improve
access to known and unknown data.
Currently knowledge hubs are static and focused. They have been suggested to have three major functions: to generate knowledge; to transfer and
apply knowledge; and to transmit knowledge to others in the community through education and training. Knowledge hubs have generated new
and basic knowledge of relevance for many industries, both old and new and the knowledge-users provide a focus for knowledge-generation,
transmission and diffusion. The early generations of knowledge hubs have captured and stored knowledge systematically with this knowledge
accessed by a pull strategy i.e. the knowledge must be found and accessed by the user. Ideally a well-designed hub will link and portal all available
knowledge foci. Unfortunately in many sectors and particular to the aged care sector, competing knowledge hubs have been created from
different stakeholder perspectives simulating knowledge and creating competing knowledge hub “silos”. These hubs are often disconnected with
those in need ie “the knowledge user” and are difficult to find and navigate.
Exploring the notion of interactivity and web 2.0 enabled knowledge hubs supporting a diffusion push strategy would radically revolutionize
current KM in the aged care sector (. The notion of a push strategy would channel the knowledge to those in need through push web strategies of
email, social media, webinars, forums, blogs discussion, alerts, pokes and rss feeds. Specifically, leadership, compliance and decision-making in
the aged care sector would be supported by these next generation knowledge hubs. Irrespective of the existent of the current portals and
knowledge hubs problems still exist based on the nature of the knowledge tacit or explicit. Explicit knowledge is knowledge that has been
articulated, codified, and stored in a certain retrieval media. It can be readily transmitted to others. Conversely tacit knowledge is knowledge,
people are not often aware of the knowledge they possess nor how it can be valuable to others. Effective transfer of tacit knowledge generally
requires extensive personal contact, regular interaction and trust as creation.
In the traditional setting, the transmission function of a knowledge hub takes place through educational institutions such as universities and
schools but also through life-long learning processes that involve firms, community based institutions and a variety of government agencies and
services including hospitals, clinics and professional associations. Users must be aware of these channels. Social media is underutilised and
provides a stronger diffusive and viral mechanism for knowledge distribution. The aims are to maximse cooperation coordination, collaboration
and conservation. Identifying the factors that influence the knowledge needs of aged care providers, advocates and associations, when
considering and providing care, and the extent to which these preferences are met by current knowledge based services, is essential to capturing
tacit and explicit knowledge.
Consideration 3: A Conceptual Model for the learning organization in aged care. The model proposed for KM for aged care services that
informs and innovate aged care KM implementation framework: knowledge identification, acquisition, capture, retrieval and
diffusion(Nonaka,&Takeuchi, 1995) and decision-making and supports the development of a learning organization for the aged care sector.
Figure 2 suggested that adopting lessons learnt philosophy will develop change behaviour and organisational rewiring. The importance of
this is the involvement of people and the role that people play in generating and creating knowledge. Figure 2 depicts the development of
learning organisations and improved performance through informed workforce leadership and efficiency in decision making, organisational
knowledge capture, Knowledge leaders, knowledge champions and application of aged care health informatics (Senge, 1990; Senge, Roberts,
Ross, Smith, and Kleinter, 1994). This figure proposes the use of technology enabled analytics (KM) and infrastructure in aged care services
improving leadership, social interactions and innovation in processes and availability and delivery through knowledge and lessons learnt.
This framework suggests improved access to knowledge, analytics and use of data [for reporting compliance and funding] is required and
that developing a feedback system will manifestly improve decision making resulting in a learning organisation. The framework also
considers aspects of culture, performance management and education as vital inputs for success. Finally, the recommendations for
management include the needs to address the four Management Challenges of KM implementation.
Companies won’t reap the full benefits of a transition to using big data, KM and Knowledge hubs unless they’re able to manage change effectively
and embrace lessons learnt philosophy. Four areas are particularly important in that process.
1. Leadership: Companies succeed in the KM and big data era due to leadership teams that set clear goals, define what success looks like,
and identify the correct needs. These leaders have vision and recognize opportunity and understand how the sector as a whole is developing.
The customer orientation and internal marketing (Martinsons, & Hosley, 1993; Ballantyne, 2000) . This includes employees and other
stakeholders. The successful participants in the aged care sector will be those with the ability to be flexible in changing the way their
organizations make many decisions and develop a decision-making culture.
2. Knowledge Leaders, Data Experts, and Champions:As data becomes easily accessible and of low cost, the data experts and
knowledge workers are of greater demand (Chong, 2005). Experts skilled in working with data are needed if the role of data and analytics is
going to be incorporated in the business. The most valuable data experts will be those that can link the needs of the sector, the firm, data and
design ways in which data can be used. These will be our knowledge leaders and champions.
3. Information and Communications Technology (ICT): The tools available to handle the volume, velocity, and variety of big data
have advanced substantially in recent times (Chong, 2005). There are many inexpensive, open source libraries that can aid the
technologically minded in setting up a system. However, these technologies do require a skill set that is new to most IT departments, which
will need to work hard to integrate all the relevant internal and external sources of data. In the aged care sector this skill set is at a lower
capability so sourcing expert staff will be a priority
4. Evidence Based Decision Making Culture: Effective and performance oriented firms use evidence to make decisions. People who
understand the problems need to be brought together with the right data, but also with the people who have problem solving techniques that
can effectively exploit them.
CONCLUDE FUTURE RESEARCH AND IMPLEMENTATION CONSIDERATIONS
This chapter has introduced the concepts of KM in aged care. The chapter has reported the findings of qualitative work that highlighted the
current problems and the need for KM, KM champions and an easy and effective way to access knowledge. The chapter has offered the following
considerations and recommendations and these are vital for the Australian aged care system to be globally innovative and be the benchmark for
many countries developing reforms and strategies for aged care.
The chapter suggested that the theoretical understanding of knowledge creation needs to be advanced in the ACC sector with particular reference
to knowledge storage, retrieval and diffusion needs for decision-making for accreditation and compliance. This will better manage information
and knowledge and equip leaders with KM support for informed decisions, service innovations, staffing and training for current and future
change and intensifying pressures.
Finally, testing and refining the elements of a dynamic, interactive KM framework and satisfying the knowledge and information needs of the
sector are essential for productivity. Developing a specific business cases for organisations and testing these in organizational settings using
action research interventions and reflection will confirm the need and specifications of each system. For the sector to advance into the future
further empirical work would advance understanding of the knowledge needs and creation and the successful development and adoption of
knowledge hubs in the aged care sector. The Benefits of building the knowledge hub including improved Quality of Care (for staff) increase staff
satisfaction, reduced staff turnover and improved consistency and quality (for residents) and decreased complaints should all be examined to
support the business case.
This work was previously published in Healthcare Informatics and Analytics edited by Madjid Tavana, Amir Hossein Ghapanchi, and Amir
TalaeiKhoei, pages 284302, copyright year 2015 by Medical Information Science Reference (an imprint of IGI Global).
REFERENCES
Australian Institute of Health and Welfare (AIHW). (2012).Residential aged care in Australia 2010-11: a statistical overview, Aged care statistics
series no. 36, Cat. No. AGE 68 . Canberra: AIHW.
Bailey, C., & Clarke, M. (2001). Managing knowledge for personal and organizational benefit . Journal of Knowledge Management ,5(1), 58.
doi:10.1108/13673270110384400
Ballantyne, D. (2000). Internal relationship marketing: A strategy for knowledge renewal . International Journal of Bank Marketing ,18(6), 274.
doi:10.1108/02652320010358698
Binney, D. (2000). The knowledge management spectrum-understanding the KM landscape . Journal of Knowledge Management , 5(1), 21–32.
Blair, D. C. (2002). Knowledge Management: Hype, Hope or Help? Journal of the American Society for Information Science and
Technology , 53(12), 1019. doi:10.1002/asi.10113
Cartwright, C., Sankaran, S., & Kelly, J. (2008). Developing a New Leadership Framework for Not-For-Profit Health and Community Care
Organisations in Australia . Lismore, Australia: Southern Cross University.
Chong, C. S. (2005, June). Critical factors in the successful implementation of Knowledge Management . Journal of Knowledge Management ,
21–42.
Comondore, V. R., Devereaux, P. J., Zhou, Q., Stone, S. B., Busse, J. W., & Ravindran, N. C. (2009). Quality of care in for-profit and not-for-profit
nursing homes: systematic review and meta-analysis . BMJ (Clinical Research Ed.) , 339, b2732. doi:10.1136/bmj.b2732
Eisenhardt, K., & Graebner, M. E. (2007). Theory building from cases: Opportunities and Challenges . Academy of Management Journal , 50(1),
25–32. doi:10.5465/AMJ.2007.24160888
Glaser, B., & Strauss, A. (1967). The Discovery of Grounded Theory: Strategies for Qualitative Research . Chicago, IL: Aldine Publishing
Company.
Haggie, J. K., & Kingston, J. (2003, June). Choosing a knowledge management strategy . Journal of Knowledge Management , 33–56.
Haggie, K., & Kingston, J. (2003). Choosing your knowledge management strategy. Journal of Knowledge Management Practice, 1 – 20.
Hall, R. (2003). Knowledge Management in the New Business Environment (A report prepared for the Australian Business Foundation) . Sydney:
ACIRRT, University of Sydney.
Hall, R. (2003). “Knowledge Management in the New Business Environment”, A report prepared for the Australian Business Foundation,
ACCIRT. University of Sydney.
Hillmer, M. P., Wodchis, W. P., Gill, S. S., Anderson, G. M., & Rochon, P. A. (2005). Nursing home profit status and quality of care: Is there any
evidence of an association? Medical Care Research and Review , 62(2), 139–166. doi:10.1177/1077558704273769
Hume, C., Clarke, P., & Hume, M. (2012). The role of knowledge management in the large non profit firm: building a framework for KM success
. International Journal of Organisational Behaviour ,17(3), 82–104.
Hume, C., & Hume, M. (2008). The strategic role of knowledge management in nonprofit organizations. International Journal of Nonprofit and
Voluntary Sector Marketing , 13(2), 129–140.
Hume, C., Pope, N., & Hume, M. (2012). KM 100: Introductory knowledge management for not-for-profit organizations .International Journal of
Organisational Behaviour , 17(2), 56.
James, M., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011, May). Big data: The next frontier for innovation,
competition, and productivity. Harvard Business Review , 59–69.
Jeon, Y. H., Merlyn, T., & Chenoweth, L. (2010). Leadership and management in the aged care sector: a narrative synthesis.Australasian Journal
on Ageing , 29(2), 54–60. doi:10.1111/j.1741-6612.2010.00426.x
Jeong, S., & Keatinge, D. (2004). Innovative leadership and management in a nursing home . Journal of Nursing Management, 12, 445–451.
doi:10.1111/j.1365-2834.2004.00451.x
Jones, N. B., Herschel, R. T., & Moesel, D. D. (2003). Using knowledge champions to facilitate knowledge management .Journal of Knowledge
Management , 7(1), 49. doi:10.1108/13673270310463617
Kane, R. (2003). Definition, Measurement, and Correlates of Quality of Life in Nursing Homes: Toward a Reasonable Practice, Research, and
Policy Agenda . The Gerontologist , 43, 28–36. doi:10.1093/geront/43.suppl_2.28
Lettieri, E., Borga, F., & Savoldelli, A. (2004). Knowledge Management in Nonprofit Organizations . Journal of Knowledge Management , 8(6),
16. doi:10.1108/13673270410567602
Lettieri, E., Borga, F., & Savoldelli, A. (2004). Knowledge Management in non-profit organizations . Journal of Knowledge Management , 8(6),
16–30. doi:10.1108/13673270410567602
Martinsons, M., & Hosley, S. (1993). Planning a strategic information system for a market-oriented non-profit organization .Journal of Systems
Management , 44(2), 14.
Miles, M. B., & Huberman, A. M. (1994). An Expanded Sourcebook:Qualitative Data Analysis (2nd ed.). Sage Publications International.
Murray, P., & Carter, L. (2005). Improving marketing intelligence through learning systems and knowledge communities in not-for-profit
workplaces . Journal of Workplace Learning , 17(7), 421–435. doi:10.1108/13665620510620016
Murray, P., & Carter, L. (2005). Improving marketing intelligence through learning systems and knowledge communities in not-for-profit
workplaces . Journal of Workplace Learning , 17(7/8), 421–435. doi:10.1108/13665620510620016
O’Sullivan, H., & McKimm, J. (2011). Doctor as professional and doctor as leader: same attributes, attitudes and values? British Journal of
Hospital Medicine , 72, 463–466. doi:10.12968/hmed.2011.72.8.463
Oliver, S., & Kandadi, K. R. (2006). How to develop knowledge culture in organizations? A multiple case study of large distributed organizations
. Journal of Knowledge Management , 10(4), 6–24. doi:10.1108/13673270610679336
Palmer, E., & Eveline, J. (2012). Sustaining low pay in aged care work. Gender, Work and Organization , 19, 254–275.
Pinnington, A. (2011). Leadership Development: Applying the same leadership theories and development practices to different
contexts? Leadership , 7, 335. doi:10.1177/1742715011407388
Raymond, L. (1985). Organizational characteristics and MIS success in the context of small business. Management Information Systems
Quarterly , 9(1), 37–52.
Reynolds, A. (2009). The Myer Foundation 2020: A Vision for Aged Care in Australia . Fitzroy, Australia: Outcomes Review, Brotherhood of St
Laurence.
Riege, A. (2005). Three dozen knowledge sharing barriers managers must consider . Journal of Knowledge Management ,9(3), 18–35.
doi:10.1108/13673270510602746
Salipante, P., & Aram, J. D. (2003). Managers as knowledgeable generators: The nature of practitioner-scholar research in the non-profit sector
. Nonprofit Management & Leadership , 14(2), 129–150. doi:10.1002/nml.26
Sankaran, S., Cartwright, C., Kelly, J., Shaw, K., & Soar, J. (2010). Leadership of non-profit organisations in the aged care sector in Australia.
In Proceedings of the 54th Meeting of the International Society for the Systems Sciences. Academic Press.
Senge, P. M. (1990). The Fifth Dimension - The Art and Patience of The Learning Organisation . New York: Doubleday.
Senge, P. M., Roberts, C., Ross, R. B., Smith, B. J., & Kleinter, A. (1994). The Fifth Discipline Field Book: Strategies and Tools for Building a
Learning Organisation . New York: Doubleday.
Treleaven, L., & Sykes, C. (2005). Loss of organizational knowledge: From supporting clients to serving head office .Journal of Organizational
Change Management , 18(4), 353–368. doi:10.1108/09534810510607056
Tsai, C.-T., & Chang, P.-L. (2005). An integration framework of innovation assessment for the knowledge-intensive service
industry. International Journal of Technology Management, 30(1-2), 85-104.
Vasconcelos, J., Seixas, P., Kimble, C., & Lemos, P. (2005). Knowledge management in non-government organisations: A partnership for the
future. In Proceedings of the 7th International Conference on Enterprise Information Systems (ICEIS 2005). ICEIS.
Venturato, L., & Drew, L. (2010). Beyond'doing': Supporting clinical leadership and nursing practice in aged care through innovative models of
care. Contemporary Nurse , 35(2), 157–170. doi:10.5172/conu.2010.35.2.157
Wenger, E. C., & Snyder, W. M. (2000). Communities of practice: The organizational frontier . Harvard Business Review , 78(1), 139–145.
Wiig, K. M. (1997). Integrating intellectual capital and knowledge management. Long Range Planning , 30(3), 399–405. doi:10.1016/S0024-
6301(97)90256-9
Zuber-Skerritt, O. (2001). Action Learning and action research: paradigm, praxis and programs. Effective change management using action
research and action learning: Concepts, frameworks, process and applications . Lismore, Australia: Southern Cross University Press.
KEY TERMS AND DEFINITIONS
Aged Care: Elderly care, or simply eldercare, is the fulfilment of the special needs and requirements that are unique to the led rely. This broad
term encompasses such services as assisted living, adult day care, long term care, nursing homes, hospice care, and home care.
Data Marts: Represent specific database systems on a much smaller scale representing a structured, searchable database system, which is
organised according to the user’s needs.
Data Repository: A database used primarily as an information storage facility, with minimal analysis or querying functionality.
Data Warehouses: The main component of KM infrastructure and stores data in a number of databases. It organizes data and provides
meaningful knowledge to the business, which can be accessed for future reference.
Health Informatics: A discipline at the intersection of information science, computer science, and health care. It deals with the resources,
devices, and methods required to optimize the acquisition, storage, retrieval, and use of information in health and biomedicine.
Knowledge Capture: A variety of techniques used to elicit facets of an individual's technical knowledge such that insights, experiences, social
networks.
Knowledge Storage: Includes data warehouses, data marts, content management systems and data repositories.
APPENDIX
List of Abbreviations
KM: Knowledge Management.
NFP: Not-for-profit organization.
Mohamed Riduan Abid
Al Akhawayn University, Morocco
ABSTRACT
The ongoing pervasiveness of Internet access is intensively increasing Big Data production. This, in turn, increases demand on compute power to
process this massive data, and thus rendering High Performance Computing (HPC) into a high solicited service. Based on the paradigm of
providing computing as a utility, the Cloud is offering user-friendly infrastructures for processing Big Data, e.g., High Performance Computing as
a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter is responsible for
the creation of virtual machines that carry out data processing jobs. In this paper, the authors evaluate the impact of virtualization on HPCaaS.
They track HPC performance under different Cloud virtualization platforms, namely KVM and VMware-ESXi, and compare it against physical
clusters. Each tested cluster provided different performance trends. Yet, the overall analysis of the findings proved that the selection of
virtualization technology can lead to significant improvements when handling HPCaaS.
INTRODUCTION
Big data and Cloud computing are emerging as new promising IT fields that are substantially changing the way humans dealt with data forever.
During the last decade, data generation grew exponentially. IBM estimated data generation rate to 2.5 quintillion bytes per day, and that 90% of
the data in the world today has been generated during the last two years (Manish et al., 2013).
The latest advances in Internet access (e.g. WiFi, WiMax, Bluetooth, 3G, and 4G) have substantially contributed to the massive generation of Big
Data. Besides, the quick proliferation of the WSNs (Wireless Sensors Networks) technology did further boost the data capture levels.
Indeed, as Big Data grows in terms of volume, velocity and value, the current technologies for storing, processing and analyzing data have
become inefficient and insufficient. A Gartner survey stated that data growth is considered as the largest challenge for organizations (2013).
Stating this, HPC has started to be widely integrated in processing big data related to problems that require high computation capabilities, high
bandwidth, and low latency network (Chee et al., 2005). HPC, by itself, has been integrated with new and evolving technologies, including Cloud
computing platforms (e.g. OpenStack (The OpenStack Cloud Software)) and distributed and parallel systems (e.g. MapReduce and Hadoop).
Merging HPC with these new technologies has led to a new HPC model, named HPC as a Service (HPCaaS). The latter is considered as an
emerging computing model where end users have on-demand access to pre-existing needed technologies that provide high performance and
scalable HPC computing environment (Ye et al., 2010). HPCaaS provides unlimited benefits because of the better quality of services, including (1)
high scalability, (2) low cost, and (3) low latency (Umakishore and Venkateswaran, 1994).
Cloud computing is promising, in this context, as it provides organizations with the ability to analyze and store data economically and efficiently.
Cloud computing is defined by National Institute of Standards and Technology (NIST) (2011) as a model for providing on-demand access to
shared resources using minimum management efforts. NIST (2011) set five characteristics that define Cloud computing, including: on-demand
self-service, broad network access, resource pooling, rapid elasticity, and measured service. Furthermore, based on NIST definition, Cloud
computing provides the following basic services: Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS).
Virtualization is deemed as the core enabling technology behind Cloud computing. When a user requests a Cloud service (e.g., SaaS, PaaS, or
IaaS), the Cloud computing platform “forks” the corresponding virtual machines. The latter are created instantly, upon service request, and are
“destroyed” once the user releases the relevant services. This fact leverages the “pay-per-use” feature of the Cloud. Since, Cloud computing
platforms use different virtualization techniques, varying in their architectures and design, this ought to impact the overall performance of the
Cloud services.
Parallel and distributed systems have also a significant role in enhancing the performance of HPC. One of the most known and adopted parallel
systems is MapReduce paradigm (Jeffrey and Sanjay, 2004) that was developed by Google to meet the growth of their web search indexing.
MapReduce computations are performed with the support of data storage system known as Google File System (GFS). The success of both
MapReduce and GFS inspired the development of Hadoop (Apache Hadoop). This implements both MapReduce and Hadoop Distributed File
System (HDFS) to distribute Big Data across HPC clusters (Molina-Estolano et al., 2009; Cranor et al., 2012). Nowadays, Hadoop is widely
adopted by big players in the market because of its scalability, reliability and low cost of implementation.
At present, the use of HPC in the Cloud is still limited. The first step towards this research was done by the Department of Energy National
Laboratories (DOE), which started exploring the use of Cloud services for scientific computing (Xiaotao et al., 2010). Stating this, HPCaaS still
needs more investigation to decide on appropriate environments that can better fit big data requirements.
In this paper, we shed further light on the coupling between Cloud computing and virtualization, and study the relevant impact using real-world
experimentation. We propose an HPCaaS architecture for Big Data processing. This consists of building Hadoop Virtualized Cluster (HVC) in a
private Cloud using OpenStack. HVC was used to, first, investigate the added value of adopting virtualized cluster and, second, to evaluate the
impact of virtualization techniques, namely using KVM and VMware ESXi, on handling HPCaaS. We used humble hardware means to deploy our
proposed HPCaaS testbed, e.g., 8 machines. Nevertheless, this serves as a proof of concept and as a blue print for further deployments using
more powerful hardware settings.
The rest of the paper is organized as follows: Section 2 bridges the gap between the contribution of this paper and previous related work. Section
3 discusses the relationship between Cloud computing and virtualization technology. In section 4, we describe the HPCaaS private Cloud
platform that was opted to evaluate the impact of virtualization. Section 5 describes the experimental setups. Section 6 presents and discusses the
findings of the paper, and finally we conclude and present future work in Section 7.
RELATED WORK
There have been several studies evaluating the performance of HPC in the Cloud. Most of these studies used Amazon EC2 as a Cloud
environment (Qiming et al. 2010; Keith et al. 2010; Edward, 2008; Jaliya and Geoffrey, 2009; Yunhong and Robert, 2008; Abhishek and Dejan,
2011). However, few of these studies integrated Cloud platforms with new emerging technologies for parallel and distributed systems (e.g.
Hadoop) (Constantinos and Chris, 2008).
Qiming et al. (2010) have evaluated the performance of HPC using three different Cloud platforms: Amazon EC2, GoGrid Cloud, and IBM Cloud.
For each Cloud platform, they ran HPC on Linux virtual machines, and they came up to the conclusion that the tested public Clouds do not seem
to be optimized for running HPC applications. This was explained by the fact that public Cloud platforms have slow network connections between
virtual machines. Furthermore, Keith et al. (2010) evaluated the performance of HPC applications in today's Cloud environments (Amazon EC2)
to understand the tradeoffs in migrating to the Cloud. Overall results indicate that running HPC on Amazon EC2 Cloud platform limits
performance and causes significant variability. Abhishek and Dejan (2011) evaluated the performance-cost tradeoffs of running HPC applications
on three different platforms. First and second platform consist of two physical clusters, and the third platform consists of Eucalyptus Cloud with
KVM virtualization. Running HPC on these platforms led authors to conclude that the Cloud is more cost-effective for low communication-
intensive applications.
In order to understand the performance implications on HPC using virtualized resources and distributed paradigms, Constantinos and Chris
performed an extensive analysis of HPC using Eucalyptus (16 nodes) and other technologies, e.g., Hadoop, Dryad and DryadLINQ (Yuan, et al.,
2008) and MapReduce (2004). The conclusion of this research suggested that most parallel applications can be handled in a fairly and easy
manner when using Cloud technologies. However, scientific applications, which require complex communication patterns, still require more
efficient runtime support.
Evaluating HPC performance using different virtualization technologies was also evaluated in some research papers (Fragni et al., 2010; Naveed,
2012; Deshan et al., 2008; Hwang et al., 2013). Fragni et al. (2010) performed an analysis of virtualization techniques (VMWare, Xen, and
OpenVZ) and HPC. Their findings show that none of the techniques match the performance of the base system perfectly; yet, OpenVZ
demonstrates high performance in both file system performance and industry-standard benchmarks. Naveed (2012) compared the performance
of KVM and VMware. Overall findings show that the VMWare performs better than KVM. Still, in few cases, KVM provides better results than
VMWare. Deshan et al. (2008) conducted a quantitative analysis of two leading open source hypervisors, Xen and KVM. Their study evaluated
the performance isolation, overall performance and scalability of virtual machines for each virtualization technology. In short, their findings
show that KVM has substantial problems with guests crashing (when increasing the number of guests); however, KVM still has better
performance isolation than Xen. Finally, Hwang et al. (2013) have extensively compared four hypervisors: Hyper-V, KVM, VMWare, and Xen.
Their results demonstrated that there is no perfect hypervisor.
To understand the trade-offs of virtualization, we further investigate, in this work, the impact of virtualization on HPCaaS. We propose an
architecture that evaluates the impact of virtualization techniques (KVM and VMware ESXi) on HPCaaS in a private Cloud, and discuss the
performance of running big data on both virtualized and physical cluster.
VIRTUALIZATION AND CLOUD COMPUTING: THE RELATIONSHIP
Virtualization has been introduced for many years as a powerful technology in computer science. This was exactly introduced in the late 60th
with the introduction of the IBM 360 mainframe that allowed a virtual sharing of resources (Robert, 2008). Virtualization is broadly defined as
an abstract layer between physical resources and their logical representation (Matthew, 2012).
Virtualization is the key technology enabler in Cloud computing. When requesting a Cloud service (e.g., IaaS, PaaS, and SaaS), all needed
resources are allocated via the instantiation of one or multiple virtual machines (VMs). The latter are mere system files that are created,
migrated, and destroyed when needed. This allows the “pay-per-use” paramount feature of the Cloud since nopermanent physical resources are
reserved.
Indeed, in the Cloud when a user is requesting a service, a corresponding VM is instantiated. This consists of allocating corresponding vCPUs
(using Virtual Time Intervals), memory (using Shadow Page Tables), and virtual I/O devices (usingVirtual Drivers). In fact, the implementation
varies from a virtualization technology to another, and this depends mainly on whether the hypervisor is a baremetal/type1 (i.e., running on
top of the hardware) or a type2 (i.e., running on top of an existing operating system). Besides, this depends as well on whether the
implementation is a full or a para one. In Para-virtualization the guest operating systems (i.e., running on VMs) are altered to become aware of
the existence of a hypervisor and thus issuinghypercalls instead of basic ISA (Instruction Set Architecture) instructions. In full-virtualization, the
guest operating systems are kept intact, and are “persuaded” to still own the hardware, and thus generating ISA instructions. The latter are
complicated in terms of translation when the ISAs of the VMs are different from the native ISA (i.e., of the physical machine).
In this context, we can clearly assimilate the different trends when implementing virtualization. Any Cloud computing platform has an
underlying virtualization technology; some do support more than one (e.g., OpenStack supports KVM, XEN, and VMware). Furthermore, we can
assess that there is a strong coupling between virtualization and Cloud computing. In fact, Cloud computing platforms (e.g., OpenStack,
Eucalyptus, and Nimbus) serve merely as user-friendly interfaces, with special accountingand security services, for “instantiating” virtual
machines to carry the requested Cloud services.
The performance of Cloud computing platforms depends mainly on the underlying virtualization technique, and we summarize this dependence
on the following main trends: (1) whether it uses emulation or native execution, (2) It is a type-1 or type-2?, (3) whether it is a full or para-
virtualization?, (4) Whether it implements I/O virtualization at the system call interface, driver, or at the instructional level.
Little work has been done so far in studying this strong coupling between virtualization and Cloud services, using real-world Cloud computing
platforms. In this paper, we prove the existence of this tight-coupling by evaluating the impact of virtualization on HPCaaS using different
virtualization techniques, namely using KVM (Linux Inc.) and VMware ESXi (VMware Inc.)
HPCAAS PRIVATE CLOUD DEPLOYMENT
In this section we delineate the blue print of a real-world HPCaaS private Cloud deployment. The proposed architecture consists of building
Hadoop Virtualized Cluster (HVC) in a private Cloud using OpenStack, which is deemed as the most promising open source Cloud platform. HVC
was used to, first, investigate the added value of adopting virtualized cluster and, second, to evaluate the impact of virtualization techniques,
namely using KVM and VMware ESXi, on handling HPCaaS.
We built two virtualized clusters of 8 virtual machines (VMs). The first cluster uses KVM as the underlying Hypervisor (Figure 1), and the second
uses the VMware ESXi Hypervisor (Figure 2). Besides, we deployed a physical cluster of 8 machines in order to investigate the added value of
adopting virtualization in the Cloud.
Figure 1. Hadoop virtualized cluster: KVM
Figure 2. Hadoop virtualized cluster: VMware ESXi
OpenStack Deployment
To deploy the HPCaaS private Cloud, we started with installing OpenStack. Using small experimental setups (in terms of storage and processing),
it was sufficient to deploy the following OpenStack components: Keystone, Glance, Nova and Horizon. These components provide both data
storage and data processing to implement HPCaaS.
After installing and configuring the virtualization hypervisor (KVM/VMware), the first OpenStack component that we installed afterwards is
the Keystone. This manages Authentication, e.g., by creating relevant tenants (OpenStack projects), associated users, and roles. The second
OpenStack component we installed is theGlance. This creates and manages the different formats of VMs images. Glance package includes glance
api that accepts incoming API requests; glancedatabase that stores all information about images, and finally glanceregistry that is responsible
of retrieving and storing metadata about images. The third component we installed is the Nova package. This contains novacompute, nova
scheduler, novanetwork, novaobjectstore, novaapi, rabbitmqserver, novnc and novaconsoleauth. All these components collaborate and
communicate with each other to create and manage VMs (Ken, 2011).
Finally, to have access to instantiated VMs, a user friendly interface was deployed through configuring OpenStack dashboard. When logging into
the OpenStack dashboard, the user specifies the number of vCPUs, disk space, and total RAM memory to create in a given VM. Now that VM
instances are accessible, the next step is to install Hadoop, and configure the instances to form a computing cluster.
Hadoop Deployment
To deploy the Hadoop cluster, we started with identifying both master and slave nodes. For master node, there are six files that need to be
configured: coresite, hadoopenv, hdfs, mapredsite,master and slaves files. Concerning slave nodes, the only files that need to be configured
are hadoopenv, coresite, hdfs andmapredsite files. These files aim at setting environment variables, defining common properties (e.g. HDFS
and MapReduce properties), specifying the master and slave nodes, setting the number of replicas, etc.
After configuring all needed files, nodes have to communicate with each other via the SSH protocol (Secure SHell). Afterwards, we formatted the
HDFS namenode. This cleans the filesystem and creates storage directories. Finally, the Hadoop cluster can be launched to run jobs after starting
the HDFS and MapReduce daemons. Hadoop documentation is provided in (Michael, 2011).
EXPERIMENTATION
Settings
For both Hadoop virtualized clusters (HVC), the KVM-based and VMware-based, we used a single 8-core server (Dell PowerEdge 2950 with 6GB
of RAM). On top of the server, we installed OpenStack to create eight VMs using first KVM, and then VMware ESXi afterwards. Hadoop Physical
Cluster (HPhC) consists of 8 physical machines (Dell OptiPlex 755 with 975MB of RAM). One machine is selected to serve both as master and
slave node. The remaining machines are configured to serve as slaves.
In fact, the underlying hardware is not that powerful and the hardware model used for HVC and HPhC are not the same. However, we stress on
the fact that the purpose of the following experimentation is proving the concept and assessing the platform functionalities.
Data Sets
We opted for two well-known benchmarks which are TeraSort and TestDFSIO (Tests for Distributed File System I/O) (Michael, 2011).
TeraSort was developed by Owen O’Malley and Arun Murthy at Yahoo Inc. (Thomas, 2013). It won the annual general purpose terabyte sort
benchmark in 2008 and 2009. It does considerable computation, networking, and storage I/O.
TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read operations. Such benchmark can be helpful for testing
HDFS by checking network performance, and testing hardware, OS and Hadoop setup.
Table 1. Experimental performance metrics
For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets; for TestDFSIO, we used 100 MB, 1 GB, 10 GB and 100 GB datasets.
We started experimentation by gradually scaling up the cluster granularity from three to eight machines. We started with 3 machines as this
number is the default Replica factor in HDFS. For each benchmark, we run three tests for each dataset size, and we calculated the mean to avoid
any outliers and to provide more accurate results.
RESULTS AND ANALYSIS
TeraSort Performance
Figures 3, 4, and 5 depict the performance of TeraSort using respectively, 100MB, 1GB, 10GB and 30 GB file sizes. By varying the number of VMs,
we got different performance trends.
In Figure 3, where the data size is 100MB, TeraSort shows better performance in the Hadoop Virtualized Cluster (HVC). For instance, for a 3
nodes cluster, the performance of sorting 100MB using HVC-KVM is 30% better than using the Hadoop Physical Cluster (HPhC), and 25% better
than using HVC-VMware.
The performance of sorting 100MB remains stable as the number of machines increases to 4, 5 and 6. Afterwards, with a cluster of 7 and 8
machines, the performance decreases by 33% when using HVC- VMware. Still, HVC-KVM proves to be faster than HPhC by 29.5% and 27% for a
cluster of 7 and 8 nodes respectively.
When increasing the file size, the performance of TeraSort changes in each cluster. In Figure 4, sorting 1GB using HVC proves again to be better
than HPhC. For a cluster of 3-5 nodes, the average execution time of sorting 1GB on virtualized clusters is 87-90 seconds, while it is 182-187
seconds when sorting the same file size on HPhC. However, the performance of sorting 1GB on HVC-KVM decreases sharply when increasing the
number of VMs to 5 up to 8. In this case, the performance of 1GB TeraSort using HPhC cluster is 89% better than using HVC-KVM of 8 VMs.
Still, the performance of sorting the same dataset on HVC-VMware is faster than HPhC.
Regarding the 10-GB TeraSort, and as Figure 5 depicts, the TeraSort performance on HVC-VMware is better than HPhC as well. For instance, for
a cluster of 5 nodes, sorting 10GB using HVC-VMware is faster by 60% than using HPhC. Sorting the same file size on a cluster of 5 nodes using
KVM is also faster by 51% than using HPhC.
Using 30 GB datasets, see Figure 6, TeraSort performance on HVC-VMware proves to be the best. For instance, for a cluster of 5 nodes, sorting
30GB on HVC-VMware is faster by 28% than HVC-KVM, and faster by 61% than HPhC. HVC-KVM performs better than HPhC as well. Yet, this
observation sustains only for clusters of 3 to 5 nodes. When increasing the cluster size to 6-8 nodes, the performance of sorting 30GB using HVC-
KVM decreases and becomes slower than HPhC.
For all dataset sizes, we observe that the performance of TeraSort on HVC-VMware declines when opting for a cluster of 8 nodes; for example,
the performance of sorting 10 GB decreased by 51% (comparing to 10GB on 3 nodes). Yet, HVC-VMware keeps performing better than other
clusters when sorting large datasets.
Figure 6. Average time for sorting 30 GB on HPhC, HVC with
KVM and VMware
TestDFSIOWrite Performance
Figures 7, 8, 9 and 10 depict the performance of TestDFSIO-Write using 100MB, 1GB, 10GB and 100GB respectively. The performance of this
benchmark is slightly different than the performance of TeraSort.
Figure 7 shows that HPhC is better than HVCs when writing small datasets, 100MB. Yet, when increasing the datasets size, see Figure 8-10, we
observe that the performance of writing 1GB, 10GB and 100GB is better when opting for HVC-KVM. Besides, we observe that the performance of
TestDFSIO-Write using HVC-VMware is low than using HPhC. For instance, in Figure 8, when writing 1 GB using 5 nodes, HVC-KVM proves to
be 31% faster than HPhC and 48% faster than HVC-VMware.
When scaling the cluster from 5 to 8 machines, the performance of writing different datasets on HVC-KVM decreases sharply.
Figure 7. Average time for writing 100 MB on HPhC, HVC
with KVM and VMware
Figure 8. Average time for writing 1 GB on HPhC, HVC with
KVM and VMware
Figure 9. Average time for writing 10 GB on HPhC, HVC with
KVM and VMware
Figure 10. Average time for writing 100 GB on HPhC, HVC
with KVM and VMware
TestDFSIORead Performance
Figure 11, 12, 13 and 14 show the performance of TestDFSIO-Read using 100MB, 1GB, 10GB and 100GB respectively. The performance behavior
of this benchmark is similar to TestDFSIO-Write.
As illustrated in Figure 11 and 12, reading small datasets with HVC is faster than HPhC. Yet, this observation sustains only for HVC-KVM with 3
to 5 nodes. When scaling HVC-KVM to 6, 7 and 8 nodes, the performance of reading 100MB and 1GB declines. The performance of reading small
data sets using HVC-VMware is lower than both HVC-KVM and HPhC.
Afterwards, when increasing the dataset size to 10 GB and 100GB, see Figure 13 and 14, we can observe different performance trends. For clusters
of 3-5 nodes, HVC-KVM proves to be better than HVC-VMware and HPhC. For instance, when reading 100 GB in a cluster of 3 nodes, HVC-KVM
is faster by 12% than HVC-VMware and 44% faster than HPhC. Still, the performance of reading 10GB and 100GB on HVC-KVM starts
decreasing when scaling the cluster to 6-8 nodes.
In contrast to TestDFSIO-Write results, HVC-VMware is faster than HPhC when reading large datasets (10GB and 100GB), see Fig 13 and 14. For
instance, reading 100GB using a cluster of 3 nodes, HVC-VMware performs better than HPhC by 36% and 55.5% for 7 and 8 nodes respectively.
Figure 11. Average time for reading 100 MB on HPC, HVC with
KVM and VMware
Figure 12. Average time for reading 1 GB on HPC, HVC with
KVM and HVC VMware
Figure 13. Average time for reading 10 GB on HPhC, HVC with
KVM and VMware
Figure 14. Average time for reading 1 GB on HPC, HVC with
KVM and HVC VMware
Discussion
When running the TeraSort benchmark, HVC-VMware proves to be fast when sorting large datasets, starting from 1GB, 10 GB and 30 GB. For
instance, sorting 30GB using a cluster of 5 nodes shows that HVC-VMware is faster than HVC-KVM by 64% and faster than HPhC by 84%. HVC-
KVM proves to be faster than HPhC as well.
However, for HVC-KVM, when increasing the number of nodes to 6, 7 and 8, the overall performance of TeraSort degrades. We explain this fact
by the scalability issue as KVM was not shaped for real-world deployments; it was shaped instead for academic and research purposes. A study
was done by Fayruz et al. (2013) states that KVM has substantial problems with guests crashing when it reaches a certain number of VMs (e.g., 4
VMs). Therefore, the overall system performance is highly affected by the scalability issue when opting for the KVM virtualization technique.
Regarding HVC-VMware, the performance of running TeraSort decreases only when scaling the cluster to 8 VMs. We explain this by the overall
system overhead, but not to scalability issue (VMware ESXi is known to be scalable (Raghu et al., 2009)). To identify the factors that led to
system overhead, we tracked the performance of sorting 30GB datasets on 8 VMware VMs, using VMware vSphere Client. We found that, at some
point, the memory required to sort 30GB dataset exceeds the available memory offered by the cluster, see Figure 15; this shows that the active
memory is higher than the granted memory (between 50-55 minutes).
Concerning TestDFSIO benchmark, HVC-KVM proves to have better performance than other clusters when performing both write and read
operations. However, HVC-VMware shows the lowest performance when being compared to HVC-KVM and HPhC.
In fact, the reason that explains the significant results we got from running TestDFSIO on HVC-KVM is related to Virtio API package. The latter
is integrated with the KVM Hypervisor to provide an efficient abstraction for I/O operations (IBM Corp., 2012). A relevant study tested the
performance of KVM (with Virtio API) at I/O operations, and compared it with the performance of VMware vSphere 5.1 (Khoa et al., 2013). The
study proved that KVM withVirtio API achieves I/O rates that are 49% higher than VMware vSphere 5.1.
In short, the overall performance of TeraSort and TestDFSIO proves that, first, HVC has better performance than HPhC, and, second, the
selection of the underlying virtualization technology can lead to significant improvements when performing HPCaaS. Regarding this research,
HVC-VMware proves to have the best performance especially when running computational jobs (TeraSort).
CONCLUSION AND FUTURE WORK
In this paper, we evaluated the impact of virtualization technologies on handling HPCaaS. We delineated relevant opportunities and challenges
related to Big Data. To overcome the latter ones, we suggest adopting the private Cloud deployment model instead of the public one. We
proposed an architecture for deploying an HPC private Cloud, and highlighted its main components, e.g., Hadoop, MapReduce and OpenStack.
The findings of the proposed architecture demonstrate that virtualized clusters can perform much better than physical cluster when processing
and handling HPC. Also, the results of the analysis show that Hadoop VMware ESXi cluster performs better at sorting big datasets (with more
computations), and Hadoop KVM cluster performs better at I/O operations.
As a future work, we are planning to scale up our deployment using powerful hardware settings. Once scaled up, the testbed can be used by
scientific communities in need of Big Data processing and mining capabilities.
This work was previously published in the International Journal of Distributed Systems and Technologies (IJDST), 6(4); edited by Nik Bessis,
pages 6581, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Abhishek, G., & Dejan, M. (2011). Evaluation of HPC Applications on Cloud . Helwett-Packard Development Company.
Chee, S t al. (2005). Cluster Computing: High-Performance, High-Availability, and High-Throughput Processing on a Network of
Computers. Handbook of Innovative Compuing, Springer Verlag.
Constantinos E. Chris N. (2008). Cloud Computing for parallel Scientific HPC Applications: Feasibility of running Coupled Atmosphere-Ocean
Climate Models on Amazon’s EC2.Proceedings of CCA.
Cranor, Ch.. (2012). HPC Computation on Hadoop Storage with PLFS . Data Laboratory at Carnegie Mellon University.
Fragni., . . .. (2010). Evaluating Xen, VMware, and OpenVZ Virtualization Platforms for Network Virtualization. Federal University of Rio de
Janeiro, 1-1.
Hwang . (2013). A Component-Based Performance Comparison of Four Hypervisors.Proceedings of IFIP/IEEE Integrated Network Management
Symposium (IM 2013).
Jeffrey D. Sanjay G. (2004). Mapreduce: Simplified data processing on large clusters.Proceedings of the 6th USENIX OSDI, 137–150.
John, G., & David, R. (2010). The Digital Universe Decade- Are You Ready? EMC Corporation.
Keith et al. (2010). Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud.Cloud Computing
Technology and Science, 159-168.
Manish, K.. (2013). Research Issues in Big Data Analytics. [IJAIEM]. The International Journal of Application or Innovation in Engineering &
Management , 2(8).
Matthew, P. (2012). Virtualization Essentials . John Wiley & Sons.
Molina-Estolano . (2009). Mixing Hadoop and HPC Workloads on Parallel Filesystems.Proceedings of 4th Annual Workshop on Petascale Data
Storage, 1-5. 10.1145/1713072.1713074
National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing . Gaithersburg.
Naveed, Y. (2012). ‘Comparison of Virtualization Performance: VMWare and KVM’, Master Thesis, Oslo University College, pp. 30-44.
Qiming . (2010). Case Study for Running HPC Applications in Public Clouds.Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing, 395-401.
Umakishore, R., & Venkateswaran, H. (1994). Towards Realizing Scalable High Performance Parallel Systems. In Proceedings of ACM:
Suggesting Computer Science Agenda (s) for HighPerformance Computing.
Xiaotao . (2010). Research of High Performance Computing with Clouds.Proceedings of the 3rd International Symposium on Computer Science
and Computational Technology (ISCSCT ’10), 289-293.
Ye X. . (2010). Research of High Performance Computing with Clouds.Proceedings of the 10th International Symposium Computer Science and
Computational Technology, 289– 293.
Yuan Y. . (2008). DryadLINQ: A System for General-Purpose Distributed Data-ParallelComputing Using a High-Level Language.Inproceedings
Eighth Symposium on Operating System Design and Implementation.
Rita Paulino
Federal University of Santa Catarina (UFSC), Brazil
ABSTRACT
The participation of people in social networks is undeniably a contemporary phenomenon that presents as a characteristic not only the flow of
explicit information in data form, natural and complex, but also some information (data) from the network's own movement. It is in this context
that this article fits with the purpose of revealing information that is implied in participatory movements of sociotechnical networks. For this, one
can rely on the conceptual theoretical contribution about Actor-Network Theory (ANT), by Bruno Latour (2012): “follow things through the
networks they carry”. It is believed that by following the movements of social networks, one can view information that reflects feelings and
actions that are implied in the connections about facts and events. In this article, the author will analyze and monitor social networks during the
games of the 2014 FIFA World Cup Brazil. This approach brings us to an applied research and experimental.
INTRODUCTION
This article aims to demystify the process of sentiment analysis in social networks, in order to present the theoretical background on the subject
and the interface with digital journalism. To achieve this goal, we want to test and search for technological tools of analysis of feelings and
information visualization tools for free access. As a case study we analyze the feeling and the emphasis of issues that arise during the matches of
the national team during the World Cup 2014. This systemic approach is underpinned by the concept of Actor-Network Theory (ACT) Bruno
Latour (2012). To register researched interconnections of information sometimes present but implicit in social networks, we have the tools to
visualize data free access to show information and generate a memory of the striking facts or event.
Social networks are configured as a large sociotechnical system and in the view of Mario Bunge (2003) defined as a structured objects of complex
shape which include components of which there is a relationship with at least one other component. More specifically, a system can be modeled
as a composite quad including the system composition (components of this system) environment (items that are not part of the system, but suffer
operating action or by some component), structure (- collection links between components and between these items and the environment) and
mechanism (collection of processes that generate qualitative novelty) (BUNGE, 2003). Thus a semantic relationship is essential to the
understanding of any system.
According to Leticia Luna Freire (2013) Latour's approach on networks refers to flows, circulations and alliances, in which the actors involved
constantly interfere and are in interference. A network is a logical connection, defined by its internal assemblages and not by its external
boundary. A sociotechnical system, in the view of Bunge (2003) and Latour (2013) refers to a structure of links or connections between peers and
environments and can withstand external influences.
On a more structural view, the actors in social networks are mapped by their relations. There is a direct relationship between two actors when
there is transmission in the general sense of the term, from one to the other, whether it is information, goods or services and control. When there
are no unilateral transfers this relationship is not driven (LEMIEUX, 2004). But the fact that a network is not oriented, does not mean that the
set of actors has no connection or meaning to the network they belong to. According to Latour, cited by Freire (2013) the author draws attention
to the need to differentiate “actor” in the traditional sense conferred by sociology from, in Actor-Network Theory (ACT), the actor is all acting,
leaves trace, may refer to individuals, institutions, animals, machines, etc. It refers not only to humans but also to non-humans, and therefore
further suggested by the term Latour surfactant.
Network communication is done by data, can have semantic, pictorial, character and media that reveal feelings. Online social networking has
become an important communication platform that brings together various aspects of information, including opinions and sentiments expressed
by its users in simple conversations or messages (M. Araujo, 2014).
The ease in storing and retrieving information from routine monitoring of the actions of actors is a characteristic of informational societies
(GANDY, 2002) and BRUNO, 2014).
Several studies in the context of social networks are focused on the identification and monitoring of polarity on shared messages assuming that of
the significant amount of data posted a significant portion is related to the moods and emotions expressed by users (M. Araujo, 2014). The author
believes that these analyzes have numerous applications, particularly in the development of systems able to capture public opinions related to
social events and product launches in real time. This article analyzes criteria of newsworthiness and audience about them can also be perceived
through analysis of feedback from users of social networks.
Against this backdrop, this article intends to conduct analyzes of feelings about facts or events through the collection, extraction and visualization
of data in sociotechnical networks.
METHODOLOGY
As described in Freire (2013), the methodological point of view Latour argues is that the only way to understand the reality of scientific studies is
to follow scientists in action, since science is founded on a practice, not about ideas.
Following the ideas of Latour, we believe that the exercise of practice makes us better able to see and understand the phenomenon that is
happening. In addition to the pedagogical character, immersion in the problem causes the researcher to identify different ways to search. Freire
(2013) believes that the methodological approach of Latour recognizes that the effective actions of scientists, in close combination with the
objects with which they interact, should no longer be seen as mere background in the production of scientific facts but a part of the first
observation plane and description of the researchers.
Marcondes Filho (2014) indicates that communication research is investigating exactly how the phenomenon affects us, of how something
resonates with us, what we suffer with it and which changes are experienced.
In order to identify or follow tracks in search of trends or feelings we adopted as methodological tools case studies. This article aims to document
and cite the major events that occurred during the matches of Brazil through the reflections of the participation of Internet users in social
networks (Twitter and Facebook).
We follow the theoretical scale studies of Latour and researchers who study the reference list of users with social media. In addition we check the
intersection of this study with further research on data journalism, not in the scope of this project to demystify the black box of algorithms used
systems, but rather in the use of systems for public access and samples of reports of such systems from a search on a particular # hashtag.
To assist in analysis and reading of the data we used some heuristics that were identified during testing the algorithms and data visualization
software. After reading the data, the generated information was presented representing a feeling or clues about a given fact.
According to Bueno F. (2009) use of the method with heuristics often facilitates a meeting of the best possible solutions to problems, but are not
exact, perfect, definitive solutions. This subjectivity, or lack of precision of the heuristic methods, it is not a defect, but a characteristic analogous
to human intelligence. Often in everyday life, we solve many problems without knowing them accurately.
• Heuristic 2: One can outline the emphasis of a matter in social networks by graphical representation of the volume of this term, word or
mention;
• Heuristic 3: You can check the feeling about a fact or keyword Hashtags through positivity or negativity of the related terms and their
emphasis of the graphic representation;
• Heuristic 4: The graphic record of a visualization of data collected at a given time provides a graphical memory about a particular fact.
ANALYSIS FEELINGS THROUGH SOCIAL NETWORKS
Analysis of feelings through social networks is not a new area on the web. In 2009 publicized in the media this field was hitherto unknown to the
area of social and computational sciences. The rise of social networks and the participation of Internet users with their comments and opinions
caught the attention of many companies that have started to monitor the comments on their products.
An emerging field known as sentiment analysis is developing around one of the unexplored frontiers of the computing world: to translate the
uncertainties of human emotion in the form of solid data.1
According to Araújo M. (2012), studies of monitoring social networks depart from the hypothesis that a significant amount of posted data
represents a significant portion related to mood and emotions expressed by users.
Identification of feelings in textual or numerical data is not an easy task, even for advanced algorithms and some companies have developed
software in this segment considered an acceptable metric around 70% to 80% accuracy. The simplest algorithms work by analyzing keywords and
more recently a new analysis variable was the aggregate of sentiment analysis systems: the “# hashtags.”
It is in this context that this article seeks to provide a method of sentiment analysis, using algorithms to analyze feelings of public access systems
during the games in Brazil in the 2014 World Cup. Feelings of social network users are mapped from hashtags used in comments on Twitter,
Facebook, YouTube and Flickr during games in Brazil.
The use of hashtags, according to BRUNS and BURGESS 2011 and Zago, 2014, for coverage of events may signal a conversation between
individuals and the creation of a community of interest around a topic. The hashtags can be created in an ad hoc manner, i.e. there is no central
authority or set of rules governing how to create and use hashtags. In the same line of thought D'ANDREA 2014, “Intermidiáticas considers that
multiple connections can be leveraged in rich media ecosystem (SCOLARI, 2010) brought about the ambience of the Internet and connected
digital technologies, making room for a growing networking sociotechnical that expand and reframe, e.g. live stream of a highly publicized event
such as a football game ...”
Let us consider that the samples identified by adopted for research systems, reveal information that is implied as a result of movement of redes
sociais. These are implicit information that is not embodied in the published posts, but the relationships and emphases in terms of these posts.
The movement of social networking during the games in Brazil can be viewed through reading data, and reveal the feeling of netizens on a
particular fact or keyword Hashtags that raced during the games.
HOW TO CHECK THE FEELING IN THIS METHOD?
Initial attempts to test various analysis tools in the sense of public access or versions for testing include Twitter Sentiment, SocialMention,
TweetFeel, iFeel. In this analysis we find discontinued programs and many broken links. Faced with this difficulty we have adopted for the
analysis and extraction of data SocialMention the program that presented the most significant samples of the volume of Keywords and more
Hashtags mentioned in a span of time or setting. This scenario is represented by an emphasis that appears due to use of Internet especially in key
Hashtags twitter.
These relations over a Hashtags provided us with information or feelings often implicit in a number of comments, but visible when it materializes
in a chart. An example of realization of feelings can be seen in Charts 1 and 2. In the first analysis, shown in Figure 1, the key used as Hashtags #
WorldCup2014Brazil the first game of Brazil, this Hashtags identifies the context of information, in this case the World Cup World view by
internet before, during and after the first game of Brazil (Heuristic 1). To display the entries on a particular subject or fact, use hashtag beyond
the key variables of emphasis to outline the volume of entries and words or terms to identify the name of the symbol. In the table below we can
identify that at a certain time of the analysis, the most common issues related to the World Cup on social networks were the World Cup itself,
Aperture, Brazil, Croatia, Come on, Twisting ... These words had circles and emphasize a greater mobilization and a sense that we risk being in a
positive way the netizens about the beginning of Brazil in the World Cup.
Figure 1. This shows the emphasis of matters or post tags
during first game of Brazil vs Croatia at the opening of the
World Cup 2014 on the social networks. Hashtags # key used
WorldCup2014Brazil. Source: Author
Other information that can be viewed by the emphasis on the social networks of the entries is the names of the players that stood out during the
game. The moment the team player Marcelo Brazilian got an own goal, many comments were posted on social networks. A larger circle as the
name Marcelo is this emphasis on player feedback (Heuristic 2). The algorithm SocialMention, can provide a data table with the emphasis of the
most cited terms in a given time span of the research that later serves to make the visualization of the data collected. One can also identify the
players who had more entries in comments about the first game of Brazil. In this case the volume of the circle from player Neymar and Oscar
(Heuristic 2) indicated that these players were the most cited in the game, and as no derogatory words related in the context of circles were
presented in the analysis, it suggests to assert that the players had a greater volume in circles due to an optimum gaming performance (Heuristic
3).
Contrary to Figure 1 which outlined evidence of positivity, Figure 2 presents indications of negativity (Heuristic 3). This negative feeling can be
seen by plotting the emphasis related to”#brasilnacopa” after Brazil's defeat to Germany in the World Cup 2014 on social networks Hashtags
affairs. The tags “vexation”,” worst”, “defeat”, “goals”, “Germany” are related to greater emphasis tags as “Brazil”, “Cup” and ” World”. This
analysis was made on July 10, three days after the defeat of Brazil against the team from Germany and one realizes that there is a negative feeling
about the defeat.
Figure 2. This shows the emphasis of matters posted or more
tags after the defeat of Brazil to Germany in the World Cup
2014 on the social networks. Key used Hashtags #
brasilnacopa. Source: Author
PROCESS ANALYSIS, COLLECTION AND DISPLAY OF DATA
Data journalism is a term that, in my view, encompasses a growing set of tools, techniques and approaches to storytelling. Reports can include
anything from help with Computer (RAC, which uses data as a “source”) to the most advanced data visualizations and news applications. The
common goal is journalistic: to provide information and analysis to help inform us better on the important issues of the day (Aron Pilhofer, New
York Times2).
What we will cover in the next section is the process that involves the collection of data to the graphic display that was adopted in the case study
of the games of the World Cup in Brazil in 2014 (Chart 1 and 2). Following the review of Aron Pilhofer journalist, New York Times, this process of
abstraction and data visualization requires the journalist have knowledge extraction and data visualization tools. And more important than
understanding the process is to realize that these techniques allow the journalist to answer questions, and through graphics or infographics
generated from the data, it can tell a story or a fact.
The steps and the results of this research will be presented concurrently describing each step of the process.
Step 1: Data Collection
In the case study of the Games in Brazil, the first step was to identify which Hashtags # more were being used at the time of the Brazil game.
These Hashtags vary in intensity, for this was used Twitter to identify #Hashtags representing comments about the game of Brazil. In the first
game the #Hashtags representing the context of that time were #WorldCup2014Brazil, #WorldCup2014, #naovaitercopa, #vaibrasil, #vaitercopa
from the game with Chile intensity of mentions and comments were focused on targeted tags, #bravschi, # bravscol and #BrasilvsAlemanha.
With the identification of a specific tag data analysis can be used in a system algorithm to identify the particular feeling of Hashtags.
SocialMention use the system to check the feelings at that time of game on social networks and other information that the program feature.
Several measurements with the hashtag #naovaitercopa before and after the game in Brazil and in accordance with the measurements of
SocialMention show a negative sentiment towards Brazil in the World Cup appeared before the beginning of the game were made in Brazil. (First
image Figure 3). But after the game in Brazil this feeling has changed, as can be seen in the second image of Figure 3).
Table 1. Algorithm analysis units of social mention program
Unidades de Description
Análise
All search result on the # key Hashtags are accompanied by the results of the analysis units from the perspective of # hashtag researched, i.e. a
quest to identify the feeling Hashtags #Brasil2014 #WorldCup2014, comes with the following data reports: retweets, urls_cited, hashtags
(related), references, top_users, top_hashtags, top_keywords (see Figure 4).
Step 2: Filtering
At this stage we have collected data on the #Hashtags key, but this data does not always appear in a “clean” way. There is need for correction of
possible errors in the data, and you can also refer to it as a complement. This phase corresponds to journalistic editing, in which are selected,
processed and verified the information effectively part of the final report (see Figure 5).
Figure 5. Report on # Hashtags related Hashtags key
#Brasil2014 #WorldCup2014, delivered on Excel spreadsheet
Step 3: View
View (DataViz) or narrative (story data driven) is the time of preparation of the final product when the help of a designer or a prior knowledge
about the system that will be used to visualize the data may be necessary. In this phase you need to worry about the type of product being
developed (infographic, dynamic visualization, application, etc.), usability, accessibility, interactivity, responsiveness, and other human and
technical aspects. Data needs to tell a story.
Every data-visualization requires a prior data collection, which can be a standard table or Excel “csv” text format. The visualization can be static
or dynamic, and this project used the RAW program open use to create custom views on the units related to Hashtags # key, vector based analysis
of D3.js library (see Figure 6).
Figure 6. Process data visualization that involves the steps of
collecting, filtering and display. Source: Author
It is noteworthy that the D3.js presents a library of various forms of display, and therefore requires of the journalist or designer an analysis of the
chart type best suited to the data that you want to see. It is important to realize that the view should assist in identifying and reading a story or
fact (Heuristic 4). But this is not always possible due to the relationships and variables that have been given. The graphics in this study were
generally to identify an emphasis publications of a # hashtag or keyword palavar then the variables used in the graph were Hierarchy, Size, and
Color Name A circle shape was more effective to show such variables, as can be seen in the graph.
CONCLUSION
There were seven matches of Brazil in the 2014 World Cup and the process of collecting, filtering and data visualization was applied to identify
the sentiment of the netizen who were interacting on social networks at the time of the games. The proposed heuristics were observed and
identified in every game so that the display could tell the story of the game (Heuristic 4).
In game #brasilvsmexico (see Figure 7), participation by keeper Ochoa is referenced in the data visualization game by the emphasis of the posts
(volume circle) on good performance in the match against Brazil (Heuristic 2).
The dramatic match against Chile, in which the key Hashtags # BRAvsCHI was often mentioned in social networks, included mention of the
highlights of the game as the player Neymar and drama on penalties (Heuristic 1).
Germany was the most impressive as can be seen in the Figure 2. Revolt of the Brazilian people was materialized with mentions in posts
#BrazilvsGermany that revealed aspects of sadness and anger with the score of 7-1 so disproportionate to Germany (Heuristic 3).
In the last game the #BrasilvsHolanda, disinterest and frustration with the Brazilian team was stamped by the lack of mention of the game. If
other games in the particulars related Hashtags were evident and bulky in the final match of Brazil, it seems that disinterest took over the
internet, almost could not generate graphs due to lack of entries during the game.
The implicit information in movements of social networks can be detected with data visualization tools. But before any investigation of a question
what we should do is to ask: what do you want to know of the sample data in relation to the matter under investigation? Much information can be
revealed or confirmed by analyzing data and an adequate visualization.
This work was previously published in the International Journal of ActorNetwork Theory and Technological Innovation (IJANTTI), 7(2);
edited by Arthur Tatnall and Ivan Tchalakov, pages 4151, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This research is part of the Journalism research network projects and digital technologies (JorTec) (http://tecjor.net/index.php?
title=P%C3%A1gina_principal).
REFERENCES
Araújo, M., Gonçalves, P., Cha, M., & Benevenuto, F. ifeel: A system that compares and combines sentiment analysis methods - Proceedings of
the companion publication of the 23rd international conference on World wide webWide Web companion p.75-78, 2014.
Bruno, F. (2013). Máquinas de ver, modos de ser: vigilância, tecnologia e subjetividade . Porto Alegre: Ed. Sulina.
Bruns, A., & Burgess, J. The Use of Twitter Hashtags in the Formation of Ad Hoc Publics. In: European Consortium for Political Research
Conference, Reykjavik, 2011.
D´Andréa, Carlos. Conexões Intermidiáticas entre Transmissões Audiovisuais e Redes Sociais Online: Possibilidades e Tensionamentos.
Associação Nacional dos Programas de Pós-Graduação em Comunicação. XXIII Encontro Anual da Compós, Universidade Federal do Pará, 27 a
30 de maio de 2014
Freire L., L.A ciência em ação de Bruno Latour. Editora Humanitas Unissinos, ano II, N 192, 2013.
Gray. Jonathan; Bounegru, Liliana; chambers, Lucy (Ed.). Manual de Jornalismo de Dados. 2014. Tradução de “Data Journalism Handbook”.
Disponível em: <http://datajournalismhandbook.org/pt/index.html>. Acesso em: 06 fev. 2014.
Latour. (2013). Bruno. “Reassembling the Social. An Introduction to Actor-Network-Theory. Journal of Economic Sociology , 14(2), 73–87.
Lemieux, V. (2008). Ouimet M. Análise Estrutural das Redes Sociais. Epistemologia e Sociedade . Instituto Piaget.
Zago, G., Recuero, R, Bastos M., T., Quem Retuita Quem? Papeis de ativistas, celebridades e imprensa durante os #protestosbr no Twitter.
Associação Nacional dos Programas de Pós-Graduação em Comunicação. XXIII Encontro Anual da Compós, Universidade Federal do Pará, 27 a
30 de maio de 2014
ENDNOTES
2 GRAY, J., Manual de Jornalismo de Dados. (2014) Disponível em: http://datajournalismhandbook.org/pt/introducao_2.html, acessado em
12/07/2014.
CHAPTER 79
Here Be Dragons:
Mapping Student Responsibility in Learning Analytics
Paul Prinsloo
Unisa, South Africa
Sharon Slade
The Open University, UK
ABSTRACT
Learning analytics is an emerging but rapidly growing field seen as offering unquestionable benefit to higher education institutions and students
alike. Indeed, given its huge potential to transform the student experience, it could be argued that higher education has a duty to use learning
analytics. In the flurry of excitement and eagerness to develop ever slicker predictive systems, few pause to consider whether the increasing use of
student data also leads to increasing concerns. This chapter argues that the issue is not whether higher education should use student data, but
under which conditions, for what purpose, for whose benefit, and in ways in which students may be actively involved. The authors explore issues
including the constructs of general data and student data, and the scope for student responsibility in the collection, analysis and use of their data.
An example of student engagement in practice reviews the policy created by the Open University in 2014. The chapter concludes with an
exploration of general principles for a new deal on student data in learning analytics.
INTRODUCTION
It is easy to be seduced by the lure of our ever-increasing access to student data to address and mitigate against the myriad of challenges facing
higher education institutions (HEIs) (Greenwood, Stopczynski, Sweat, Hardjono & Pentland, 2015; Stiles, 2012; Watters, 2013; Wishon & Rome,
2012). Challenges include, inter alia, changes in funding regimes and regulatory frameworks necessitating greater accountability to a widening
range of stakeholders such as national governments, accreditation and quality assurance bodies, employers and students (Altbach, Reisberg, &
Rumbley, 2009) (also see Bowen & Lack, 2013; Carr, 2012; Christensen, 2008; Hillman, Tandberg, & Fryar, 2015; New Media Consortium, 2015;
Shirky, 2014). Though anything but a recent development (see e.g., Hartley, 1995), funding increasingly follows performance rather than
preceding it (Hillman et al., 2015). The continuous decrease of public funding for higher education increases the pressures on higher education
institutions to not only be accountable to an increasing number of stakeholders, but also to ensure the effectiveness of their teaching and student
support strategies. There are also increasing concerns that HEIs have not solved, nor done enough to attempt to solve, the ‘revolving door’
syndrome whereby many students either fail to complete their courses or programmes or take much longer than planned (Subotzky & Prinsloo,
2011; Tait, 2015).
As teaching and learning increasingly move online and digital, the amount of digital data available for harvesting, analysis and use increases.
HEIs’ access to and use of student data is thought to have the potential to revolutionise learning (Van Rijmenam, 2013) with the expectation that
it will change ‘everything’ (Wagner & Ice, 2012), that student data is the new black (Booth, 2012) and thenew oil (Watters, 2013). The current
emphasis on the ‘potential’ of learning analytics without (as of yet) definitive evidence that learning analytics does indeed provide appropriate
and actionable evidence (Clow, 2013a, 2013b; Essa, 2013; Feldstein, 2013; Selwyn, 2014), can produce and sustain a number of ‘blind spots’
(Selwyn & Facer, 2013).
In a climate of expectation then that the increased collection and analysis of student data can provide much needed intelligence to both increase
our understanding of the challenges and issues facing HEIs and may further assist in formulating more effective responses; there are also
concerns that data1 and increasingly Big Data, is not an unqualified good (Boyd and Crawford, 2012, 2013; Kitchen, 2014a). The harvesting,
analysis and use of student data must also be seriously considered amidst the discourses surrounding privacy, student surveillance, the nature of
evidence in education, and so forth (Biesta, 2007, 2010; Eynon, 2013; Prinsloo & Slade, 2013; Selwyn & Facer, 2013; Wagner & Ice, 2012).
In much of the current discussions around learning analytics, the emphases are on the institution, the potential of data, modelling and algorithms
and on students as producers of data, modelling and algorithms. Though student data is central in learning analytics, the role of students is
mostly limited to the production of intelligence for more effective teaching and resource allocation. Students are seen as (merely) generators of
data, objects of surveillance, customers and recipients of services (Kruse & Ponjasapan, 2012).
A further concern is a view that for most sites involving the use of personal data, the Terms and Conditions (TOC) of use are generally considered
to be ineffective in providing users with informed control over their own data. More seriously though, many users simply do not take the time nor
have the necessary technical or legal expertise to engage with those TOCs and make informed and rational decisions (Antón & Earp, 2004;
Bellman, Johnson, & Lohse, 2001; Earp, Antón, Aiman-Smith, & Stufflebeam, 2005; Lane, Stodden, Bender, & Nissenbaum, 2014; Miyazaki &
Ferenandes, 2000). Higher education is no exception to this dire state of affairs. Analyses of the Terms and Conditions (TOCs) for three major
providers of Massive Open Online Courses (MOOCs) found that students’ role in the data exchange is severely limited to the sole responsibility to
ensuring that the information provided by them is correct and current (Prinsloo & Slade, 2015a). Once students accept such TOCs, they have very
little control over what data is collected, used and shared; the persons or entities with whom their data is shared; the governance and storage of
their data; and even access to their own digital profiles.
In the light of the asymmetrical power relationship between students and HEIs, where students have little choice but to accept the TOCs, there is
a need to think differently with regard to the ethical issues in the collection, analysis and use of student data (Slade & Prinsloo, 2013). If one
accepts that higher education is, amongst other things, a moral practice based on a social contract between students and their institutions, it is
impossible to ignore the fiduciary duty of HEIs to re-think the student role in the value chain of data exchange.
The rationale for rethinking the role and responsibilities of students in this context is found in an awareness of the ethicalimplications and
considerations in learning analytics. Although there are promising signs of increasing attention paid to issues surrounding privacy and ethics in
learning analytics (Eynon, 2013; Pardo & Siemens, 2014; Siemens, 2013), the practical challenges associated with adopting institutional policy
and frameworks to address those issues are complex (Prinsloo & Slade, 2013). It is believed that the Open University (UK) policy on the Ethical
use of student data (Open University, 2014) is the first of its type within the context of higher education.
This chapter argues that ignoring the role of students in the context of the value exchange of their data is seriously underestimated and under-
theorised. We propose that students’ responsibility in the data exchange is much more than just ensuring that the information provided is correct
and current. Students can and should actively collaborate in the institution’s collection, analysis and use of that data. Students’ agency is much
more than just opting in or out of having data collected, analysed and used. We therefore position students as agents in a student-centric
approach (e.g., Kruse & Pongsajapan, 2012; Slade & Prinsloo, 2013; Prinsloo & Slade, 2015a) to learning analytics.
In times past, ancient maps were used to record what was already known and to guide exploration into areas which were not yet familiar. Such
maps were informed and shaped by the cartographers’ skills, knowledge and understanding of the areas they were trying to map. Areas unknown
to the cartographer were often accompanied with warnings, such as ‘here be dragons.’
In the new world of learning analytics, this chapter attempts to map some of the current approaches to student data and more specifically the
potential of student-centric learning analytics. The authors are aware that the simple binary of opting in or out is does not provide an effective
nor sufficiently nuanced approach (Prinsloo & Slade, 2015a). In exploring and mapping the unknown areas of student responsibility in the
context of learning analytics, it is perhaps pertinent to suggest that ‘here be dragons.’
ENGAGING WITH DATA AS CONSTRUCT
Data in a variety of formats has always been essential to scientific research, commercial enterprises and higher education, but it is also fair to say
that [d]ata have traditionally been timeconsuming and costly to generate, analyse and interpret, and generally provided static, often course,
snapshots of phenomena(Kitchen, 2014a, p. xv). The production and availability of data have overtaken our understanding of the complexities
and ethical challenges of data as phenomenon (Barnes, 2013; Diakopoulos, 2014a; Eubanks, 2015; Koponen, 2015; Mehta, 2015). While many
analysts may accept data at face value, and treat them as if they are neutral, objective, and preanalytics in nature, data are in fact framed
technically, economically, ethically, temporally, spatially and philosophically. Data do not exist independently of the ideas, instruments,
practices, contexts and knowledges used to generate, process and analyse them (Kitchen, 2014a, p. 2) (Also see Boellstorff, 2013; De Zwardt,
2014; Pasquale, 2015; Uprichard, 2015).
This is illustrated by the fact that unless a dataset can be considered complete, that data remains inherently partial, selective and representative,
and the distinguishing criteria used in their capture has consequence (Kitchen, 2014a, p. 3). While data, in the social imaginary, has become to
be understood as representative of the total sample, objective and neutral, data isdifferent in nature to facts, evidence, information and
knowledge(Kitchen, 2014a, p. 3). Except for the fact that we can frame data technically and ethically, data also needs to be framed politically and
economically that are used to discriminate against individuals, classes of people and geopolitical entities (Kitchen, 2014a, p. 15). (Also see
Crawford, 2013; Henman, 2004; Selwyn & Facer, 2013; Uprichard, 2015).
DATA, BLACK BOXES, AND ALGORITHMS
At the intersection of the deluge of data shared and collected by a variety of stakeholders and a growing reliance on predictive models are growing
concerns about the ways in which algorithms now appear to shape our existence. As the reach, impact and downstream impacts of those
algorithms become clearer (Eubanks, 2015; Henman, 2004; Kalhan, 2013; Pasquale, 2015), so too does the unease around accuracy, reliability
and inherent biases (Bozdag, 2013; Diakopoulous, 2014a, 2014b; Friedman & Nissenbaum, 1996; Kitchen, 2014b; Pariser, 2011; Pasquale, 2011).
While it falls outside of the scope of this chapter to fully explore this more fully, it is crucial that we again warn that ‘here be dragons.’
Amidst the concerns regarding algorithmic regulation (Morozov, 2013, par.15) the algorithmic turn, (Napoli, 2013, p. 1), and thethreat of
algocracy (Danaher, 2014), scholars, researchers and policymakers are beginning to engage in debate around how best to control and
regulate automated authority (Diakopoulous, 2014b, p. 12). Current strategies include (but are not limited to) data and algorithmic
transparency, accountability and due process (Crawford & Schultz, 2014; Diakopoulous, 2014a; Koponen, 2015; Lohr, 2015; Seaver, 2013).
Given that many higher education institutions increasingly rely on algorithms to determine access, inform policy, personalise learning and
support, and determine at-risk student populations (e.g., Cabral, Castolo, & Gallardo, 2015; Daniel, 2015; Prinsloo & Slade, 2013), we suggest a
growing need to engage with the construct of student data.
ENGAGING WITH STUDENT DATA AS CONSTRUCT
In the light of the accountability regimes and cost of funding for higher education as public good [student] data can thus be understood as an
agent of capital interests (Kitchen, 2014a, p. 16). How then does this influence and shape our practices of selecting and analysing student data? If
this data does not represent the whole reality of students’ life and learning worlds, what does this mean for our analyses and interventions? If we
accept that student data is not benign (Shah as cited in Kitchen, 2014a, p. 21) but should be understood as framed and framing(Gitelman &
Jackson, as cited in Kitchen, 2014a, p. 21) – why are we so complacent about the analyses and the predictability models which flow from those
analyses? The construct of student data therefore raises, inter alia, two important issues - namely our assumptions regarding student digital
(non)activity and the ‘situatedness’ of that data.
Within higher education, students’ digital footprints combined with their demographic details collected through registration and funding
applications processes are often believed to represent the total picture of student potential and risk. Further, there is a belief that more data will,
necessarily, provide a more holistic picture and result in better and more effective interventions (Prinsloo, Archer, Barnes, Chetty, & Van Zyl,
2015).
Indeed, it would seem that many theoretical frameworks and practices relating to learning analytics assume that the learning management
system (LMS) provides a sufficient picture of student learning and that measuring activity, such as the number of clicks, time-on-task, or page
views provides a direct correlation to student retention and success. (See Guillaume & Khachikian, 2011; Kovanović, Gašević, Dawson,
Joksimović, Baker, & Hatala, 2015; Tempelaar, Rienties, & Giesbers, 2014). However, Kruse and Pongsajapan (2012) question this assumption
that evidence of student activity on the LMS is a valid proxy for learning. (See also Godwin-Jones, 2012; Pardo & Kloos, 2011; Slade & Prinsloo,
2013).
Moreover, the notion that student digital data represents a total or holistic picture of the complexities of students’ learning continues to gain
acceptance, although surely questionable in the context of education as an open and recursive ecosystem. Not only is such data abstracted but
also decontextualised - resulting in a loss of contextual integrity. Considering that [c]ontexts are structured social settings characterised by
canonical activities, roles, relationships, power structures, norms (or rules), and internal values (goals, ends, purposes) (Nissenbaum, 2010, p.
132), there is a need to seriously question assumptions around student digital data as a reliable proxy or as fully representative. In a world
seeking the elusive n=all data set, we should not accept that data showing student (dis)engagement on the LMS is sufficiently complete (Mayer-
Schönberger & Cukier, 2013). (Also see Campbell, Chang, & Hosseinian-Far, 2015; Harford, 2014).
When we look at students’ trajectories in higher education asheterotopic spaces (Foucault, 1984, p. 1), the notion of the LMS as representative of
the total student journey becomes even more problematic. For researchers, the clear boundaries between public and private spaces are
disappearing as we inhabit various spaces simultaneously (Rymarczuk & Derksen, 2014). Students concurrently live in physical worlds and
digital worlds where their private (digital and non-digital) and public (digital and non-digital) lives intersect and become mutually constitutive.
Our notions of synchronous and asynchronous, past and present lose their heuristic value and fundamentally contradict our assumptions and
analyses. The assumption that student digital data collected from the LMS represents the ‘reality’ and allocate values to time-on-task, number of
clicks, etc., disregards the LMS as a socially constructed space with its own norms, rules and activities (see Nissenbaum, 2010).
STUDENT DATA AND STUDENT RESPONSIBILITY
The potential benefits of learning analytics for increasing the effectiveness of learning are well-documented (Diaz & Brown, 2012; McKeown &
Ayson, 2013; Oblinger, 2012; Siemens & Long, 2011). Durall & Gros (2014) state that analytics in education can be used to identify students at
risk, provide recommendations to students to help personalise resources, reading and support, todetect the need for, and measure the results of,
pedagogic improvements, to identify teachers who are performing well, and teachers who need assistance with teaching methods and to assist
in the student recruitment process (p. 380). (For a discussion on the difference between academic and learning analytics and their different
potential, see Siemens & Long, 2011). We should remember and keep on reminding ourselves, that learning analytics is, primarily, about learning
(Gašević, Dawson, & Siemens, 2015).
Central to a great deal of the literature highlighting the potential benefits of learning analytics is the notion of students and their learning
trajectories as data objects. Students are portrayed in the main as producers of data and recipients of services and personalised learning journeys
based on the availability of their digital data, the assumptions made and understanding of the complexities of learning, and analyses conducted.
In addition, there is often also a worrying absence of concern regarding students’ (lack of) knowledge of the collection of their data and any
possible impact on their learning journeys. The only apparent requirement of students to engage in their role as data objects is to ensure that the
demographic data which they provide is correct and current.
The current thinking about the scope of student responsibility may be influenced by the complexities of conceptualising student retention and
success. Many models place the responsibility for student success on the ability of students to fit into the processes, epistemologies and
pedagogical strategies of higher education (Prinsloo, 2009). This view of student involvement and responsibility stands in stark contrast to the
conceptual model proposed by Subotzky and Prinsloo (2011) who suggest that students are not mere recipients of services, but active agents in a
reciprocal social contract with higher education.
The underlying notion of students as passive recipients of services and as objects of interventions is amplified when the constant language of
‘intervention’ perpetuates an institutional culture of students as passive subjects – the targets of a flow of information – rather than self
reflective learners given the cognitive tools to evaluate their own learning processes (Kruse & Pongsajapan, 2012, p. 2) (Also see Gašević et al.,
2015). If instead we see student success as the outcome of the mutually influential activities, behaviours, attitudes, and responsibilities of
students and the institution, which are viewed in the sociological perspective of situated agents (Subotzky & Prinsloo, 2011, p. 184), it is clear
that any assumptions of students as (just) producers of data must be revised. (Also see Slade & Prinsloo, 2013; Prinsloo & Slade 2015a, 2015b).
STUDENTCENTRED LEARNING ANALYTICS
While the notion of student-centred analytics has not gained wide traction in the discourses on educational mining and learning analytics, a
number of authors (Kruse & Pongsajapan, 2012; Slade & Prinsloo, 2014; Prinsloo & Slade, 2015a, 2015b) point to students not only making more
informed choices regarding the information that they share, but also in holding their higher education institutions accountable for any
subsequent analysis, and interpretation.
Engaging with the possibility and potential of students in learning analytics, Kruse and Pongsajapan (2012) propose that higher education needs
to move from an interventioncentric approach to learning analytics and to reimagine analytics in the service of learning [and] transform it into
a practice characterised by a spirit of questioning and inquiry. In this way, students become participants in the identification and gathering of
their data as well as co-interpreters of their own data (p. 4). Durall and Gros (2014) support this view, stating that very rarely are students
considered the main receivers of the learning analytics data or given the opportunity to use the information to reflect on their learning activity
and selfregulate their learning efficiently (p. 382). Indeed, not only should students be given the possibility to verify their digital dossiers, but
HEIs should [also] provide mechanisms for learners to interact with these systems explicitly (Durall & Gros, 2014, p. 382). (Also see Kump,
Seifert, Beham, Lindstaedt, & Ley, 2012; Slade and Prinsloo (2013) go on to suggest that valuing students as agents, making choices and
collaborating with the institution in constructing their identities is a positive approach in the context of the impact of skewed power relations,
monitoring, and surveillance (p. 1520).
STUDENT DATA, AGENCY, AND PRIVACY SELFMANAGEMENT
It falls outside of the scope of this chapter to engage with the various definitions and regulatory frameworks which set out the protection of
privacy. See for example, Gurses (2014), Marx (2001), Nissenbaum (2010), Tene and Polonetsky (2012) and Solove (2001, 2004, 2013) for
discussion of the general context and specifically geopolitical and institutional contexts. However, it is possible to surmise that due to, inter alia,
technological developments and changing social and cultural norms, the notion of privacy is fluid (Solove, 2013, p. 61).
Our discussion so far has suggested that users are in an asymmetrical power relationship to those different entities who collect, analyse and
increasingly combine different sources of information. It would seem then that one of the few remaining ways for users to exercise their
admittedly limited control over the collection and use of their data is by engaging with an organisation’s Terms of Conditions (TOCs). Even so, it
is broadly accepted that those TOCs are not particularly effective in ensuring either transparency or user control of that data, as those TOCs may
be overly complex and/or lengthy (see, e.g., Bellman, Johnson, & Lohse, 2001, Prinsloo & Slade, 2015a). Furthermore, users must engage with
separate TOCs on a case to case basis to an extent that the engagement becomes anything rational (Solove, 2013). The result is that users will
often exchange their personal data for limited benefits.
Despite increasing sensitivity regarding surveillance and the use and sharing of personal data (PewResearchCenter, 2014) many users have
become digitally promiscuous (Brian, 2015; Murphy, 2014). Against this backdrop, Prinsloo and Slade (2015a) present consent as a fragile
concept. Despite this, Solove (2013) states thatProviding people with notice, access, and the ability to control their data is key to facilitating
some autonomy in a world where decisions are increasingly made about them with the use of personal data, automated processes, and
clandestine rationales, and where people have minimal abilities to do anything about such decisions (p. 1899; emphasis added).
While we may think consent options as a straight binary choice between of opting in or opting out, Miyazaki and Fernandez (2000) suggest that
there is a broader range of choice. Possibilities of disclosure range from never collecting data or identifying customers when they access a site;
customers opting in by explicitly agreeing to having their data collected, used and shared; customers explicitly opting out; the constant
collection of data without consumers having a choice (but with their knowledge); to the collection, use and sharing of personal data without
the user’s knowledge (Prinsloo & Slade, 2015a, par. 13). Despite this, it is also clear that once consent is provided in a particular context, however
consensual and well understood, current legal frameworks do not protect user data when it is repurposed and reused in other contexts (Ohm,
2010, 2015).
Taking into account the fluidness of the notion of privacy and the fragility of the notion of consent, Greenwood et al.(2015) propose a new deal on
data (p. 192) while Gurses (2014) suggests a palette of ‘privacy solution’ to allow individuals the agency to make more informed choices (Kerr &
Barrigar, 2012).
A CASE STUDY OF OPEN UNIVERSITY POLICY
While the ethical considerations of research on students are covered extensively in research policies with ample consideration of ‘doing no harm’
there is very little or no consideration of the ethical or harm considerations in the use of learning analytics within higher education (Prinsloo &
Slade, 2013).
Many learning analytics strategies focus on informing students of the uses of their data, but as Solove (2013) indicates, privacy policies are not
always read or well understood. Indeed, there is little formal evidence that students are explicitly consulted or even loosely aware of the broader
uses of their data beyond research. The Open University in the UK is an open, distance learning university supporting over 200,000 students
each year. Like many other HEIs, the Open University is making increasing use of its student data and became aware that this position was not
adequately described within existing policy. To that end, the development of a new institutional policy was commissioned to address ethical
issues relating to its approach to learning analytics.
A thorough review of the literature and of existing external practice within the higher education sector was conducted which indicated little
evidence of other new policy development beyond that required under national legislation. Themes and issues from the review were used to
develop a set of eight broad guiding principles, describing key tenets and values of the University’s approach to student support and of the use of
data to guide and shape that support.
The creation of any new institutional policy, of necessity, requires considerable consultation combined with a formal and iterative approval
process. A small working group of staff with expertise and authority in relevant areas met frequently taking shared ownership of tasks. Key
stakeholders were consulted at an early stage on the eight principles. This initial consultation specifically included a few elected student
representatives to ameliorate concerns that broader inclusion of students might introduce disquiet at a point when issues had not been
thoroughly appraised. Once a coherent draft was agreed, the policy was circulated to a larger group of students via a formal consultation process
to uncover student attitudes to their data, and specific issues such as privacy and scope.
The student perspective was crucial, offering both direct insight and understanding of the need for transparency and informed consent.
Capturing the student voice clarified key issues and emphasised a need to consider seriously the issue of opt-out. However, this particular issue
was not easily resolved. Opt-out introduces significant systems issues, but also unearths a specific moral conflict. Some staff felt strongly that opt-
out would render the University unable to act in students’ best interests (considered unethical in itself), whilst others felt that students, as active
and adult learners, should have rights to determine their own support. As a result, the policy was approved in July 2014, with the condition that
the specific issue of consent be reviewed within a year.
The new policy has been clearly communicated to all stakeholder groups. For many students, the emergence of the policy has highlighted for the
first time the uses being made of their data. The formal consultation yielded insight that many were unaware that any of their data was collected
and used for monitoring or tracking purposes, and it has been clear that there is significant unease regarding the ways in which personal data
might be used. Managing the communication process has required a sensitive balance - with the need to both reveal longstanding and ongoing
practices, and, at the same time, to clarify and reassure students of the purposes, boundaries and potential benefits. Alongside the publication
and communication of the policy, further staff resources were provided to ensure a practical understanding of how existing and planned activities
might be impacted.
In retrospect, the emergence of the new policy has been both more straightforward and more complex than anticipated. Key to its creation and
implementation has been the emergence of a high level champion, the input of a fully representative set of stakeholders and the awareness that
the very creation of a policy flagging the activities undertaken is akin to opening a Pandora’s box. Whilst the existence of the policy offers some
sense of moral comfort for the University, it is clear that ongoing work will be needed to ensure that the changing legislative environment is
adhered to and that the student voice continues to be heard and reflected.
GENERAL PRINCIPLES FOR A NEW DEAL ON STUDENT DATA IN LEARNING ANALYTICS
So far we have attempted to open up the constructs of student data, student responsibility and their involvement in learning analytics and to look
at a case study of an institutional response to the complexities and practicalities of the use of student data in teaching and learning contexts.
If HEIs are to successfully rethink students’ roles and participation in learning analytics, they must accept a new dispensation where students are
no longer (only) the producers of data and data objects, but participating agents. Prinsloo and Slade (2015a) propose a number of principles for
HEIs to move beyond a paternalistic approach to the collection, analysis and use of student data, to a discursive-disclosive approach (Stoddart,
2012). In discussing Prinsloo and Slade’s (2015a) suggested principles, we specifically focus on the agency and responsibility of students.
The Duty of Reciprocal Care
The asymmetrical power relationship between students and HEIs necessitates that we seriously consider the ethical dimensions and impact of
this asymmetry. With HEIs currently cast in the role of provider, it is clear that the imperative for an ethical approach to the use of student data
lies within the locus of control of HEI. As students become increasingly aware of and concerned about the collection and use of their data (Slade
& Prinsloo, 2013), we argue that HEIs have a fiduciary duty to respond.
HEIs can, amongst other things, make their TOCs more accessible and understandable; make clear the scope, timing, use, governance and
conditions of sharing student data and information; and provide students with recourse in case of privacy breaches. The TOCs should, however,
also reimagine the role and responsibility of students to not only empower students with actionable information, but also allow them to verify
data, provide context, and provide additional information.
The Contextual Integrity of Student Data
Earlier we established the importance of context in considering the collection, analysis and use of student data. Student engagement on an
institutional LMS does not represent a holistic picture of students’ learning and/or progress. Any measure of LMS activity such as students’
sharing of data and their time-on-task, number of clicks and page views should also consider the embedded rules, conventions and norms of
engagement.
Should data from students’ activity outside of the LMS be collected and analysed, it is crucial to remember that information shared in one context
and for a specific context-specific purpose may not be transferable to other contexts (see Nissenbaum, 2010). While there are increasing claims
regarding the potential of combining different sources and databases, and for the repurposing of data sets (Long, 2014), we cannot and should
not disregard the social contract between higher education and its students.
Rethinking Student Agency and Consent
In seeking to move toward students as active agents, we should first rethink the ways in which students are more practically able to become
involved in their HEIs’ approaches to learning analytics. One perspective on this might be to examine how students are able to grant consent to
their involvement. We suggest that effective privacy self-management requires a more nuanced approach to that of the simple binary of opting in
or out. Students may, for example, agree to share data in specific contexts or may agree to share specific aspects of their learning experiences if
they are given a clear understanding of the details of data collection, e.g., the scope of data collected, who will have access, under which
conditions, for how long and for what purpose (Prinsloo & Slade, 2015a). (Also see Solove, 2013). Prinsloo and Slade (2015a) suggest that, despite
some residual concerns, nudging students to share their data with clear guidance on the purpose and benefits of such sharing would be an
improvement on current practice in learning analytics where students largely don’t know and have no choice.
Adjusting Privacy’s Timing and Focus
Accepting the importance of contextual integrity and guarding against context-collapse (Nissenbaum, 2010), we suggest that students should be
informed regarding aspects such as how long their data will be kept in personalised forms, for what purposes their data may be used and shared
downstream. While students may provide consent at the time of the collection of their data, they may have legitimate concerns regarding other
uses of their data outside of the original context and timing of the initial collection. Prinsloo and Slade (2015) therefore suggest that HEIs should
provide students with a range of options such as outright restrictions, partial consent which may depend on the scope, context and timing, and
permission to harvest and use data, with an option to later revoke consent or change the scope of consent depending on the context or
circumstances (p. 90).
Moving Toward Substance Over Neutrality
Amidst tensions between paternalistic or rule-based considerations of user data privacy and discursive-disclosive approaches (Stoddart, 2012),
there is a need for clear and enforceable legislation and regulatory frameworks to prevent gross misuse and unfair practices. Alongside this, it
might be argued, there is also a need for flexible, case-by-case and context-appropriate guidelines. Given that ordinary users are unlikely to have
the resources or access to legal advice to enforce due process if feeling aggrieved or if rights to privacy and data protection are breached, the
introduction of institutional guidelines offers some protection against the potential for harm.
From Quantified Selves to Qualified Selves
In the broader context of a quantification fetish (Prinsloo, 2014) in higher education, there is greater potential to collect, analyse and use student
data, it is crucial that we keep in mind that students are more than their data (e.g., Carney, 2013). The danger is that students become quantified
selves based on the number of their log-ins, clicks, downloads, time-on-task and various other data points (Prinsloo & Slade, 2015b). Ideally we
should aim to move from quantified selves to qualified selves (e.g., Boam & Webb, 2015; Davies, 2013; Li, Dey, & Forlizzi, 2011; Lupton, 2014a,
2014b; Prinsloo & Slade, 2015b). Where the quantified self gives us the raw numbers, the qualified self completes our understanding of those
numbers (Boam & Webb, 2015, par. 8). Our students are therefore much more than just conglomerates of quantifiable data (e.g., Lupton, 2014b)
and it is important that we take into account the contexts in which numbers are created(Lupton, 2014b, p. 6).
CONCLUSION
There is a danger that amidst the hype regarding Big Data in general, and specifically learning analytics in the context of higher education, we
don’t acknowledge a number of blind spots (Selwyn & Facer, 2013). In this chapter we have attempted to highlight some of these and to warn that
‘here be dragons.’ Despite those dragons, the blind spots and a myriad of as yet unresolved issues with regard to privacy and the ethical
implications of learning analytics, we are of the firm opinion that higher education cannot afford (often literally) not to collect, analyse and to use
student data (Slade & Prinsloo, 2014).
Given that higher education is a moral endeavour and that higher education has a fiduciary duty towards caring for students, it is clear that the
issue is not whether higher education should collect, analyse and use student data, but under what conditions.
Although we would suggest that there are some clear principles which may be considered in involving students in the collection, analysis and use
of their data, we would also acknowledge that there are a number of remaining uncertainties and practicalities that deserve exploration and
further investigation. Some of these include consideration of the implications and scalability of allowing students to opt in to some data collection
and analyses, while opting out of others. What are the implications when students opt in, and then after a period of time, decide to opt out?
Students’ digital dossiers are only a part of their learning trajectories and provide but glimpses of their activities in an increasingly open and
recursive system where their engagement or disengagement flows from mutually constitutive and interdependent variables. It is clear that our
assumptions about time-on-task, their number of clicks and page views are nothing more than peeks into the unknown and that there are still
many unchartered territories in student learning.
This work was previously published in Developing Effective Educational Experiences through Learning Analytics edited by Mark Anderson
and Collette Gavan, pages 170188, copyright year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Altbach, P. G., Reisberg, L., & Rumbley, L. E. (2009). Trends in global higher education: tracking an academic revolution. A report prepared for
the UNESCO World Conference on Higher Education. Paris: UNESCO. Retrieved from
http://atepie.cep.edu.rs/public/Altbach,_Reisberg,_Rumbley_Tracking_an_Academic_Revolution,_UNESCO_2009.pdf
Antón, A. I., & Earp, J. B. (2004). A requirements taxonomy for reducing web site privacy vulnerabilities. Requirements Engineering , 9(3), 169–
185. doi:10.1007/s00766-003-0183-z
Barnes, T. J. (2013). Big Data, little history. Dialogues in Human Geography , 3(3), 297–302. doi:10.1177/2043820613514323
Bellman, S., Johnson, E. J., & Lohse, G. L. (2001). On site: to opt-in or opt-out?: it depends on the question. Communications of the ACM , 44(2),
25–27. Retrieved from http://dl.acm.org/citation.cfm?id=359241doi:10.1145/359205.359241
Biesta, G. (2007). Why “what works” won’t work: Evidence-based practice and the democratic deficit in educational research.Educational
Theory , 57(1), 1–22. doi:10.1111/j.1741-5446.2006.00241.x
Biesta, G. (2010). Why ‘what works’ still won’t work: From evidence-based education to value-based education. Studies in Philosophy and
Education , 29(5), 491–503. doi:10.1007/s11217-010-9191-x
Boam, E., & Webb, J. (2015). The qualified self: going beyond quantification [Blog comment]. Retrieved from
http://designmind.frogdesign.com/articles/the-qualified-self-going-beyond-quantification.html
Bowen, E. G., & Lack, K. A. (2013). Higher education in the digital age . Princeton, N.J.: Princeton University Press. doi:10.1515/9781400847204
Boyd, D., & Crawford, K. (2012). Critical questions for Big Data.Information Communication and Society , 15(5), 662–679.
doi:10.1080/1369118X.2012.678878
Boyd, D., & Crawford, K. (2013). Six provocations for Big Data. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431
Bozdag, E. (2013). Bias in algorithmic filtering and personalization. Ethics and Information Technology , 15(3), 209–227. doi:10.1007/s10676-
013-9321-6
Brian, S. 2015. The unexamined life in the era of big data: toward a UDAAP for data. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?
abstract_id=2533068
Cabral, S. R., Castolo, J. C. G., & Gallardo, S. C. H. (2015, January). Algorithm to optimize a fuzzy model of students’ academic achievement in
higher education. In Congreso Virtual sobre Tecnología . Education et Sociétés , 1(4).
Campbell, J., Chang, V., & Hosseinian-Far, A. (2015). Philosophising data: A critical reflection on the ‘hidden’ issues. [IJOCI]. International
Journal of Organizational and Collective Intelligence , 5(1), 1–15. doi:10.4018/IJOCI.2015010101
Carney, M. (2013). You are your data: the scary future of the quantified self movement [Blog comment]. Retrieved from
http://pando.com/2013/05/20/you-are-your-data-the-scary-future-of-the-quantified-self-movement/
Christensen, C. (2008). Disruptive innovation and catalytic change in higher education. Forum for the Future of Higher Education.EDUCAUSE.
Retrieved from https://net.educause.edu/ir/library/pdf/ff0810s.pdf
Clow, D. (2013b, November 13). Looking harder at Course Signals [Blog comment]. Retrieved from http://dougclow.org/2013/11/13/looking-
harder-at-course-signals/
Crawford, K. (2013, April 1). The hidden biases in big data. [Web log comment]. Harvard Business Review. Retrieved from
http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html
Crawford, K., & Schultz, J. (2014). Big data and due process: Toward a framework to redress predictive privacy harms. BCL Rev. , 55, 93–128.
Danaher, J. (2014). Rule by algorithm? Big data and the threat of algocracy. Institute for Ethics and Emerging Technologies. [Blog comment].
Retrieved from http://philosophicaldisquisitions.blogspot.com/2014/01/rule-by-algorithm-big-data-and-threat.html
Daniel, B. (2015). Big Data and analytics in higher education: Opportunities and challenges. British Journal of Educational Technology , 46(5),
904–920. doi:10.1111/bjet.12230
Davies, J. (2013, March 13). The qualified self [Blog comment]. Retrieved from http://thesocietypages.org/cyborgology/2013/03/13/the-
qualified-self/
De Zwart, H. (2014, May 5). During World War II, we did have something to hide. (Translated by Benjamin van Gaalen). Retrieved from
https://www.bof.nl/2015/04/30/during-world-war-ii-we-did-have-something-to-hide/
Diakopoulos, N. (2014a). Algorithmic Accountability Reporting: On the Investigation of Black Boxes. Tow Center. Retrieved from
http://www.nickdiakopoulos.com/wp-content/uploads/2011/07/Algorithmic-Accountability-Reporting_final.pdf
Diaz, V., & Brown, M. (2012). Learning analytics. A report on the ELI focus session. Retrieved from
http://net.educause.edu/ir/library/PDF/ELI3027.pdf
Durall, E., & Gross, B. (2014). Learning analytics as a metacognitive tool. Retrieved from
https://files.ifi.uzh.ch/stiller/CLOSER_2014/CSEDU/CSEDU/Information_Technologies_Supporting_Learning/Short
Papers/CSEDU_2014_152_CR.pdf
Earp, J. B., Antón, A. I., Aiman-Smith, L., & Stufflebeam, W. H. (2005). Examining Internet privacy policies within the context of user privacy
values. Engineering Management . IEEE Transactions on , 52(2), 227–237.
Essa, A. (2013). Can we improve retention rates by giving students chocolates? [Blog comment]. Retrieved from
http://alfredessa.com/2013/10/can-we-improve-retention-rates-by-giving-students-chocolates/
Eubanks, V. (2015, April 30). The policy machine. Slate Magazine. Retrieved from
http://www.slate.com/articles/technology/future_tense/2015/04/the_dangers_of_letting_algorithms_enforce_policy.html
Eynon, R. (2013). The rise of Big Data: What does it mean for education, technology, and media research? Learning, Media and
Technology , 38(3), 237–240. doi:10.1080/17439884.2013.771783
Feldstein, M. (2013, November 6). Purdue’s non-answer on Course Signals [Blog comment]. Retrieved from http://mfeldstein.com/purdues-
non-answer-course-signals/
Friedman, B., & Nissenbaum, H. (1996). Bias in computer systems.ACM Transactions on Information Systems , 14(3), 330–347.
doi:10.1145/230538.230561
Gašević, D., Dawson, S., & Siemens, G. (2015). Let’s not forget: Learning analytics are about learning. TechTrends , 59(1), 64–71.
doi:10.1007/s11528-014-0822-x
Greenwood, D., Stopczynski, A., Sweat, B., Hardjono, T., & Pentland, A. (2015). The new deal on data: a framework for institutional controls. In
J. Lane, V. Stodden, S. Bender, & H. Nissenbaum (Eds), Privacy, big data, and the public good (pp. 192-210). New York, NY: Cambridge
University Press.
Guillaume, D. W., & Khachikian, C. S. (2011). The effect of time on task on student grades and grade expectations. Assessment & Evaluation in
Higher Education , 36(3), 251–261. doi:10.1080/02602930903311708
Gurses, S. (2014). Privacy and security. Can you engineer privacy? Viewpoints. Communications of the ACM , 57(8), 20–23. Retrieved from
http://dl.acm.org/citation.cfm?id=2632661.2633029doi:10.1145/2633029
Harford, T. (2014). Big data: are we making a big mistake.Financial Times, 28, 1-5.
Hartley, D. (1995). The ‘McDonaldization’ of higher education: Food for thought? Oxford Review of Education , 21(4), 409–423.
doi:10.1080/0305498950210403
Henman, P. (2004). Targeted! Population segmentation, electronic surveillance and governing the unemployed in Australia.International
Sociology , 19(2), 173–191. doi:10.1177/0268580904042899
Hillman, N. W., Tandberg, D. A., & Fryar, A. H. (2015). Evaluating the impacts of “new” performance funding in higher education.Educational
Evaluation and Policy Analysis .
Kalhan, A. (2013). Immigration policing and federalism through the lens of technology, surveillance, and privacy. Ohio State Law Journal , 74,
1105–1165.
Kerr, I., & Barrigar, J. (2012). Privacy, identity and anonymity . In Ball, K., Haggerty, K. D., & Lyon, D. (Eds.), Routledge Handbook of
Surveillance Studies (pp. 386–394). Abingdon, UK: Routledge. doi:10.4324/9780203814949.ch4_1_c
Kitchen, R. (2014a). The data revolution. Big data, open data, data infrastructures and their consequences . London, UK: SAGE.
doi:10.4135/9781473909472
Kitchen, R. (2014b). Thinking critically about and research algorithms. The Programmable City Working Paper 5. Retrieved from
http://eprints.maynoothuniversity.ie/5715/
Koponen, J. M. (2015, April 18). We need algorithmic angels [Blog comment]. TechCrunch. Retrieved from
http://techcrunch.com/2015/04/18/we-need-algorithmic-angels/
Kovanović V. Gašević D. Dawson S. Joksimović S. Baker R. S. Hatala M. (2015, March). Penetrating the black box of time-on-task
estimation.Proceedings of the Fifth International Conference on Learning Analytics and Knowledge (pp. 184-193).
ACM.10.1145/2723576.2723623
Kruse, A., & Pongsajapan, R. (2012). Student-centered learning analytics. Retrieved from
https://cndls.georgetown.edu/m/documents/thoughtpaper-krusepongsajapan.pdf
Kump B. Seifert C. Beham G. Lindstaedt S. N. Ley T. (2012, April). Seeing what the system thinks you know: visualizing evidence in an open
learner model.Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 153-157).
ACM.10.1145/2330601.2330640
Lane, J., Stodden, V., Bender, S., & Nissenbaum, H. (Eds.). (2014).Privacy, big data, and the public good . New York, NY: Cambridge University
Press. doi:10.1017/CBO9781107590205
Li I. Dey A. K. Forlizzi J. (2011, September). Understanding my data, myself: supporting self-reflection with ubicomp technologies.Proceedings of
the 13th international conference on Ubiquitous computing (pp. 405-414). ACM.10.1145/2030112.2030166
Lohr, S. (2015, April). How much should humans intervene with the wisdom of algorithms [Blog comment]. Retrieved from
http://www.afr.com/technology/how-much-should-humans-intervene-with-the-wisdom-of-algorithms-20150407-1mfqiw
Long, S. (2014, February 20). ‘Re-purposing data’ in the digital humanities [Blog comment]. Retrieved from
https://technaverbascripta.wordpress.com/2014/02/20/re-purposing-data-in-the-digital-humanities/
Lupton, D. (2014a, July 28). Beyond the quantified self: the reflexive monitoring self [Blog comment]. Retrieved from
https://simplysociology.wordpress.com/2014/07/28/beyond-the-quantified-self-the-reflexive-monitoring-self/
Lupton, D. (2014b). You are your data: self-tracking practices and concepts of data. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?
abstract_id=2534211
Marx, G. T. (2001). Murky conceptual waters: The public and the private. Ethics and Information Technology , 3(3), 157–169.
doi:10.1023/A:1012456832336
Mayer-Schönberger, V., & Cukier, K. (2013). Big data. A revolution that will transform how we live, work, and think . New York, N.Y.: Houghton
Miffling Harcourt Publishing Company.
Mehta, P. (2015, March 12). Big Data’s radical potential. Jacobin Magazine. Retrieved from https://www.jacobinmag.com/2015/03/big-data-
drones-privacy-workers/
Miyazaki, D., & Ferenandez, A. (2000). Internet privacy and security: An examination of online retailer disclosures. Journal of Public Policy &
Marketing , 19(1), 54–61. doi:10.1509/jppm.19.1.54.16942
Morozov, E. (2013a, October 23). The real privacy problem. MIT Technology Review. Retrieved from
http://www.technologyreview.com/featuredstory/520426/the-real-privacy-problem/
Murphy, K. (2014, October 4). We want privacy, but can’t stop sharing. The New York Times [Blog comment]. Retrieved from
http://www.nytimes.com/2014/10/05/sunday-review/we-want-privacy-but-cant-stop-sharing.html
Napoli, P. (2013). The algorithm as institution: Toward a theoretical framework for automated media production and consumption. Proceedings
of the Media in Transition Conference(pp. 1–36). DOI:10.2139/ssrn.2260923
New Media Consortium. (2015). Horizon report higher education edition. Retrieved from https://net.educause.edu/ir/library/pdf/HR2015.pdf
Nissenbaum, H. (2010). Privacy in context. Technology, policy, and the integrity of social life . Stanford, CA: Stanford Law Books.
Ohm, P. (2010). Broken promises of privacy: Responding to the surprising failure of anonymisation. UCLA Law Review. University of California,
Los Angeles. School of Law , 57, 1701–1777. Retrieved from http://heinonline.org/HOL/Page?
handle=hein.journals/uclalr57&div=48&g_sent=1&collection=journals
Ohm, P. (2015). Changing the rules: general principles for data use and analysis . In Lane, J., Stodden, V., Bender, S., & Nissenbaum, H.
(Eds.), Privacy, big data, and the public good (pp. 96–111). New York, NY: Cambridge University Press.
Open University. (2014). Policy on ethical use of student data for learning analytics. Retrieved from
http://www.open.ac.uk/students/charter/essential-documents/ethical-use-student-data-learning-analytics-policy
Pardo A. Kloos C. D. (2011, February). Stepping out of the box: towards analytics outside the learning management system.Proceedings of the 1st
International Conference on Learning Analytics and Knowledge (pp. 163-167). ACM.10.1145/2090116.2090142
Pardo, A., & Siemens, G. (2014). Ethical and privacy principles for learning analytics. British Journal of Educational Technology ,45(3), 438–450.
doi:10.1111/bjet.12152
Pariser, E. (2011). The filter bubble. What the Internet is hiding from you . London, UK: Viking.
Pasquale, F. (2015). The black box society: the secret algorithms that control money and information . London, UK: Harvard University Press.
doi:10.4159/harvard.9780674736061
PewResearchCenter. (2014). Public perceptions of privacy and security in the post-Snowden era. Retrieved from
http://www.pewinternet.org/2014/11/12/public-privacyperceptions/
Prinsloo, P. (2009). Modelling Throughput at Unisa: The key to the successful implementation of ODL. Retrieved from
http://uir.unisa.ac.za/handle/10500/6035
Prinsloo, P., Archer, E., Barnes, G., Chetty, Y., & Van Zyl, D. (2015). Big (ger) data as better data in open distance learning. The International
Review of Research in Open and Distributed Learning , 16(1), 284–306.
Prinsloo P. Slade S. (2013, April). An evaluation of policy frameworks for addressing ethical considerations in learning analytics.Proceedings of
the Third International Conference on Learning Analytics and Knowledge (pp. 240-244). ACM.10.1145/2460296.2460344
Prinsloo, P., & Slade, S. (2014). Educational triage in open distance learning: Walking a moral tightrope. The International Review of Research in
Open and Distributed Learning , 15(4), 306–331.
Prinsloo, P., & Slade, S. (2015, b). Student vulnerability, agency and learning analytics: an exploration. Presented at the EP4LA workshop
during the Fifth International Conference on Learning Analytics and Knowledge.
Prinsloo P. Slade S. (2015a, March). Student privacy self-management: implications for learning analytics.Proceedings of the Fifth International
Conference on Learning Analytics and Knowledge (pp. 83-92). ACM.10.1145/2723576.2723585
Rymarczuk, R., & Derksen, M. (2014). Different spaces: Exploring Facebook as heterotopia. First Monday , 19(6). Retrieved from
http://firstmonday.org/ojs/index.php/fm/article/view/5006/4091doi:10.5210/fm.v19i6.5006
Selwyn, N. (2015). Data entry: Towards the critical study of digital data and education. Learning, Media and Technology, 40(1).
Selwyn, N., & Facer, K. (Eds.). (2013). The politics of education and technology . New York: Palgrave Macmillan. doi:10.1057/9781137031983
Shirky, C. (2014, January 29). The end of higher education’s golden age. [Blog comment}. Retrieved from
http://www.shirky.com/weblog/2014/01/there-isnt-enough-money-to-keep-educating-adults-the-way-were-doing-it/
Siemens, G. (2013). Learning analytics: The emergence of a discipline. The American Behavioral Scientist , 57(10), 1380–1400.
doi:10.1177/0002764213498851
Siemens, G., & Long, P. (2011). Penetrating the fog: Analytics in learning and education. EDUCAUSE Review . Retrieved from
http://www.educause.edu/ero/article/penetrating-fog-analytics-learning-and-education
Slade, S., & Prinsloo, P. (2013). Learning analytics ethical issues and dilemmas. The American Behavioral Scientist , 57(10), 1510–1529.
doi:10.1177/0002764213479366
Solove, D. J. (2001). Privacy and power: Computer databases and metaphors for information privacy. Stanford Law Review , 53(6), 1393–1462.
doi:10.2307/1229546
Solove, D. J. (2004). The digital person. Technology and privacy in the information age . New York, NY: New York University Press.
Solove, D. J. (2013). Introduction: Privacy self-management and the consent dilemma. Harvard Law Review 1880 GWU Legal Studies Research
Paper No. 2012-141. Retrieved from http://ssrn.com/abstract=2171018
Stiles, R. J. (2012). Understanding and managing the risks of analytics in higher education: A guide. EDUCAUSE. Retrieved from
https://net.educause.edu/ir/library/pdf/EPUB1201.pdf
Stoddart, E. (2012). A surveillance of care. Evaluating surveillance ethically . In Ball, K., Haggerty, K. D., & Lyon, D. (Eds.), Routledge Handbook
of Surveillance Studies (pp. 369–376). Abindgon, UK: Routledge. doi:10.4324/9780203814949.ch4_1_a
Subotzky, G., & Prinsloo, P. (2011). Turning the tide: A socio-critical model and framework for improving student success in open distance
learning at the University of South Africa. Distance Education , 32(2), 177–193. doi:10.1080/01587919.2011.584846
Tait, A. (2015). Student success in open, distance and e-learning. ICDE Report Series. Retrieved from
http://icde.org/admin/filestore/Resources/Studentsuccess.pdf
Tempelaar, D. T., Rienties, B., & Giesbers, B. (2014). In search for the most informative data for feedback generation: Learning Analytics in a
data-rich context. Computers in Human Behavior. Retrieved from http://www.sciencedirect.com/science/article/pii/S0747563214003240
Tene, O., & Polonetsky, J. (2012). Big data for all: Privacy and user control in the age of analytics. Northwestern Journal of Technology and
Intellectual Property, 239, 1–36. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2149364
Uprichard, E. (2015, February 12). Philosophy of data science - Most big data is social data – the analytics need serious interrogation. Impact of
Social Sciences. Retrieved from http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science-emma-uprichard/
Van Rijmenam, M. (2013, April 30). Big data will revolutionise learning [Blog comment]. Retrieved from
http://smartdatacollective.com/bigdatastartups/121261/big-data-will-revolutionize-learning
Wagner, E., & Ice, P. (2012, July 18). Data changes everything: delivering on the promise of learning analytics in higher
education. EDUCAUSEreview. Retrieved from http://www.educause.edu/ero/article/data-changes-everything-delivering-promise-learning-
analytics-higher-education
Watters, A. (2013, October 13). Student data is the new oil: MOOCs, metaphor, and money [Blog comment]. Retrieved from
http://www.hackeducation.com/2013/10/17/student-data-is-the-new-oil/
Wishon, G. D., & Rome, J. (2012, 13 August). Enabling a data-driven university. EDUCAUSE review. Retrieved from
http://www.educause.edu/ero/article/enabling-data-driven-university
ENDNOTES
1 Though data as the plural of datum have traditionally been treated as plural, data is increasingly used as a phenomenon and therefore singular.
In this chapter we therefore refer to data as singular unless used as plural in quotations.
Section 6
Managerial Impact
The 7 chapters within this section present contemporary coverage of the social implications of Big Data, more specifically related to the corporate
and managerial utilization of information sharing technologies and applications, and how these technologies can be extrapolated to be used in
Big Data. Equally as crucial, chapters within this section discuss how leaders can utilize Big Data applications to get the best outcomes from their
governors and their citizens.
CHAPTER 80
Business Process Improvement through Data Mining Techniques:
An Experimental Approach
Loukas K. Tsironis
University of Macedonia, Greece
ABSTRACT
The chapter proposes a general methodology on how to use data mining techniques to support total quality management especially related to the
quality tools. The effectiveness of the proposed general methodology is demonstrated through their application. The goal of this chapter is to
build the 7 new quality tools based on the rules that are “hidden” in the raw data of a database and to propose solutions and actions that will lead
the organization under study to improve its business processes by evaluating the results. Four popular data-mining approaches (rough sets,
association rules, classification rules and Bayesian networks) were applied on a set of 12.477 case records concerning vehicles damages. The set of
rules and patterns that was produced by each algorithm was used as input in order to dynamically form each of the quality tools. This would
enable the creation of the quality tools starting from the raw data and passing through the stage of data mining, using automatic software was
employed.
INTRODUCTION
Nowadays, a sudden increase of data that is stored in electronic form within an organization or an organisation is observed. This data constitutes
the “historical files” of any process-activity that has taken place in the past and is digitally recorded and prompts each interested analyst to
extract the useful information that is “hidden” inside. The application of mined knowledge in theoretical practices is capable of leading the
analysts to a set of actions that will bring about the optimization of the organization’s processes.
In the field of interest of the herein, the mined knowledge is represented by the data mining techniques while the theoretical practices are
represented by the quality tools, as main components of Total Quality Management (T.Q.M.).
The present work intends to demonstrate the applicability of data mining techniques in the quality tools formation. More specifically, the aim is
the implementation of an automatic application, which is based on their feed with a specific type of information and this comes from the results
of data mining techniques upon raw data. The final goal is the emergence of the sources of the problems and the provision of the likely solutions
that will lead to the improvement of the business processes.
The quality tools are chosen as a guide for the process improvement because they are powerful, easy to use and simple to be dynamically
constructed. Furthermore, they offer a better frame of quality management from the others. Finally, they are suitable for pointing out the sources
of a problem and its possible solutions (Kolarik, 1995).
At a brief, the 7 new quality tools and their main functions are:
• Affinity Diagram: It concerns the systematization of large quantities of data in groups, according to some form of affinity (Kanji and
Asher, 1996). The regrouping adds structure in a big and complicated subject, categorises it and leads to the determination of a problem
(Dahlgaard, et al, 1998).
• Relationship Diagram: Its aim is the recognition, comprehension and simplification of complex relations (Dale, 1994).
• Systematic Diagram: The systematic diagram is a hierarchical graphic representation of the requisite steps towards the achievement of a
goal or project (tree diagram) (Dale, 1994). Its aim is the development of a sequence of steps, which compose the resolution of a problem
(Mizuno, 1988). Also, it has the ability to deconstruct a general problem in more specific ones, helping to understand their causes.
• Matrix Diagram: The matrix diagram aims to seek the clarification of relations between causes and effects (Dale, 1994). Moreover, it
detects the reasons behind problems during a productive process (Mizuno, 1988).
• Arrow Diagram: The arrow diagram is used for the improved development project planning and maintains suitable control so that its
goals will be achieved (Kanji and Asher 1996). Furthermore, the arrow diagram visualises the sequence of tasks that should be done until the
final goal is reached. (Lindsay and Petrick 1997).
• Process Decision Program Chart PDPC: The process decision program chart helps to focus on the likely solutions that will lead to
the solution of a problem (Kanji and Asher 1996). It is mainly used for the planning of new or renewed actions which are complicated and it
determines the processes which should be used, taking into account the succession of the events and the likely consequences (Lindsay and
Petrick 1997).
• Matrix Data Analysis: The aim of matrix data analysis is the quantification of the data of the matrix diagram using methodologies of
data analysis.
On the other hand, the data mining techniques were selected as the most suitable solution to the problem when a vast amount of data has to be
dealt within a database. The main result of the data mining techniques is the creation of rules and patterns based on the raw data. These rules will
dynamically form the new quality tools.
Introducing the data mining techniques, it is worthwhile pointing out that apart from the human potential; the information constitutes the most
precious resource of modern organisation. Although the information is a decisive factor for the achievement of the operational objectives, often it
remains stored in databases without being analysed (Witten and Frank, 2005). The determining process of recognition of valid, innovative,
potentially useful and comprehensible patterns through the data is called Knowledge Discovery from Databases-KDD (Piatetsky-Shapiro et al,
1991).
The Knowledge Discovery from Databases-KDD deals with the process of the production of functional knowledge, which is hidden in the data and
cannot be easily exported by an analyst. Even if data mining is substantially coincided with the knowledge discovery from databases, it differs on
the fact that it concerns more the technical “part” of the discovery of knowledge. Consequently, data mining refers to the stage of knowledge
discovery that is constituted by applied computational techniques that, under acceptable conditions and restrictions, produce a number of
patterns and models that spring from the data (Fayyad et al, 1996).
The detection of patterns in data of organisations is not a new phenomenon. It was, traditionally, the responsibility of analysts who generally
used statistical methods. In the past few years, however, with the spread of computers and networks, enormous databases have been created, the
data of which require new techniques for their analysis. The data mining covers precisely this field (Bose and Mahapatra, 2001). It is necessary to
mention that most algorithms that are used in data mining intend to discover information which would be useful for human decision makers.
Au and Choi (1999) propose an informative system architecture, which couples total quality management principles with statistical process
control and expert rules to support continuous monitoring of quality. The work recommends the use of machine learning to induce knowledge
from data. Tsironis, et al (2005) demonstrated the applicability of machine learning tools in quality management by applying two popular
machine learning tools (decision trees and associations rules) on a set of production case records concerning the manufacture of an ISDN
modem. Sung and Sang (2006) used conventional statistical and data mining techniques to identify customer voice patterns from data that have
been collected through a web based voice of customer (VOC) process. Using this data, the system detected problematic areas where complaints
had been occurred. Cunha, Agard and Kusiak (2006) applied data-mining approach in production data to determine the sequence of assemblies
that minimises the risk of producing faulty products.
Our work is organised in sections. Section 1 presents the theoretical topics which explain the fact that data mining techniques and the new quality
tools are successfully bound and a logical sequence of actions is presented which will lead us from the raw data to the discovery of the source
problems and their possible solutions. In section 2, the data of our case study is analysed using the data mining techniques (and the criteria of
their selection) and the data transformation, which was necessary so that the data could function as an input in our chosen data mining
techniques. Section 3 comprises a brief description of the developed application and the tools, which are used to materialise it. Section 4
demonstrates, the results of the application of the new quality tools and their solutions, which they “suggest”, and an evaluation of the data
mining techniques are presented. Section 5, concludes with the importance of accepting the use of data mining techniques in quality
management is underlined especially in organisations and organisations which possess a large amount of data in their digital databases and a
variety of data mining techniques is presented which could be used for this purpose in relation to the type of data that is to be analysed.
RATIONALE
There is one specific reason that leads to the choice of the specific quality tools. This reason has to do with the compatibility of the results of data
mining techniques (rules and patterns) with the type of information that is needed as an input in order to create them. Particularly, the
“qualitative” orientation of the quality tools is the fact that it achieves a compatible combination of them with the data mining techniques.
More specifically, data mining algorithms produce a set of rules through the raw data. Having in mind which of the new quality tools need to be
constructed, an algorithm could be applied on these rules that will dynamically create the desirable quality tool.
The steps that are necessary to achieve the dynamical construction of the 7 new quality tools are:
2. Define the goal (what you are hoping to discover) of the analysis (“concept”, in terms of data mining).
3. Choose which fields you desire to associate (“attributes”, in terms of data mining).
4. Define the number of records that need to be processing (“instances”, in terms of data mining).
5. Select the appropriate algorithms that you want to use as data mining techniques.
6. Find out which types of data you have to deal with. This is compulsory because a lot of data mining techniques can work only with specific
types of data. Those data types can be nominal (description or name), ordinal (e.g. high, medium, low), interval (e.g. year) or ratio (e.g.
temperature).
7. Define the type of learning that you want to apply on the selected dataset (classification, association, clustering, numeric prediction).
8. Decide which type of pre-process or transformation on the selected data is needed before the main process. This decision has to be taken
with regard to the type of the algorithm (data mining technique) that is chosen.
10. Via an automatic software application, read and obtain the generated rules and store them in a database.
11. Having chosen which of the seven quality tools you want to dynamically create, use the generated rules from the database as an input for
the quality tool and generate the selected quality tool with the use of a software script.
14. Based upon the “instructions”, take actions that will lead you to the desirable business improvement.
METHODOLOGY
The data under study concerns a construction company that also produces and distributes reinforced concrete. From 2005 onwards the company
has maintained an electronic database in which it records existing problems of the company's plant. Included are all types of vehicles that belong
to and are used by the company as well as those used in the production of reinforced concrete. There are a total of 12,477 recorded incidents for
the time period 03-01-2005 until 08-11-2007
The database of the machine’s problems is called “Machine History” and a brief description follows: For each fault in the machines, the
characteristics of the damage (machine name, machine type, machine operator, date of the damage, origin of the damage, solutions etc.), the
workers of the organisation who fixed the damage (full name, hours of work etc.), the parts which were used for the repair (part type, part name,
cost, items etc.) and the external collaborators that contributed to the repair by offering parts that did not exist in the storage of the organisation,
or by offering specialized human potential (collaborator name, cost for the work, cost for the parts) are recorded.
For the data registration, a database was created in MySQL (a Database Management System- DBMS) with a basic table that was
named basic and three secondary tables (workers, parts,outsource for the storage of information that concerns the workers, the parts and the
external collaborators respectively). The secondary tables are connected relationally with the basic table with a Many to Many (M-M) relation.
We cope with nominal data, for instance: a description or a name. Also, the number of the attributes that are associated in order to enter the
algorithms of data mining is in most cases equal with number 2. This could be achieved either “naturally”, when we want to associate only 2
attributes, or “artificially”, when we have more than 2 attributes that we want to associate. On this occasion, the dataset is transformed by
dividing an instance into more instances, until the associated attributes to be 2. For example, if we have one instance with 4 attributes (a, b, c, d),
then we create 3 equivalent instances, a - b, b - c, c - d. Inalienable condition for this transformation is the maintenance of the meaning and the
sense of the associated data. Consequently, the particular transformation is used to ensure that the absolute meaning of data always remains.
This transformation is achieved because each of our attributes depends only on one attribute and/or only one influence. Apart from the
aforementioned transformation, an additional conversion of the attributes values from nominal type to interval-ratio type is used only in the
cases that were required from the data mining algorithms.
Concerning the criteria on which the data mining algorithms were selected, it is underlined that the unique but necessary condition in order to
select one algorithm was that this algorithm would produce the whole distinctive rules that were contained in the dataset. More analytically, if
inside the 12.477 records (cases of damage) that are registered in the database one algorithm is unable to find one rule, which represents even 1
registration in the 12477, then this algorithm is excluded from the process. Therefore, the algorithms are selected and then produce the whole
rules that are contained in the selected data, even if the rules concerned only a few cases. The reason that this condition was set, was our ambition
to orientate the type of the selected algorithm towards the total discovery of the “hidden” rules in the data, hoping to apply those data mining
algorithms in future occasions, in cases where there is no tolerance of error.
Moreover, a positive factor in the selection of an algorithm was the clearness of the rules in the “output” of each algorithm and the extra benefit of
some importance indexes about the produced rules, as well.
The Tertius algorithm, which produces association rules and is hosted in the Weka Suite software. The Tertius algorithm effects a rule
exploration in the data based on an unsupervised technique (without the result of each rule to be determined by the user) and on a first order
method (a method of conclusions where variables are used instead of real values, aiming to find resemblances), relied on assumptions (Flach and
Lachiche, 1999a).
The Prism algorithm, which produces classification rules and is hosted in the Weka Suite software. Cendrowska (1987) first presented the Prism
algorithm in 1987. It is considered an evolution of the well-known decision tree algorithm ID3 (Quinlan, 1986). The basic difference from ID3 is
that the PRISM algorithm produces classification rules at once and not decision trees that are transformed in classification rules.
The Naive Bayes algorithm, which relies on the Bayesian Networks framework and is hosted in BayesiaLab software. The algorithm Naive Bayes
is a simple classification probabilistic algorithm. It is used in BayesiaLab software and it requires a determination of one node as a target node,
which is considered as the “parent” node of all the other nodes (child nodes). The privileges of Naive Bayes are its robustness and the fast time of
response. The drawback of the Naive Bayes algorithm is that it doesn’t produce directly the rules at its output but it has to be created by using one
of the algorithm produced values. This value, which is called modal value, indicates the most probable value of the child node and the probability
of its appearance concerning the specific value of the parent node. Knowing these probabilities, rules can indirectly be produced which support
those probabilities.
Moreover, the use of the Naive Bayes requires en extra transformation of our data. This transformation becomes obvious from the following
example: let us consider 2 attributes that we want to associate, a and b, with the attribute b to be the target node. Then, for each distinct value of
the attribute a, a child node is created which is named by the value of the attribute. Hence, if the attribute a contains on the whole 3 distinct
values (e.g. a_value1, a_value2 and a_value3) then three child nodes (a_value1, a_value2 and a_value3) are created instead of one (a). The value
of instances for each one of the new nodes take the prices 0 or 1 depending on whether the value of the “former” node coincides with the name of
new node.
Table 1. The dataset before the required transformation for the BayesiaLab
Attributes a b
Instance 1 1 0 0 B_value1
Instance 2 0 1 0 B_value2
Instance 3 0 0 1 B_value2
Instance 4 1 0 0 B_value1
Instance 5 0 1 0 B_value1
The Holte’s 1R algorithm, from Rosetta software, which relies on the Rough Set theory. The algorithm Holte's 1R is a classification algorithm that
is based on a variant of the algorithm 1R as it is presented by Robert C. Holte (Holte, 1993).
A covering algorithm, from RSES2 software (RSES version: 2), which relies on the Rough Set theory, the covering algorithm is a classification
algorithm (Bazan et al., 2000), (Bazan et al., 2003). In general, covering algorithm develops a “cover” in the entire data that is connected with the
rule.
APPLICATION
We implement our approach based on a Web database technique. The intercommunication with the database is accomplished via a web browser.
In our application, Apache HTTP server is selected as an http server, MySQL as a database management system (DBMS) and php as a web-based
dynamic language.
Figure 3. The interaction between MySQL, http Apache Server
and php
The utilization of MySQL, Apache and PHP is widespread because, they are all freeware and they have gained great popularity concerning their
functionality. Moreover, their wide acceptance has created a big network of supported users that guaranties a great growth of these applications.
Meanwhile, improved versions of all three software’s are continuously developed.
Apart from the previous software’s, the Flash technology has also been used (http://www.adobe.com/products/flash/) for better visualization of
the results and the programming language JavaScript (client-side scripting language) in order to give a dynamic use in our web pages.
1. The search form, which gives the opportunity to the user to find a set of records or an individual record from the database and to analyse
the specific data for this entry. Moreover, the user is able to create a multi-criteria query which can combine the most important attributes of
our database and to view the query’s results. Then the record of his interest has to be selected to view the details.
2. The affinity diagram form, which is the start page of the dynamically creation of the most common tool of the seven new quality tools, the
affinity diagram. At the front page, the user has the opportunity to select one of the five offered data mining techniques in order to data mine
and to discover the hidden patterns. The available algorithms are the Tertius, Prism, Naive Bayes, Rosetta and the covering algorithm RSES.
It is worthwhile noting at this point that the fields that are going to be associated have already been selected (damage code and damage
category), have been transformed and they are ready to be inserted as an input to the selected data mining algorithms (Figure 4).
After the user has selected the algorithm of his choice, he is directed to a web page where the results of the data mining are presented (Figure 5).
These results include the entire set of rules that the algorithm extracted within the selected raw data and some extra information (importance
indexes and extra information for the dataset). At the end of the page, there is a link-button that if the user clicks it, a script will be activated. The
script scans the whole page trying to isolate the rules from the other information and to store them into the database. The database table that the
rules are going to be stored has three attributes: parent, child andnumber. These attributes are named parent and child because if one rule is
deconstructed to his basic elements, they could be reconnected by using the parentchild type of relation. In the attribute number, the number is
stored and indicates how many times the specific rule is satisfied in our dataset.
Figure 5. The result page of the RSES software concerning the
affinity diagram
When the script has accomplished its task, the user is directed to the web page of the affinity diagram production (Figure 6) where the
dynamically created affinity diagram can be seen (the creation is based on the rules that just have been inserted into the database). The script
which implements the affinity diagram function is as follows: knowing that the affinity diagram concerns the grouping and the categorization of
data, the script selects from the database the distinct values of table parent and it creates so many columns as the number of the distinct values.
Then, the columns are filled with the child’s that are connected with the specific parent. With this technique, the affinity diagram is created.
Except for the production of the affinity diagram, the application gives the opportunity to validate the results, by executing a fixed query straight
to the raw data that gives the same results (rules) with the results of our data-mining algorithm. In addition, the application demonstrates a
prediction on a new dataset, based on the rules that were currently stored.
The above procedure is the main procedure that we follow trying to construct the rest of the quality tools. The only difference, as far as the
development of the application is concerned, has to do with the scripts that we use in order to create the quality tools. All the sub-procedures are
remaining the same.
3. The relationship diagram form. Following the same methodology as the affinity diagram, a user could select from the 5 available
algorithms, the one he wishes to do the data mining. After the algorith is selected, the user is directed to a web page where he is able to see
the rules that are produced. If he proceeds further, he is directed to the web page where the constructed relationship diagram is presented.
Here, it is important to underline that the application produces a relationship diagram that is equivalent to the relationship diagram at his
classic form. In addition, the user has the capability to use the features of prediction and confirmation just as he can do at the affinity
diagram.
The script, which creates dynamically the relationship diagram, functions as follows: From the produced rules that are stored in the database, the
script selects all the values of the field “child” that don’t exist in the field “father” and builds the first nodes. Consequently, as these nodes lead to
others. These nodes create the first column. Then, the values of the field “father” that has “children” the nodes of first column are selected (except
for the case that some nodes have already being used) and they create the new nodes at the second column. With this technique, the script
continues until it has examined all the relations of “father” - “child” that have being stored in the database.
4. The systematic diagram form. The application follows the same methodology as of the creation of the systematic diagram. The only
difference has to do with the script that forms automatically the systematic diagram which functions are as follows: It selects the distinct
values from the field “father” and places them in a list and then selects the related “child” for each “father”.
5. The matrix diagram form. The extra information that is needed in order to describe this part of the application is just that in the cases
that the data-mining algorithm offers statistical data for the rules, the matrix diagram can be considered also as a matrix data analysis since
it gives quantitative data.
As far as the script which creates dynamically the Matrix Diagram (an L-Diagram, which associates 2 attributes) is concerned, its functions are as
follows: It selects all the distinct values from the field “father” and the field “child” and places the first in a horizontal column and the second in a
vertical. Thus, a 2-dimension table is created, where each element is marked or not with a tick (✓), depending on whether or not a connection
exists of the particular pair of values “father” - “child” in the database.
6. The arrow diagram form. At this case, 2 types of arrow diagrams are produced: one total arrow diagram in which all the probable
connections exist between the associated fields and one type of distinct arrow diagram in which all the distinct connections of the associated
fields are put at one arrow diagram per connection.
7. The process decision program chart form. As in the case of the relationship diagram, an equivalent process decision program chart
is created which its only difference with the classic form of process decision program chart has to do with the visualization.
The production algorithm of the Process Decision Program Chart functions as follows: the algorithm selects the distinct values of the field “child”
that are not presented in the field “father” and places them into the first line. Afterwards, it finds the “fathers” of those values and places them in
a new line above, in a sequence that makes obvious the connections between “fathers” and “children”. The same process is followed until all the
stored rules are examined.
RESULTS
a. The majority of the damage cases happen due to the misuse of the machines by their operators.
b. Less damage is owed to the “defective” products of the suppliers, to unfavourable for the machines weather conditions and to the bad
road surface.
c. Consequently, the affinity diagram “suggests” a more careful use of the machines, seeking to reduce the damages.
5. Arrow Diagram: Attribute “workers”, concerns personnel who work for the repair of the “damage”(collaborator’s workers or not) and
the type of the spare parts that were used for the repair (organisational, collaborative, a combination of both or even no spare parts). In
figure 12, the set of the distinct possible arrow diagrams are presented which are produced while in figure 13 the total arrow diagram. The
arrow diagram presents the process, which is followed as far as the correlation between the origin of the spare parts that were used for the
repair of the damage and the engineers (organisation’s or collaborators’) who worked for the damage is concerned.
The previous results are related to the actions that have to be taken from the organisation in order to move towards the business process
improvement. Apart from these results, conclusions could be drawn for the performance of the data mining algorithms that were employed.
Firstly, all used data mining algorithms accomplished the main condition, which was the discovery of all the potential rules that existed within
the raw data.
Apart from this, the algorithms’ evaluation was based on the criteria shown in Table 3.
Table 3. The correlation of the algorithm with the selected criteria
Total of Rules ✓ ✓ ✓ ✓ ✓
Quantitative ✓ ✓ ✓
Information
Rejection of ✓ ✓ ✓
unnecessary
rules
Facility of ✓ ✓ ✓ ✓
exporting rules
Importance ✓ ✓ ✓ ✓
indexes
Low cost of ✓ ✓ ✓ ✓
transformation
Based on the criteria in Table 3, the following classification index could be formed (Table 4).
Table 4. Classification of the algorithms
Algorithms
Algorithms
Classification
1. Holte’s1R (Rosetta)
2. Covering (RSES2)
DISCUSSION
The seven (7) new quality tools produce observations and conclusions which could lead the organisation to the improvement of its business
processes and finally to its further growth and monitoring.
At this point, it has to be underlined that the selected five data mining algorithms are not the only suitable algorithms, but it is sure that many
others exist which with the appropriate transformation of data and with the right setting selection they could produce the total set of the rules
that are “hiding” inside the raw data.
To summarise, it is worthwhile advocating that in organisations and organisations in which an automatic management system of production
exists, the techniques of data mining and the tools of Total Quality Management can be combined towards a total dynamic system of data
analysis, which will export in real time conclusions and proposals for the improvement of the business processes. This methodology will reduce
the cost and the time that are required for data analysis. Especially, in the databases with vast amount of data, data mining can be considered as
the best solution for data analysis.
This work was previously published in Automated Enterprise Systems for Maximizing Business Performance edited by Petraq Papajorgji,
François Pinet, Alaine Margarete Guimarães, and Jason Papathanasiou, pages 150169, copyright year 2016 by Business Science Reference
(an imprint of IGI Global).
REFERENCES
Au, G., & Choi, I. (1999). Facilitating implementation of quality management through information technology . Information &
Management , 36(6), 287–299. doi:10.1016/S0378-7206(99)00030-0
Bazan, J., Nguyen, H. S., Skowron, A., & Szczuka, M. (2003). A View on Rough Set Concept Approximations. In Proceedings of R.S.F.D.G.r.C.,
China. Springer. 10.1007/3-540-39205-X_23
Bazan, J. G., Nguyen, H. S., Nguyen, S. H., Synak, P., & Wróblewski, J. (2000). Rough set algorithms in classification problem . In Polkowski, L.,
Tsumoto, S., & Lin, T. (Eds.), Rough Set Methods and Applications (pp. 49–88). Heidelberg, Germany: Physica-Verlag. doi:10.1007/978-3-7908-
1840-6_3
Bose, I., & Mahapatra, K. R. (2001). Business data mining – A machine learning perspective . Information & Management ,39(3), 211–225.
doi:10.1016/S0378-7206(01)00091-X
Cendrowska, J. (1987). PRISM: An algorithm for inducing modular rules . International Journal of Man-Machine Studies ,27(4), 349–370.
doi:10.1016/S0020-7373(87)80003-2
Cunha, C. D. A., Agard, E., & Kusiak, A. (2006). Data mining for improvement of product quality . International Journal of Production
Research , 44(18-19), 18–19. doi:10.1080/00207540600678904
Dahlgaard, J. J., Kristensen, K., & Kanji, K. (1998). Fundamentals of total quality management . London: Chapman & Hall. doi:10.1007/978-1-
4899-7110-4
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Ramasami, U. (1996). Advances in Knowledge Discovery and Data Mining . Cambridge, MA: MIT
Press.
Flach, P. A., & Lachiche, N. (1999a). Confirmation-Guided Discovery of first-order rules with Tertius . Machine Learning ,42(1-2), 61–95.
Holte, R. C. (1993). Very Simple Classification Rules Perform Well on Most Commonly Used Datasets . Machine Learning , 11(1), 63–90.
doi:10.1023/A:1022631118932
Kanji, K. G., & Asher, M. (1996). 100 methods for total quality management . London: Sage Publications. doi:10.4135/9781446280164
Kolarik, J. W. (1995). Creating Quality: Concepts, Systems, Strategies and Tools . New York: McGraw-Hill.
Lindsay, M. W., & Petrick, A. J. (1997). Total quality and organization development . Florida: St. Lucie Press.
Mizuno, S. (1988). Management for quality improvement: The seven new QC tools . Cambridge, MA: Productivity Press.
Piatetsky-Shapiro, G., Frawley, W. J., & Matheus, C. (1991).Knowledge Discovery in Databases . A.A.A.I./MIT Press.
Sung, H. H., & Sang, C. P. (2006). Service quality improvement through business process management based on data mining .ACM SIGKDD
Explorations Newsletter , 8(1), 49–56. doi:10.1145/1147234.1147242
Tsironis, L., Bilalis, N., & Moustakis, V. (2005). Using machine learning to support quality management: Framework and experimental
investigation . The TQM Magazine , 17(3), 237–248. doi:10.1108/09544780510594207
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques . Morgan Kaufmann.
CHAPTER 81
Big Collusion:
Corporations, Consumers, and the Digital Surveillance State
Garry Robson
Jagiellonian University, Poland
C. M. Olavarria
Université d’Avignon, France
ABSTRACT
In the post-Snowden digital surveillance era, insufficient attention has been paid to the role of corporations and consumers in the onslaught on
digital privacy by the largest surveillance state – the U.S. The distinction between corporations and the government is increasingly difficult to
pinpoint, and there exists an exclusive arrangement of data sharing and financial benefits that tends towards the annihilation of individual
privacy. Here the role of consumers in facilitating this alliance is examined, with consideration given to the “social” performances treated as free
and exploitable data-creating labor. While consumers of the digital economy often assume that everything should be free, the widespread
tendency to gratify desires online inevitably leads to hidden costs and consequences. The permanent data extracted from consumer behavior
helps agencies sort and profile individuals for their own agendas. This trilateral relationship of ‘Big Collusion’ seems to have gained an
irreversibly anti-democratic momentum, producing new transgressions of privacy without proper consent.
INTRODUCTION
The earliest correspondence between NSA contractor and whistleblower Edward Snowden and his initial primary source, documentary
filmmaker Laura Poitras, reads like something out of a work of spy fiction. One email in particular just prior to their now famed rendezvous at
that hotel lobby in Hong Kong clarifies with chilling effect the very nature of unfettered global digital surveillance, which no citizen (or head of
state) of any nation can escape and for which no official of the perpetrating governments has been held accountable. These words of Mr. Snowden
stand alone, without rival, as a description of the totality of undemocratic transgressions that will define this new era of mass digital surveillance
by states, and the collusive, unlawful offensives of their corporate partners:
From now, know that every border you cross, every purchase you make, every call you dial, every cell phone tower you pass, friend you keep,
article you write, site you visit, subject line you type, and packet you route, is in the hands of a system whose reach is unlimited but whose
safeguards are not. Your victimization by the NSA system means that you are well aware of the threat that unrestricted, secret abilities pose
for democracies. This is a story that few but you can tell (Poitras, 2014).
Opinion regarding Snowden the man, as hero or traitor (those simplistic divisions themselves reflecting simplistic and divided Anglo-American
political cultures) is irrelevant, though in his own words he claims “I am neither traitor nor hero; I’m an American” (Madison, 2014, p. 72).
Ultimately it is not the man who is on trial, but the secrets and threats he revealed. Many of those who seek to put the man on trial over his
revelations have proven to be at best state surveillance apologists, a group including corporate media interests, and at worst elected officials who
betray their electorates in favor of consolidating state power to protect the state and corporate enemies of democratic principles – foremost
privacy, individual autonomy and transparency.
While the greatest threats to U.S. democracy appear to be internal rather than external, as is often claimed in justification of the ethically
questionable and unlawful policies that the state continues to update to serve its own agenda (Shield, 2006, p. 23), the obvious reason for the
implementation of those policies is to undermine individuals or groups who question the very power they continue to consolidate: “At this
historical juncture there is a merging of violence and governance along with the systemic disinvestment in and breakdown of institutions and
public spheres that have provided the minimal conditions for democracy” (Giroux, 2014b, p. 41).
At one time the relationship between state and corporations in breaching the privacy of citizens was somewhat opaque, partly because of an
excess of public trust, partly because of persistent denial and obfuscation of facts by these unchecked powers. The two were even distinguished by
otherwise implicit Orwellian terms, with private industry playing “little brother” to the state’s “big brother” (Tambini, Leonardi & Marsden,
2008). Today there is no doubt, no question regarding the extent of their data-sharing and privacy-obliterating alliance. As Price (2014) observes,
“Snowden’s revelations reveal a world where the NSA is dependent on private corporate services for the outsourced collection of data, and where
the NSA is increasingly reliant on corporate owned data farms where the storage and analysis of the data occurs.” Following the wellsprings of big
data to their source leads to the data dissemination points of corporate communication towers and Internet Service Provider (ISP) warehouses or
collection points, with only “25 ISPs carrying around 80% of the global internet traffic” (Hathaway, 2014, p. 4). From there big data’s origins can
be traced back to the hard drives and smartphones of the average consumer; but this process must be taken even one step further to consider the
billions of terabytes being collected and stored by the corporate-state surveillance apparatus on a daily basis (Gellman & Poitras, 2014).
It is consumers who are, inherently, the producers of the digital content that forms the third and perhaps most important and least discussed
component of Big Collusion’s digital surveillance trifecta. The psychosocial desires of consumers motivate their digital behavior. Every search
query, social media performance, email, instant message, skype call, online purchase, application download and data packet sent from a digital
device is created from the behavioral motivations of digital consumers. But it would be unfair to consumers to ignore the role corporations play in
creating the conditions that motivate or control this behavior. From Facebook, Twitter, Google and smaller players of the ‘social web’, consumers
are inundated with claims about life enhancing tools that will enable them to ‘connect’, ‘share’ and ‘control’ their own user experience. All of this
rhetoric is there to support program design, which means that consumers are often being controlled unawares by the platforms they engage with.
A close examination of the construct of any Social Networking Service (SNS) or application would confound most users with its complex
assemblage of web code such as CSS, Javascript, PHP, XHTML, and arrays of syntax and highly developed algorithms that monitor habits,
searching for patterns to guide behavior and control the present and future actions of the user. The ultimate goal for the unseen armies of
engineers and developers is to extract from users the greatest and perhaps last remaining commodity of the so-called digital economy – attention
(Bauman, 2009). Consumers are more than willing to offer up endless screen hours of attention each day, knowing that the social web will bestow
upon them what they desire in return: the attention of others. This is the foundation of the growing wealth of the profiteers and exploiters of Big
Data, who are essentially in the business of buying and selling privacy (Jerome, 2013). Coincidentally, new enabling digital technologies and
platforms also engender a culture of surveillance between users, a kind of digital equivalent of the famed Hitchcock film “Rear Window”.
Consumers put their public and private lives on display to be incessantly consumed by others, as if intent on fulfilling the desired prophecy of
Facebook CEO Mark Zuckerberg that “privacy is dead”. This should sound an alarm for consumers concerned with privacy, but constructive
concern in this area requires an understanding of how data works and the fact that agreeing to corporate terms and conditions and privacy
policies under the lure of using “free” services ultimately means that “…someone else will determine how you live” (Lanier, 2013). In the case of
digital platforms, ‘always-on’ devices and the social web, absolutely everything users “agree” to comes at a costly price.
How then do collective social attitudes toward common valuation of privacy emerge in this technological era, and the newly engrained behavior
of digital consumers? Are users cognizant of the fact that their digital behavior is being exploited and collected into data server permanence, and
that this is sure to have delayed and unforeseen ramifications? Perhaps an exploration of the broader structures of state-corporate ‘Big Collusion’
can provide some insight into what ethical measures digital consumers should consider in the post-Snowden surveillance society, given that
states and corporations seem to have no legal incentive or principled desire to rein in their systemic transgressions.
BIG COLLUSION
In the pre-digital era of electronic surveillance, consumer profiling materialized through more dated technological means than exist today but the
underling capitalist motivations of corporate entities were no different (Lyon, 1994, pp.142-144). Companies like ACNielsen tracked consumer
behavior, and marketers used telephone-polling data, free postage-returned fliers or coupons, magazine subscriptions, credit card usage, sales
tracking and trend watching data to try and predict future consumer desires (Terence & Ludloff, 2011, p. 5). There existed minimal laws to
protect citizens from state surveillance, requiring authorities to seek warranted permission through legal measures that have since consistently
been eroded, without legislative or public consent, to the point of complete nonexistence (Boghosian, 2014, pp. 47-50).
In the post-9/11 digital surveillance world, politicians enlisted by the fourth estate to serve their democracies and protect their undeniable rights
have, to the contrary, continually made “vigorous attempts to train citizens to disdain their own privacy” (Greenwald, 2014, p. 125). This
pervasive culture of intrusion has been enabled and greatly expanded by new technologies, and “…the emergence of a surveillance state in which
social media not only become new platforms for invasion of privacy but further legitimate a culture in which monitoring functions are viewed as
both necessary and benign” (Giroux, 2014b, p. 41); in this context attitudes have been conditioned by both government interests and corporate
profiteers:
Those state authorities have been assisted in their assault on privacy by a chorus of Internet moguls—the government’s seemingly
indispensable partners in surveillance. When Google CEO Eric Schmidt was asked in a 2009 CNBC interview about concerns over his
company’s retention of user data, he infamously replied: “If you have something that you don’t want anyone to know, maybe you shouldn’t be
doing it in the first place.” With equal dismissiveness, Facebook founder and CEO Mark Zuckerberg said…“people have really gotten
comfortable not only sharing more information and different kinds, but more openly and with more people.” Privacy…is no longer a “social
norm,” he claimed, a notion that handily serves the interests of a tech company trading on personal information (Greenwald, 2014, p. 125).
According to Price (2014) both private and public sectors have the same agenda: “These two surveillance tracks developed with separate
motivations, one for security and the other for commerce, but both desire to make individuals and groups legible for reasons of anticipation and
control.” On the one side the state seeks to legitimize blanket surveillance of its own citizens, under the guise of shielding and protecting them
from danger (Greenwald, 2014, p.120), while corporations are in the business of digital behavior exploitation through targeted data tracking and
monitoring that serves to increase profits and satisfy shareholders seeking to maximize returns. The myopic motivations of both state and
corporate surveillance apparatuses undermine the rights of individuals and the protection of the fundamental democratic principles of privacy
and transparency (Roberts, 2014, p.). The long-term consequences of dragnet surveillance, of mass data collection under the rubric of ‘security’,
are given no consideration by the unchecked powers of the surveillance state and their corporate collaborators (Angwin, 2014, p.). Internet
behemoths such as Google and Facebook adamantly deny intentional collusion with state surveillance agencies, often claiming to be victims of
“backdoor” intrusion while providing only court-warranted “front door” access when officially requested through legal channels.
Figure 1. Big collusion: the digital surveillance trifecta
Surveillance measures employed by the state continue to expand, enhanced by the technological innovation often invested in by the state, and
researched and developed in government and university laboratories at the taxpayers’ expense; but they are eventually coopted by the private
sector, into which the profits are wholly absorbed (Giroux, 2014a). These surveillance systems are often attached to the dubious word ‘security’.
Simple spatial security systems such as airport checkpoints and urban CCTV cameras have their data collection limitations (Schneier, 2003; Lyon
2007). Securing spaces and areas means allowing the flow of people into checkpoints and attempting to trap undesirable and suspicious persons,
it relies on dated methods of x-ray scanning and post-event tracking, such as cameras which can record suspicious activity to be reviewed later
(Lyon, 1994). But more recent virtual surveillance methods make tracking individuals easy via digital platforms and devices that can be easily
hacked and monitored. The next generation of surveillance follows an evolved pattern of invasive measures at the expense of individual
autonomy and privacy.
The introduction of digital devices allows for mobility tracking via satellites (Aas, Gundhus, & Lomell, 2009) and voice recording: so long as an
individual has an approopriate device on their person their movements can be tracked. The next generation of methods will become one step
more intrusive, as biometric technologies (Nissenbaum, 2010, p. 48) seek to bypass the ability to forge individual-identifying documents like
passports and overcome the limitations of mere spatial surveillance to get inside persons, to collect and log the data uniquely attributable to every
human individual in the world. Retinal scanners, facial recognition software, fingerprint scanners and oral DNA swabs are already used not just
on criminals but as “preventive policing” (Joh, 2014, pp. 48-51), a prerequisite for passing through those checkpoints of “security” which can
include sporting events, government buildings, airports and border crossings and to collect information to track ‘future criminals’. These
methods will likely soon be adopted by the private sector to access corporate buildings, shops and stores, or to open car doors, personal
residences and operate a multitude of domestic devices beyond the latest iPhone that already incorporates fingerprint touch to unlock. All these
devices will be communicating with one network and assigned to specific individuals, monitoring their every action at home, in their car, in a
retail store and tracking their movements through time. To technologists eager to implement this way of life this has been called the “Internet of
things” and it merely serves as a system of enhanced data collection and surveillance that corporations and the state can exploit to further
consolidate their power and control over consumer actions and behavior. Through the Internet of things, “…big data technology can capture,
analyze, and make predictions based on the digital trails we leave. The end result is all seeing and all knowing, which can be illuminating or
frightening, depending on your perspective” (Terence, 2011, p. 49). To be sure it’s illuminating for the digital surveillance state and corporate
advertisers, and frightening for those who value autonomy and privacy. According to Nissenbaum (2010, pp. 31-45), these collusive powers
essentially seek to know us better than we know ourselves. We, as in digital consumers, aren’t making it difficult: “In 2015, twenty-five billion
devices are projected to be connected to the Internet; this number could double to fifty billion devices by the end of the decade. Simply going
about our everyday lives creates a vast trail of ‘digital exhaust’ that can reveal much about us” (Jerome, 2014, p. 215).
The revelations of NSA secrets from Snowden’s leak prove that ‘Big Collusion’ is real and is happening. The NSA’s PRISM program ‘piggybacks’
on the real time data collection and mining of nine corporations’ customers: Google, Yahoo, Facebook, Skype, Microsoft, AOL, YouTube (also
Google) and Apple (Greenwald, 2014, p. 20). While many of the companies publicly deny their collusive role in government surveillance, they
have done little to show that anything concrete has changed since the revelation of the PRISM program in June of 2013. One reason may be the
benefits of not providing access to the warrantless surveillance the government seeks. Most of these companies elude taxes, exploiting loopholes
to avoid paying and in some cases reducing to zero corporate taxes on billions of dollars in profits (Giroux, 2014a). Another reason is more
obvious; these corporations rely on data collection and exploitation to justify their market capitalization to shareholders. On the other end elected
officials in government have no incentive to protect their electorate’s privacy: “For these companies, stringent privacy regulations would curb
their ability to make money and in their words, ‘deprive consumers from advertisers’ abilities to serve up more relevant ads.” This is certainly the
case lobbyists are making for Google (spending $5.2 million in 2010), Yahoo ($2.2 million), Apple ($1.2 million) and Facebook ($350,000)”
(Terence, 2011, p. 66; Opensecrets.org, 2014). Millions of dollars move from corporations, to lobbyists, to the coffers of elected officials and their
Super Political Action Committees (PACS). In exchange, the government gains access to dragnet surveillance and provides easy tax loopholes so
those same corporations can keep their billions in profits. This financial foundation of big collusion systematically undermines any notion that
the system of government in place that encourages this relationship is “democratic” in nature:
This postDemocratic form of political order paradoxically consists of the combination of fragmented special interests that punish anyone
daring enough to challenge their desires and a central government that is consolidating its power to monitor, control and intimidate its
citizens. It also includes an insatiable set of information gathering businesses that are functioning as “enablers” by amassing an inconceivable
amount of data on Americans and everyone else for that matter (Barnheizer, 2013, pp. 24).
This post-democratic state, relying on digital surveillance to thrive in the shadows and control its own citizens, inherently must it follows have
complete disdain for their privacy and undermine it using any obtainable methods. Meanwhile American citizens who elect representatives to the
legislative and executive branches of government, in the increasingly circus-like biennial and quadrennial pageants where billions of dollars are
hoarded through shamefully legal means to ‘win’ the favor of voters, are also required to foot the bill for their own surveillance. Snowden’s
documents reveal that the NSA has a “black budget”, or unknown amounts of money, at its disposal to spy on those who pay their salaries and
nearly 70% goes to private sector contractors with top secret government clearance (Greenwald, 2014, p. 76).
DIGITAL SURVEILLANCE STATE
In the immediate aftermath of the first Snowden related media reports, President Obama continued to insist that the surveillance state he kept
intact almost immediately upon assuming office in 2009 could not without probable cause and a court warrant “listen to your telephone calls” if
you are a “U.S. person”, and claimed it was “the same way it’s always been” (Greenwald, 2014, p. 93). These brazen lies came on several occasions
in late June 2013, the weeks following published and documented reports of intentional collusion between wireless telecommunications giant
Verizon and the NSA, to which it provided unfettered access to millions of citizen’s telephone calls and records on a daily basis.
The boldness of these lies is further underscored by revelations as early as 2005 that the CIA was paying $10 million a year to
telecommunications giant AT&T for unconstrained blanket surveillance of citizen phone and Internet communications (Angwin, 2014, p.24), or
more specifically to “make copies of all emails, Web browsing, and other Internet traffic to and from AT&T customers, and provide those copies
to the NSA” (Lyon, 2014, p. 8; EFF, 2013; Harris, 2010 p. 268). If anything the Verizon revelations did in fact prove Obama’s claim that it was
“the same way it’s always been” and that AT&T, according to Boghosian (2014, p. 41), had fulfilled its promised advertising slogan “Your World
Delivered”.
As news of further programs leaked through Snowden’s sources the world became familiar with names like Xkeyscore, Turmoil, Muscular and
Tempora, all comprehensive NSA and GCHQ (its Brtish equivalent) data collection programs or database storage systems for gigantic global
digital surveillance missions that had been long underway (Greenwald, 2014). Despite vast troves of evidence, the surveillance state continued to
deny the obvious, obfuscate, condemn the source of the leaks - Edward Snowden - and even hinted at threats against the journalists entrusted
with exposing the details of these unlawful programs and the hypocrisy and disgrace of those supporting them.
In a May 2014 report by President Obama’s ‘Council of Advisors on Science and Technology’, a mix of academics and corporate executives,
submitted a report entitled “Big Data and Privacy: A Technological Perspective.” The report to the president claims that the
… policy framework should accelerate the development and commercialization of technologies that can help to contain adverse impacts on
privacy, including research into new technological options. By using technology more effectively, the Nation can lead internationally in
making the most of big data’s benefits while limiting the concerns it poses for privacy (PCAST, 2014).
It is clear that the recommendations of this council, with its simultaneous if not conflicting support for both ‘Big Data’ and ‘Privacy’, are in direct
contrast with present day reality, and that the entire report serves as a publicity stunt for a nation with policies directly opposed to any rights of
privacy or data protection for digital consumers. If the report reeks of hypocrisy and conflict, perhaps a closer examination of the council’s
constitution can help explain why. “Google Executive Chairman Eric Schmidt was on the select Big Data working group…” and “…given Google’s
continual problems with privacy and antitrust regulators, and that his company is subject to a 20 year oversight for its bad behavior by the
Federal Trade Commission, Mr. Schmidt should have recused himself” (CDD, 2014). Others include Craig Mundy, a senior advisor to the CEO of
Microsoft and Mark Gorenberg, the managing director of a major venture capital firm who said “As the volume and velocity of data continues to
proliferate, so does the need and opportunity to find value in it through analytics” (CDD, 2014). There are several others exposed by the Center
for Digital Democracy, with major corporate conflicts of interest in no position to give an honest assessment of big data’s effect on consumer
privacy.
These presidential commissions are par for the course when public outrage demands political posturing to massage appearances and impressions
as actual “change on the horizon”. The rhetoric of unreality only reveals itself as disturbing when the curtain covering the backstage area is lifted,
which recently occurred with disturbing data revelations regarding the Obamacare website.
Until as recently as January 2015, when it became public knowledge, the government’s ‘Obamacare’ website heathcare.gov was sharing user
information with Google’s chief advertising subsidiary Doubleclick, along with other companies like Yahoo and Twitter (Pagliery, 2015). This
collusion likely began in December 2013, when President Obama invited Silicon Valley CEOs to the White House, desperate to cure the major
fiasco surrounding heathcare.gov’s embarrassing launch failures. While nobody knows for sure what went on behind closed doors, it didn’t take
long for the site to be running smoothly and it wouldn’t have been out of character for those companies to request ‘favors’ in return for fixing
Obama’s political debacle. In this case the payoff was likely unfettered access to all new registrant’s personal data. With the government
‘piggybacking’ corporate collection of personal data for nearly a decade, the collusion contract allowed corporations to now ‘piggyback’
government collection of personal data, even the medical and health records of its most disenfranchised citizens simply following the tenets of
the new law by seeking mandatory insurance coverage. This grotesque collusion would likely still continue if it weren’t for advocacy groups like
the Electronic Frontier Foundation sounding the alarm (Pagliery, 2015). In another suspicious public maneuver of political posturing the Obama
administration claimed to be concerned with the “data collection practices” of Silicon Valley companies (Sanger, 2014), and with a needed
examination of potential legislation to protect consumers, something only a handful of representatives are making a priority.
If the executive and legislative branches of the Digital Surveillance State are indifferent to mass surveillance of citizens, the judicial branch is only
marginally better. In the Supreme Court caseUnited States v. Jones, the court recognized the warrantless use of citizens’ GPS-tracking devices as
a clear violation of the U.S. Constitution’s Fourth Amendment (Thompson II, 2014). In joining the ruling majority Justice Sotomayor, one of the
court’s more liberal justices, acknowledged the threats of the new invasive technologies to citizen privacy. “New technologies…permit the
government to collect more and more data and cost less and less to implement. The technological invasion of citizens’ privacy was clearly
‘susceptible to abuse’ and over time could ‘alter the relationship between citizen and government in a way that is inimical to democratic society’
(Citron & Gray, 2013, p. 270).” The decision was still considered “puzzling” and “confusing,” leaving “many of the case’s privacy implications
unanswered”, while conservative Justice Alito, “ominously conceded that a ‘diminution of privacy’ may be ‘inevitable,’ and suggested further that
society may find it ‘worthwhile’ to trade convenience and security ‘at the expense of privacy’” (Citron & Gray, 2013, p. 271). Attitudes toward
consumer privacy by the state appear to follow those of their collusive partners, the Internet and telecommunications companies.
CORPORATIONS
In the digital economy, euphemistically dubbed the “sharing economy”, it becomes a necessary process for companies to employ mass data
collection and exploit user behavior in every way possible to satiate shareholders and fulfill the inherent modus operandi of any corporate entity:
constant expansion at any expense. Clever and novel ways of collecting data are engineered and implemented, from implanting tracking cookies
on computers and devices, to monitoring mouse movements and calculating the amount of time to the hundredth of a second what content a user
is engaging with online. The continual investment in or acquisition of companies in the business of digital behavior monitoring means that the
goal of corporations to know digital consumers better than they know themselves is already well underway:
The race to know as much as possible about you has become the central battle of the era of Internet giants like Google, Facebook, Apple and
Microsoft…While Google and Facebook may be helpful, free tools, they are also extremely effective and voracious extraction engines into which
we pour the most intimate details of our live. In the view of the “behavior market” vendors, every click signal you create is a commodity, and
every move of your mouse can be auctioned off within microseconds to the highest commercial bidder (Parser, 2012, p. 26).
The lure of what many consumers believe to be free “tools”, including software, search engines, mobile applications and social media platforms is
strong, but they are anything but free. In the words of virtual reality pioneer turned cautionary technology prophet Jaron Lanier, in digital life
“free inevitably means others will be deciding how you live.” For most consumers the costs are invisible as data collection, the fingerprint of their
digital behavior, is collected from their open and active devices and sent to corporate and government servers.
Surveillance becomes selfgenerated…circulated through a machinery of consumption that encourages transforming dreams into data bits.
Such bits then move from the sphere of entertainment to the deadly serious and integrated spheres of capital accumulation and policing as they
are collected and sold to business and government agencies who track the populace (Giroux, 2014a, p. 11).
Nearly six months after the initial Snowden revelations made their way to press The Washingon Post uncovered evidence of one of the more
obvious connections between corporate data surveillance and the NSA’s exploitation of their methods. With cookies, or small files stored locally
on a user’s computer, websites and advertisers are able to track online browsing activity to monitor and profile that user for future enhanced ad-
targeted methods (Nissenbaum, 2010, p. 46), which greatly increase the potential for ‘click-thru’ or monetization. As the world’s largest online
advertising company, Google is perhaps the most proficient or worst offender (depending on one’s view of privacy) when it comes to planting
data tracking and data collecting cookies. One PowerPoint slide in one single document provided by Snowden reveals the NSA has been
“piggybacking” Google’s PREF cookie, now aptly called Google’s ‘NSA Cookie’. “This cookie allows the NSA to single out an individual’s
communications among the sea of Internet data in order to send out software that can hack that person’s computer. The slides say the cookies are
used to ‘enable remote exploitation’” (Soltani, Peterson, & Gellman, 2013), which translates to the NSA seizing remote access of a user’s computer
and all its contents without the user’s consent or knowledge.
Internet companies cannot survive without sufficiently and silently eroding any and all privacy protections they may have once offered. The
evolution of Google’s privacy policy from 2000 to the present clearly displays a grotesque annihilation of user privacy protections, as does
Facebook’s privacy policy and user agreement, which are agreed to de facto post registration and can be eroded without notice (Waters &
Ackerman, 2011, p. 102). Visiting any website requires that a user already agrees to all the terms and conditions of the site by simply loading the
page in their browser. If a user tampers with cookie settings in their browser preferences, opting to not allow cookies to be stored locally on their
computers, most sites now simply don’t show content for the page and access is blocked. Even if the user opts to have those cookies removed once
they close the browser, sites like Google, YouTube, Facebook, Twitter and Yahoo display a noticeable pop-up insisting that cookies help make
their site experience run more smoothly, and in the case of Google’s help page it is even claimed that cookies protect user privacy. This is
tantamount to a subtle form of blackmail to monetize user attention, collect and map data of that user’s digital behavior for their innumerable
third party advertising partners. The default setting for these Internet behemoths is “collect it all” and tell the users it’s “for their own good”,
because through tracking all their digital behavior they can present a “better experience” the next time the the service is accessed. The language
even implies consent without the user having to join the site, create a personal account or log in: “You agree that by using our service you accept
our cookie policy.” Even if the user simply closes that window without clicking “I Agree”, they are still agreeing and cookies such as Google’s ‘NSA
PREF Cookie’ have already been sent to their local system (Figure 2).
Figure 2. Google’s locally stored ‘NSA Cookie’ ID: PREF
Source: Google Chrome Browser settings ‘Cookies and site
data’; context: Screenshot after opening Chrome without
being logged in as a Google user
While corporations require surveillance of user behavior to raise capital and justify their market capitalization, the attitudes towards privacy of
those in corporate power are as disturbingly aloof and indifferent as those of the branches of the digital surveillance state. Greenwald (2014, p.
125) points out the hypocrisy of privacy standards as something that must be bought by those who seek it, observing that Facebook CEO “Mark
Zuckerberg purchased the four homes adjacent to his own in Palo Alto, at a cost of $30 million, to ensure his privacy.” He also quotes the
technology website CNET, which states that “Your personal life is now known as Facebook’s data. Its CEO’s personal life is now known as mind
your own business.” The rise of Big Data and its servicers in Big Collusion have amassed the wealth, power and control to ensure their own
privacy is not violated. “Metadata is a slow, relentless concentrator of wealth and power for those who run the computers best able to calculate
with it. The only form of targeting that is absolutely reliable is distinguishing those who run the biggest computers from everyone else…it has
given rise to the 1%” (Lanier, 2014). At the expense of the privacy of those who consume the technologies responsible for data collection and
contribute to astonishing concentrations of wealth, creating a class of young Silicon Rockefellers and Carnegies, Big Collusion’s profiteers can
afford, through their state collusion contracts, lobbyists and amassed personal wealth to protect themselves from being victims of their own
invasive technological systems. This gives them the freedom to continue to exercise their meritocratic, pseudo-libertarian ideology of
technological ‘disruption’ upon society (Wieseltier, 2015), while the other 99% are forced to gauge the costs of “free”.
CONSUMERS
As mentioned, consumers and their digital behavior are the wellspring source of all data. Like Google’s ‘NSA Cookie’, one of the great selling
points of data collection to consumers is the enhancement of their “personalized experience” (Parser, 2013). Data helps online retailers like
Amazon.com customize their storefront based on the behavioral history of each user. Google’s search results are formulated on the basis of
algorithmic reflections of the totality of past searches by that individual user. The ruse of customization has convinced users that they are being
presented with information that is of greatest import to them. “If you liked that, then you’re sure to like this” equates to the simplistic tell of site
customization. In reality the content displayed is fashioned on the algorithmic probability of inducing a commercial transaction, extracting
capital or instigating a sale. At the very least it’s geared toward extracting attention and keeping the user on the site for as long as possible, each
second a ring on the data cash register.
It may play on consumptive desires, calculated by digital behavior monitoring, but the greatest consequence of this is limiting the scope of the
presentation of what a user might actually need or find surprisingly useful via the increasingly marginalized experience of ‘serendipity’ in the
digital sphere. The customization process narrows content to exploit our greatest impulsive desires in the moment. It is up to the user and
consumer to discern the difference between their wants and needs, but first this requires an understanding that they are being manipulated by
both their past online actions and the site’s use of those actions as calculative data (Parser, 2013). Most users see utility and efficiency in
customized content, rather than diminished possibilities, exploitative enticement and self-reduction through commodification: “The crucial,
perhaps the decisive purpose of consumption in the society of consumers…is not the satisfaction of needs, desires and wants, but the
commoditization or re-commoditization of the consumer: raising the status of consumers to that of sellable commodities” (Bauman, 2009).
Through social media individuals turn their public and often private lives into commodities. The essential motivation is the yearning to re-
present a version of their physical embodied life via image and textual performances, often as the enhanced copy, or version of their “best self”,
that serves in managing the impressions of others in a self-aggrandizing cycle to reap the neurological rewards provided by attention
gratification. The image we have of ourselves and the image of ourselves we perform through digital platforms are what we want others to think
of us, often attempting to collect “friends”, “followers” and “likes” as badges of digital-social validation. The concept of privacy in the context of
social media has vanished: “The right to privacy is easily given up by hundreds of millions of people for the wonders of social networking and
varied seductions inspired by consumer fantasies” (Giroux, 2014a, p. 5).
Authenticity is always in question, and in essence managing digital impressions serves to reward the great commodity of Bauman’s contemporary
consumer society: attention. The attention or applause of others validates our online performances, and briefly reinforces self-esteem so that the
performer is induced (often addictively) to return to the source of that temporary gratification.
Many social media consumers claim they don’t actively “use” the platforms they are registered with; the word “use” refers to making original
posts, commenting or following others – the source of digital content. However they “use” these sites in different ways, lured by the desire to
survey the lives of other people’s best selves. Perhaps this type of collective consumer surveillance has contributed to a wider culture of accepting
corporate-state surveillance. The spectacle of consumer digital services reinforce “…a culture of control in which the most cherished notions of
agency collapse into unabashed narcissistic exhibitions and confessions of the self, serving as willing fodder for the spying state. The self has
become not simply the subject of surveillance but a willing participant and object” (Giroux, 2014a, p. 4).
If everyone is watching one another anytime they choose, what does it matter if the state and corporations are watching too? Often the
shortsighted claim “I have nothing to hide” serves as a clue of some obtuse consumers regarding the extent of massive data permanence they are
creating in their own name. Digital memory is easily recalled and all it takes is one misstep on the part of digital consumers to be future victims of
their own insatiable technological desires and carelessness. Giroux (2014a, p. 7) quotes Ariel Dorfman who points out that ultimately “Social
media users gladly give up their liberty and privacy, invariably for the most benevolent of platitudes and reasons.”
In response to the growing culture of both fear and surveillance among consumers, one industry has emerged to profit from distrust, suspicion
and aggressive voyeurism. Products like “wolftracker” (wolf-tracker.com) offer yearly account subscriptions to use their software and tools for
hacking mobile phones. A person doesn’t even need access to the device; just the phone number is sufficient for the hacking software to implant
malware via the device’s wireless Bluetooth technology to monitor and record all calls, text messages, application usage and Internet browsing. It
even gives the hacker remote access to the device’s storage. These subscriptions and services are not illegal and the industry targets the fears of
parents who constitute a bulk of the market. Ironically, with these services available to any consumer, a Gallup poll of October 2014 revealed that
the hacking of smartphones tops the list of crimes Americans fear the most (Riffkin, 2014).
Despite all the revelations resulting from the Snowden leaks, one Pew research poll found “18% of Americans believe the government can be
trusted to do what’s right (with their data, presumably) all or most of the time. Consumers’ faith in companies is even lower: just 12% of those
surveyed feel companies can be trusted to do what is right all or most of the time” (Bradshaw, 2014). While another poll found that consumers
trust the NSA (18%) with their data more than Google (10%) or Facebook (5%) – (Reason.com, 2014). This could reveal the confusion of
consumers in failing to see the collusive partnerships and the NSA’s intrusive tactics, or a testament to their carelessness in not trusting Facebook
with their data yet continuing to use the service to perform and consume their digital lives.
BREAKING THE CYCLE
The international community has widely condemned the actions of dragnet U.S. digital surveillance, with a major U.N. report issuing a dire
warning with supporting legal evidence on the illegitimacy of its spying programs (St. Vincent, 2014). In step with exceptionalist attitudes of U.S.
policy makers and enforcers, international concerns are met with silence and callous indifference, conveying superior stance to the rules of
international law while continuing to act above and beyond their scope (Giroux, 2014b, p. 44).
While the various branches of the U.S. government spin the issue of unfettered lawless digital surveillance with commission reports and symbolic
legislation destined for recycling bins, the European Union has shown far more interest in implementing digital privacy protections, though
intentions can be deceiving. The EU’s “right to be forgotten” law passed in the summer of 2013 allows digital consumers to edit Google search
results associated with their name. This can protect individuals from false claims and defamation of character, but it turns Google, an
information search engine, into a platform for impression management much like Facebook. People can abuse this privilege, forcing Google to
remove truthful information about them without having to prove that it is in fact false. This sanitization of an important information search
engine is no substitution for privacy.
Digital privacy advocates like Snowden’s primary contact Laura Poitras and data expert Jacob Applebaum are actively trying to educate
consumers on the breadth and magnitude of state-corporate surveillance. In addition to encrypted email protections like PGP and GPG Snowden
used to communicate with journalists to facilitate his leak, he has publicly commented in detail on the numerous platforms and programs
available to everyday digital consumers. He discusses the increasingly popular “alternate Internet” that can be accessed through the TOR browser
which functions by sending data packets to multiple servers, routing a user’s traffic so it becomes difficult to track their location and impossible
to track their digital behavior. The TOR browser accesses an “alternet”, often called the “dark web” by those unfamiliar with its purpose or
function, which has .onion domain sites that do no track user behavior, implant cookies or attempt to collect data or violate the privacy of users
on behalf of advertising companies. It may hold a clue to what a truly private Internet might look like in the near future, though TOR must first
shake its unfair label as the preferred tool of black market criminals and devious hackers. For those who find TOR too “extreme”, there are
Internet browser “add-ons” or extensions such as Ghostery andDisconnect, that block third party trackers, social media surveillance and add
extra layers to private browsing beyond blocking cookies in the browser’s preference settings. Ghosteryeven reveals how many third party
trackers are blocked, provides their corporate names, which the consumer can search and discover are all cogs in Big Data’s treasure machine.
The numbers are often no less than a dozen on any average web page.
One problem with application or software add-ons is the most efficient ones rise to the top and cash out. Sooner than later they are vacuumed up
and bought outright by corporate surveillance companies like Google or Facebook. Whisper Systems, a digital privacy company, bought by
Google is a prime example. The biggest issue consumers face by actively pursuing tools and applications to block or hinder corporate-state
surveillance is that those who do by default are immediate put under suspicion by the state for exercising their right to digital privacy (Angwin,
2014; Terence, 2014; Greenwald; 2014). Those who download TOR browser immediate appear on the NSA radar, as if they are candidates to be
monitored under the auspices of science fiction-like “pre-crime” conditions.
Digital privacy advocates have proposed consumer options including a “Do Not Track” button that keeps websites and third party advertisers
from collecting their online behavior. However efforts to come to terms with the details of this privacy option were abandoned as a clear
infringement upon corporate monetary interests: “They ground to a halt when the Digital Advertising Alliance (DAA), a trade group representing
online ad companies…abandoned the effort after clashes over the proposed policy” (Soltani, Peterson, & Gellman, 2013). In the end 181 of 200
DAA companies refused to further consider any “opt-out” option for consumers inundated with digital advertisements (Terence, 2013, p. 72).
While browsers like Google’s Chrome offer the option to ‘refuse to be tracked’, Angwin (2014, p. 124) points out the irony of the opt-out process,
“…it would have required me to install cookies on my computer to alert the tracking companies that I didn’t want to be tracked. This seemed
vaguely Orwellian: I had to allow myself to be tracked in order not to be tracked.”
Digital consumers concerned with privacy and the potential threat of the corporate-state surveillance apparatus must ultimately make the
conscious effort to become conscientious digital consumers. This is not an easy task. It becomes a full time job to employ the privacy protections
necessary to use digital tools, applications, software, platforms and the Internet without having digital behavior monitored, tracked and collected.
Even these are no guarantee of digital privacy.
There are countless stories that reveal the disturbing precision of corporate data tracking, like the outraged father who learned that the
department store Target knew of his teenage daughter’s pregnancy before he did by monitoring her search habits and collecting her browser
information (Boghosian, 2013). Then there’s the story of Sociology Professor Janet Vertesi, who became pregnant and was intent on keeping that
multi-billion dollar “parent-to-be” industry from haranguing her with advertisements. According to her own research, online “The average
person’s marketing data is worth 10 cents; a pregnant woman’s data skyrockets to $1.50, and once targeted advertisers find a pregnant woman
and won’t let up” (Petronzio, 2014). So Vertesi took to older methods of communication and notified family by telephone and email, requesting
that nobody make any mention of it on social media or online. She used the TOR browser to search the Internet and made all baby related
purchases with cash only. These are just the basic measures an average consumer must take to avoid being digitally tracked and targeted, and to
keep an intimate and personal life event out of the agitating vacuum of Big Data.
While there are numerous and exhaustive measures the average digital consumer can take to protect themselves, “…if you live in the
industrialized world and desire a convenient and full life, your online privacy future is bleak. You can’t un-ring this bell, but you can reduce your
exposure, keeping in mind...what happens on the Internet stays on the Internet forever” (Terence, 2011, p. 74). The best opportunity at some
semblance of digital privacy “…is to control how much information you put out there, keep yourself informed about how privacy impacts the
technologies you use, and vote with your dollars against companies that abuse your trust” (Terence, 2011, p. 74). In a total surveillance society the
culture of suspicion is passed down from the corporate state to the consumer, who must always be suspicious when using technologies once
thought to be harmless and “free”.
The big challenge presented by big data is that the value may not be clear, the motives let alone the identity of the data collector may be hidden,
and individual expectations may be confused. Moreover, even basic reputationmanagement and dataprivacy tools require either users’ time
or money, which may price out average consumers and the poor (Jerome, 2013, p. 4).
There are several organizations fighting for the rights of digital consumers, “like the Electronic Privacy Information Center (EPIC), American
Civil Liberties Union (ALCU), and Electronic Frontier Foundation. These organizations cover privacy policy…privacy-related court cases, provide
research and analysis, and give guidance on how to prevent privacy violations” (Terence, 2011, p. 72). There is the Center for Digital Democracy
(CDD), which follows the latest revelations of NSA activity,
publishing articles on legislative proposals that will enhance either surveillance or privacy. All these groups and individual advocates, including
the privacy-conscious consumer, must fight an uphill battle against corporate-state powers and groups like the Digital Advertising Alliance that
seek to hack away at any legislation that undermines profit driven and data collection agendas of their corporate subscribers.
As a post-Snowden world comes to terms with the magnitude of these transgressions against privacy, transparency and democracy, it must
collectively prop up those fighting the power structures intent on subverting them:
In short, if we want to prevent ourselves from becoming silenced, compliant subjects in a corporatized society where everything we do passes
through technology connected to a militarized surveillance grid, we should resist by declaring allegiance to all who dare to defend their rights
and freedoms by exercising them. All who resist are the custodians of democracy (Boghosian, 2014).
While supporting ‘custodians of democracy’ is essential for a reexamination and reinvestment in democratic principles like privacy, David Lyon
(2014, p. 9) explains why Big Collusion of surveillance is big business serviced by a state-corporate revolving door:
But a key reason why those commercial and governmental criteria are so imbricated with Big Data is the strong affinity between the two,
particularly in relation to surveillance. Big Data represents a confluence of commercial and governmental interests; its political economy
resonates with neoliberalism. National security is a business goal as much as a political one and there is a revolving door between the two in
the world of surveillance practices.
CONCLUSION
Dataveillance, predictive profiling and transgressions of individual privacy are now endemic in much of the world, with certain of
the‘Anglophone’ democracies apparently leading the way. New technologies, and the practices generated by them, have as usual arrived in our
lives before we have had adequate time to fully understand or publicly debate them. Now, in the post-Snowden era, it is time for a more
extensive, critically engaged discussion of the disbenefits, as well as the benefits, of the digitalization of everything to come to the center of public
attention. Most importantly, there are connections to be made between the rapid naturalization of this new order and the roles played in it by
states (particularly the United States), corporations (upon whose activities and technologies states have largely piggy-backed) and consumers (of
goods, services and reduced ‘avatar’ versions of actual people). It is argued here that simplistic, ‘top down’ or traditionally ‘Orwellian’ accounts of
power, control, autonomy and privacy are inadequate to this new situation, in which corporations are as important as states and individuals
gleefully and carelessly contribute to their own surveillance and management. Things are stirring, and critical studies of the ramifications of the
switch to the digital life are becoming more forceful and numerous; but much more research is still required into the close workings of
this‘trifecta’ of collusion and mutual reinforcement, which has already annihilated many of the hitherto prevailing social norms and legal
protections around privacy once thought fundamental to the proper functioning of liberal democracies.
This work was previously published in Ethical Issues and Citizen Rights in the Era of Digital Government Surveillance edited by Robert A.
Cropf and Timothy C. Bagwell, pages 127144, copyright year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Aas, K. F., Gundhus, H. O., & Lomell, H. M. (Eds.). (2009).Technologies of Insecurity: The Surveillance of Everyday Life . New York: Routledge-
Cavendish.
Angwin, J. (2014). Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance . New York: Henry Holt and
Company.
Barnhizer, D. (2013). Through a PRISM Darkly: Surveillance and Speech Suppression in the PostDemocracy Electronic State.Cleveland-
Marshall Legal Studies Paper No. 13-258. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2328744
Boghosian, H. (2014). Spying on Democracy: Government Surveillance, Corporate Power, and Public Resistance . San Francisco: City Lights
Books.
Bradshaw, A. (2014). New Pew Report Finds Consumers Insecure About Protecting Privacy. Center for Democracy and Technology. Retrieved
from https://cdt.org/blog/new-pew-report-finds-consumers-insecure-about-protecting-privacy-post-snowden/
Citron, D. K., & Gray, D. (2013). Addressing the Harm of Total Surveillance: A Reply to Professor Neil Richards. Harvard Law Review , 307(110),
262–274.
Gellman, B., & Poitras, L. (2013, June 7). U. S., British intelligence mining data from nine U. S. Internet companies in broad secret program. The
Washington Post. Retrieved from http://www.washingtonpost.com/investigations/us-intelligence-mining-data-from-nine-us-internet-
companies-in-broad-secret-program/2013/06/06/3a0c0da8-cebf-11e2-8845-d970ccb04497_story.html
Giroux, H. (2014a). Totalitarian Paranoia in the Post-Orwellian Surveillance State. TruthOut.org. Retrieved from http://www.truth-
out.org/opinion/item/21656-totalitarian-paranoia-in-the-post-orwellian-surveillance-state
Giroux, H. (2014b). The Violence of Organized Forgetting: Thinking Beyond America’s Disimagination Machine . San Francisco: City Lights
Books.
Greenwald, G. (2014). No Place to Hide: Edward Snowden, the NSA and the Surveillance State . New York: Penguin Books.
Harris, S. (2010). The Watchers: The Rise of America’s Surveillance State . New York: The Penguin Press.
Hathaway, M. E. (2014). Connected Choices : How the Internet Is Challenging Sovereign Decisions. American Foreign Policy Interests, (36),
300–313. doi:10.1080/10803920.2014.969178
Jerome, J. W. (2013). Buying and Selling Privacy: Big Data’s Different Burdens and Benefits. Stanford Law Review Online, (66), 47–53.
Jerome, J. W. (2014). Big Data: Catalyst for a Privacy Conversation. Indiana Law Review , 48(213), 214–242.
Joh, E. (2014). Policing by Numbers: Big Data and the Fourth Amendment. Washington Law Review (Seattle, Wash.) , 89(35), 35–69.
Lanier, J. (2013, July8). The meta question. Nation (New York, N.Y.) , 20–23. Retrieved from http://www.thenation.com/article/174776/meta-
question
Lyon, D. (1994). The Electronic Eye: The Rise of Surveillance Society . Minneapolis, MN: University of Minnesota Press.
Lyon, D. (2014). Surveillance, Snowden, and Big Data: Capacities, consequences, critique. Big Data & Society , 1(2), 1–13.
doi:10.1177/2053951714541861
Madison, E. (2014). News Narratives, Classified Secrets, Privacy, and Edward Snowden. Electronic News , 8(1), 72–75.
doi:10.1177/1931243114527869
Nissenbaum, H. (2010). Privacy in Context: Technology, Policy and the Integrity of Social Life . Stanford, CA: Stanford University Press.
Parser, E. (2012). The Filter Bubble: How the New Personalized Web is Changing What We Read and How We Think . New York: Penguin Books.
doi:10.3139/9783446431164
Petronzio, M. (2014, April 26). How One Woman Hid Her Pregnancy from Big Data. Mashable. Retrieved from
http://mashable.com/2014/04/26/big-data-pregnancy/
Price, D. H. (2014). The New Surveillance Normal: NSA and Corporate Surveillance in the Age of Global Capitalism. Monthly Review (New York,
N.Y.) , 66(3), 43–54. doi:10.14452/MR-066-03-2014-07_3
Regan, L. (2014). Electronic Communications Surveillance.Monthly Review (New York, N.Y.) , 66(3), 32–43. doi:10.14452/MR-066-03-2014-
07_2
Riffkin, R. (2014, October). Hacking Tops List of Crimes Americans Worry About Most. Gallup.org. Retrieved from
http://www.gallup.com/poll/178856/hacking-tops-list-crimes-americans-worry.aspx
Sanger, D. E. (2014, May 2). In Surveillance Debate, White House Turns Its Focus to Silicon Valley. The New York Times. Retrieved from
http://www.nytimes.com/2014/05/03/us/politics/white-house-shifts-surveillance-debate-to-private-sector.html
Shield, P. (2006). Electronic Networks, Enhanced State Surveillance and the Ironies of Control. Journal of Creative Communications , 1(1), 19–
38. doi:10.1177/097325860500100102
Soltani, A., Peterson, A., & Gellman, B. (2013, December 10). NSA uses Google cookies to pinpoint targets for hacking. Retrieved from
http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/
Tambini, D., Leonardi, D., & Marsden, C. (2008). Codifying Cyberspace: Communications Self-Regulation in the Age of Internet Convergence .
New York: Routledge.
Terence, C., & Ludloff, M. E. (2011). Privacy and Big Data . Sebastopol, CA: O’Reilly Media.
Thompson, R. M., II. (2014). The Fourth Amendment Third-Party Doctrine. Congressional Research Service. Retrieved from www.crs.gov
Waters, S. & Ackerman, J. (2011). Exploring Privacy Management on Facebook: Motivations and Perceived Consequences of Voluntary
Disclosure. Journal of ComputerMediated Communication, 17(1).
Wieseltier, L. (2015, January 7). Among the Disrupted. The New York Times. Retrieved from
http://www.nytimes.com/2015/01/18/books/review/among-the-disrupted.html
CHAPTER 82
Business Analytics and Big Data:
Driving Organizational Change
Dennis T. Kennedy
La Salle University, USA
Dennis M. Crossen
La Salle University, USA
Kathryn A. Szabat
La Salle University, USA
ABSTRACT
Big Data Analytics has changed the way organizations make decisions, manage business processes, and create new products and services.
Business analytics is the use of data, information technology, statistical analysis, and quantitative methods and models to support organizational
decision making and problem solving. The main categories of business analytics are descriptive analytics, predictive analytics, and prescriptive
analytics. Big Data is data that exceeds the processing capacity of conventional database systems and is typically defined by three dimensions
known as the Three V’s: Volume, Variety, and Velocity. Big Data brings big challenges. Big Data not only has influenced the analytics that are
utilized but also has affected technologies and the people who use them. At the same time Big Data brings challenges, it presents opportunities.
Those who embrace Big Data and effective Big Data Analytics as a business imperative can gain competitive advantage.
INTRODUCTION
Generations of technological innovations have evolved since the 1970’s. Decision support systems (DSS) have emerged as one of the earliest
frameworks intended to assist complex decision making through user-friendly interfaces, rudimentary database relationships, basic visualization
capabilities, and predefined query proficiencies. A typical cycle of activities within a DSS network began with decision makers (Zeleny, 1987)
defining a problem requiring a solution. After defining the problem and exploring possible alternatives, a decision model was developed that
eventually would guide the decision makers toward implementation. This model-building phase of the process was an iterative approach to
resolving organizational problems (Shim, 2002).
As a logical progression, supplementary support systems were being funded within the C-suite. Executive support systems were developed to
obtain timely access to information for competitive advantage. These inter-networking infrastructures became possible because of distributed
computing services, online analytical processing and business intelligence applications.
Today, it is the demand for the application of analytics to Big Data that is driving an expansion of information technology that will continue at an
accelerating rate (Davenport, 2014). Big Data and analytics, now possible because of advances in technology, have changed the way organizations
make decisions, manage business processes, and create new products and services.
Informed Decision Making
In any organization, it is essential that strategic decisions have executive level support. Exploring Big Data using analytical support systems has
strategic, as well as tactical importance. This is not a modernistic view, rather one of historic precedence and contemporary necessity (Bughin,
2010; Ewusi-Mensah, 1997; Jugdev, 2005; Poon, 2001). Furthermore, Vandenbosch (1999) clearly established a relationship between how
organizations can enable competitiveness and use methods and techniques for focusing attention, improving understanding, and scorekeeping.
In recent years, numerous studies have validated the premise that business analytics informs decision making. Davenport, Harris and Morison
(2010) show that business analytics produces smarter decisions. Business analytics has changed the way organizations make decisions.
Organizations are making informed decisions because business analytics enables managers to decide on the basis of evidence rather than
intuition alone. While business analytics does not eliminate the need for intuition and experience, it changes long standing ideas about the value
of experience, the nature of experience and the practice of management (McAfee & Brynjolfsson, 2012).
Improved Business Processes
Many large organizations are burdened with an array of process modeling intended to improve the decision making hierarchy (Dijkman, 2011). If
an organization has been in business for several decades, managing these processes is time-prohibitive and expensive because a team is required
to manage and refine them. As organizations adopt business process management systems to automate key business processes, integration with
business intelligence remains equally important. Making data from business processes available for business intelligence in near real-time allows
organizations to proactively manage business processes through improved insight into performance. Business analytics not only changes the way
organizations evaluate business processes but also how they manage business processes.
Empowering Products and Services
Nothing more effectively moves change in the business environment as does competition. Products and services evolve as competitive
information is obtained and analyzed. Data has become widely available at historical discounts allowing organizations to manage their
employees. Data also allows vendors the ability to adjust pricing based on archival and real-time sales. Similarly, considerations for
complementary products and services are based on consumer behavior (Brown, 2011). These activities can take place only if data can be accessed
and analyzed.
If a company makes things, moves things, consumes things, or works with customers, the company has increasing amounts of data about these
activities (Davenport, 2013). Powerful data-gathering and analysis methods can provide an opportunity for developers of products and services.
They can create more valuable products and services from the analysis of data.
The impact of business analytics and Big Data is measurable. The following sections describe: (a) what business analytics is, (b) the methods and
techniques of business analytics and (c) the effect of Big Data on traditional analytics processes, tools, methods and techniques.
BUSINESS ANALYTICS: WHAT IS IT?
In today’s business environment, few would argue against the need for business analytics. It provides facts and information that can be used to
improve decision making, enhance business agility and, provide competitive edge.
Many academics and practitioners have defined business analyticsor analytics in slightly different ways. Davenport and Harris (2007 p. 7)
define analytics as “the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based
management to drive decisions and actions.” Lustig, Dietrich, Johnson, and Dziekan, (2010) at IBM proposed that business analytics include
analytics and what Davenport and Harris (2007 p.7) define as business intelligence, “a set of technologies and processes that use data to
understand and analyze business performance.” At IBM, the term business analytics applies to:
1. Software Products: Including business intelligence and performance management, predictive analytics, mathematical optimization,
enterprise information management, and enterprise content and collaboration;
2. Analytic Solutions Areas: Such as industry solutions, finance/risk/fraud analytics, customer analytics, human capital analytics, supply
chain analytics;
3. Consulting Services: Outsourced business processes and configured hardware (Lustig et al. 2010).
These authors further refine business analytics into three categories: descriptive analytics, predictive analytics and prescriptive analytics.
Davenport (2010) defined business analytics as the broad use of data and quantitative analysis for decision making within organizations. These
definitions highlight four key aspects of business analytics: data, technology, statistical and quantitative analysis, and decision making support.
The Analytics Section of the Institute for Operations Research and Management Science (INFORMS) is focused on promoting the use of data-
driven analytics and fact-based decision making in practice. The Section recognizes that analytics is seen as both
2. A broad set of analytical methodologies that enable the creation of business value.
Consequently, the Section promotes the integration of a wide range of analytical techniques and the end-to-end analytics process. INFORMS
(2012 p. 1) defines analytics as “the scientific process of transforming data into insight for making better decisions.”
A Science and an Art
In our view, business analytics is the use of data, information technology, statistical analysis, and quantitative methods and models to support
organizational decision making and problem solving. The main categories of business analytics are:
1. Descriptive Analytics: The use of data to find out what happened in the past or is currently happening;
2. Predictive Analytics: The use of data to find out what could happen in the future; and
3. Prescriptive Analytics: The use of data to prescribe the best course of action for the future.
Business analytics will not provide decision makers with any business insight without models; that is, the statistical, quantitative, and machine
learning algorithms that extract patterns and relationships from data and expresses them as mathematical equations (Eckerson, 2013). Business
analytics, clearly, is a science.
However, business analytics is also an art. Selecting the right data, algorithms and variables, and the right techniques for a particular business
problem is critical. Equally critical is clear communication of analytical results to end-users. Without an understanding of what the model
discovered and how it can benefit the business, the decision maker would be reluctant to act on any insight gained. One of the best ways to tell a
data story is to use a compelling visual. Organizations today are using a wide array of data visualization (dataviz) tools that help uncover valuable
insight in easier and more user-friendly ways.
BUSINESS ANALYTICS: METHODS AND TECHNIQUES
This section provides an overview of key methods and techniques within each of the main categories of business analytics: descriptive, predictive
and prescriptive.
Descriptive Analytics
Descriptive analytics is the use of data to reveal what happened in the past or is currently happening in the present. As presented in Figure 1, the
methods and techniques within descriptive analytics can be classified by the purpose they serve.
Reporting and visual displays provide information about activities in a particular area of business. They answer questions such as: What
happened? What is happening now? How many? How often? Where? Methods and techniques include both standard, predetermined report
generation and ad hoc, end-user created report generation. Dashboards and scorecards are also included in this category. A dashboard is a visual
display of the most important information needed to achieve objectives. A scorecard is a visualization of measures and their respective targets,
with visual indicators showing how each measure is performing against its targets.
Analysis, Query and Drill Down provide descriptive summaries, retrieve specific information from a database, and move deeper into a chain of
data, from higher-level information to more detailed, focused information. They answer questions such as: What exactly is the problem? Why is it
happening? What does this all mean? The methods and techniques within this classification include data descriptive summaries provided by
statistics (analysis), database manipulation capabilities provided by information systems technologies (query), and database navigation
capabilities provided by information systems technologies (drill down).
Data Discovery allows decision makers to interactively organize or visualize data. Some of the previously mentioned methods and techniques can
be considered traditional business intelligence. While data discovery can answer similar questions as traditional business intelligence, it is a
departure from the traditional because it emphasizes interactive analytics rather than static reporting. Data discovery allows for easy navigation
of data and quick interactive question-asking. These capabilities help the end-user gain new insights and ideas. Data discovery is recognized as
critical to business data analysis because of the rise of Big Data. Methods and techniques in this category include statistics, text and sentiment
analysis, graph analysis, path and pattern analytics, machine learning, visualization and complex data joins (Davenport, 2013).
Data Visualization capability is an important feature of data discovery because it visually represents patterns or relationships that are difficult to
perceive in underlying data. Dataviz also facilitates communication of what the data represent, especially when skilled data analysts communicate
to business decision makers. In this context, the picture is truly worth a thousand words.
There are a variety of conventional ways to visualize data: tables, histograms, pie charts, and bar graphs. Today, there are also very innovative,
creative approaches to present data. When selecting a visualization for data, it is important to:
1. Understand the data to be visually represented, including its size and cardinality;
4. Use a visual that conveys the information in the most direct and simplest way (SAS, 2012).
Predictive Analytics
Predictive analytics is the use of data to discover what could happen in the future. The questions that can be answered with predictive analytics
include: What will happen next? What trends will continue? As shown in Figure 2, predictive analytics encompasses a variety of methods and
techniques from statistics and data mining and can be used for purposes of prediction, classification, clustering, and association.
Prediction assigns a value to a target based on a model. Techniques include regression, forecasting, regression trees, and neural networks.
Classification assigns items in a collection to target categories or classes. Techniques include logistic regression, discriminant analysis,
classification trees, K-nearest Neighborhood, Naïve Bayes, and neural networks. Clustering finds natural groupings in data. Cluster analysis falls
within this category. Association finds items that co-occur and specifies the rules that govern their co-occurrence. Association rule discovery, also
known as affinity analysis belongs to this category.
Within the field of data mining, techniques are described as supervised or unsupervised learning. In supervised learning, the variables under
investigation can be split into two groups: explanatory variables and dependent variables, also called the target. The objective of the analysis is to
specify a relationship between the explanatory variables and the dependent variables. In unsupervised learning, all variables are treated the
same; there is no distinction between explanatory and dependent variables.
Prescriptive Analytics
Prescriptive analytics is the use of data to prescribe the best course of action for the future. The questions that can be answered with prescriptive
analytics include: What if we try this? What is the best that can happen? As depicted in Figure 3, prescriptive analytics encompasses a variety of
methods and techniques from management science including optimization and simulation.
Optimization provides means to achieve the best outcome. Optimizations are formulated by combining historical data, business rules,
mathematical models, variables, constraints, and machine learning algorithms. Stochastic Optimization provides means to achieve the best
outcome that address uncertainty in the data. Sophisticated models, scenarios, and Monte Carlo simulations are run with known and randomized
variables to recommend next steps and display if/then analysis.
BUSINESS ANALYTICS: IMPACT OF BIG DATA
Data and data analytics allows managers to measure, and therefore, know much more about their businesses. They can then directly translate
that knowledge into improved decision making, business process performance, and products and services. In recent years, businesses have
experienced the emergence of Big Data.
What Exactly Is Big Data?
Big Data is data that exceeds the processing capacity of conventional database systems; the data is too big, moves too fast, or doesn’t fit the
structures of database architectures (inside-bigdata.com). Big Data is typically defined by three dimensions known as the Three V’s: volume,
variety, and velocity. Volume refers to the amount of data, typically in the magnitude of multiple terabytes or petabytes. Variety refers to the
different types of structured and unstructured data, such as: transaction-level data, video, audio, and text. Velocity is the pace at which data flows
from sources, such as: business processes, machines, networks, social media sites, and mobile devices. Now an additional V has surfaced,
veracity. This fourth V refers to the biases, noise, and abnormality in data. Veracity is an indication of data integrity and trust in the data.
Big Data Means Big Challenges
Big Data brings with it big challenges. The technology, the analytic techniques, and the people, as well as the data, are all somewhat different
from those employed in traditional business analytics. In recent years, tools to handle the volume, velocity, and variety of Big Data have
improved. For example, new methods of working with Big Data, such as Hadoop and MapReduce, have evolved as alternatives to traditional data
warehousing. However, these new technologies require a new IT skills set for integrating relevant internal and external sources of data, and this
can present a challenge (McAfee, A., and Brynjolfsson, 2012). Not only does Big Data require advanced technologies, Big Data requires advanced
analytics. In addition to the higher level traditional business analytics techniques, Big Data uses advanced techniques such as text analytics,
machine learning, and natural language processing.
Together, these changes in technologies and techniques present managerial challenges that require organizational change. Organizations must
manage change effectively to realize the full benefits of using Big Data. Leadership, talent, and company culture must adapt.
The intrinsic resistance to organizational change can be effectively mitigated with time and education. The dedication and diligence of actionable
leadership can positively govern culture change within the organization. Leadership can also provide on-going support for malleable technology
and adaptive talent acquisition, as well as for building a management team that is cognitively tactical in its decision making responsibilities.
While critical, these management challenges can be overcome in organizations desiring to build integrated capabilities without the immediate
need for large capital investments (McAfee, 2012). However, these organizational refinements must be recognized as preexisting conditions
during the on-going transition period.
Furthermore, contemporary leaders must have the ability to respond to market issues that are both managerial and technical by nature (Tallon,
2008). This includes the ability to hire properly. Talent management and the acquisition of knowledgeable personnel are essential for
dynamically adapting organizations. The strategy is to achieve competitive advantage with a maturing resource pool (Bartlett, 2013). The
alternative strategy of acquiring talent and knowledge only by educating existing human capital is not a solution for either the short or the
intermediate term.
Most importantly, unless managers have the ability to convert corporate goals, objectives, competencies, and resourcefulness into meaningful
outcomes, organizations will fall short of the cyclical requirements essential for instilling creativity within talented people (Farley, 2005). Along
with these challenges, organizational leaders must understand the nuances of technology, the value-added reasons for change, the constraints on
human capital, and the impact that sequential implementation will have throughout the entire adaptive process (Hoving, 2007).
Big Data Analytics Means Big Opportunities
While the name may change in the future, Big Data is here to stay. According to Brynjolfsson and McAfee (2012), organizations in almost all
industries can enhance their operations and strategic decision-making by implementing Big Data analytics programs:
Almost no sphere of business activity will remain untouched by this movement. We’ve seen Big Data used in supply chain management to
understand why a carmaker’s defect rates in the field suddenly increased, in customer service to continually scan and intervene in the health
care practices of millions of people, in planning and forecasting to better anticipate sales on the basis of a data set of product characteristics
(p.1).
Increasingly, organizations seeking competitive advantage need to enable, learn from, and use data. Those who embrace Big Data and effective
data analytics as a business imperative can gain competitive advantage in the rapidly evolving digital economy (Johnson, 2012).
This work was previously published in the Handbook of Research on Organizational Transformations through Big Data Analytics edited by
Madjid Tavana and Kartikeya Puranam, pages 111, copyright year 2015 by Business Science Reference (an imprint of IGI Global).
REFERENCES
Bartlett, C., & Ghoshal, S. (2013). Building competitive advantage through people. Sloan Management Review , 43(2).
Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. The McKinsey Quarterly , 4, 24–35.
Brynjolfsson, E., & McAfee, A. (2012). Big Data’s Management Revolution. In The promise and challenge of big data. Harvard Business Review
Insight Center Report.
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. The McKinsey
Quarterly , 56(1), 75–86.
Davenport, T. (2010). The New World of “Business Analytics . International Institute for Analytics.
Davenport, T. (2014). Big Data at Work: Dispelling the Myths, Uncovering the Opportunities . Harvard Business Review Press.
Davenport, T., & Harris, J. (2007). Competing on Analytics: The New Science of Winning . Boston, MA: Harvard Business School Publishing
Corporation.
Davenport, T., Harris, J., & Morison, R. (2010). Analytics at Work: Smarter Decisions, Better Results. Boston, MA: Harvard Business School
Publishing Corporation. Retrieved from http://www.sas.com/resources/asset/IIA_NewWorldofBusinessAnalytics_March2010.pdf
Dijkman, R., Dumas, M., Van Dongen, B., Käärik, R., & Mendling, J. (2011). Similarity of business process models: Metrics and
evaluation. Information Systems , 36(2), 498–516. doi:10.1016/j.is.2010.09.006
Ewusi-Mensah, K. (1997). Critical issues in abandoned information systems development projects. Communications of the ACM , 40(9), 74–80.
doi:10.1145/260750.260775
Farley, C. (2005). HR's role in talent management and driving business results. Employment Relations Today , 32(1), 55–61.
doi:10.1002/ert.20053
Hoving, R. (2007). Information technology leadership challenges - past, present, and future. Information Systems Management ,24(2), 147–153.
doi:10.1080/10580530701221049
Johnson, J. (2012). Big Data + Big Analytics = Big Opportunity . Financial Executive International.
Jugdev, K., & Müller, R. (2005). A retrospective look at our evolving understanding of project success. Project Management Journal , 36(4), 19–
31.
Langley, A. (1999). Strategies for theorizing from process data.Academy of Management Review , 24(4), 691–710.
Lustig, I., Dietrich, B., Johnson, C., & Dziekan, C. (2010). The Analytics Journey. Retrieved from analyticsmagazine.com
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review , 90(10), 60–68.
Poon, P., & Wagner, C. (2001). Critical success factors revisited: Success and failure cases of information systems for senior executives. Decision
Support Systems , 30(4), 393–418. doi:10.1016/S0167-9236(00)00069-5
Shim, J. P., Warkentin, M., Courtney, J. F., Power, D. J., Sharda, R., & Carlsson, C. (2002). Past, present, and future of decision support
technology. Decision Support Systems , 33(2), 111–126. doi:10.1016/S0167-9236(01)00139-7
Tallon, P. P. (2008). Inside the adaptive enterprise: An information technology capabilities perspective on business process agility. Information
Technology Management , 9(1), 21–36. doi:10.1007/s10799-007-0024-8
Vandenbosch, B. (1999). An empirical analysis of the association between the use of executive support systems and perceived organizational
competitiveness. Accounting, Organizations and Society , 24(1), 77–92. doi:10.1016/S0361-3682(97)00064-0
Wilson, H., Daniel, E., & McDonald, M. (2002). Factors for success in customer relationship management (CRM) systems. Journal of Marketing
Management, 18(1-2), 193-219.
Zeleny, M. (1987). Management support systems: Towards integrated knowledge management. Human Systems Management, 7(1), 59–70.
KEY TERMS AND DEFINITIONS
Business Analytics: Use of data, information technology, statistical analysis, and quantitative methods and models to support organizational
decision making and problem solving.
Descriptive Analytics: Use of data to find out what happened in the past or is currently happening.
Prescriptive Analytics: Use of data to prescribe the best course of action for the future.
Volume: Refers to the amount of data, typically in the magnitude of multiple terabytes or petabytes.
CHAPTER 83
SLODBI:
An Open Data Infrastructure for Enabling Social Business Intelligence
Rafael Berlanga
Universitat Jaume I, Spain
Lisette GarcíaMoya
Universitat Jaume I, Spain
Victoria Nebot
Universitat Jaume I, Spain
María José Aramburu
Universitat Jaume I, Spain
Ismael Sanz
Universitat Jaume I, Spain
Dolores María Llidó
Universitat Jaume I, Spain
ABSTRACT
The tremendous popularity of web-based social media is attracting the attention of the industry to take profit from the massive availability of
sentiment data, which is considered of a high value for Business Intelligence (BI). So far, BI has been mainly concerned with corporate data with
little or null attention to the external world. However, for BI analysts, taking into account the Voice of the Customer (VoC) and the Voice of the
Market (VoM) is crucial to put in context the results of their analyses. Recent advances in Sentiment Analysis have made possible to effectively
extract and summarize sentiment data from these massive social media. As a consequence, VoC and VoM can be now listened from web-based
social media (e.g., blogs, reviews forums, social networks, and so on). However, new challenges arise when attempting to integrate traditional
corporate data and external sentiment data. This paper deals with these issues and proposes a novel semantic data infrastructure for BI aimed at
providing new opportunities for integrating traditional and social BI. This infrastructure follows the principles of the Linked Open Data initiative.
1. INTRODUCTION
The massive adoption of web-based social media for the daily activity of e-commerce users, from customers to marketing departments, is
attracting more and more the attention of Business Intelligence (BI) companies. So far BI has been confined to corporate data, with little
attention to external data. Capturing external data for contextualizing data analysis operations is a time-consuming and complex task that,
however, would bring large benefits to current BI environments (Pérez et al., 2008a). The main external contexts for e-commerce applications
are the Voice of the Customer (VoC) and the Voice of the Market (VoM) forums. The former regards the customer opinions about the products
and services offered by a company, and the latter comprises all the information related to the target market that can affect the company business.
Listening to the VoM allows setting the strategic direction of a business based on in depth consumer insights, whereas listening to the VoC helps
to identify better ways of targeting and retaining customers. As pointed out by Reidenbach (2009), both perspectives are important to build long-
term competitive advantage.
The traditional scenario for performing BI tasks has dramatically changed with the consolidation of the Web 2.0, and the proliferation of opinion
feeds, blogs, and social networks. Nowadays, we are able to listen to the VoM and VoC directly from these new social spaces thanks to the burst of
automatic methods for performing sentiment analysis over them (Liu, 2012). These methods directly deal with the posted texts to identify global
assessments (i.e., reputation) over target items, to detect the subject of the opinion (i.e., aspects) and its orientation (i.e., polarity). From now on,
we will consider as social data the collective information produced by customers and consumers as they actively participate in online social
activities, and we will refer to all the data elements extracted from social data by means of sentiment analysis tools as sentiment data.
A good number of commercial tools have recently appeared in the market for listening and analyzing social media and product review forums, for
example Salesforce Radian6 (http://www.salesforce.com/marketing-cloud), Media Miser (http://www.mediamiser.com), and Sinthesio
(http://synthesio.com), to mention just a few. Unfortunately, these commercial tools aim to provide customized reports for end-users, and
sentiment data on which these reports rely on are not publicly available (indeed this is the key of their business). Consequently, critical aspects
such as the quality and reliability of the delivered data cannot be contrasted nor validated by the analysts. This fact contrasts with the high quality
that BI requires for corporate data in order to make reliable decisions.
Figure 1. BI contexts and their relation to the Web 3.0 data
infrastructure
Apart from the sentiment analysis approaches, there is also a great interest on publishing strategic data for BI tasks within the Linked Open Data
(LOD) cloud (Heath & Bizer, 2011). The Web 3.0 and LOD are about publishing data identified and linked to each other through a Unique
Resource Identifier (URI), and providing data with well-defined semantics to allow users and machines to rightly interpret them. Projects
like Schema.org are allowing the massive publication of product offers as micro-data, as well as specific vocabularies for e-commerce
applications. Unfortunately, nowadays there is no open data infrastructure that allows users and applications to directly perform analysis tasks
over huge amounts of published opinions in the Web.
In this paper we discuss the opportunities and advantages of defining new data infrastructures for performing social BI. As Figure 1 shows, in this
social BI infrastructure, VoC and VoM sentiment data must be integrated together with all the external factors that may potentially affect a
business (e.g., new legislations, financial news, etc.). We claim that such a data infrastructure must follow the principles of the LOD initiative. As
a result, if web-based social data is migrated to the Web 3.0 as linked data in order to be shared, validated and eventually integrated with
corporate data, a new global BI scenario for e-commerce applications is enabled. Furthermore, most data and vocabularies used by researchers
and companies for performing sentiment analysis could be better exploited if they are shared, contrasted and validated by the community.
• We propose a novel semantic data infrastructure to publish both social data and automatically extracted sentiment data. This data
infrastructure follows the LOD principles, and therefore it is aimed at linking the social data with other related datasets in the LOD cloud;
• We propose a novel method for data provisioning, called ETLink, which covers the requirements identified in this scenario.
The rest of the paper is organized as follows. Next section is dedicated to describe the background of the proposal. In Sections 3 and 4
respectively, the proposed social BI infrastructure is presented and its main component datasets described. Afterwards, Section 5 discusses how
the main components of the SLOD-BI data infrastructure are populated from the social resources. Section 6 presents the evaluation. An
illustrative application of this infrastructure and some example analysis operations are depicted in Section 7 and the overall conclusions are
summarized in Section 8.
2. BACKGROUND
BI refers to the methodologies, architectures and technologies that transform raw data into meaningful and useful information to enable more
effective decision-making. BI technologies provide historical, current and predictive views of business operations. Common functions of BI are
reporting, online analytical processing (OLAP), data mining, complex event processing and text mining among others. Often BI applications use
data gathered from a data warehouse (DW) or a data mart. In fact, one of the most successful approaches to BI has been the combination of DW
and OLAP (Codd, 1993).
Traditional BI follows a three-layered architecture consisting of the data sources layer, where all the potential data of any nature is gathered, the
integration layer, which transforms and cleanses the data from the sources and stores them in a DW, and the analysis layer, where different tools
exploit the integrated data to extract useful knowledge that is presented to the analyst as charts, reports, cubes, etc. For the integration layer, the
multidimensional model (MD) is used, where factual data gathered from the data sources layer must be expressed in terms of numerical
measures and categorical dimensions. The semantics of this model consists in representing any interesting observation of the domain (measures)
at its context (dimensions). The typical processes in charge of translating data from the data sources layer to the integration layer are called ETL
processes (extract, transform, and load).
Even though this traditional architecture has proved useful to analyze corporate data, it presents several limitations that make it unsuitable to
meet the analytical requirements of social BI. First of all, the previous architecture only works well in a closed-world scenario, where both the
data sources and the user requirements are static and known in advance. Moreover, the ETL processes are meant to periodically load well-
structured data in batch mode, as they usually apply heavy cleansing transformations. The massive availability of web-based social media related
to business processes has become a valuable asset for the BI community. The integration of these external and heterogeneous data sources with
corporate data would enable more insightful analysis and would bring new marketing opportunities so far unexplored. The need of incorporating
external data to the traditional analysis processes is not new. The majority of approaches try to incorporate external data to the already existing
MD structures by establishing mappings. Thus, the integration is only circumstantial and problems such as the lack of dynamicity and freshness
remain. These problems require a shift in the traditional BI architecture towards a more dynamic, open and flexible infrastructure.
In recent years, opinion mining and sentiment analysis have been an important research area that combines techniques from Machine Learning
(ML) and Natural Language Processing (NLP). One of the most relevant applications of sentiment analysis is the aspect-based summarization
(Liu, 2012). Given a stream of opinion posts, aspect-based summarization is aimed at giving the most relevant opined aspects, also called features
or facets, along their sentiment orientation, usually given by a score and a polarity. For example, given a stream of opinions about digital
cameras, some relevant aspects can be the battery life, the quality of the lenses, etc. Sentences like “the battery life is too short” will contribute to
the negative orientation of the battery life aspect, whereas others like “we took very good pictures” will contribute to the positive orientation of
the picture quality aspect. Aspect-based summarization has been usually divided into three main tasks, namely: sentiment classification,
subjectivity classification and aspect identification. The first one is focused on detecting the sentiment orientation of a sentence, the second one
consists of detecting if a sentence is subjective (i.e., if it contains a sentiment), and the latter one consists of detecting the most relevant aspects of
an opinion stream. ML supervised approaches have been widely adopted to solve these problems, as they can be easily modelled as traditional
classification problems. Unfortunately, it is unfeasible to get training examples for all the items and potential aspects regarded in opinion
streams. Thus, supervised approaches have been restricted to obtain sentiment lexicons and to detect sentence subjectivity with them (Liu, 2012).
As a consequence, sentiment analysis in open scenarios should rely on unsupervised or semi-supervised methods (García-Moya et al., 2013b).
Moreover, sentiment analysis must be blended with social network analysis, which basically aims to predict the diffusion and popularity of
opinions spread across social networks (Guille et al., 2013).
The problem of how to exploit social data to extract sentiment data that could be useful for BI applications and how to integrate the extracted
opinion data into the existing corporate DW is still an open issue. Pérez et al. (2008a) proposed a first approach to this problem by
contextualizing a sales DW with on-line customer reviews about the company products/services. A contextualized warehouse allows users to
obtain strategic information by combining all their sources of structured data and documents (Pérez et al., 2008b). The analysis cubes of a
contextualized warehouse, denoted R-cubes, are special since each fact is linked to an ordered list of documents. These documents provide
information related to the fact (i.e., they describe the context of the fact). Similarly, the EROCS (Bhide et al., 2008) system basically constructs a
link table between DW facts and external documents. Named entity recognition (NER) techniques are used to identify fact dimension values
within document texts, and then valid combinations likely to represent facts are extracted to define the fact-document links. However, all these
approaches only regard the explicit rates of the opinion posts (e.g., 5-star ratings) as sentiment measures, which are clearly not enough to
perform social BI analysis. Firstly, many opinion sources do not provide explicit ratings. Furthermore, most interesting BI analysis involves facet-
sentiment pairs, which must be extracted from post texts.
A recent approach to integrate BI with sentiment data was proposed by (García-Moya et al., 2013a) where a corporate DW is enriched with
sentiment data from opinion posts. In this approach, sentiment data are extracted from opinion posts and then stored into the corporate DW. As
a result, sentiment and corporate data can be jointly analyzed by means of OLAP tools. The main limitation of this approach is that sentiment
data must be fitted to a predefined MD schema, which reduces the range of analytical operations that can be performed over the extracted
sentiment data. In contrast to the closed and rigid scenario of DW/OLAP, in this paper we propose an open and dynamic framework based on
LOD, where data can be linked to external sources on demand, without being attached to rigid data structures or schemas.
As BI mainly involves the integration of disparate and heterogeneous information sources, semantic issues are highly required for effectively
discovering and merging data. Most work proposed in this direction can be classified either in those focusing on web data (Pérez et al., 2008c),
where the presence of Semantic Web (SW) technologies is granted, and those using SW technologies to tackle integration in any scenario. A
pioneer approach was presented in (E. Mena et al., 2000), where multiple data sources are expressed and integrated via description logics. The
main idea behind this model is to achieve a loose coupling between the integrated data sources through semi-automatic ontology mapping tools.
It is worth mentioning that this is the main leitmotiv behind the LOD initiative (Bizer, Heath & Berners-Lee, 2009).
The LOD initiative aims at creating a global web-scale infrastructure for data. Relying on the existing web protocols, this initiative proposes to
publish data under the same principles that web documents, that is, they must be identified through a Unique Resource Identifier (URI), with
which any user or machine can access to their contents. Similarly to web documents, these data can also be linked to each other through their
URIs. In order to manage the resulting data network, data must be provided with well-defined semantics to allow users and machines to rightly
interpret them. For this purpose, the W3C consortium has proposed several standards to publish and semantically describe data, mainly the
Resource Description Framework (RDF) and the Ontology Web Language (OWL). In this paper we refer as semantic data infrastructures to the
data networks resulted from publishing and linking data with the standard formats RDF and OWL.
Semantic data infrastructures provide a series of standards and tools for editing, publishing and querying their data. The basic component of this
infrastructure is the dataset, which consists of a set of RDF triples that can be linked to other LOD datasets. These datasets usually provide a
SPARQL endpoint, with which data can be accessed via declarative queries. Additionally, SPARQL also enables distributed queries over linked
datasets. These data infrastructures are opening new opportunities to both data providers and consumers to develop new applications, which
goes beyond the corporate boundaries. More specifically, LOD has opened new ways to perform e-commerce activities such as retailing,
promotion, and so on. Proposals like schema.org andGoodRelations (http://www.heppnetz.de/projects/ goodrelations) are allowing the massive
publication of product offers as micro-data, as well as specific vocabularies for e-commerce applications. Additionally, commercial search engines
like Google and Yandex are adopting these formats to improve the search of these data. As far as we know, there is no open data infrastructure
that allows users and applications to directly perform analysis tasks over huge amounts of published opinions in the Web. Some preliminary work
such as MARL (Westerski et al., 2011) attempts to provide proper schemas for expressing opinion data as linked data. However, MARL has not
been devised for performing large-scale BI analyses, and consequently it disregards the BI patterns with which data should be aggregated, as well
as data provisioning methods to populate the intended data infrastructure.
Although these three worlds, BI, sentiment data and LOD technology, have kept unconnected to each other until recently, in this paper we
advocate for a BI paradigm shift towards a LOD infrastructure of sentiment data extracted from social media. With this infrastructure companies
are able to execute complex analysis operations that dynamically integrate corporate data with relevant social data. With this infrastructure, it is
possible to study the response of consumers to the company strategic decisions, to identify the sentiments that company products and services
produce among consumers or to analyse social data with the purpose of predicting new demands of the market.
3. A SOCIAL BI DATA INFRASTRUCTURE
From a BI point of view, social data can be regarded as a multidimensional model that can be blended with company data for helping decision-
making. For example, the reputation of a product, the most outstanding features of some brand, or the opined aspects of an item can be
represented as multidimensional data, and efficiently computed through OLAP tools (García-Moya et al., 2013a). In this section we first present a
new set of analytical patterns that combine corporate and social data. Then, in Section 3.2, the global requirements of our social BI infrastructure
are established. Finally, a structural view and a functional architecture for implementing the infrastructure are introduced in Section 3.3.
3.1. Analytical Patterns for Social BI
The main BI patterns to analyze and combine corporate and social data are summarized in Figure 2. The analysis patterns at the corporate data
side of the figure correspond to the traditional MD model of a typical DW (Codd, 1993). Patterns at the social data side constitute the main
contribution of our proposal, and they are explained in the next paragraphs.
Figure 2. Main BI patterns in a social analysis context
scenario. Notice that some facts also act as dimensions of
other facts (e.g., post fact and market fact).
In the figure, facts (labelled with ‘F’) represent spatio-temporal observations of some measure (e.g., units sold, units offered, number of positive
reviews, and so on), whereas dimensions (labelled with ‘D’) represent the contexts of such observations. In some cases, facts can have a dual
nature, behaving as either facts or dimensions according to the analyses at hand. For example, in Figure 2, a post can be considered either as a
fact or as a dimension of analysis for opinion and social facts. Dimensions can further provide different detail levels (labelled with ‘L’). For
example, the dimension Item is provided with the level Sentiment topic. In Figure 2 we have distinguished two kinds of corporate data that can
be combined with social data, namely: Corporate fact, which concerns business transactions (e.g., sales, contracts, etc.), and Market fact, which
the promotions and offers of the company products and services.
The main facts concerning social data are opinion facts, post facts, and social facts. Opinion facts are observations about sentiments expressed
by opinion holders concerning concrete facets about an item, along with their sentiment indicators. For example, the sentence “I don’t like the
camera zoom” expresses an opinion fact where the facet is “zoom”, and the sentiment indicator is “don’t like” (negative polarity). Post facts are
observations of published information about some target item, which can include a series of opinion facts. Examples of post facts can be reviews,
tweets, and comments published in a social network. Notice that opinion facts are usually expressed as free texts in the posts, and therefore it is
necessary to process these texts to extract the facts (Liu, 2012). Finally, social facts are observations about the opinion holders that interchange
sentiments about some topic. These facts are usually extracted from social networks by analyzing the structure emerged when the opinion
holders discuss about some topic (Pak & Paroubek, 2010). Notice that topic-based communities can be very dynamic as they rise and fall
according to time-dependant topics (e.g., news, events, and so on).
As for the measures associated to these facts, Table 1 shows some examples of typical measures used in the literature for sentiment and social
analysis.
Table 1. Examples of measures for social BI facts
Like ( , ) Post
It is important to notice that in Figure 2 the corporate and social BI patterns are separated by a dotted line. As the intended data infrastructure is
aimed at facilitating the integration of information, we define data bridges between corporate and social data elements (see the arrows that cross
the dotted line in Figure 2). Data bridges are the patterns that can be used to execute analysis operations that combine corporate and social data.
Each data bridge consists of an internal data element and an external data element (i.e., dimensions or facts) that are related in the analysis
scenario. For example, the analytical pattern between market facts and post facts can be applied to study the features of marketing campaigns
from the point of view of its acceptance by consumers, and in the other way, to analyse consumer opinions in the context of each campaign.
Different applications and different scenarios can make use of different data bridges to integrate data.
Data bridges support the communication channels between the internal and external data sources and it is very important for companies to
enable all the means necessary to implement them. Of special interest are the data bridges that relate Sentiment Topicto Facet,
and Customer to Holder dimensions. In the first case, the company can specify the most important topics in its items (products or services) that
require some sentiment analysis and that use to coincide with some of the facets that appear in the opinions of a post. In order to facilitate the
implementation of this data bridge, companies and social media users could apply the same hashtags to mark up these topics. In the second case,
it is important to note that when the holder of an opinion is a known customer, both entities must be identified as the same. With respect to these
data bridges companies must ensure that the corporate data and metadata files include key information to enable the recognition of corporate
entities in social data by means of sentiment analysis tools.
Some examples of interesting sentiment and social analysis operations that can be done with the previous BI patters are the following ones:
• To predict the popularity of a topic (i.e., sentiment topic) in the different communities;
3.2. Requirements of a Social BI Data Infrastructure
Regarding the nature of the data to be published in the data infrastructure, we have identified a set of global requirements that are not covered
yet by current proposals, namely:
1. The infrastructure must give support for massive generation of sentiment data from posts (e.g., reviews, tweets, etc.) so that high volumes
of crawled data can be quickly processed and expressed as linked data. As in DWs, a series of ETL processes are needed to periodically feed
the data infrastructure. These ETL processes are quite unconventional, as they deal with semi-structured web data, perform some kind of
sentiment analysis, and output RDF triples;
3. Analysing social data can imply the massive generation of opinion facts from social media sources (i.e., Big Data). Consequently, the
infrastructure must support massive processing and distribution of data, providing optimal partitions with respect to data usage. Since BI
analysis is subject-oriented, data distribution should take profit from the topics around which opinions are generated. For example, opinion
facts should be organized into item families (e.g., electronic products, tourist services, etc.) and allocated into separate distributed datasets;
4. The infrastructure must provide fresh data by migrating as quickly as possible published posts. In this respect, depending on the scenario
and other features, social data elements have a different lifespan during which they can be considered fresh for real time applications;
5. The infrastructure must ensure the quality and homogeneity of the datasets, dealing with the potential multi-lingual issues of a BI
scenario. At this respect, it is essential to focus only in the posts with opinions that are relevant, discarding all those social data elements
without a clear and valid meaning. As e-commerce acts in a global market, sentiment data extracted from different countries will be
expressed in different languages. Datasets must support multi-lingual expressions as well as organize them around well-understood
semantic concepts (see requirement 2). Additionally, links between datasets of the intended infrastructure must be as coherent as possible,
using the appropriate classes and data types offered by our infrastructure. Some current approaches like MARL (Westerski & Iglesias, 2011)
allow users to express opinion facts with any kind of resource (e.g., a string, a URI to an external entity, etc.) Despite the fact that this makes
the schema much more flexible to accommodate any opinion fact, it makes unfeasible to perform a BI analysis over these data;
6. The infrastructure must support complex analysis operations that integrate data with two different purposes. In many cases, companies
will exploit the social datasets to execute analysis operations on internal corporate data but contextualized with external sentiment data. For
this purpose, social data should be easily structured, loaded and integrated with corporate data in order to analyse it with the available BI
applications. In the other cases, advanced applications working on the cloud will analyse relevant social data in the context of some company
events such as marketing campaigns or special offers. Although these applications will mainly use social data, they will also need relevant
data coming from the corporate databases.
In this paper, we mainly focus on the points 1, 2, 5 and 6. Points 3 and 4 will be left to the future work since they depend on the growth rate of the
infrastructure: number of followed opinion streams, variety of domains to be regarded, and so on.
3.3. SLODBI Overview
Regarding the previous requirements, Figure 3 proposes the architecture for the intended social BI data infrastructure. First, we divide the
involved datasets into two layers. Thus, the inner ring of Figure 3 regards the main vocabularies and datasets of the proposed infrastructure,
whereas the outer ring comprises the external linked open vocabularies (LOV), and the datasets that are directly related to the infrastructure
(e.g., DBpedia and productDB). Every SLOD-BI component consists of a series of RDF-triple datasets regarding some of the perspectives we
consider relevant for BI over sentiment data. For example, in the Item Component each dataset holds the products associated to a particular
domain (e.g., cars, domestic devices, etc.) These datasets are elaborated and updated independently of each other, and can be allocated in
different servers. All the datasets of a component share exactly the same schema (i.e., set of properties), reflecting the BI patterns defined in
Section 3.1.
Figure 3. Structural view of SLOD-BI. LOD stands for linked
open data (http://linkeddata.org), and LOV for linked open
vocabularies (http://lov.okfn.org)
In Figure 3, links between components are considered hard links, in the sense that they must be semantically coherent, and they are frequently
used when performing analysis tasks. Consequently, the infrastructure should facilitate join operations between triples of these datasets. On the
other hand, links between infrastructure components and external datasets are considered soft links, as they just establish possible connections
between entities of the infrastructure and external datasets. These external datasets are useful when performing exploratory analyses, that is,
when new dimensions of analysis could be identified in these external datasets. Links to external datasets like DBpedia play a very relevant role in
this infrastructure since they can facilitate the migration of existing review and opinion data. For example, reviews already containing micro-data
referring to some product in DBpedia will be automatically assigned to the item URI of the corresponding SLOD-BI dataset.
Figure 4. Proposed functional view for SLOD-BI infrastructure
Figure 4 summarizes the functional view for the proposed data infrastructure. At the bottom layer, the external web data sources are selected and
continuously monitored to extract, transform and link (ETLink) their contents according to the SLOD-BI infrastructure. As earlier stated, social
BI facts are regarded as spatio-temporal observations of user sentiments in social media. Therefore, both spatial and temporal attributes must be
captured and explicitly reflected in the ETLink processes.
The SLOD-BI infrastructure is exploited by means of the data service layer, which is in charge of hosting all the services consuming sentiment
data to produce the required data for the analytical tools. These services are implemented on top of a series of basic services provided by the
infrastructure, namely: a SPARQL endpoint to directly perform queries over sentiment data, a Linking service to map corporate data to the
infrastructure data (e.g., product names, locations, etc.), an RDF dumper to provide parts of the SLOD-BI to batch-processing services, an API for
performing specific operations over the infrastructure (e.g., registering, implementing access restrictions over parts of the infrastructure, etc.),
and visual tools for data exploration.
Notice that in the proposed functional view, sentiment data is integrated with corporate data at the corporate analytical tool, by making use of
some intermediate data service. In this case, corporate and sentiment data is aggregated separately and joined inside the analytical tools through
a cross-join. This process is similar to Pentaho blending processes to integrate external and internal data (http://www.pentaho.com/ bigdata
blendoftheweek). The predictive models and exploration tools will allow the execution of complex processes over the sentiment data in the
infrastructure. In both cases, the data service layer will facilitate the retrieval of the relevant corporate data as necessary. An important advantage
of using the data service layer to query the corporate DW is that it helps to maintain the appropriate level of data governance and security
necessary for accurate and reliable analysis (Carey, 2012).
4. SLODBI DATASETS
In this section, we describe the main datasets that will constitute the SLOD-BI data infrastructure as represented in the inner ring of Figure 3. In
addition to complying with the W3C recommendations about publishing linked data, these datasets have been defined according to the following
criteria:
• Take profit from existing vocabularies and schemas as much as possible, mainly from schema.org, which is the de facto standard in e-
commerce;
• Distribute data according to the identified BI demands (e.g., subject and topics), in order to achieve high scalability;
• Keep the inner datasets as coherent as possible, so that they can be easily queried for analytical tasks;
• Provide data provenance metadata: all sentiment data captured from the web should be attached to their location (URL) and time, and all
calculated measures should be attached to the service (URL) used for such calculations.
The rest of the section shows the most relevant aspects of the datasets included in each component. In the specific schemas, we do not include the
standard properties that are common to all datasets, namely: rdfs:label to specify possible variants and synonyms of the described entity,
owl:sameAs to specify mappings between infrastructure elements and external datasets, and rdf:type to classify instances into classes. Moreover,
to represent and organize topics within the infrastructure we use the Simple Knowledge Organization System (http://www.w3.org/TR/skos-
reference/), and for data provenance the Dublin Core vocabulary (dc). We also adopt whenever is possible the vocabulary ofschema.org (name
space s), since it is the standard de facto for e-commerce micro-data.
4.1. Items Component
This component contains the datasets describing concrete products and services as well as their manufacturers (e.g., product brand, providers,
facilitators, etc.) These datasets must be kept as simple as possible, just providing useful attributes for BI tasks. Furthermore, additional
attributes and relationships can be accessed through the links to externals datasets such as eCl@ss, DBpedia, ProductDB, FreeBase, etc. The
main class of this component is slod:Item, whose basic properties are summarized in Table 2.
Table 2. Basic properties for describing items
4.2. Facets Component
This component comprises all the elements subject to evaluation in the user’s opinions, which are called facets (slod:Facet). According to the
analytical patterns of Section 3.1., we consider two detail levels about judged elements, namely: sentiment topicand facet. A sentiment
topic describes some BI perspective of an item family, like “design”, “safety” and “comfortability” for cars. Sentiment topics group facets, which
can be any abstract or concrete aspect opined by the users (e.g., “engine”, “diesel engine”, etc.). In order to account for the semantic relationships
between facets (e.g. “diesel engine” is-an “engine”), we make use of the SKOS vocabulary. However, as this kind of relationships is not required
for BI analysis they can be omitted.
Table 3. Basic properties for describing facets
Currently there are scarce LOD datasets including facets subject to opinions (e.g., some small GoodRelations ontologies). We may also consider
technical specifications about products like in eCl@ss, but they do not properly cover the features customers usually opine on (García-Moya et al.,
2013b). As a consequence, facets should be extracted directly from text reviews by applying sentiment analysis methods (Liu, 2012). Indeed, one
of the SLOD-BI goals is to conceptualize and make public facets automatically extracted from reviews (see Section 5.2). For this purpose, we
propose a simple schema (Table 3) to which item facets must map to. The main issues for performing these mappings are: to group together
expressions denoting a same facet (they should appear as different labels of the same instance) and to classify facets into sentiment topics. For
the former issue, we make use of external datasets such as BabelNet by using an automatic linking process (see Section 5.3.), whereas the latter
issue is addressed by manually defining the require mappings according to corporate criteria.
4.3. Sentiment Indicators Component
Sentiment analysis relies on the existence of a set of words and expressions that indicate some opinions about a subject. The Sentiment
Indicators Component is mainly based on linguistic resources that allow identifying facets from review texts as well as sentiments associated to
them.
Sentiment words, also known as opinion words, are the most important indicators of sentiments about a subject. These are words commonly
used to express positive or negative opinions. For example “excellent”, “amazing”, “good” are positive words whereas “bad”, “terrible”, “awful” are
negative ones. Additionally, there also exist sentences used for expressing opinions, for example, “cost a pretty penny” and “cost an arm and a
leg” all are referring to the indicator expensive.
Sentiment indicators could be defined as context-independent or context-dependent (Lu et al., 2011). An opinion indicator is context-dependent
when its polarity depends on the domain and/or the facets it is modifying (e.g., “unexpected” for movies (+) and electronic devices (−)). Even
within the same domain, the polarity of an indicator may be different depending on the facet it applies to. For example, the word “long” in digital
cameras: “long delay between shots” (−) and “long battery life” (+). Another interesting kind of opinion indicators consists of expressions that
implicitly bring the facet. For example, the indicator “too expensive” refers to the aspect “price”. The main class of the Sentiment Indicator
Component is slod:Indicator. Table 4 shows its main properties.
Table 4. Properties for describing sentiment indicators
Nowadays there exist many sentiment lexicons, some of them available in LOD. The most popular ones are SentiWordNet (Euli & Sebastiani,
2006) and SenticNet (Cambria et al., 2013), which provide sentiment-based characterizations for common words in English. Unfortunately, these
lexicons are of limited use because they are of general purpose and do not take into account context-based indicators (Lu et al., 2011).
Additionally, there is a proliferation of web services for computing polarities from free-texts (Thelwall et al., 2010). This kind of services could be
applied to obtain the values of the property slod:hasPolarity. In order to account for both context-based indicators and sentiment indicators
implying a facet, we include the property slod:onFacet. For example, the following sentiment indicators also imply a
facet:expensive → cost, delicious → taste, spacious → comfort.
4.4. Post and Opinion Facts Components
Currently, we can find several proposals for representing metadata of reviews and social data in LOD. One of the main references isschema.org,
which has been adopted by Google for rich snippets over posts. This vocabulary covers all aspects we need for the Post Component, and therefore
we have adopted it without any extensions. Table 5 shows the main properties associated to the post fact class.
Table 5. Properties for describing post facts
Opinion facts express the associations between features/aspects to opinion indicators that appear at the post texts. In our approach, an opinion
fact is always linked to the post object from which it was identified. Consequently, each opinion fact takes the time and place dimensions from its
linked post. Thus, the schema of an opinion fact can be just expressed with the feature/aspect and indicator/shifters involved in the fact. Table 6
summarizes the properties associated to the opinion fact class.
Table 6. Properties for describing opinion facts
4.5. Social Facts Component
There are a few useful sources for extracting social data. The main one is that provided by the social networks’ own APIs (e.g., the Twitter API,
The Google+ API and Facebook’s Graph API). Opinions formulated in the context of these social networks usually have associated a large amount
of meta-data, which is accessed through these APIs. Opinion meta-data can be used to find indicators about the impact of the opinion in the
context of the social network (Guille et al., 2013). We refer to these indicators as social facts. Thus, the aim of social facts is to provide relevance
indicators about holders and their opinions in the context of the community they belong to. Measures such as the number of followers and the
number of times an opinion was shared, are indirect indicators of both opinions and holders relevance as perceived by the social community. As a
consequence, these metrics resemble those used to assess the reach of social media campaigns.
Table 7. Properties for describing social facts
5. DATA PROVISIONING
This section discusses how the main components of the SLOD-BI data infrastructure are populated from the selected web resources (e.g., blogs,
twitter streams, product reviews sites, and so on). The whole process of data provisioning is summarized as follows:
1. For each followed opinion stream (which is associated to a particular item), all metadata and micro-data are extracted and processed to
generate the corresponding social and post facts;
2. From each post, textual contents are pre-processed for normalization issues (Section 5.1). Then, the system automatically extracts facet
and opinion expressions by applying the vocabularies learnt from domain and background corpora (Section 5.2);
3. Automatically extracted sentiment data is then linked to the infrastructure, generating thus the opinion facts associated to posts (Section
5.3).
The different steps of the whole process are performed within a framework called ETLink which provide the necessary operators to generate all
required data. A brief description of this framework is provided at section 5.4.
5.1. Text PreProcessing
It is well known that products review web sites, forums, social networks, and so on, are written in casual language without paying attention to the
spelling. As our method is fully unsupervised and results are statistical by nature (meaningless words are oriented to reach low-probability
values), the presence of many spelling errors affects significantly the results. Repeatedly misspelled adverbs, prepositions, conjunctions and so on
may be considered as “new” words by the method and, therefore, erroneously classified as facet or sentiment indicators. In order to alleviate this
issue, the target collection should be fixed before applying the learning method. Basically, the text pre-processing phase is divided into the
following three steps:
1. Fix negative contractions (English): When one or more letters are missed out in a contraction, an apostrophe is inserted (e.g., “isnt”
is replaced by “isn’t”);
5.2. Vocabularies Construction
Regarding the structure of the proposed infrastructure (Figure 3), the first step towards its population consists in identifying the basic
vocabularies that allow describing sentiments over products and services. Basically, we need to distinguish between two almost disjoint
vocabularies, namely: facets and sentiment indicators. The construction of these vocabularies will be performed for each particular domain (e.g.,
cars, cameras, etc.) since each domain exhibits different terminologies and writing styles. It is worth mentioning that domain ontologies (if
existing) are usually targeted to other purposes different from sentiment analysis, such as e-commerce (e.g., technical product aspects), and
therefore they are usually incomplete for describing social sentiments. This is why some machine learning method is necessary to
comprehensively capture potential facets and sentiment indicators from target collections. Once these potential concepts are identified, they can
be linked to existing ontologies to get a richer view of them. For this purpose, we adopt the unsupervised statistical method proposed in (García-
Moya et al., 2013b), which aims at assigning probabilities to words acting as either facets or indicators. This method is summarized as follows.
We consider stochastic mappings between words to estimate a unigram language model of facets from a probabilistic model of opinion words.
The initial unigram language model for facets P is defined as follows:
The matrix Τ ={p(wi | wj)}1≤i,j≤n represents the word-word entailment probabilities, which are estimated from local contexts of a large collection
of opinion posts of the target domain. The unigram model Q is a generative model of opinion words, which assigns to each word w the likelihood
of being an opinion word, denoted Q(w).
In addition, we consider refining the unigram model P to avoid the assignment of high-probability values to meaningless words such as
prepositions and conjunctions. The refined unigram language model P’ is obtained by means of an expectation-maximization (EM) (Neal &
Hinton, 1998) process which minimizes the cross entropy with respect to a background model Pbg:
In (García-Moya et al., 2013b), these statistical models are used to generate the ranking of facet-sentiment pairs of either a review or the whole
collection. In this work, our aim is slightly different, as we aim to build two basic vocabularies for the data infrastructure, namely: words acting as
facets (model H), and words acting as sentiment indicators (model O). For this purpose, we apply the following iterative process:
The optimal H and O models at step j are obtained by applying the EM algorithm, taking as reference the model P’ defined above. The initial
model O(0) is set to the model of opinion words Q, and α is set to 0.5. Finally, by applying some threshold over the H and Omodels we obtain the
vocabularies to be used for facet and sentiment indicators, respectively.
5.3. Automatic Linking of Data
Opinion posts (e.g., product reviews, tweets) usually consist of free text fields where users express their opinions. Therefore, opinion facts are
usually expressed in these fields as natural language expressions. In order to extract these expressions and mapping them to the infrastructure, it
is necessary to define an automatic semantic annotation process. In our proposal, this process consists of the following phases:
2. Recognize chunks corresponding to facet expressions and those corresponding to opinion expressions;
3. Find the relations between facet and opinion expressions;
6. Finally, generate the opinion fact for each pair facet-opinion expressions and the information related to them (i.e., polarity and links).
Given a sentence, it is first cleaned and represented as a plain sequence of words. To chunk the sentence, we take into account four categories:
facet words, sentiment indicator words, shifters, and connector words. Facet words and sentiment indicator words are extracted from the
corresponding datasets of SLOD-BI (rdfs:label statements). Additionally, we also identify words that can change the valence of the polarity
assigned to sentiment indicators, like negations (“not”, “never”, “none”, etc.), intensifiers (“deeply”, “very”, “little”, “rather”, etc.), modal shifters
(“might”, “possibly”, etc.), and presuppositions (e.g., “lack”, “neglect”, “fail”, etc.) These words constitute the lexicon of shifters (Polanyi et al.,
2006). Finally, connector words are those that connect words to express concepts (e.g., prepositions). In this way, facet expressions are sub-
sequence of consecutive words categorized as either facet or connector, whereas opinion expressions are subsequence of consecutive words
categorized as either sentiment expression or shifter.
Once facet and opinion expressions are identified, each facet expressions must be associated to their opinion expressions. Accurate results can be
obtained by using dependency analysis, thus assigning to each facet expression the opinion expressions whose words are syntactically related to
the facet expression words. However, this operation is time consuming and dependent of the language of the posts. A simpler heuristic consists of
just taking the opinion expressions adjacent to the facet expression, and checking if they entail each other by applying the statistical model
described in the previous section. For the current prototype of the infrastructure, we have applied this simple strategy.
Each opinion expression is analyzed to assign a polarity. A simple algorithm to perform this analysis consists of the following steps: first assign
each sentiment indicator word to its polarity score, then invert the sign of the words affected by the shifters, and finally sum all the scores. Notice
that the polarity score of a word can depend on its context (i.e., the facet expression to which it is assigned).
In order to link the extracted data from posts to the resources of the corresponding datasets, we use the concept retrieval technique described in
(Berlanga et al., 2010). Basically, given a text chunk Tassociated to either a feature or opinion expression, we score the candidate resources R,
whose labels are denoted with labels(R), as follows:
In these functions, both text chunks and labels are expressed as sets of words. The info measure accounts for the relevance of matched words
w.r.t. candidate resources, which is captured by its inverse frequency in a background corpus Pbg(w). In practice, strings T and L are previously
normalized by applying stopword removal, case lowering and word lemmatizing, in order to favour their match.
Finally, the top scored resources R whose scores are greater than a given threshold and that best cover the chunk T are selected to link the
opinion fact to the corresponding datasets. For example, the feature expression “my 308sw” would be linked to the resource
“slod:Peugeot_308SW”.
5.4. ETLink Processes
Similarly to traditional DWs, we propose to populate the SLOD-BI infrastructure by means of ETL processes. An ETL process consists of a data
flow that periodically extracts data from the sources, and transforms them into elements of the DW (i.e., dimensions and facts). The processing
units of an ETL are called operators, which consume and produce tabular data. Operators perform SQL-like operations (e.g., selection, join,
union, group by, and so on), as well as other data transformations such as concatenation/split of columns, function application to columns, and
so on.
Table 8. Proposed ETLink operator types
The implementation of ETLink processes follows the same spirit aspygrametl (Thomsen & Pedersen, 2009). Broadly
speaking,pygrametl provides a series of Python classes for performing data transformations and for populating DW structures (i.e., dimensions
and facts). Data flows are then specified with Python scripts using these classes. In our approach, workflow operators consume and produce
either tabular data (CSV) or RDF triples. Instead of using DW structures, we use RDF primitives (RDFLib library) to generate the intermediate
data, and SPARQL to perform the required look-up operations. Moreover, we provide operators to perform both sentiment analysis and data
linkage.
Table 9. Summary of the ETLink processes for the SLODBI infrastructure
6. EVALUATION
To populate the SLOD-BI infrastructure we have selected a subset of opinion posts from several social media sources of information specialized
on vehicles and from Twitter. Table 10 summarizes the main statistics. Although there are much more opinion facts extracted from Twitter,
opinion facts from specialized forums exhibit a much higher quality. In global, there are much more positive comments than negative ones, when
the usual situation is that negative comments dominate social sentiment data. This seems a particularity of this domain (cars), where customers
are usually satisfied with their vehicles.
According to the structural view of SLOD-BI in Figure 3, the inner ring must be populated with the vocabularies and datasets for the car rental
domain. The construction of the facets and the sentiment indicators components has been performed as explained in Section 5.2. For example,
given a stream of opinions about cars, some relevant aspects are interior, engine, cost,consumption, etc. From sentences like “The interior design
is attractive” or “The interior is superb quality and just so comfortable” we can extract the facet “interior” and the positive sentiment indicators
“attractive” and “superb quality”. In the use case developed in the following section, facets will be classified into six sentiment topics useful for
analysis. Therefore, aspects such as “interior”, “style” and “dashboard” will belong to the “design” topic, whereas aspects such as “clutch”, “wheel”
and “gearbox” will belong to the “mechanical” topic, and so on. The prototype of this dataset can be accessed through the SPARQL endpoint
http://krono.act.uji.es/SLOD-BI/sparql.
Table 10. Statistics of opinion posts processed
14,610 6,427
The rest of the datasets of the SLOD-BI infrastructure (opinion facts, items, etc.) are populated and linked by semantically annotating and
processing the post facts as explained in Section 5.3. As a result, Figure 5 shows an excerpt of an opinion fact that has been extracted from a post
fact and linked in the SLOD-BI infrastructure.
Figure 5. Example of opinion fact in the SLOD-BI
infrastructure
6.1. Quality Results
As semantic data is automatically extracted without any supervision, it is necessary to evaluate the quality of the generated data, as well as to find
the best parameter settings of the learning algorithms to achieve good enough results. For this purpose, we have built two reference lexicons
(unigram models), for facet and sentiment words respectively, restricted to a particular domain. To build the lexicon of sentiment words, we have
downloaded and merged more than ten lists of opinion word lists that are freely available. The probability of each word is estimated from a large
corpus of reviews of the specific domain. To build the lexicon of facet words, we manually chose a set of Wikipedia categories falling within the
target domain, and then selected all Wikipedia entries having at least one of these categories.
With these two reference lexicons, we can evaluate the quality of the automatically extracted sentiment and facet expressions. Firstly, each word
of the constructed vocabulary (Section 5.2) is classified as either facet or sentiment according to its probabilities in the reference lexicons. The
overall precision of the method is then calculated as the total number of rightly classified words divided by the total number of classified words.
Notice that some words may remain unclassified because either they do not appear at the reference lexicons or they cannot be statistically
classified. Table 11 shows the precision results for the automatically generated vocabularies for facets (H) and sentiment words (O) in the “cars”
domain. Results are shown with respect to the probability threshold applied to the models.
Table 11. Precision results of the generated vocabularies for the “cars” domain
We can see that the quality achieved by sentiment words is quite good across the different probability thresholds. Results are not so good for the
facets vocabulary, and this is because many words affected by sentiment indicators do not belong to the domain (e.g., expressions like “I had a
nice day”). In order to improve the quality of this vocabulary we make use of the entailments derived from the target collection and BabelNet
(Navigli & Ponzetto, 2010). Thus, for the final vocabulary we just consider those facet words that participate in at least an entailment of each
translations model. As a result, the facet vocabulary is reduced to 638 words, achieving a precision of 0.93.
7. AN EXAMPLE OF SOCIAL ANALYSIS WITH SLODBI
To demonstrate our proposal, we have developed a prototypical SLOD-BI infrastructure for the car rental domain. At the core of each rent-a-car
company lies the idea of providing their customers cost effective and quality services. This vision must be reflected on each of their business
activities, which range from accepting reservations for new and existing customers to providing cars to customers, handling car upgrades when
there is a shortage of cars or selecting the best promotional offer plan for each customer, among others.
To ensure business success, companies often have a series of strategic goals, such as optimum utilization of resources, customer satisfaction or
controlling costs, which are materialized by more specific and measurable objectives. The objectives are set up as the result of a decision making
process, which usually involves complex analytical queries over corporate data. The most established approach is to use a DW to periodically
store information subject to analysis. In the case of a car rental company, the DW schema to analyze rental agreements could be similar to the
one proposed in (Frias et al., 2003), where typical analysis dimensions include the rented vehicles, locations, customer features, etc. In order to
make decisions, analysts often request the generation of reports involving analytical queries, e.g., number of rental agreements per location and
time or preferred rented vehicles by location.
Apart from traditional analytic queries involving corporate data, there is a need to get more insight of the business internal processes in real time
to be able to react more efficiently. In particular, customer satisfaction has become the greatest asset to success and there is a growing need of
knowing customers’ opinions about the companies’ products and services. In this way, companies are able to dynamically integrate corporate
data with relevant social data to analyze the answer of customers to its strategic decisions or to predict the demands of the market.
For a successful analytical experience, the company must specify the most important topics in its items (products or services) that require some
sentiment analysis. In our use case, the company is interested in knowing people’s opinion about the vehicles that they offer for renting,
therefore, they have set up six sentiment topics that they consider of interest such as comfortability, safety,driving
perception, design, mechanical issues and price. By analyzing people’s opinion of their vehicles with respect to these topics, the company is able
to detect vehicles implying high maintenance costs, the preferred vehicles by their design, etc.
Once the SLOD-BI is set up for the car rental domain, sentiment data can be consumed by means of the data service layer in order to produce the
required data for the analytical tools. In the following, we present a series of examples of interesting analytical queries over the SLOD-BI that can
be integrated with corporate data:
1. The analyst has executed a query over the corporate DW to find out the top rented cars by location. However, they would like to gain more
insight by aligning the top rented cars with the people’s opinion about such cars with respect to the design, to check if design is a relevant
aspect for the customers behind those rentals. Figure 6 shows people’s opinion (i.e., polarity) of different cars with respect to the sentiment
topic design. This graph is the result of executing a query using the SPARQL endpoint service provided by SLOD-BI. The query aggregates
the polarities of all the aspects classified under the topic design. Notice that whereas the first car has a high positive polarity, the last four
cars have negative polarity, meaning that users are not happy with the design aspects of such cars;
2. The company is interested in acquiring new fleet, but first, they would like to analyse people’s opinion about cars with respect to
mechanical issues, in order to avoid the acquisition of cars that usually involve more mechanical problems. Figure 7 shows the result of
aggregating people’s opinion about the topic mechanical issues. Notice that the last two cars show a high negative polarity, and therefore, the
acquisition of these cars should be avoided;
3. The firm Peugeot has offered the rental company a special price if they acquire more than 10 units of the “Peugeot 208”. However, the
company would like to know people’s opinion about this specific car with respect to the topics that they consider relevant. Figure 8 shows the
results in the form of a bar chart. From the graph, we observe that design and safetyare the highest rated aspects, whereas price is the
lowest;
4. Finally, the company is interested in blending corporate and social data to get some insights about how design opinions can affect to the
number of contracts with respect to the company fleet. For this purpose, the popular corporate analytical tool KNIME
(http://www.knime.org/) is used. In a few words, KNIME is an open source data analytics, reporting and integration platform with a
graphical user interface that allows assembly of nodes for data pre-processing, modelling, analysis and visualization. We have implemented a
node for performing SPARQL queries on SLOD-BI, and then we have used the workflow nodes of KNIME to integrate the social and
corporate data. The resulting workflow is shown in Figure 9a. The bottom node queries corporate data to extract the number of rentals by car
during 2013. The result is a table with two columns, the car and the number of rentals. TheRDF QueryAP node executes a SPARQL query
over the data service layer of the infrastructure to extract sentiment data about the topic “design”. After some processing, the Joinernode
merges the two tables by the car column and the resulting chart (Figure 9b) displays the number of car rentals (in blue) vs. the aggregation
opinion on “Design” aspects (in red) by car. In general, we observe a positive correlation between the two variables, as the mostly rented cars
(i.e., Renault Megane, Peugeot 208 and 508) are the ones with highest rating of design aspects.
This paper has presented SLOD-BI, a new semantic data infrastructure for capturing and publishing sentiment data to enable Social BI. The
infrastructure components are designed to cover the main BI patterns we have identified for analysing both corporate and social data in an
integrated way. The infrastructure also provides the functionality required to perform massive opinion analysis, for example the automatic
extraction of sentiment data from posts, and their linkage to the infrastructure. As a result, users will be able to incorporate opinion-related
dimensions in their analysis, which is out of reach of traditional BI.
For future work, we will study the performance of complex queries over the SLOD-BI infrastructure, for example OLAP-like operations, which
may require massive data processing methods. For this purpose, the datasets in the inner ring of SLOD-BI must be properly partitioned and
distributed according to the BI demands. For example, datasets should be partitioned with respect to domains and time slices. Moreover,
functional map-reduce implementations (Dean et al, 2004) can process such distributed partitions and parallelize complex analysis operators
such as filter, join and aggregate (Sridhar et al., 2009). Additionally, to speed-up costly operations within the inner SLOD-BI datasets, ad-hoc
indexing mechanisms should be defined. More challenging is however, to efficiently perform BI operations involving external datasets, as we do
not have control over them. Finally, to extend the functionality of the infrastructure we aim at linking data to multi-lingual resources such as
BabelNet. We also plan to introduce services for transforming the query results to the RDF Data Cube vocabulary, so that they can be included in
tools designed for this vocabulary.
Another issue to be addressed in the future work is how the infrastructure can manage the high dynamicity of certain topics in some domains.
Unfortunately, the problem of adapting sentiment analysis tools to evolving topics has been poorly treated in the literature. Moreover, the
validation of a self-adapted approach for sentiment analysis requires a huge amount of data recorded during a long time in order to detect fast
iteration cycles.
Finally, another open issue of the infrastructure is the mapping of automatically extracted facets to corporate sentiment topics. Currently, these
mappings are manually performed, but this process has a high cost and is prone to errors. In the future work we plan to study semi-automatic
methods for performing these crucial mappings of the infrastructure.
This work was previously published in the International Journal of Data Warehousing and Mining (IJDWM), 11(4); edited by David Taniar,
pages 128, copyright year 2015 by IGI Publishing (an imprint of IGI Global).
ACKNOWLEDGMENT
This work has been partially funded by the “Ministerio de Economía y Competitividad” with contract numbers TIN2011-24147 and TIN2014-
55335-R. We would like to thank Avelino Font for helping us in implementing the first prototype of the SLOD-BI infrastructure.
REFERENCES
Berlanga, R., Nebot, V., & Jimenez, E. (2010). Semantic annotation of biomedical texts through concept retrieval.Procesamiento del Lenguaje
Natural , 45, 247–250.
Bhide, M., Chakravarthy, V., Gupta, A., Gupta, H., Mohania, M., Puniyani, K., . . . Sengar, V. (2008), Enhanced business intelligence using
EROCS. In Proc. of the 24th IEEE International Conference on Data Engineering, 1616–619. IEEE Press.
Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data - the story so far. International Journal on Semantic Web and Information
Systems , 5(3), 1–22. doi:10.4018/jswis.2009081901
Cambria, E., Song, Y., Wang, H., & Howard, N. (2013). Semantic Multi-Dimensional Scaling for Open-Domain Sentiment Analysis.IEEE
Intelligent Systems , 29(2), 44–51. doi:10.1109/MIS.2012.118
Carey, M. J., Onose, N., & Petropoulos, M. (2012). Data Services.Communications of the ACM , 55(6), 86–97. doi:10.1145/2184319.2184340
Codd, E. F., Codd, S. B., & Salley, C. T. (1993). Providing OLAP (Online Analytical Processing) to User Analysts: An IT Mandate . E.F. Codd and
Ass.
Dean, J., & Ghemawat, S. (2004). Mapreduce: Simplified data processing on large clusters. 6th Symposium on Operating System Design and
Implementation OSDI ’04, 137-150.
Esuli, A., & Sebastiani, F. (2006). SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. In Proc.of 5thInternational
Conference on Language Resources and Evaluation(LREC), 417-422.
Frias, L., Queralt, A., Olivé, A. (2003). EU-Rent Car Rentals Specification. Research report of Departament de Llenguatges i Sistemes
Informàtics, Universitat Politècnica de Catalunya. LSI-03-59-R.
García-Moya, L., Anaya-Sánchez, H., & Berlanga, R. (2013b). A Language Model Approach for Retrieving Product Features and Opinions from
Customer Reviews. IEEE Intelligent Systems ,28(3), 19–27. doi:10.1109/MIS.2013.37
García-Moya, L., Berlanga, R., & Anaya-Sánchez, H. (2012). Learning a statistical model of product aspects for sentiment analysis. Procesamiento
del Lenguaje Natural , 49, 157–162.
García-Moya, L., Kudama, S., Aramburu, M. J., & Berlanga, R. (2013a). Storing and analysing voice of the market data in the corporate data
warehouse. Information Systems Frontiers , 15(3), 331–349. doi:10.1007/s10796-012-9400-y
Guille, A., Hacid, H., Favre, C., & Zighed, D. A. (2013). Information Diffusion in Online Social Networks: A Survey.SIGMOD Record , 42(2), 17–
28. doi:10.1145/2503792.2503797
Heath, T., & Bizer, C. (2011). Linked Data: Evolving the Web into a Global Data Space (1st ed.). San Rafael, CA: Morgan & Claypool.
Liu, B. (2012). Sentiment Analysis and Opinion Mining . Morgan & Claypool Publishers.
Lu Y. Castellanos M. Dayal U. Zhai C. X. (2011). Automatic construction of a context-aware sentiment lexicon: an optimization approach.InProc.
of WWW 2011, 347-356. 10.1145/1963405.1963456
Mena, E., Illarramendi, A., Kashyap, V., & Sheth, A. P. (2000). OBSERVER: An Approach for Query Processing in Global Information Systems
Based on Interoperation Across Pre-Existing Ontologies. Distributed and Parallel Databases , 8(2), 223–271. doi:10.1023/A:1008741824956
Mendes P. Jakob M. García-Silva A. Bizer C. (2011). DBpedia spotlight: shedding light on the web of documents. In Proc. 7thInternational
Conference on Semantic Systems, 1-8. ACM. 10.1145/2063518.2063519
Navigli, R., & Ponzetto, S. P. (2010). BabelNet: Building a very large multilingual semantic network. In Proc. of the 48th Annual Meeting of the
Association for Computational Linguistics, 216-225. ACL.
Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical
models, 355-368. Springer Netherlands. doi:10.1007/978-94-011-5014-9_12
Pérez, J. M., Berlanga, R., Aramburu, M. J., & Pedersen, T. B. (2008a). Towards a data warehouse contextualized with web opinions. In Proc. of
the 2008 IEEE International Conference on eBusiness Engineering, 697–702. IEEE Press. 10.1109/ICEBE.2008.43
Pérez, J. M., Berlanga, R., Aramburu, M. J., & Pedersen, T. B. (2008b). Contextualizing data warehouses with documents.Decision Support
Systems , 45(1), 77–94. doi:10.1016/j.dss.2006.12.005
Pérez, J. M., Berlanga, R., Aramburu, M. J., & Pedersen, T. B. (2008c). Integrating data warehouses with web data: A survey.IEEE Transactions
on Knowledge and Data Engineering , 20(7), 940–955. doi:10.1109/TKDE.2007.190746
Polanyi, L., & Zaenen, A. (2006). Contextual valence shifters. Computing Attitude and Affect in Text: Theory and Applications.The Information
Retrieval Series , 20, 1–10. doi:10.1007/1-4020-4102-0_1
Reidenbach, R. E. (2009). Listening to the Voice of the Market: How to Increase Market Share and Satisfy Current Customers. Crc Press, 2009.
Sridhar R. Ravindra P. Anyanwu K. (2009). RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web. In Proc. of the 8th International
Semantic Web Conference, 715-730. 10.1007/978-3-642-04930-9_45
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. [JASIST]. Journal of
the Association for Information Science and Technology , 61(12), 2544–2558. doi:10.1002/asi.21416
Thomsen, C., & Pedersen, T. B. (2009). Pygrametl: a powerful programming framework for extract-transform-load programmers. In Proc. of the
ACM 12th International Workshop on Data warehousing and OLAP (DOLAP '09), 49-56. ACM. 10.1145/1651291.1651301
Westerski, A., & Iglesias, C. A. (2011). Exploiting Structured Linked Data in Enterprise Knowledge Management Systems.An Idea Management
Case Study. In Proc. of Enterprise Distributed Object Computing Conference Workshops, 395-403.
Yu J. Zha Z. J. Wang M. Wang K. Chua T. S. (2011). Domain-assisted product aspect hierarchy generation: towards hierarchical organization of
unstructured consumer reviews. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 140-150. ACL.
CHAPTER 84
An Innovation Ecosystem beyond the Triple Helix Model:
The Trentino's Case
Alberto Ferraris
University of Turin, Italy
Stefano Leucci
University of Trento, Italy
Stefano Bresciani
University of Turin, Italy
Fausto Giunchiglia
University of Trento, Italy
ABSTRACT
In the current global scenario, the relevance and the importance of social innovation becomes critical in order to face this situation of crisis. For
its close link with the local area in which it takes place, social innovation is deeply rooted in the overall system, and thus involves the action of
many different actors. The aim of this chapter is to highlight the presence of a new managerial model that is more suitable to promote social
innovation within an ecosystem. This analysis has been developed focusing on a new and innovative framework, the Social Innovation Pyramid,
and on the Trentino's ecosystem based in North-East of Italy.
INTRODUCTION
In the nowadays contest characterized by an unexpected social crisis, one of the striking feature of our society is the increasing urge of enhancing
innovation. Enhancing innovation means developing a network of public and private institutions, within which production, diffusion and
application of new knowledge and technology takes place (Erikson et al., 2002). In this context the concept of social innovation is becoming more
and more prominent. In particular, it is a form of innovation that explicitly aims for the social and public good (Harris & Albury, 2009). OECD
(2010) defines social innovation as the phenomenon that seeks new answers to social problems, through new services, new labor market
integration processes, new competencies, new jobs, new forms of participation. Several models were developed to enhance innovation. At first,
the Triple Helix Model has been proposed (Etzkowitz & Leydesdorff, 1995), as a joint effort of firms, Universities and the public actor to the
development of innovation. Afterwards, this model has been improved by Carayannis and Campbell (2009; 2010) that propose a fourth and also
a fifth dimension from an inter-disciplinary and trans-disciplinary perspective of analysis that relates knowledge, innovation and the
environment to each other. In their contributions, the authors moved towards a Quadruple Helix Model (adding the media-based and cultural-
based public) and then a Quintuple Helix Model (adding the natural environment).
Another interesting framework has been developed by Giunchiglia (2013): the Social Innovation Pyramid (see Figure 1). This model proposes an
innovative way to understand the interrelations between the actors and identifies a new and important actor that is fundamental to enhance the
innovation ecosystem: the innovation catalyst. In this context, the need for intermediaries that can create the necessary link between the involved
actors is crucial in order to foster social innovation. As affirmed by NESTA (2007) there is a notable absence of intermediaries able to connect
demand and supply and to find the right organizational forms to put the innovation into practice. The innovation catalyst meets this need. It
plays the unique role of facilitating the interaction and collaboration between all the actors, combining their objectives, protecting as the same
time the whole ecosystem.
Figure 1. Triple Helix model
Source: adapted from Etzkovitz (2010)
In the Italian contest, an excellent example is Trento Rise. It is based in the Trentino Region, an area that in the last years becomes one of the
best examples of virtuous and innovative ecosystem, a centre of excellence in Italy and Europe in particular in ICT technologies. TrentoRise is a
core partner of the European Institute of Technology (EIT) ICT Labs (the equivalent of MIT in US) and its mission is to act as an innovation
catalyst between Research, Education and Business actively fostering social innovation through ICT. It is a fully operational institution merging
the ICT branch of the largest research institution in Trento - Fondazione Bruno Kessler (with about 380 researchers) - with the Department of
Information Engineering and Computer Science (DISI) of the University of Trento, in a wide spectrum of scientific areas and human sciences.
Also through the ability of Trento Rise, the city of Trento became one of the most innovative city in Italy. Trento is the Italian city that was
awarded in October 2013 with the first place of the ICityRate, the ranking of smart cities developed by Italian FORUMPA, the institute that
evaluate Italian smart cities. And, it is therefore on track to maintain this leadership through a virtuous alliance with the local enterprises,
research institutes and institutions, even managing the difficult task of engaging citizens in the testing of smart solutions that could improve the
quality of life
Starting from these premises, the purposes of this chapter are: a) to highlight a new framework in which it is easier develop innovation and social
innovation in order to understand the main actors involved in the process of innovation, the relationships between them and the key successful
factors of this model. b) to provide a deep analysis of TrentoRise according to the Social Innovation Pyramid perspective, c) to highlight new
evidences on the relevance of the Open and Big Data Projects of the Autonomous Province of Trento within this framework. These two project are
a concrete and successfully example that concretely shows the relationship between actors involved in the Trentino innovation ecosystem. To do
that, we will deeply investigate the Trentino ecosystem and the role of TrentoRise as a catalyst of social innovation. Thus, a deeper knowledge on
the topic will allow us to understand how to develop scalability strategies and to replicate the model in other areas helping regional policy makers
to develop similar model according to the characteristics of districts.
SOCIAL INNOVATION
Among the first definition of social innovation, is worth noticed the one coined in 2000 from the Local Economic and Employment Development
Committee (LEED) of the OECD, in the framework of its Forum on Social Innovation (FSI). This was a multi-stakeholder forum, created with the
main objective of facilitating international dissemination and transferring best policies and practices in social innovation. The definition they
came up with, has its focus in the concepts of change, organizational change and changes in financing, and relationship, with stakeholders and
territories. Basically, social innovation aims at finding new answers to social problems. This can happen mostly in two ways: by identifying and
delivering new services that improve the quality of life of individuals and communities, and by identify and implementing new labor market
integration process, new competencies, new jobs, and new forms of participation, to contribute improving the position of individuals in the
workforce (OECD, 2010).
The need for social innovations arise from many social challenges that are resistant to conventional approaches to solving them. Social
innovation means new responses to those needs and challenges, not only with its outcomes, but also with the processes it implements.
In the definitions is evidenced a strong link between social innovation and local development, as social innovation is a way to improve individuals
and communities welfare, and explicit reference is made to new relationship with territories (OECD, 2010).
Many contributions on the topic focused on the boundaries and overlapping between social innovations and some related concepts. For example
the term is often used interchangeably with “social entrepreneurship”. Even though the terms have much in common, social entrepreneurship
refers to an entrepreneurial activity run for the achievement of a social mission. The emphasis on profitability is what differentiate it from social
innovation. Social innovation does not necessarily involve a commercial interest nor an entrepreneurial activity, though it does not preclude such
interest (Westley & Antadz, 2010). Its interest is wider as it transcends sectors, level of analysis, and method to discover the processes that
produce lasting impact (Phills et al, 2008). It is correct to say that social innovations aim at modify the overall system in which social
entrepreneurship can take form, creating the right framework and the strategy in which it can develop and operate.
The definition of Phills et al. (2008) is also explored, who stated: “a novel solution to a social problem that is more effective, efficient and
sustainable than existing solutions and for which the value created accrues primarily to society as a whole rather than private individuals”.
Moreover, Mulgan (2006) defined social innovation as: “innovations that are social both in their ends and in their means. In other words: it
covers new ideas (products, services, and models) that simultaneously meet socially recognized social needs (more effectively than alternatives)
and create new social relationships or collaborations, that are both good for society and enhance society’s capacity to act”.
Also, the differences between business and social innovations have been deeply investigated. The main difference lies in the fact that business
innovation aims at introducing new types of production or exploiting new markets in themselves, while social innovation is completely drive by
the goals of public good. However, it has to be noticed that this view is not shared by all the scholars. Somehow, as stated by Pol and Ville (2009),
it could be argued that also business innovations generate benefits not only to the innovator, but also to other parties such as consumers and
competitors, through a process that they called innovation spillover. From this perspective the concept of social innovation adds nothing to what
we already know about innovation in itself and is too vague ever to be useful. The key aspect that the authors underline is the way in which social
innovations benefit human beings. The implied idea within this concept is that social innovation has the potential to improve either quality or
quantity of life (Pol & Ville, 2009).
Finally, Battisti (2014) verify that the characteristics that render the management of social innovation are different from the management of
other kind of innovation and in particular of service innovation. On the one hand, service innovation management usually accounts for most of
the economic aspects of target markets (a specific group of users of innovation). In addition, target market can be considered as a group of people
who focus on the use of innovation, with specific goals, to address their own social needs. Within this field of research, Battisti (2014) stated that
the current models in the literature appear to have gone through extensive empirical testing and theoretical developments, since the seminal
contributions of Chesbrough (2003) to the fields of open innovation and service innovation.
On the other hand, an active and intensive collaboration between users and organisations has been required by social innovation management, in
order to achieve the economic and social aspects of target market needs. Moreover, it is important to understand that the different roles that
users assume within the innovation process are crucial for organisations. From this perspective, Battisti (2014) stated that “social innovation
requires at least the role of users categorised as citizens, relevant social groups of people, or a crowd of users, which are fundamental in the entire
social innovation process”.
HOW TO DEVELOP AND FOSTER SOCIAL INNOVATION IN AN ICT ECOSYSTEM
Due to the close link between SI and the system in which it develops, several actors are involved in SI processes. In the SI scaling-up process it is
necessary to identify the roles of the actors involved and the dynamics that affect the relationship between the supply and demand for SI.
Westley and Antadze (2010) identified at first the vulnerable group that demand social innovation for its breakthrough. In response to this
demand, the socio-entrepreneurial organizations strive to attenuate their needs. On the other hand, this supply can be financed by governments
or charitable foundation because the source of financing cannot come from the users themselves. The success of grant proposals depends not only
on the evident needs of the vulnerable client group, but also on the skills of the grant writers in mediating such needs so as to fit the priorities of
government programs (Ferraris and Grieco, 2015). Thus, innovation processes are interactive (Bresciani & Ferraris, 2012; Ferraris, 2013). They
can be better studied intellectually by specifying the actors and the linkages between those (Cook et al., 1997; Bresciani et al., 2013; Ferraris,
2014). This study can be done using those models developed to explain how innovation comes out from the interaction of different parties. In this
sense Etzkowitz and Leydesdorff (1995) developed the Triple Helix Model (see Figure 1), an important and famous landmark within this field of
study. It has been advocated as a useful method for fostering entrepreneurship and growth, analyzing the dynamics existing between three
helices: state, academia, and industry. In accordance with the OECD classification of sectors the state represents the government sector,
academia the higher education sector, and industry the business enterprise sector. Ranga and Etzkowitz, (2013) stated that in a knowledge
society the Triple Helix thesis need to rethink the role of the university and to give more importance to this actor and in the hybridization of
elements from university, industry and government to generate new institutional and social formats for the production, transfer and application
of knowledge. The relationship between the three actors span networks that enable and constrain flux of communication (Ranga & Etzkowitz,
2013). Within this model, all the actors should generate a knowledge infrastructure in terms of overlapping institutional spheres, with each
taking the role of the other and with hybrid organizations emerging at the interfaces (Etzkowitz & Leydesdorff, 2000).
Etzkowitz and Leydesdorff (2000) are inclined to talk about trilateral networks and the hybrid organizations that arise where the helices overlap.
In this context, Dzisah and Etzkowitkz (2008) explain the Triple Helix concept of circulation among university-industry-government that become
fundamental in order to enhance knowledge development. They stated that in addition to decentralization and devolution of the decision making,
underdevelopment can be overcome by enhancing circulation of persons, ideas and innovations. Naturally, this circulation needs to be adapted to
different cultural and national context. This could be done through a three step process. First, a neutral environment in which all the actors are
bringing together has to be built in order to have a free and frank discussion of strengths and weaknesses of the triple helix actors and partners
and blockages. Second, it needs to identify opportunities, limitations and barriers to overcome maybe through a commissioned study. Third, an
action plan that may adapt organizational models or invent new ones has to be formulate.
Summarizing, the Triple Helix circulation is an alternative model of development based upon the notion of society as a series of interpenetrating
rather than separate institutional spheres (Dzisah & Etzkowitkz, 2008). A circulation strategy enhances the opportunities Triple helix for rapid
socio-economic development in the transition towards a knowledge based society. The critical elements of Triple Helix circulation are persons,
ideas and innovations (Figure 2) together with such sub-elements as dual-life, alternation, innovation networks etc.
Source: adapted from Dzisah and Etzkowitkz (2008)
Regarding the first element, it reflects a sort of a revolving door that allows for the introduction of viable ideas from one sphere to another
through the flow of people (Dzisah & Etzkowitkz, 2008). This may lead to collaborative projects and to cross-institutional initiative. Circulation of
persons may involve the movement of people from one sphere to another for example the flow of high-tech-firm entrepreneurs who were
university professors to industry.
Regarding the second element, it reflects the collaboration and premised on information communication through networks at various levels of
research, knowledge production, dissemination and utilization activities. This innovation network aids the communication and dissemination of
government policies and funding resources, cutting-edge research results from universities and their implications for new technologies and
industries; collaboration needs from industry and the support of innovative regions (Dzisah & Etzkowitkz, 2008).
Regarding the third element, it reflects the instantiation and dissemination of various results to potential users and innovators to put into
practice on a much larger scale to assist in knowledge-based development. The production, dissemination and use include forward and reverse
linear elements, creating and interacting environment-a ‘seamless web’ among the triple helix actors. In this regard, ‘reciprocity’ among actors
and ‘equality of contribution’ to innovation is a crucial factor in enhancing itself in a reflexive manner (Dzisah & Etzkowitkz, 2008). As such, a
gap in translating ideas into innovations might appear if there is a negative imbalance in contributions from among triple helix actors.
Conversely, a positive imbalance might stimulate other actors to increase their efforts thereby enhancing their institutional sphere.
In the last years Carayannis and Campbell (2009) have been further widened this model with the adjunction of elements that were supposed to
better complete the framework from which innovation can come out. They added the element of the public as a fourth helix, more precisely
identified as the “media-based and culture-based public”. The authors justify the introduction of this helix explaining how both culture and
values and the way in which reality is constructed and communicated by the media, highly influence every national innovation system (Ferraris &
Grieco, 2015). Public discourses transported through and interpreted by the media, are crucial for a society to assign top priorities to innovation
and knowledge.
Afterwards, the same authors kept enriching the model adding a fifth helix that links to the established model the role of the “natural
environment or natural environments of society” (Carayannis & Campbell, 2010). With this configuration, the renewed Quintuple Helix model
became an analytical framework for sustainable development and social ecology and outlines what sustainable development might mean and
imply for eco-innovation and eco-entrepreneurship, in the current scenario.
A NEW MODEL TO DEVELOP SOCIAL INNOVATION: THE SOCIAL INNOVATION PYRAMID
In an ecosystem that wants to develop social innovation, another important actor is represented by the citizens or the whole society in a wider
perspective. In the model of Giunchiglia (2013), in fact, the citizens are on the top of the social innovation pyramid. The citizens are the first actor
and they have the final purpose of improving their life quality. In this new perspective, citizens are given services and products by firms that
create an innovation of service and product in a B2C business model.
At the same time, the public actor as a buyer and main user of new products or services is on the top of the pyramid and has an important role. In
this way the public administration or actor (PA) makes the creation, the bootstrap and evolution easier. It also facilitates the environment
sustainability in a long time perspective because of being itself the main user and creator, laying the foundation for the future increase of private
participation (Giunchiglia, 2013).
At the bottom of the pyramid (see Figure 3) there are: a) firms which provide technological innovation, either as services or products to other
firms in a B2B business model; b) the research system that provide knowhow and skills to the type of firms already mentioned; c) training and
high training system, which provide new personnel and transfer knowledge on a big scale, either to firms or research system and d) the public
actor that plays a role of financer.
Source: adapted from Giunchiglia (2013)
In addition, there are also some indirect stakeholders, which are secondary but not less important; they are indeed necessary for creating
ecosystem (Giunchiglia, 2013).
Main examples are: a) the political system, which must guarantee the correct working of the Social Innovation Pyramid mechanism in order to
get a standard use of it. The political system has also the function of launch all the innovation, organization, territorial and legislation processes.
It is important that the figure of the financer and the PA coincide; and it’s not so easy; b) the social part system, which should support the process
restricting any negative side effects; c) the private financers, Venture Capital included, which can accelerate the process with further financing to
research and innovation.
Giunchiglia (2013) stated that the direct and indirect stakeholders of innovation are not sufficient to create an innovation ecosystem because of
their diversity. This diversity prevents them from collaborate with ease, in particular under three important dimensions: a) roles and
responsibilities; b) objectives; c) time.
Regarding the main actors, from the first perspective, research has a key role in producing new knowledge and new researchers. Training
institutions need to transfer skills while service companies (such as those that provide energy, mobility, connectivity ...) and the PA (public
administration) must provide services. Finally, companies must provide new technology systems.
From the second perspective, researchers have as objective the expansion of human knowledge, educators training students to quality, companies
generating profit, PA offer the best services at the lowest possible cost.
From the third perspective, it is clear that three years is needed to build a new skill at best in research, and it is a short period. One single year is
often the minimum measurement unit to measure the results for training while for companies the basic unit of time is the month because
monthly cadence with which you have to pay salaries.
This diversity of roles, responsibilities and time frames in fact makes it difficult, almost impossible, the collaboration between the actors of the
innovation ecosystem (Giunchiglia, 2013). The solution to this apparent paradox is the creation of a convergent interaction among the different
actors A new actor is needed and it is aimed at producing concrete results such for example the safeguard of the whole ecosystem, both the
specificity of each component and the diversity.
Today, it is evidenced that social innovation requires a wide range of actors. What emerges, however, is a lack of a certain kind of intermediary
that in some way act as a link between all the different actors involved. In this sense is useful the imagine proposed by NESTA (2007). Social
innovation can be seen as the result of a combination of “bees” and “trees”. The first are small organizations, individuals and groups who have
new ideas, the second are big organizations such as government or companies which are generally poor at creativity but good in implementation
and which have the resilience, roots and scale to make things happen. The problem in this picture is how to connect bees and trees. There is a
notable absence of intermediaries able to connect demand and supply and to find the right organizational forms to put the innovation into
practice (NESTA, 2007). This need is also highlighted by OECD (2010) that in developing policy recommendations express the need for
incubators and intermediaries as their absence in the social field is seen as a key reason why too few innovations succeed. Summarizing, it is
needed a component that promotes and accelerates the process of creating innovation (Ferraris & Grieco, 2015).
This because innovation is something really difficult to capture and appreciate in its all complexity and it is unpredictable because it is linked to
creativity. So, being innovative means having the ability to analyze the shortcomings of the present, but above all, to imagine the challenges of the
future (Heunks, 1998). Moreover, as an innovation process cannot be engineered, designed from the top and it is not planned at table, it is only
possible increasing the probability that the innovation happens (Krause, 2004) and the probability of generating innovation tends to increase in
societies where there is a greater “inclination to innovation” (Giunchiglia, 2013). An open culture to innovation is necessary because it helps in
order to see change as an opportunity rather than a threat (Chesbrough, 2003).
In this context, it is necessary to develop an innovation ecosystem such as many famous example in Silicon Valley and in Sweden. This allow a
permanent, stable and self-generating process of innovation that enhance the possibility of generating new ideas (Adner & Kapoor, 2010). In fact,
open innovation core concepts are the circulation and the opening of ideas, knowledge and projects (Chesbrough et al., 2006).
In these innovation ecosystems are present the main stakeholders individuated in the Social Innovation Pyramid. In Silicon Valley, for example,
there are major companies such as Intel, and top universities like Berkeley, Caltech, UCLA and the public actor of course. Also in Sweden an
innovation ecosystem has been developed through years and thanks to huge amount of investment made by the public actor. But, what it is
similarly is the presence of a particular actor, what we called the “innovation catalyst”, an actor who operates a crucial role in the birth and
development of the regions: the Defense Advanced Research Projects Agency (DARPA) and VINNOVA. These public actors invests every year a
lot of money mainly in promoting collaboration between businesses, universities, research centers and the public sector, encouraging greater use
of research, making long-term investments and creating catalytic meeting places.
As DARPA and VINNOVA, a catalyst of innovation must be a streamlined, not hierarchical, agile structure (Ferraris & Grieco, 2015). It should
not be afraid of the risk, it should be guided by the ideas and results-oriented (Giunchiglia, 2013). It must has a strong link with the territory in
which it operates, but at the same time open to the world, because the change is global. In short, must act in a glocal perspective which is the
basis of creation and functioning of ecosystems of successful innovation. This is necessary in order to achieve the right flexibility to manage and
anticipate the change.
Moreover, in Europe catalyst of social innovation should work following general operating principles, in particular using public-private
partnerships. The reasons are manifold. For example, in the Euro zone where the public sector is much more developed than in other continents
and has a great capacity of financing, this tool could be better exploited in order to achieve a a competitive advantage. Thus, this applies not just
to the public as funding body (where many states outnumber us, first of all the United States and Korea), but rather for the public as public
administration, capitalizing on the fact that the citizen is the first and immediate beneficiary of social innovation (Giunchiglia, 2013). Finally, it is
crucial that the various processes of collaboration must be enabled via incentives. As it makes no sense to impose innovation from the top, you
cannot even impose the project activities aimed at increasing the probability of generating innovation. Only those who see in the initiative a
chance to return, measured according to its own value chain, will tend to participate. This is also the way to ensure the sustainability of medium
to long term, even after the end of the project (Ferraris & Grieco, 2015).
TRENTINO ECOSYSTEM AND TRENTO RISE
Trentino ecosystem is an italian example of innovative ecosystem that growth in the last years as a technological ICT cluster, a landmark in the
ICT Italian and European framework. Trentino is, along with South Tyrol, one of the two provinces which make up the region of Trentino-Alto
Adige/Südtirol, which is designated an autonomous region under the Italian Constitution. The province covers an area of more than 20,000 km2,
with a total population of about 0.5 million. But, this “small” territory have “big” numbers over the EU average such as: 2.19% of GDP invested in
R&D activities (Italy: 1,27 - EU27: 2,02) (2013, Istat); 6,1 Employed in R&D every 1000 inhabitants (Italy: 3,8 - UE27: 5,1) (2013, Istat); 1
University, 12 public Research Centers and 6 Industrial Research Centers.
In this context, all the actors interact according to the Social Innovation Pyramid (Giunchiglia, 2013). In particular, research is a key factor
because it generates, permanently, new skills and new ideas that can enable innovation. Research aimed at enabling both technological and social
innovation, has to be multidisciplinary (thus including the disciplines of economics, social, legal, neuroscience, etc ...) and interdisciplinary, in
order to better meet the new challenges posed by a changing society in depth (Carayannis & Campbell, 2009; 2010). From this perspective, if the
research produces new knowledge, training in its various forms (education, higher education, lifelong learning), is the way to transfer the new
knowledge to society in all its components.
Following the Social Innovation Pyramid (Giunchiglia, 2013) the actors involved in this ecosystem are: University of Trento (in particular the ICT
Branch), the Bruno Kessler and the Edmund Mach Foundation as public research centers, Telecom, Microsoft as private research centers, the
Autonomous Province of Trento (ATP) indigenous firms and the citizens.
In this context, TrentoRise is the innovation catalyst in the Trentino ecosystem (Giunchiglia, 2013). It is a fully operational institution merging
the ICT branch of the largest research institution in Trento - Fondazione Bruno Kessler - with the Department of Information Engineering and
Computer Science (DISI) of the University of Trento, in a wide spectrum of scientific areas and human sciences. Its main goal is to play the
unique role of combining all the actors' objectives, facilitate the interaction and collaboration between them and protect the whole ecosystem
(Ferraris & Grieco, 2015). It develops numerous relationships with the territory but also with Europe operating in a global perspective. In fact, it
is a core partner of the European Institute of Technology (EIT) ICT Labs (the European answer to MIT) and part of EIT ICT Labs Italy.
Its mission is “to act as an innovation catalyst between Research, Education and Business actively fostering social innovation through ICT” and
this highlights how this actor is an effective and ideal instrument towards the integration of education, research and business dimensions.
In the next years, TrentoRise would become one of the leading hubs in ICT sector in Europe and would drive the internationalization and
innovation of Trentino. Its main activities are: a) to promote business development through innovation projects that meet societal needs; b) to
promote scientific research that creates added value for people, the market and the society at large; c) to promote new business creation,
fostering highly innovative startups in the ICT sector and d) to attract highly motivated students by launching initiatives in the field of higher
education offering not only academic but also entrepreneurial education.
The focus areas are: a) energy and environment, b) health and wellbeing, and c) tourism and culture. These are chosen obviously with regard to
the impact on Trentino area, in conjunction with the Autonomous Province of Trento (APT), following the medium/long term strategy of social
innovation as it defined by the APT.
The enabling infrastructure (systems and technologies) are two areas: open and big data (the provision of large amounts of data by the public
actor, and not only, and their use for enabling personalized services), and smart services (for example the creation of intelligent services for the
citizen, also enabled by data made available by the open and big data projects). These aspects will be detailed in the next section (see the last
paragraph for a deep analysis of this topics).
Third dimension are the main 3 core competencies of TrentoRise in Education, Business and Research. From the Education perspective the
University of Trento proved to have the best Italian standards and regularly scores as top University in Italy and it is able to obtain important
research projects funded by the European Commission. It is one of the most internationally oriented Italian Universities seamlessly connected
with more than 100 EIT Partners and it is strongly focused on Entrepreneurship and Innovation. English is the official language and more than
200 Ph.D. students - 70% foreigners work in the area. Regarding this perspective, Trento Rise promotes International Master School, Doctorate
Training Center, Summer and Winter School, PhD scholarship and professional training.
From the Research perspective, more than 800 top class researchers work in the Trentino ecosystem and have a strong network to more than 40
EIT linked Universities and research centers. In addition, large enterprises such as Telecom, IBM, Nokia and Siemens are involved in research
programs. Regarding this perspective, Trento Rise invests in new talent coming from best international schools and in new strategic areas
considered strategic but currently under-represented or absent in Trentino, for the creation of social, business, service and technology
innovation. Moreover, it develops research projects with the participation of local enterprises and international research project in order to take
funds coming from the EU.
The definition of instruments used by TrentoRise essentially follows the guidelines defined by the membership of TrentoRise to EIT ICT Labs,
the most important of which are Pre-commercial procurement (PCP). In recent years, the European Commission has been concentrating more
and more attention and interest on PCP issues and investing considerable resources to encourage the use of PCP in Europe developing a policy
framework and directly supporting several surveys, programs, projects and awareness-building and dissemination events. PCP is a process
empowering public authorities to buy the technologically innovative solutions that fit their needs. Public procurers act as first buyers who share
with suppliers the benefits and risks of pulling technology from early stage research to pre-commercial products. It focuses on domains where no
commercial solutions exist yet on the market (European Commission, 2008).
First-buyer involvement in the early phases of industry R&D delivers better products at lower costs. Thus, PCP dramatically reduces the risks and
the cost of failure at deployment stage both for procurers and for suppliers. A healthy competition has been guaranteed by putting several
suppliers in competition in developing solutions at the pre-commercial stage.
Other tools that TrentoRise is using to stimulate social innovation are: a) Trentino as a Lab (TasLab), that enable the area in testing solutions
before going to market and produce advantages both for company and territory; b) attract enterprises in co-location centers in order to develop
R&D programs and to create synergy with research and education areas; c) co financing and IPR Sharing.
For example, TasLab is the regional network of innovation in the ICT (Information & Communication Technologies) that promotes innovation in
services of the Public Administration. The initiative is coordinated by Informatica Trentina SpA, on the recommendation of the Autonomous
Province of Trento. TasLab aims at developing a network of territorial innovation participated by research centers, enterprises and the public
administration of the Trentino region, in order to nurture opportunities for discussion among the regional stakeholders finalized to the
collaboration towards service innovation in PA (ideas, project proposals, partnerships, projects and new innovation initiatives, etc..).
Moreover, TrentoRise promotes other numerous initiative activity in Society (for the growth of the collective consciousness (Ferraris & Grieco,
2015) such as:
• ICT Days: Annual event for the sharing and the development of awareness by the population and the major stakeholders of innovation,
about the process of social change and its proactive management;
• Territorial Seminars: Decentralized intermediate events for the growth of awareness by the population, about the process of social
change and its proactive management;
• Social Innovation Laboratories (SIL): Work roundtables with the main stakeholder groups for the proactive management of social
change, and
• TEDx: International event of global significance for raising awareness of the most innovative ideas developed on a national and
international reference topic “quality of life” and the “social innovation”.
Another success example is TechPeaks. It is offering six months free housing, food and an office in Trento, to individuals or teams with “deep
technical or design” expertise. Applicants do not have to belong to a team or even have a specific idea, just a passion for business and information
technology. The program is the result of a partnership between seven tech universities and even international accelerators. There are grants of
25,000 euro and possible match-funding for private investments up 200,000 euro. A total of 13 million euro is reportedly sufficient funding for
four years (Wall Street Journal, 2013).
Participants follow a dedicated track according to the category they belong to, supported by coaching and mentoring services provided by highly
qualified entrepreneurs, investors and professionals of national and international standing, and will have access to more than five hundred (500)
ICT researchers active in the Trentino area. Coaching and mentorship sessions will cover strategic, business and technological matters. The
organisation and content of such activities evolve in accordance with the development participants’ business ideas. At the end of ten weeks the
participant have to submit three to five analyses of innovative ideas in different areas of business and technology, with a preference for the
following areas: (i) Data collection and analysis (big, open, linked, audio, video), (ii) Financial technology and advanced payment methods, (iii)
Health and wellbeing, (iv) Internet of Things and wearable computing, and (v) Tourism, sports, food and wine.
TRENTORISE FLAGSHIP PROJECTS: OPEN AND BIG DATA PROJECT IN TRENTINO, SMART CAMPUS, AND SMART
CROWD
One of the most important aspect of innovation in Trentino is related to the re-use of open data. In the definition provided by the Open
Knowledge Foundation, “open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement
to attribute and share-alike”. Open data can mainly be provided by three actors: governments like the Autonomous Province of Trento,
corportations like Enel S.p.a. and communities like OpenStreetMaps.
The public sector collects, produces, reproduces and disseminates a wide range of information in many areas of activity, such as social, economic,
geographical, weather, tourist, business, patent and educational information. Wider possibilities of re-using public sector information should
allow European companies to exploit its potential and contribute to economic growth and job creation.
European Union endorsed the process of opening the Public Sector Information since 1998 with the publication of the Green Book on Public
Sector Information (European Commission, 1998). Then, the movement became a shared policy after the Directive no. 2003/98/CE (Public
Sector Information Directive), then modified by the Directive no. 2013/37/UE. In the specific domain of geo-data, Europe defined a shared data
structure with the Directive no. 2007/2/CE (INSPIRE Directive).
Nowadays, data are becoming the means through which a new kind of innovative services can be provided by governments and corporation
(Manyika et al., 2013). Open data is accessible public data “that people, companies, and organizations can use to launch new ventures, analyze
patterns and trends, make data-driven decisions, and solve complex problems” (Gurin, 2014). Opening data can also lead to increase public
transparency, generate insights into how to improve government performance (Ubaldi, 2013) and fight corruption (Grönlund, 2010). Recently,
the G8 Global Summit approved the “Open Data Charter” (G8, 2013) where they “recognise [that] the benefit of open data can and should be
enjoyed by citizens of all nations”.
Releasing open data is a matter of increasing societal efficiency. Sharing data “has the potential to unlock large amounts of economic value, by
improving the efficiency and effectiveness of existing processes; making possible new producs services, markets; and creating value for individual
consumers and citizens” (McKinsey, 2013). Agencies all over the word published reports related to the quantification of the potential economic
value that has to be extracted from the data: (Pira International, 2000) estimate for the value around €68 billion annually; (Mepsir Eu, 2006)
publish that the overall market for public sector information in the European Union is €26.1 billion; (Vickery/EU Commission, 2011) defined the
current total direct and indirect economic value of public sector information at €140 billion per year for the EU27; (McKinsey, 2013) estimate
that the potential value can be shared between the United States ($1.1 trillion), Europe ($900 billion) and the rest of the world ($1.7 trillion).
According to ideas, policies and legal instruments described below, four main flagship projects are active now in Trentino thanks to the
TrentoRise coordination: Open Data Project in Trentino, Big Data Project, Smart Campus and Smart Crowd. These projects are done in
collaboration with public and private partners to make Trentino an intelligent and competitive territory with high potential and an excellent
quality of life. They aim is to create open source platforms that manage data and services for small and medium enterprises as well as citizens in
general for the development of innovative solutions.
At time of writing, the most mature project is the Open Data Project in Trentino. Governments of various countries in the world have been
starting releasing a huge quantity of datasets (Ubaldi, 2013). Along this line, the Autonomous Province of Trento, TrentoRise, and other business
actors (Informatica Trentina S.p.a., SpazioDati S.r.l.) and research institutions (Università di Trento and Fondazione Bruno Kessler) endorse the
“Open Data Project in Trentino”. It aims at publishing data withheld by all the department of the Province for generating accountability,
transparency and foster economic growth, as expressed in the guidelines for the reuse of public data official document. The project is held by the
Department of Innovation of the Province, and the project team is composed by technical, legal and socio-economic expert in relation with
different aspects involved.
The process started by adapting and improving, for the local administration context, the state of the art of existing European good practices.
The process involves all the local public authorities since the beginning by asking to every provincial department to open at least one dataset per
month. A data catalog (http://www.dati.trentino.it) has been released and it is daily feeded by new dataset coming from provincial departments.
At the same time, the team focused at the creation of the “Data as a Culture” through educative actions both inside and outside the authorities
involved. In this context, a “School of Data” has been organized by Fondazione Bruno Kessler and Open Knowledge Foundation with the purpose
of disseminate tools and best practices for the re-use of data.
As April 2014, the open data catalogue contains 650 datasets of about 60 provincial departments. The most important aspect is related to open
licenses used for sharing data (Ricolfi et. al., 2011). Crucial aspects are also related to securing personal data (Van Der Sloot, 2011). The paradigm
of open data is very close to the big data one (see Figure 4), but they are not the same: “open data brings a perspective that can make big data
more useful, more democratic, and less threatening” (Gurin, 2014).
Source: Gurin 2014
Definition of big data is based on the three main characteristics: volume, variety, and velocity (Zikopoulos et. al., 2012). Volume is related to size
of the data that can reach terabytes (TB), petabytes (PB) or zettabytes (ZB). Variety means the types of data. In addition, difference sources will
produce big data such as sensors, devices, social networks, the web and mobile phones (Zavslasky, 2012). Velocity is related to the frequency in
which data are generated. Big Data Project aims at developing a platform based on state-of-the-art technologies and techniques for delivering
advanced services to a wide range of users and applications. The platform will consolidate all the knowledge currently produced in Trentino by
public and private bodies. This will promote services, research and development, enabling a better quality of life to the citizens.
In this Venn diagram, all the subsets between open data and big data provided by different actors are largely explained. The direction of the Big
Data Project is also related to the use of various big data techniques to increase the re-use of open data: “when the government turns big data into
open data, it’s especially powerful” (Gurin, 2014).
It is clear that “a vibrant ecosystem of developers will be necessary to transform open data into valuable tools” (McKinsey, 2013). The Data as a
Culture is not a matter of few people. It requires “buy-in across an organization, which in turn requires educating employees about the power of
data, and empowering them through training” (The Economist Intelligence Unit, 2013).
This is the intuition through which the two projects Smart Campus and Smart Crowd have been endorsed by TrentoRise. They aim at creating an
environment of services based on data, empowering people in the re-use of them.
First of all, Smart Campus aims at empowering citizens of a smart city with a more active role in designing, developing and delivering the services
they want and like. It uses the campus with its students, researchers, and institutions as a scaled-down, but complete, model of a smart city. It is
only the first step toward the vision of a smart city lab that will cover the whole territory of Trentino.
Smart Campus is both a lab and a community. The lab builds a social and technical environment for collaborative service design and personalized
service delivery. The community is composed of all the students, researchers and staff who use the services and collaborate in their creation.
Finally, Smart Crowds Territorial Lab consists of large sets citizens living in the Province of Trento, who participate in R&D and innovation
projects promoted by Trento RISE and its partners. Citizens are profiled according to their socio-demographic background, and other, more
specific information such as personal health profile, expense profile, ICT literacy, etc. Citizens who participate usually own Android smartphones
and are trained to install and test mobile applications provided by Trento RISE and partners. Users are able to participate in user-experience
research activities, as well as participatory design activities. Citizens' participation in the Territorial Lab is gamified, in order to ensure long term
commitment, community building, and trust among the members of the community.
CONCLUSION AND FUTURE DIRECTIONS
This chapter aimed at understanding how social innovation may be better developed in order to fits social need and promote a societal change. To
develop innovation the better place is to create a innovation ecosystem where different actors interact each other and where ideas and knowledge
better circulate within and outside it. To do this we proposed a new framework based on the Social Innovation Pyramid (Giunchiglia, 2013) in
which it is easier develop innovation and social innovation where we highlight the main actors involved in the process of innovation, the
relationships between them and the key successful factors of this model.
We deeply investigated a virtuous Italian ICT ecosystem based in the Trentino Region in the north east of Italy according to the framework
proposed. Moreover, we provide a new and insightful needs that emerge when the diversity of the actors involved become a problem to the
efficiency and functioning of the whole ecosystem, the presence of a new actor: the innovation catalyst. We provide an analysis of the features
that this kind of actor should has and we analyzed the TrentoRise case, the innovation catalyst of the Trentino ecosystem.
Finally, we described some concrete examples in which all actors involved work together in order to enable a smart and innovative environment.
This arguments are very relevant in particular for policy maker because a deeper knowledge on the topic allow to understand how to develop
scalability strategies or to replicate the model in other areas in order to develop similar model according to the characteristics of districts.
This work was previously published in the Handbook of Research on Entrepreneurial Success and its Impact on Regional Development edited
by Luísa Carvalho, pages 631648, copyright year 2016 by Information Science Reference (an imprint of IGI Global).
REFERENCES
Adner, R., & Kapoor, R. (2010). Value creation in innovation ecosystems: How the structure of technological interdependence affects firm
performance in new technology generations. Strategic Management Journal , 31(3), 306–333. doi:10.1002/smj.821
Autonomous Province of Trento (APT). (2012). Delibera Giunta Provinciale 2858/2012, Retrieved from
http://www.innovazione.provincia.tn.it/binary/pat_innovazione/notizie/Lineeguida_21dicembre_def.1356705195.pdf
Battisti, S. (2014). Managing social innovation: the shaping of information and communication technology in dynamic
environments [Unpublished doctoral dissertation]. Politecnico di Milano, Italy.
Becchetti, L., De Panizza, A., & Oropallo, F. (2007). Role of industrial district externalities in export and value-added performance: Evidence
from the population of Italian firms.Regional Studies , 41(5), 601–621. doi:10.1080/00343400701281691
Bresciani, S., & Ferraris, A. (2012). Imprese multinazionali: innovazione e scelte localizzative . Milano: Maggioli.
Bresciani, S., Vrontis, D., & Thrassou, A. (2013). Change through Innovation in Family Businesses: Evidence from an Italian Sample, World
Review of Entrepreneurship . Management and Sustainable Development , 9(2), 195–215.
Brundin, E., Wigren, C., Isaacs, E., Friedrich, C., & Visser, K. (2008). Triple helix networks in a multicultural context: Triggers and barriers for
fostering growth and sustainability. Journal of Developmental Entrepreneurship , 13(1), 77–98. doi:10.1142/S1084946708000867
Carayannis, E. G., & Campbell, D. F. (2009). 'Mode 3'and'Quadruple Helix': Toward a 21st century fractal innovation ecosystem. International
Journal of Technology Management ,46(3), 201–234. doi:10.1504/IJTM.2009.023374
Carayannis, E. G., & Campbell, D. F. (2010). Triple Helix, Quadruple Helix and Quintuple Helix and how do knowledge, innovation and the
environment relate to each other? A proposed framework for a trans-disciplinary analysis of sustainable development and social
ecology. International Journal of Social Ecology and Sustainable Development , 1(1), 41–69. doi:10.4018/jsesd.2010010105
Chesbrough, H. (2003). Open Innovation: the new imperative for creating and profiting from technology . Boston, MA: Harvard Business School
Press.
Chesbrough, H., Vanhaverbeke, W., & West, J. (2006). Open Innovation: Researching a New Paradigm . Oxford, UK: Oxford University Press.
Cooke, P., Gomez Uranga, M., & Etxebarria, G. (1997). Regional innovation systems: Institutional and organisational dimensions.Research
Policy , 26(4), 475–491. doi:10.1016/S0048-7333(97)00025-5
Dekkers, M., Polman, F., Te Velde, R., & De Vries, M. (2006).Final Report of Study on Exploitation of public sector information - benchmarking
of EU framework conditions. MEPSIR Measuring European Public Sector Information Resources . European Commission.
Dzisah, J., & Etzkowitz, H. (2008). Triple Helix Circulation: The Heart of Innovation and Development . International Journal of Technology
Management and Sustainable Development , 7(2), 101–115. doi:10.1386/ijtm.7.2.101_1
Erikson, R. S., MacKuen, M. B., & Stimson, J. A. (2002). The macro polity . Cambridge, UK: Cambridge University Press.
Etzkowitz, H., & Leydesdorff, L. (1995). The triple helix university–industry–government relations: A laboratory for knowledge-based economic
development. EASST Review , 14(1), 14–19.
Etzkowitz, H., & Leydesdorff, L. (2000). The dynamics of innovation: From National Systems and “Mode 2” to a Triple Helix of university–
industry–government relations. Research Policy ,29(2), 109–123. doi:10.1016/S0048-7333(99)00055-4
European Commission. (2008). Strategy for ICT research and innovation unit . Information Society and Media.
Ferraris, A. (2014). Rethinking the literature on Multiple Embeddedness and Subsidiary Specific Advantages, Multinational. Business Review
(Federal Reserve Bank of Philadelphia) , 22(1), 15–33.
Ferraris, A., & Grieco, C. (2015). The role of the innovation catalyst in social innovation – an Italian case study. Sinergie Italian Journal of
Management, 33(97), 127-144.
Giunchiglia, F. (2013). Innovazione sociale – La nuova frontiera . Trento: Department of Information Engineering and Computer Science.
Grönlund, A. (2010). Using ICT to combat corruption, SPIDER Center - Swedish Program for ICT in Developing Regions. SPIDER ICT4D, (3), 7-
27.
Gurin, J. (2014). Big data and open data: what's what and why does it matter? Retrieved from http://www.theguardian.com/public-leaders-
network/2014/apr/15/big-data-open-data-transform-government
Harris, M., & Albury, D. (2009). The Innovation Imperative: Why radical innovation is needed to reinvent public services for the recession and
beyond . In Discussion paper . London: The Labe, Nesta.
Heunks, F. J. (1998). Innovation, creativity and success. Small Business Economics , 10(3), 263–272. doi:10.1023/A:1007968217565
Krause, D. E. (2004). Influence-based leadership as a determinant of the inclination to innovate and of innovation-related behaviors: An
empirical investigation. The Leadership Quarterly , 15(1), 79–102. doi:10.1016/j.leaqua.2003.12.006
Manyika, J., Chui, M., Groves, P., Farrell, D., Van Kuiken, S., & Almasi Doshi, E. (2013). Open data: Unlocking innovation and performance
with liquid information, McKinsey Global Institute. Retrieved from
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information
Phills, J. A., Deiglmeier, K., & Miller, D. T. (2008). Rediscovering social innovation. Stanford Social Innovation Review , 6(4), 34–43.
Pol, E., & Ville, S. (2009). Social innovation: Buzz word or enduring term? Journal of Socio-Economics , 38(6), 878–885.
doi:10.1016/j.socec.2009.02.011
Ranga, M., & Etzkowitz, H. (2013). Triple Helix systems: An analytical framework for innovation policy and practice in the Knowledge
Society. Industry and Higher Education , 27(4), 237–262. doi:10.5367/ihe.2013.0165
Ricolfi, M., Van Eechoud, M., Morando, F., Tziavos, P., & Ferrao, L. (2011). The “Licensing” of Public Sector Information.Informatica e Diritto,
(1-2), 129-146.
The Economist Intelligence Unit. (2013). Fostering a data-driven culture. Retrieved from
http://www.cfoinnovation.com/system/files/fostering%20a%20data%20driving%20culture.pdf
Ubaldi, B. (2013). Open Government Data: Towards Empirical Analysis of Open Government Data Initiatives. OECD Working Papers on Public
Governance, (22), OECD Publishing.
Van Der Sloot, B. (2011). Public Sector Information & Data Protection: A Plea for Personal Privacy Settings for the Re-use of PSI. Informatica e
Diritto, (1-2), 219-238.
Clayton, N. (2013). Edugames, Art Auctions and Data Analytics: The European Funding Round-Up. Wall Street Journal. Retrieved from
http://blogs.wsj.com/tech-europe/2013/04/08/edugames-art-auctions-and-data-analytics-the-european-funding-round-up/
Westley, F., & Antadze, N. (2010). Making a difference: Strategies for scaling social innovation for greater impact. Innovation Journal , 15(2), 1–
19.
Ashok Kumar Wahi
Jaypee Institute of Information Technology, India
Yajulu Medury
Jaypee Group, India
ABSTRACT
Customers are no longer at the receiving end in the new digital economies. They have a say in everything and are co-creating products and
services. Their connection with other customers is stronger and the influence they exert collectively on businesses is phenomenal. All this has
been made possible by the technologies that the collaborative internet has made possible. Businesses have discarded hierarchies and functional
pyramid structures in favor of flat empowered structures to improve decision responsiveness in the new age. Competency is fast replacing
compatibility amongst successful employees. Geography is dead and interactions take place across boundaries of distance, time, language and
culture. This transformation of the business enterprise to Enterprise 2.0 has become possible due to the use of Web 2.0 tools becoming common
place and has had far reaching implications. The question that it raises is that are all organizations equally well equipped to take advantage of
these changes or is it going to change the relative power equation amongst them to make some small forward looking technology savvy
organizations suddenly more powerful than the erstwhile successful large giants who had built themselves on the strength of their products and
markets over time. This paper aims at creating a framework that can help evaluate this emerging equation and assess the state of readiness of all
organizations to meet this onslaught of business change. The framework addresses these technologies, the way they are impacting business
strategy and spells out all that organizations need to do to be able to gear up to face the changing fabric of the new age enterprise.
INTRODUCTION
The last decade has seen a lot of new technologies impacting the business enterprises. Besides the real impact of internet technologies, mobile,
social networks and cloud technologies have become mainstream and are having a real commercial impact and are no longer a novelty. Big Data
appears to be arriving in a big way and making its impact on reaching the right customers at the best time and making a real impact on the top
line and bottom line. However it needs enterprises to be more nimble and responsive. The resulting change in people skills, culture and attitude,
training and talent acquisition etc. needs a new strategy, evolved techniques and business models. All organizations cannot and will not respond
to this equally effectively. It needs a discussion on how success will be measured in the next decade and the state of preparedness of different
organizations.
Big Data, on the other hand promises that if you have the capability of tackling the 3Vs viz. variety, volume and velocity of the data typhoon, you
can obtain real value out of it. It has its own series of challenges and opportunities to offer. The insights that can be gained can be game changers
for many businesses and combining interactions with transactions can multiply the worth of both of them many times. The right combination of
structured transactions with unstructured interactions can similarly be harnessed to obtain huge impacts on the top line and bottom-line of the
businesses.
REVIEW OF CURRENT LITERATURE
Web 2.0
Web 2.0 “technologies are used on organizations' intranet and extranets”. Enterprise 2.0 aims to help employees, customers and suppliers
collaborate, share, and organize information via Web 2.0 technologies. Enterprise 2.0 refers to the “the use of emergent social software platforms
within companies, or between companies and their partners or customers” (McAfee, 2006).
The Web 2.0 wave has strongly spilled over the corporations' boundaries and created a good deal of discussions about the way businesses have
been managed so far and how they will evolve in the future. Under the name of Enterprise 2.0 social media has found its way into many
businesses. From many examples we can now see what approaches lead to success. The mistakes that were made at the beginning do not need to
be repeated. Some examples of early case studies in the field of Enterprise 2.0 were marked by a bottom-up, unintended approach. Nowadays you
will find a strong agreement on a more strategic approach for future initiatives in order to extend the idea towards an Enterprise-wide approach
with a transformable scope for the organization.
Organizations saw a lot of activities around blogs and wikis at the beginning. Collaboration and collecting knowledge was the main focus. Some of
these activities went very well, but some projects led to great frustration. Connecting people with a Facebook-like social network or using micro-
blogging seems to be the approach which works better than other Enterprise 2.0 activities. The activity stream provides the employees with
information and gives companies the ability to communicate easily.
Social Media
Social Media these days is having a big influence on people’s personal or professional lives. People all over the world are getting affected by it.
This media not only helps individuals in social interaction, but also helps companies in advertising their brand and also establishing relationship
with customers where customers are free to speak their mind which is again visible to all other customers. This ability of social media where
customers interact with and influence each other has reversed the traditional relationship between company and the consumers where in all the
power was with the company. The consumer thought process is gaining visibility as is peer to peer consumer interaction. Social media has altered
marketing in terms of not only the scalability of influence consumers have over each other but also the way in which consumers interact, evaluate
and finally choose information, and has put them directly into the driver’s seat. This is what Saul Berman et al., call deliver an end-to-end
experience for the consumer, without timing, pricing or brand experience breakages or conflicts (Saul Berman, et al., 2007).
Business Management in Enterprise 2.0
Enterprise 2.0 involves creation of a competitive advantage for an organization through interactive and collaborative business models. Enterprise
2.0 is no more a vision for the future; rather it truly conveys today’s reality. Companies now need to respond urgently as most business models
are based on customization and customer self-service. As customers become very comfortable with technology, outsourcing and collaborating
with partners has become a routine. The workforce involves people, who are internet-savvy, but there still exists a huge gap between the way
business models work and business is managed and how the IT systems are implemented. Competitive advantage involves collaboration,
communication and excellence in management and can no longer be achieved through command, control or operational excellence (Paul Turner,
1991).
CHANGES IN THE SOCIETY
Contemporary organization is no longer represented by a hierarchical model. The power and command is still used for managing people and also
for setting strategic direction. But it does not work for solving problems or for that matter even optimizing business results. The current
generation of professionals uses social networks to complete their tasks and achieve the goals within the organization. They work freely with
partners and customers and even consult peers within other organizations resulting in full-fledged business networks.
The way the current generation is shaping their work is totally transforming the way organizations need to be managed. This generation wants
open access to information and technology. The workers today seek an environment which involves constant learning. They want access to
knowledge to develop their own expertise and learn new skills. But there still is a limit up to which the organizations can give access to employees
and this limit needs to be maintained. The current generation is truly capable of multitasking and can engage with different sources of
information at the same time. This proves to be beneficial for the businesses as they are able to accomplish more in a shorter timeframe. They
work in their own personalized and collaborative way of working which requires different ways for motivation and rewards.
Today professionals demand to participate in the organization decision making process. They have a strong voice and desire to be heard.
Therefore to be successful the organization needs to develop a participatory culture. Employees need to be valued and treated as primary assets.
They need to be involved in the corporate decision making for the organization to flourish as well as for them to remain bound to the
organization.
The young professionals desire constant feedback. Web 2.0 enables this, as feedback can be provided easily and almost instantaneously. Also
organizations should encourage peer-to-peer feedback. Rankings provided can be based on productivity, time spent, quality of work etc.
With the change in professionals working in the organization the structure, the way the organization is managed and the way it does business is
also changing (Francesco A. Calabrese, 2010). New generation professionals prefer those organizations which quickly and easily adapt to any kind
of demographic or structural changes. Today information drives the organization/business. It is no longer product or money driven. It is the era
of knowledge businesses or information driven businesses. This information includes customer preferences which help the organization
understand its customers and deliver accordingly which again acts as a differentiator and provides product leadership. Information about the
individual customers helps the organization to customize the product according to their needs rather the same product in vast quantities (Saul
Berman, 2012). Today only the companies that provide a collaborative learning environment in addition to growth will be able to attract and
retain the best and brightest minds. This would be a global talent which can work anywhere and at any time.
To achieve competitive advantage, organizations need to be smart and agile i.e. implement changes easily and instantly and aligned which
involves sharing their best practices with others. Not only has the volume of information available grown exponentially, even the techniques of
accessing it has also evolved over the last 4 decades. Mobile computing devices have made access anywhere anytime really feasible. It is no longer
a situation of obtaining real time information, which itself was a luxury a few decades ago, it is now expected that we get real time analytics and
business driven by rules that themselves are dynamic. This has been brought out beautifully in the concept of “RTE enterprises” by Malhotra
where he argues that “strategic execution of the business models was accelerated with the help of technologies” (Yogesh Malhotra, 2005).
ROLE OF ENTERPRISE INFORMATION SYSTEMS
As businesses have grown in size and complexity and span multiple locations across multiple continents and as the velocity of data generation
and requirements for decision making have also become faster organizations have tended to rely on information systems more and more rather
than the classical management techniques. Thus techniques have become more impersonal and more technology dependent.
In this category are enterprise systems such as ERP (Enterprise Resource Planning) and CRM (Customer Relationship Management) systems.
These systems have helped large organizations manage costs vis-à-vis growth expectations while ensuring compatibility with business partners in
terms of speed of carrying out transactions. Simultaneously GRC (Governance, Risk and Compliance) issues have forced organizations to be extra
vigilant in the last couple of decades.
All the above factors have led organizations to focus on business processes and a new discipline of Business Process Management (BPM) has
emerged. This goes beyond the traditional focus on standardization of processes, streamlining or reengineering of processes, visibility into
processes and their monitoring etc. to newer techniques like business rules management and process optimization. This leads to technology
taking over from manual control of processes and ensuring its alignment to business objectives and strategy on an ongoing basis without even
batting an eyelid. Processes are now managed by “business rules and interfaces that transform recurrent requests, process them via a web of
interactions involving the firm, its customers, and other stakeholders in its value chain, and deliver unique value to the stakeholders” (Henry M.
Kim, Rajani Ramkaran, 2004).
Another major advantage of technology taking over the role of process management is that for a large organization doing it the traditional manual
way could expose it to things falling through the cracks while technology can ensure connectedness across large set of disparate systems (Jason
Underwood, et al., 2011). This has become more meaningful with the onset of new techniques like SOA (Service Oriented Architecture) which lets
old and new technologies coexist seamlessly without major development efforts. SOA “develops open and standard ways of connecting
traditionally independent systems in a dynamic and flexible way” (Apostolos Malatras, et al., 2008).
DATA MANAGEMENT
Data used for managing businesses has grown to be not just larger in volume but also more diverse in the last two decades. It is flowing in larger
chunks and has changed from just being alpha numeric transactional data to multimedia interaction data which needs to be understood and
acted upon in real time. We need to obtain data faster, digest it quicker and then do the right thing with it sooner and discard it when it has
outlived its utility. Before internet based business became large, it was good enough if each of these stages could be accomplished in “days”.
Today most of it has to happen in “hours” if not minutes. Data composition has also altered substantially. Earlier, more than 80% of the data was
internal to the organization and to a large extent produced in a controlled environment. Today the same proportion of the data is from external
sources and with no control on its quality or diversity.
The benefit of this data management philosophy is evident in situations where there is an external event disrupting the business in such
organizations. When organizations have mastered the art of on boarding the external data and making it useful and relevant for organizational
decision making in a “faster” manner, they are able to visualize the impact of the disruptive event and study the alternative responses possible
and shortlist the most effective reaction before any substantial damage is done. This improves the business continuity of the organization and
consequently its fiscal viability or profitability.
Even in the business as usual scenario these organizations performed better than their lesser capable competitors. They are more responsive to
customer changes in preferences, supply situations, fiscal tightening in the economy or labor availability. However it needs adequate
preparedness not only in terms of technology to obtain store and analyze this data but also in terms of expertise (whether in-house or external) to
be able to evolve as the data complexion changes rapidly.
The last dimension of data management that assumes critical importance in today’s connected business is that of data security. Data Loss
Prevention (DLP) as well as Data Protection techniques have also evolved with the proportion of external data becoming predominant. Keeping
pace with them and ensuring that revenue is not leaking with data from the organization has become paramount. It also needs a continuing
dedication to the cause as it is not just a one time effort of putting things in place but an ongoing activity which is somewhat of a cultural issue.
Even though all of us are now more aware and even more vigilant on these issues the number and size of these incidents has not reduced. This is a
cause for concern as it points to the fact that cyber criminals or hackers have become more technologically savvy and have access to newer tools
and techniques. These data breaches quite often go undetected for long periods of time and are not only for immediate financial gain. Another
disturbing fact in this regard is the fact that insider threats are growing as a proportion at an exponential rate. In conjunction with the fact that
detecting such malware through standard detection software is also becoming more difficult makes guarding such data breaches even more
difficult and costly.
Insider threats can be classified into malicious and non malicious. While non malicious are errors, often unintentional, due to carelessness it is
the malicious that are long term recurring threats that can result in major financial losses. The troubling factor here is the fact that the
compromised organization often takes weeks or even months before realizing that something is amiss.
In order to ensure that end users are able to effectively use the data available needs tools for predictive analytics with interactive dashboards so
that once the system has been set up and rolled into place the effort on maintaining it is minimal. These tools reduce the dependence of the end
user on IT and make it self driven in both analyzing the situation and taking appropriate corrective action.
Another technology making its presence felt is the In memory computing technology which has made it superior to the traditional disk based
data management and analytics solutions by providing insight into the data as it happens rather than by storing it, accumulating it and then
seeing its impact. This has sometimes been the change that has brought trust in the power of the data and the decisions based on it. It has helped
get the support of top management for data driven management decision making replacing the earlier “experience” based decisions.
The internet and email have always been vehicles for intrusion into the data of large corporates, but with mobile phones and other devices being
allowed inside almost all organizations through the main door, cybercriminals do not need to use the back door to enter the data vaults.
Most of these are those that most business organizations can not restrict or block access to without affecting organizational productivity. Some of
these are search engines, social media sites, news and media sites etc. The use of signature based defenses has been ineffective in controlling this
threat and there is need for real time defenses for tracking them.
With the new trend of users being always connected to their social media sites using the latest mobile devices the threats to corporate data have
increased significantly. This new habit leads to lowering of guard and makes data prone to shortened web links becoming carriers of malicious
content quietly slipping through even with data security aware users. It is estimated that nearly one fifth of all tweets with web links used
shortened web links and that nearly one third of all malicious web links has used shortened web links to sneak in.
In the e-book “The Digital Edge: Exploiting New Technology and Information for Business Advantage”, Mark P. McDonald, group vice president
and Gartner fellow, reveals how forging an edge requires new thinking and new approaches for bringing together digital and physical resources.
The result is a new source of advantage, where technology supports growth. Most businesses have already covered the Automate and Apply stages
in their model and need to get ready for the Accompany and Augment stages of their evolution. The author advocates that companies must build
up their internal digital capabilities to produce externally relevant outcomes and results.
EVOLUTION OF BUSINESS
Businesses need to evolve and adapt and embrace Enterprise 2.0 principles. The new generation enterprises would be an extended enterprise
where multiple stakeholders would be willing to work in collaboration to deliver effective and efficient results to the consumers.
1. Customer SelfService: To achieve organization excellence, customers need to be allowed to access the system and processes. This
helps in lowering company costs and also ensuring a high quality of data. These often involve customization as customer accesses the system
according to his specific needs. Customers have been effectively directing the business processes’ (Carolyn Heller Baird, et al., 2011). Also
customization has removed the difference between the front office and the back office. In Enterprise 2.0 they are fully integrated and are
transparent to the customer. Thus business involves continuous interaction and collaboration;
2. Expanded Sourcing: Previously organizations continued to grow as long as cost of doing things themselves for organizations remains
smaller than the cost of outsourcing it to the market. But outsourcing is proving to be a cheaper, faster and better resource;
3. CoCreation: Organizations are working towards creating relationships with nonprofit institutions which would help to meet social
responsibility goals and also create a useful and meaningful working environment for employees. Technology is licensed from partners and
then incorporated into the organizations’ products or services.
It is however obvious that the extent of adoption in organizations will not be uniform and will vary by function, age and level of the stakeholders.
It is becoming important to profess that you are aware of the techniques prevalent and being written about in management journals and to show
off. “Even in organizations that state that they use WEB 2.0 applications and tools, it is not really understood what the actual adoption level is.”
(Moria Levy, 2009). It is therefore essential to quantify the usage and the range of applications based on these tools that the organizations have
developed a comfort level with. The proposed framework helps obtain that formalization of the assessment of the state of readiness.
TECHNOLOGICAL ADVANCES
Technological advances have led to the new business models that tap the collaborative and conversational mode of information exchange which
occurs between organization, partners, employees and customers. As a result hierarchical communication is no longer effective or efficient.
Instead information should be made available to all so that who need it can use it, consume it or modify it. Now the focus is on collaboration.
Enterprise 2.0 promotes collaboration between companies and their partners and consumers and between the employees as well. Enterprise 2.0
combines various aspects of Web 2.0 into a secure platform where business tasks are executed in context of business goals. Enterprise 2.0 has a
modular structure which enables organizations to add components and resources as and when required as the business grows (Dr. R. Cherinka).
A good Enterprise 2.0 needs to incorporate three main capabilities:
1. Centralized information system that contains both structured as well as unstructured information;
2. Collaboration services;
Marketing success is usually measured in terms of a growing market share (revenue growth) which is also profitable (margin growth). This needs
insight into the customer’s reasons for affection with the product or service on offer. Maneuvering strategic inputs in organizations in alignment
with customer tastes needs ongoing monitoring of intelligence in the marketplace through multiple tools. In the past transactional data would
suffice and behavioral data was collected periodically through surveys to realign yourself periodically. With lifecycles shortening and customers
moving into the driving seat, thanks to new technology options available it is mandatory that even unstructured data be monitored to ensure that
responses are not delayed and competition does not play spoilsport.
Some of the techniques that marketers have adopted to meet the new challenge of this century are improving the quality of the sales leads and
their follow-up to conversion in the pipeline, improved revenue from cross-selling and up-selling, better control on marketing spends through
constant monitoring and mid course corrections when needed, newer ways of maintaining contact with the customer and customer sentiment
and finally much higher levels of collaboration both within and outside the organization (Anthony Patino, et al., 2012).
The new markets demand a much higher success rate of marketing campaigns as there is very little scope for error and correction. The current
“word of mouse” is much louder and pervasive than the past “word of mouth”. Any misreading of the customers’ mind needs to be corrected in
hours and days rather than weeks and months. This necessitates monitoring of both the structured and unstructured data more or less on real
time basis. Decisions cannot wait for the monthly management review meetings. Employees have to be empowered to change or correct
situations near real time.
This change of “pace” in marketing and sales has a positive rub off effect if managed well. Customer retention levels are much higher than the
marketing and sales efforts of the last century. This consequently reduces costs too. Marketing is expected to change customer behavior as a
result of targeted marketing campaigns under constant monitoring on the one hand and closer collaboration with sales, manufacturing, logistics,
and supply chain etc. on the other (Strategic Direction, 2012).
It forces the marketer to constantly review which campaigns and media options are delivering and thus improving focus on them. Due to the fact
that customers now are more aware of facts and opinions and are technology savvy needs a new way of segmenting markets than was traditionally
prevalent. It makes the traditional mantra of “the right input at the right time at the right place to the right customer” gain a totally new meaning
in the new world (Patrick McCole, 2004).
Effective marketing organizations now do not just use technology but also other techniques such as knowledge repositories, business intelligence
tools and business process management techniques to enable themselves to keep pace with their customers. Their processes are better defined,
responsibilities better delineated and top management support better coordinated to ensure effect at the market end. Among the activities that
they do is web site visitor tracking, lead management and enterprise performance management (EPM) (Ranjit Bose, 2006). They are better able
to link which campaign activities result in what impact on the end sales. Such organizations are more data driven than their competitors. The
level of collaboration both intra and inter organization is much higher.
Organizationally they are better equipped to align the appropriate salesman to the relevant customer and if necessary offer personalized offers to
specific high risk or high value customers for retention or for cross-selling scenarios. Their processes are more responsive and better quantified
and often decisions are managed by effective business rules rather than rigid policy frameworks. Performance indicators are monitored and
managed real time.
Knowledge repositories are more formal and technical support is more often than not crowd sourced. Other (satisfied) customers are their biggest
votaries and often support product service issues for new or less satisfied or knowledgeable customers. The lead times between identifying data,
collecting data, aggregating and analyzing data are much lower than those of competition thus making them more agile than their counterparts
(Wei Li, 2010). This ensures accuracy, currency and relevance of the data and thus higher quality of decision management. Organizations “covet
their knowledge assets” in these virtual environments (Jennifer Rowley, 2002).
A common single knowledge repository also offers the advantage of having “one version of truth” for the entire organization and thus no
reconciliation efforts on what and why of events that takes place. This also improves accountability and thus responsibility management in the
organization. Measurement is emphasized in their intra and interorganizational communications rather than just action. Rather than shoot in the
dark, people tend to take calculated risk and evaluate the result to improve their subsequent moves. While it will never be possible to remove a
certain amount of guess work or speculation out of marketing activities these organizations are able to reduce it substantially.
Most of these tools and techniques have put the three legged stool of customer management on a more even keel. The other two legs of sales and
service have traditionally been more data driven and this reduces the criticism of marketing being the “gut based” cousin of the two to a large
extent. All these data driven marketing initiatives lead to a higher return on marketing investment (ROMI) thus making marketing again the blue
eyed boy of senior management (Mike Bradbury, Neal Kissel, 2006).
Collaboration amongst employees of an organization and with external stakeholders is rapidly becoming extremely important in today’s
organizations. And what matters is not just the presence or absence of collaboration but also the response rate to external stimuli. The fact that
Web 2.0 tools enable Enterprise 2.0 organizations to be able to exploit opportunities in real time or avoid threats again in real time is what
distinguishes them from the crowd. Human Resource departments are trying to inject a certain percentage of employees who are aware and
capable of this expertise in their organizations.
The benefits of this approach are highlighted by Monika Wencek in her interview by Gareth Bell, (2012) where it is pointed out that “A
collaborative organization enables HR departments to not only engage and communicate effectively internally, but also to transform
relationships with external parties by connecting in a collaborative workspace, where files can be easily shared and discussed on the go. These
innovative ways of engaging result in shortened project life cycles, retention of specialist knowledge and development of new products, services
and industries.”
Issues of managerial support that emerged in a similar study are related to “basic implementation issues such as creating awareness about the
tools, promoting their usage and communicating the benefits to encourage adoption among employees” (Sotirios Paroutis, Alya Al Saleh, 2009).
ENTERPRISE 2.0 AND ROI
Enterprise 2.0 should be able to generate a solid return on investment (ROI). The main difficulty lies with developing a business case. While
Enterprise 2.0 is about creating competitive advantage through interaction and collaboration, the business case addresses not the technology but
the organizational effectiveness and the drivers of profitability which include customer satisfaction, loyalty, employee satisfaction, revenue,
profitability, forecast accuracy etc.
Enterprise 2.0 combines multiple disciplines, technologies and experiences which involves an integrative business strategy. Now the businesses
need to determine their action plan with respect to how they will develop an Enterprise 2.0 strategy.
Data (which became real time data and then analytical data) has evolved into Big Data. It is characterized by the 3Vs viz. Volume, Velocity and
Variety. Information has become the most valuable and easily stealable resource of the business enterprises. In the process data security has
assumed new dimensions. With the plethora of devices creating and moving that information there is need to protect it from hackers and
competitors.
BIG DATA
In recent times another related term has appeared on the landscape. Big Data is a term used to describe data sets so large, so complex or that
require such rapid processing (sometimes called the Volume/Variety/Velocity problem), that they become difficult or impossible to work with
using standard database management or analytical tools. Manipulating data sets like these often require massively parallel software running on
tens, hundreds, or even thousands of servers.
Big Data growth has essentially come about due to use of Web 2.0 technologies and includes the explosion of social media, video, photos,
unstructured text in addition to the data gathered by ubiquitous sensing devices including smart phones. Among the many difficulties associated
with Big Data are capture, storage, search, sharing, analysis, and data visualization (Brown, B., Chui, M. and Manyika, J. 2011).
It seems to be pretty obvious therefore that when senior executives aim to introduce Web 2.0 into the day to day functioning of their
organizations, they need to build an appropriate culture and environment where it is welcomed rather than detested. It also has to be able to
delegate decision making to the operational levels and simultaneously build checks and balances to ensure that relevant control aspects are not
diluted as a result of this gravitating towards Enterprise 2.0. This has also been corroborated by Schneckenberg who states that “Once the Web
2.0-based enterprise platform has been implemented, essential corporate information has to be distributed through it to make tools like Wikis,
Blogs and RSS feeds relevant to employees and to trigger the development of a new communication culture within the organization” (Dirk
Schneckenberg, 2009).
RESEARCH GAP
Most of the current literature has focused on the emergence of Web 2.0 and its effect on businesses and thus the transformation of enterprises to
Enterprise 2.0. However there is inadequate research on how to evaluate and measure the state of readiness of an organization for this
transformation and thus be able to forecast the possibility of success in undertaking this transformation. McAfee and O’Reilly, who have been
thought leaders in this new business models being shaped by Web 2.0, have not proposed any methodology for assessing the preparedness of an
organization to take the leap from traditional to the new real time business models based on a mix of transactions and interactions and being able
to handle voluminous unstructured data at a huge velocity. The only exception is the work of Brown, B., Chui, M. and Manyika, J. (2011)
published in the McKinsey Quarterly which raises the relevant questions, but does not propose any measurement framework.
The current research work attempts to fill this gap by suggesting a mechanism for measurement and then offering benchmarks for organizations
for self evaluation before starting this transformation journey. It is felt that this move is a multi dimensional move, with importance being
attached not just to the technical readiness, but also cultural and organizational capability readiness of the organization. While the same technical
competence may be available to multiple organizations, one may succeed superbly and the other fail miserably in the transformation.
2. Evaluates the understanding of Enterprise 2.0 factors in the organization and the extent of usage currently and on the cards;
3. Estimates the foreseen benefits and then relates it to the state of readiness for Big Data in a reasonable time frame.
RESEARCH QUESTIONS
The Research Instrument will help collect data to create measurement techniques to address the following research questions:
1. How prevalent is the understanding of the concepts of Web 2.0 and Enterprise 2.0 in the organization? Is it limited to the sales and
marketing and IS personnel only or does it pervade the entire organization?
2. Which of the Web 2.0 technologies do people value as productive for themselves and for their organizations?
3. How actively are these techniques being pursued and encouraged? And who are the change agents / leaders? Are projects still in pilot
stage or are they delivering benefits? Are the benefits only qualitative or also quantitative?
4. Has the concept of Big Data penetrated the different layers of the organization? Which aspects of big data have caught the fancy of users
and are there any perceived benefits? Is this usage also concentrated in certain functions or all pervasive?
5. Have all these techniques improved their forecasting ability of future behavior of its customers and the ability of the organization to
respond to these changes? Is all this only informal or has penetrated the formal methods of the organization?
THE FRAMEWORK
The research instrument (Annexure I) has been formulated in order to assess and evaluate the state of readiness of the organizations in a
particular country or market. The instrument addresses all the parameters of Enterprise 2.0, which have been discussed in the paper above. The
assessment will aim at identifying the areas of opportunity where further strengthening in organizations is needed in the context of Enterprise
2.0.
This research instrument will aid participants judge the current state of their data management and analytics practices’ health and then link it to
the awareness and use of Web 2.0 technologies at the individual level and organizational level in those enterprises. It attempts to understand the
driving forces for this transition to new technologies and hindrances faced if any. It then finds out the perspective plan for the use of these
technologies and discovers the depth to which they have gained root in those organizations.
In the second part the instrument attempts to quantify the benefits that organizations that have moved on this path perceive that they have
gained and how do they hope on enhancing them. It specifically focuses on Big Data use and the resultant effect on analytical prowess of the
organization including areas where it has shown effect on the top line or bottom line of the business. The survey also collects demographic data of
the participants to be used for analysis in correlating financial performance with the use of new technologies in them.
Subsequently, this research instrument hopes to be able to provide a benchmark for enterprises seeking to understand the state of Web 2.0
initiatives among peer institutions and covers issues such as:
• What are the initial applications, kinds of data, and approaches that enterprises are employing for their Web 2.0 initiatives?
• Where do organizations stand in terms of the comparative maturity of their Web 2.0 initiatives and their rate of progress?
The research instrument will be subjected to relevant reliability and validity tests before being formally administered to the sample population of
respondents from the corporate sector.
PROPOSED DATA COLLECTION FOR THE STUDY
Data for the study will be collected from a wide cross section of Indian businesses spread across the country. Representation will be given to large
scale, medium scale and small scale enterprises in a wide cross section of industries.
Designing the Research Instrument
The Research Instrument has been designed after (a) a series of personal interviews with CEOs and CIOs of select organizations and (b) literature
review of the manuscripts available in the field. Well defined reliability and validity tests will be conducted to ensure the appropriateness of the
Research Instrument.
Sample Population for Data Collection
The Research Instrument will be administered to a selected database of 100 CIOs and CEOs. The individuals thus chosen are personal contacts of
the author, developed during his 35 years work experience in Indian businesses. The concerned individuals can be treated as experts in the field,
on account of the following criteria:
The diversity of the sample population will attempt to include sample organizations from industries which have a natural inclination towards
digitization such as Banking and Travel, as also, the non-IT oriented industry verticals like Construction and Agriculture. It is further proposed to
publish the results in a research journal.
Data Analysis
The data collected from the respondents will be empirically analyzed by using appropriate research analysis techniques like Factor analysis,
Regression analysis and Cluster analysis. The framework will subsequently aid future researchers to replicate the results in other parts of the
world and also quantify the extent of movement in this journey, on a global scale.
It will also aid the businesses to be able to assess areas of improvement and thus strengthen their approach to Enterprise 2.0 which is going to
impact all of us. It will help address the question of ensuring that the physical resources and digital resources of the business complement each
other. It will also help businesses recognize the areas where to infuse new digital capability so that not only revenues grow but so does the value
addition by the business.
ACADEMIC AND PRACTICAL IMPLICATIONS
Tomorrow’s businesses will only be successful if they have the ability and agility to listen to their stakeholders online in real time and then
assimilate that knowledge into their operations after appropriate analysis and filtering to create a strategy based on topical insights. They will
have to be business process driven and have ability to modify their business models and business rules to respond to developing situations
without losing the opportunity to competition. All these are the tenets of Enterprise 2.0 as described in literature and measured in this research.
Organizations will need new tools to be able to work in this environment. The new ecosystem will call for marrying the traditional ERP and CRM
systems based on structured alphanumeric data with new social intelligence tools based on unstructured and multimedia data. Their
characteristics are also diametrically opposite in the sense that traditional systems could run “batch” while new media has to be analyzed in real
time. Tomorrow’s customers will have different expectations than those of today and the businesses cannot ignore their characteristics.
Even the technology of today will not suffice in the new times. Traditional databases and computing models will need to be replaced as they will
be extremely inefficient in the new ecosystem. This has economic implications of investment and ROI. Businesses will have to be pragmatic
enough to replace capital expenditures with operational expenditures and move to “cloud” based solutions to keep flexibility to scale up or down
as the situations evolve. This will not only give them advantages of agility and scalability at cost effective rates but also the flexibility to change
their minds should the situation so demand.
All these have not only practical implications for tomorrow’s managers but also call for developing new academic frameworks of defining and
measuring these new parameters. The current research is an attempt in that direction and is expected to help focus on relevant metrics which are
an essential part of understanding any new area of business.
CONCLUSION
It is important to assess the readiness of the organization to leverage the new technologies that are coming up. If the organization is not suitably
equipped, it can use the opportunity of obtaining the resources and competencies to be able to compete in the new Digital Age.
In tomorrow’s organizations it will be imperative for not only members of the organization to collaborate and contribute to the organizational
goals, it will be necessary for them to ensure that their suppliers, customers and other stakeholders also join appropriately in the symphony. The
creation of business value and its leverage will depend on this mass collaboration in real time. It implies that only those organizations which can
capably transform themselves into the Enterprise 2.0 model are the ones which will be able to script success stories for themselves. Getting ready
for this before their competitors is going to be a challenge for almost all organizations of today.
The aim of the current research is to be the guidepost of this journey into the new future.
LIMITATIONS AND DIRECTIONS FOR FUTURE RESEARCH
One major limitation of the current study will be the fact that random sampling of the businesses is not possible at this stage as this is an
emerging area and most organizations are not even aware of the concepts and technologies. Thus the sample may be skewed in favor of the
technology savvy organizations. The sample size of 100 organizations also may not be seen as adequate by traditional researchers, but is
absolutely sufficient for this study considering 100 organizations will represent the viewpoint of thousands individuals.
Future research can attempt to build on the findings of this study and refine the framework to cover the total population of businesses in India.
Similar studies can be attempted in other internet savvy countries.
This study does not attempt to link the state of readiness to various causative factors for the result which can be a topic for future research.
Similarly it is a current snapshot of the status in Indian businesses. Future research can attempt longitudinal studies to assess the movement in
this direction year on year. However, it is expected that this framework will work as a baseline for a lot of future research in this area and set up
the foundations for a lot of subsequent work.
This work was previously published in the International Journal of Virtual Communities and Social Networking (IJVCSN), 6(1); edited by
Subhasish Dasgupta, pages 5266, copyright year 2014 by IGI Publishing (an imprint of IGI Global).
REFERENCES
Baird, C. H., & Parasnis, G. (2011). From social media to social customer relationship management. Strategy and Leadership ,39(5), 30–37.
doi:10.1108/10878571111161507
Bell, I. G. (2012). Enterprise 2.0: Bringing social media inside your organization: An interview with Monika Wencek, Senior Customer Success
Manager at Yammer. Human Resource Management International Digest , 20(6), 47–49. doi:10.1108/09670731211260915
Berman, S. J. (2012). Digital transformation: Opportunities to create new business models. Strategy and Leadership , 40(2), 16–24.
doi:10.1108/10878571211209314
Berman, S. J., Abraham, S., Battino, B., Shipnuck, L., & Neus, A. (2007). New business models for the new media world. Strategy and
Leadership , 35(4), 23–30. doi:10.1108/10878570710761354
Bose, R. (2006). Understanding management data systems for enterprise performance management. Industrial Management & Data
Systems , 106(1), 43–59. doi:10.1108/02635570610640988
Bradbury, M., & Kissel, N. (2006). Investment in marketing: The allocation conundrum. The Journal of Business Strategy , 27(5), 17–22.
doi:10.1108/02756660610692662
Brown, B., Chui, M., & Manyika, J. (2011). ‘Are you ready for the era of ‘big data’? Parsing the benefits: Not all industries are created equal. The
McKinsey Quarterly , 4, 24–35.
Cherinka, R., Miller, R., Prezzama, J., & Smith, C. (n.d.). Reshaping the enterprise with web 2.0 capabilities: Challenges with main-stream
adoption across the Department of Defense. The MITRE Corporation.
Francesco, A. C. (2010). Evolution of twenty-first century knowledge workers. On the Horizon , 18(3), 160–170. doi:10.1108/10748121011072618
Kim, H. M., & Ramkaran, R. (2004). Best practices in e-business process management: Extending a re-engineering framework.Business Process
Management Journal , 10(1), 27–43. doi:10.1108/14637150410518310
Levy, M. (2009). WEB 2.0 implications on knowledge management. Journal of Knowledge Management , 13(1), 120–134.
doi:10.1108/13673270910931215
Li, W. (2010). Virtual knowledge sharing in a cross-cultural context. Journal of Knowledge Management , 14(1), 38–50.
doi:10.1108/13673271011015552
Malatras, A., Asgari, A. H., Baugé, T., & Irons, M. (2008). A service-oriented architecture for building services integration.Journal of Facilities
Management , 6(2), 132–151. doi:10.1108/14725960810872659
Malhotra, Y. (2005). Integrating knowledge management technologies in organizational business processes: Getting real time enterprises to
deliver real business performance. Journal of Knowledge Management , 9(1), 7–28. doi:10.1108/13673270510582938
McAfee, A. P. (2006). Enterprise 2.0: The dawn of emergent collaboration. MIT Sloan Review , 47(3), 21–28.
McCole, P. (2004). Refocusing marketing to reflect practice: The changing role of marketing for business. Marketing Intelligence &
Planning , 22(5), 531–539. doi:10.1108/02634500410551914
McDonald, M. P., & Rowsell-Jones, A. (2012). The digital edge: Exploiting new technology and information for business advantage. Gartner E-
Books.
New Vantage. (2012). Big data survey questionnaire. Retrieved from http://newvantage.com/wp-content/uploads/2012/12/Big-Data-Survey-
Questionnaire.pdf
Paroutis, S., & Al Saleh, A. (2009). Determinants of knowledge sharing using Web 2.0 technologies . Journal of Knowledge Management , 13(4),
52–63. doi:10.1108/13673270910971824
Patino, A., Pitta, D. A., & Quinones, R. (2012). Social media's emerging importance in market research. Journal of Consumer Marketing , 29(3),
233–237. doi:10.1108/07363761211221800
Rowley, J. E. (2002). Reflections on customer knowledge management in e-business. Qualitative Market Research: An International
Journal , 5(4), 268–280. doi:10.1108/13522750210443227
Schneckenberg, D. (2009). Web 2.0 and the empowerment of the knowledge worker. Journal of Knowledge Management , 13(6), 509–520.
doi:10.1108/13673270910997150
Strategic Direction. (2012). New media needs new marketing: Social networking challenges traditional methods. Strategic Direction , 28(6), 24–
27. doi:10.1108/02580541211224085
Turner, P. (1991). Using information to enhance competitive advantage –The marketing options. European Journal of Marketing , 25(6), 55–64.
doi:10.1108/03090569110000664
Underwood, J., & Isikdag, U. (2011). Emerging technologies for BIM 2.0. Construction Innovation: Information, Process .Management , 11(3),
252–258. doi:doi:10.1108/14714171111148990
APPENDIX
State of Readiness for Enterprise 2.0 in Indian Businesses
The survey should take no more than 40 - 50 minutes to complete. Respondents will remain anonymous and responses, either at a responder or
company level, will not be shared or identified (See Tables 1 and 2).
Table 1. State of readiness for Enterprise 2.0 survey
SI Question
No.
Company Background
Enclosures/Attachments
Contact Information
ABSTRACT
This chapter is mainly crafted in order to give a business-centric view of big data analytics. The readers can find the major application domains /
use cases of big data analytics and the compelling needs and reasons for wholeheartedly embracing this new paradigm. The emerging use cases
include the use of real-time data such as the sensor data to detect any abnormalities in plant and machinery and batch processing of sensor data
collected over a period to conduct failure analysis of plant and machinery. The author describes the short-term as well as the long-term benefits
and find and nullify all kinds of doubts and misgivings on this new idea, which has been pervading and penetrating into every tangible domain.
The ultimate goal is to demystify this cutting-edge technology so that its acceptance and adoption levels go up significantly in the days to unfold.
INTRODUCTION
Today, besides data getting originated in multiple formats and types, data sources, speeds, and sizes are growing expediently and exponentially.
The device ecosystem is expanding fast thereby resulting in a growing array of fixed, portable, wireless, wearable, nomadic, and mobile devices,
instruments, machines, consumer electronics, kitchen wares, household utensils, equipment, etc. Further on, there are trendy and handy,
implantable, macro and nano-scale, disposable, disappearing, and diminutive sensors, actuators, chips and cards, tags, speckles, labels, stickers,
smart dust, and dots being manufactured in large quantities and deployed in different environments for gathering environment intelligence in
real time. Elegant, slim and sleek personal gadgets and gizmos are really appealing to people today. Self, situation and surroundings-awareness
are being projected as the next-generation feature for any casually found cheap things. With the set of empowerment tasks such as digitalization,
service-enablement and extreme connectivity, the future hardware and software systems are bound to exhibit real-time and real-world
intelligence in their operations, outputs and offerings. Knowledge extraction, engineering and exposition will become a common affair.
There are deeper and extreme connectivity methods flourishing these days. Integration and orchestration techniques, platforms, and products
have matured significantly. The result is that Information and Communication Technology (ICT) infrastructures, platforms, applications,
services, social sites, and databases at the cyber level are increasingly interconnected with devices, digitalized objects (smart and sentient
materials) and people at the ground level via a variety of networks and middleware solutions. There is a strategic, seamless and spontaneous
convergence between the virtual and physical worlds. All these clearly insist the point that data creation / generation, capture, transmission,
storage and leverage needs have been growing ceaselessly. This positive and progressive trend is indicated and conveying a lot of key things to be
seriously contemplated by worldwide business and IT executives, engineers and experts. New techniques, tips, and tools need to be unearthed in
order to simplify and streamline the knowledge discovery process out of data heaps. The scope is bound to enlarge and there will be a number of
fresh possibilities and opportunities for business houses. Solution architects, researchers and scholars need to be cognizant of the niceties,
ingenuities, and nitty-gritty of the impending tasks of transitioning from data to information and then to knowledge. That is, the increasing data
volume, variety, and velocity have to be smartly harnessed and handled through a host of viable and valuable mechanisms in order to extract and
sustain the business value.
THE UNWRAPPING OF BIG DATA COMPUTING
We have discussed about the fundamental and fulsome changes happening in the IT and business domains. Service-enablement of applications,
platforms, infrastructures and even everyday devices besides the varying yet versatile connectivity methods has laid down strong and simulating
foundations for man as well as machine-generated data. The tremendous rise in data collection along with all the complications has instinctively
captivated both business and IT leaders to act accordingly to take care of this huge impending and data-driven opportunity for any growing
corporates. This is the beginning of the much-discussed and discoursed big data computing discipline. This paradigm is getting formalized with
the deeper and decisive collaboration amongst product vendors, service organizations, independent software vendors, system integrators,
innovators, and research organizations. Having understood the strategic significance, all the different and distributed stakeholders have come
together in complete unison in creating and sustaining simplifying and streamlining techniques, platforms and infrastructures, integrated
processes, best practices, design patterns, and key metrics to make this new discipline pervasive and persuasive. Today the acceptance and
activation levels of big data computing are consistently on the climb. However it is bound to raise a number of critical challenges but at the same
time, it is to be highly impactful and insightful for business organizations to confidently traverse in the right route if it is taken seriously. The
continuous unearthing of integrated processes, platforms, patterns, practices and products are good indications for the bright days of big data
phenomenon.
The implications of big data are vast and varied. The principal activity is to do a variety of tool-based and mathematically sound analyses on big
data for instantaneously gaining big insights. It is a well-known fact that any organization having the innate ability to swiftly and succinctly
leverage the accumulating data assets is bound to be successful in what they are operating, providing and aspiring. That is, besides instinctive
decisions, informed decisions go a long way in shaping up and confidently steering organizations. Thus, just gathering data is no more useful but
IT-enabled extraction of actionable insights in time out of those data assets serves well for the betterment of businesses. Analytics is the formal
discipline in IT for methodically doing data collection, filtering, cleaning, translation, storage, representation, processing, mining, and analysis
with the aim of extracting useful and usable intelligence. big data analytics is the newly coined word for accomplishing analytical operations on
big data. With this renewed focus, big data analytics is getting more market and mind shares across the world. With a string of new capabilities
and competencies being accrued out of this recent and riveting innovation, worldwide corporates are jumping into the big data analytics
bandwagon. This chapter is all for demystifying the hidden niceties and ingenuities of the raging big data analytics.
BIG DATA CHARACTERISTICS
Big data is the general term used to represent massive amounts of data that are not stored in the relational form in traditional enterprise-scale
databases. New-generation database systems are being unearthed in order to store, retrieve, aggregate, filter, mine and analyze big data
efficiently. The following are the general characteristics of big data.
• Data storage is defined in the order of petabytes, exabytes, etc. in volume to the current storage limits (gigabytes and terabytes)
• There can be multiple structures (structured, semi-structured and less-structured) for big data
• Multiple types of data sources (sensors, machines, mobiles, social sites, etc.) and resources for big data
• Data is time-sensitive (near real-time as well as real-time). That means big data consists of data collected with relevance to the time zones
so that timely insights can be extracted
A Perspective on Big Data Computing
A series of revolutions on the web and the device ecosystem have resulted in multi-structured (unstructured, semi-structured and structured)
data being produced in large volumes, gathered and transmitted over the Internet communication infrastructure from distant, and distributed
sources. Then they are subjected to processing, filtering, cleansing, transformation, and prioritization through a slew of computer and data-
intensive processes, and stocked in high-end storage appliances and networks. For decades, companies have been making business-critical
decisions based on transactional data (structured) stored in relational databases. Today the scene is quite different and the point worth
mentioning here is that data are increasingly less structured, exceptionally huge in volumes, and complicatedly diverse in data formats. Decision-
enabling data are being generated and garnered in multiple ways and they can be classified as man-generated and machine-generated.
Incidentally machine-generation data sizes are huge and humungous compared to the ones originating from human beings. Cameras’ still images
and videos, clickstreams, industry-generic and specific business transactions and operations, knowledge content (E-mail messages, PDF files,
word documents, presentations, excel sheets, e-books, etc.), chats and conversations, data emitted from sensors & actuators, and scientific
experiments data are the latest less and medium-structured data types.
Millions of people everyday use a number of web 2.0 (social web) platforms and sites that facilitate users from every nook and corner of this
connected world to read and write their views and reviews on all subjects under the sun, to voluntarily pour their complaints, comments and
clarifications on personal as well as professional services and solutions, to share their well-merited knowledge to a wider people community, to
form user communities for generic as well as specific purposes, to advertise and promote newer ideas and products, to communicate and
collaborate, to enhance people productivity, etc. Thus weblogs and musings from people across the globe lead to data explosion. These can be
appropriately integrated, stocked, streamed and mined for extracting useful and usable information in the forms of tips, trends, hidden
associations, alerts, impending opportunities, reusable and responsible patterns, insights, and other hitherto unexplored facts.
Data have become a torrent flowing into every area of the global economy. Companies churn out a burgeoning volume of transactional data,
capturing trillions of bytes of information about their customers, suppliers, and operations, millions of networked sensors are being embedded
in the physical world in devices such as mobile phones, smart energy meters, automobiles, and industrial machines that sense, create, and
communicate data in the age of the Internet of Things. Indeed, as companies and organizations go about their business and interact with
individuals, they are generating a tremendous amount of digital “exhaust data,” i.e., data that are created as a byproduct of other activities.
Social media sites, Smartphones, and other consumer devices including PCs and laptops have allowed billions of individuals around the world
to contribute to the amount of Big Data available. And the growing volume of multimedia content has played a major role in the exponential
growth in the amount of Big Data. Each second of highdefinition video, for example, generates more than 2,000 times as many bytes as
required to store a single page of text. In a digitized world, consumers going about their daycommunicating, browsing, buying, sharing,
searchingcreate their own enormous trails of data.From the McKinsey Global Institute Report on Big Data
Big data computing involves a bevy of powerful procedures, products and practices to comprehensively and computationally analyze multi-
structured and massive data heap to create and sustain fresh business value. Sharp reductions in the cost of both storage and compute power
have made it feasible to collect, crunch and capitalize this new-generation data proactively and pre-emptively with greater enthusiasm.
Companies are looking for ways and means to include non-traditional yet potentially valuable data along with their traditional enterprise data for
predictive and prescriptive analyses. The McKinsey Global Institute (MGI) estimates that data volume is growing 40% per year. There are four
important characteristics that are to define and defend the era of ensuing big data computing.
• Volume: As indicated above, machine-generated data is growing exponentially in size compared to man-generated data volume. For
instance, digital cameras produce high-volume image and video files to be shipped, succinctly stored and subjected to a wider variety of tasks
for different reasons including video-based security surveillance. Research labs such as CERN generate massive data, avionics and
automotive electronics too generate a lot of data, smart energy meters and heavy industrial equipment like oil refineries and drilling rigs
generate huge data volumes.
• Velocity: These days social networking and micro-blogging sites create a large amount of information. Though the size of information
created and shared is comparatively small here, the number of users is huge and hence the frequency is on the higher side resulting in a
massive data collection. Even at 140 characters per tweet, the high velocity of Twitter data ensures large volumes (over 8 terabytes (TB) per
day).
• Variety: Newer data formats are arriving compounding the problem further. As enterprise IT is continuously strengthened with the
incorporation of nimbler embedded systems and versatile cloud services to produce and provide premium and people-centric applications to
the expanding user community, new data types and formats are evolving.
• Value: Data is an asset and it has to be purposefully and passionately processed, prioritized, protected, mined and analysed utilizing
advanced technologies and tools in order to bring out the hidden knowledge that enables individuals and institutions to contemplate and
carry forward the future course of actions correctly.
WHY BIG DATA COMPUTING?
The main mandate of information technology is to capture, store and process a large amount of data to output useful information in a preferred
and pleasing format. With continued advancements in IT, lately there arose a stream of competent technologies to derive usable and reusable
knowledge from the expanding information base. The much-wanted transition from data to information and to knowledge has been simplified
through the meticulous leverage of those IT solutions. Thus data have been the main source of value creation for the past five decades. Now with
the eruption of big data and the enabling platforms, corporates and consumers are eyeing and yearning for better and bigger value derivation.
Indeed the deeper research in the big data domain breeds a litany of innovations to realize robust and resilient productivity-enhancing methods
and models for sustaining business value. The hidden treasures in big data troves are being technologically exploited to the fullest extent in order
to zoom ahead of competitions. The big data-inspired technology clusters facilitate the newer business acceleration and augmentation
mechanisms. In a nutshell, the scale and scope of big data is to ring in big shifts. The proliferation of social networks and multifaceted devices
and the unprecedented advancements in connectivity technologies have laid a strong and stimulating foundation for big data. There are several
market analyst and research reports coming out with positive indications that bright days are ahead for big data analytics.
The Application Domains
Every technological innovation in our everyday life is being recognized and renowned when it has the inherent wherewithal to accomplish new
things or to exalt existing things to newer heights. There is an old saying that necessity is the mother of all inventions. As the data germination,
capture and storage scene is exponentially growing, knowledge discovery occupies the center stage. This has pushed technology consultants,
product vendors and system integrators to ponder about a library of robust and resilient technologies, platforms, and procedures that come
handy in quickly extracting practical insights from the data heaps. Today there are a number of industry segments coming out of their comfort
zone and capitalizing the noteworthy advancements in big data computing to zoom ahead of their competitors and to solemnly plan and provide
premium services to retain their current customers as well as to attract new consumers. In the paragraphs below, I have described a few verticals
that are to benefit enormously with the maturity of big data computing.
For governments, the big data journey assures a bright and blissful opportunity to boost their efficacy in their citizen services’ delivery
mechanisms. The IT spend will come down while enhancing the IT-based automation in governance. There are research results that enforce the
view that the public sector can boost its productivity significantly through the effective use of big data. For corporates, when big data is dissected,
distilled and analysed in combination with traditional enterprise data, the corporate IT can gain a more comprehensive and insightful
understanding of its business, which can lead to enhanced productivity, a stronger competitive position in the marketplace and an aura of greater
innovation. All of these will have a momentous impact on the bottom-line revenue.
For people, there is a growing array of immensely incredible benefits. For example, the use of in-home and in-body monitoring devices such as
implantable sensors, wearables, fixed and portable actuators, robots, compute devices, LED displays, smart phones, etc. and their ad-hoc
networking capabilities to measure vital body parameters accurately and to monitor progress continuously is a futuristic way to improve patients’
health drastically. That is, sensors are the eyes and ears of new-generation IT and their contribution spans from environmental monitoring to
body-health monitoring. This is a breeding ground for establishing elegant and exotic services for the entire society. Sellers and shoppers can
gain much through communication devices and information appliances. The proliferation of smart phones and other GPS devices offers
advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up uncharted
and hitherto unforeseen avenues for fresh revenues for service providers and businesses. The market and mind share of those pioneering
businesses are bound to grow by leaps and bounds. Retailers can make use of social computing sites to understand people’s preferences and
preoccupations to smartly spread out their reach. The hidden facts and patterns elicited can enable them to explore and execute much more
effective micro-customer segmentation and targeted marketing campaigns. Further on, they come handy in eliminating any supply chain
disturbances and deficiencies. Thus big data computing is to contribute for all kinds of enterprises in propping up the productivity. It has become
a compelling reason for people to ponder about its tactical as well as strategic significance.
BIG DATA CONCERNS AND CHALLENGES
Since big data is an emerging domain, there can be some uncertainties, potential roadblocks and landmines that could probably unsettle the
expected progress. Let us consider a few that are more pertinent.
• Technology: Technologies and tools are very important for creating business value of big data. There are multiple products and platforms
from different vendors. However the technology choice is very important for firms to plan and proceed without any hitch in their pursuit.
The tool and technology choices will vary depending on the types of data to be manipulated (e.g. XML documents, social media, sensor data,
etc.), business drivers (e.g. sentiment analysis, customer trends, product development, etc.) and data usage (analytic or product development
focused).
• Data Governance: Any system has to be appropriately governed in order to be strategically beneficial. Due to the sharp increase in data
sources, types, channels, formats, and platforms, data governance is an important component in efficiently regulating the data-driven tasks.
Other important motivations include data security while in transit and in persistence, data integrity and confidentiality. Further on, there are
governmental regulations and standards from world bodies and all these have to be fully complied with in order to avoid any kind of
ramifications at a later point of time.
• Skilled Resources: It is predicted by MGI that there will be a huge shortage of human talent for organizations providing big data-based
services and solutions. There will be requirements for data modelers, scientists, and analysts in order to get all the envisaged benefits of big
data. This is a definite concern to be sincerely attended by companies and governments across the world.
• Accessibility, Consumability and Simplicity: Big Data product vendors need to bring forth solutions that extract all the complexities
of the big data framework from users to enable them to extract business value. The operating interfaces need to be intuitive and informative
so that the goal of ease of use can be ensured for people using big data solutions.
Big data’s reputation has taken a bit of a battering lately thanks to the allegations that the NSA is silently and secretly collecting and storing
people’s web and phone records. This has led to a wider debate about the appropriateness of such extensive data-gathering activities. But this
negative publicity should not detract people from the reality of big data. That is, big data is ultimately to benefit society as a whole. There’s more
to these massive data sets than simply catching terrorists or spying on law-abiding citizens.
In short, big data applications, platforms, appliances and infrastructures need to be designed in a way to facilitate their usage and leverage for
everyday purposes. The awareness about the potentials need to be propagated widely and professionals need to be trained in order to extract
better business value out of big data. Competing technologies, enabling methodologies, prescribing patterns, evaluating metrics, key guidelines,
and best practices need to be unearthed and made as reusable assets.
INTRODUCING BIG DATA ANALYTICS
This recent entrant of big data analytics into the continuously expanding technology landscape has generated a lot of interest among industry
professionals as well as academicians. Big Data has become an unavoidable trend and it has to be solidly and succinctly handled in order to derive
time-sensitive and actionable insights. There is a dazzling array of tools, techniques and tips evolving in order to quickly capture data from
diverse distributed resources and process, analyze, and mine the data to extract actionable business insights to bring in technology-sponsored
business transformation and sustenance. In short, analytics is the thriving phenomenon in every sphere and segment today. Especially with the
automated capture, persistence, and processing of tremendous amount of multi-structured data getting generated by men as well as machines,
the analytical value, scope, and power of data are bound to blossom further in the days to unfold. Precisely speaking, data is a strategic asset for
organizations to insightfully plan to sharply enhance their capabilities and competencies and to embark on the appropriate activities that
decisively and drastically power up their short as well as long-term offerings, outputs, and outlooks. Business innovations can happen in plenty
and be sustained too when there is a seamless and spontaneous connectivity between data-driven and analytics-enabled business insights and
business processes.
In the recent past, real-time analytics have gained much prominence and several product vendors have been flooding the market with a number
of elastic and state-of-the-art solutions (software as well as hardware) for facilitating on-demand, ad-hoc, real-time and runtime analysis of
batch, online transaction, social, machine, operational and streaming data. There are a number of advancements in this field due to its huge
potentials for worldwide companies in considerably reducing operational expenditures while gaining operational insights. Hadoop-based
analytical products are capable of processing and analyzing any data type and quantity across hundreds of commodity server clusters. Stream
Computing drives continuous and cognitive analysis of massive volumes of streaming data with sub-millisecond response times. There are
enterprise data warehouses, analytical platforms, in-memory appliances, etc. Data Warehousing delivers deep operational insights with advanced
in-database analytics. The EMC Greenplum Data Computing Appliance (DCA) is an integrated analytics platform that accelerates analysis of Big
Data assets within a single integrated appliance. IBM PureData System for Analytics architecturally integrates database, server and storage into a
single, purpose-built, easy-to-manage system. Then SAP HANA is an exemplary platform for efficient big data analytics. Platform vendors are
conveniently tied up with infrastructure vendors especially cloud service providers (CSPs) to take analytics to cloud so that the goal of analytics as
a service (AaaS) sees a neat and nice reality sooner than later. There are multiple startups with innovative product offerings to speed up and
simplify the complex part of big data analysis.
The Big Trends of Big Data Analytics
The future of business definitely belongs to those enterprises that swiftly embrace the big data analytics movement and use it strategically to their
own advantages. It is pointed out that business leaders and other decision-makers, who are smart enough to adopt a flexible and futuristic big
data strategy, can take their businesses towards greater heights. Successful companies are already extending the value of classic and conventional
analytics by integrating cutting-edge big data technologies and outsmarting their competitors. There are several forecasts, exhortations,
expositions, and trends on the discipline of Big Data analytics. Market research and analyst groups have come out with positive reports and
briefings, detailing its key drivers and differentiators, the future of this brewing idea, its market value, the revenue potentials and application
domains, the fresh avenues and areas for renewed focus, the needs for its sustainability, etc. Here come the top trends emanating from this field.
The Rapid Growth of the Cloud Paradigm
The cloud movement is expediently thriving and trend-setting a host of delectable novelties. A number of tectonic transformations on the
business front are being activated and accentuated with faster and easier adaptability of the cloud IT principles. The cloud concepts have opened
up a deluge of fresh opportunities for innovators, individuals and institutions conceive and concretize new-generation business services and
solutions. Without an iota of doubt, a dazzling array of path-breaking and mission-critical business augmentation models and mechanisms have
emerged which are consistently evolving towards perfection as the cloud technology grows relentlessly in conjunction with other enterprise-class
technologies.
The Integrated Big Data Analytics Platforms
Integrated platforms are very essential in order to automate several tasks enshrined in the data capture, analysis and knowledge discovery
processes. A converged platform comes out with a reliable workbench to empower developers to facilitate application development and other
related tasks such as data security, virtualization, integration, visualization, and dissemination. Special consoles are being attached with new-
generation platforms for performing other important activities such as management, governance, enhancement, etc. Hadoop is a disruptive
technology for data distribution amongst hundreds of commodity compute machines for parallel data crunching and any typical big data
platform is blessed with Hadoop software suite.
Further on, the big data platform enables entrepreneurs, investors, chief executive, information, operation, knowledge, and technology officers
(CXOs), marketing and sales people to explore and perform experiments on big data, at scale at a fraction of the time and cost required
previously. That is, platforms are to bestow all kinds of stakeholders and end-users with actionable insights that in turn lead to consider and take
informed decisions in time. Knowledge workers such as business analysts and data scientists could be the other main beneficiaries through these
empowered platforms. Knowledge discovery is an important portion here and the platform has to be chipped in with real-time and real-world
tips, associations, patterns, trends, risks, alerts, and opportunities. In-memory and in-database analytics are gaining momentum for high-
performance and real-time analytics. New advancements in the form of predictive and prescriptive analytics are emerging fast with the maturity
and stability of big data technologies, platforms, infrastructures, tools, and finally a cornucopia of sophisticated data mining and analysis
algorithms. Thus platforms need to be fitted with new features, functionalities and facilities in order to provide next-generation insights.
Optimal Infrastructures for Big Data Analytics
There is no doubt that consolidated and compact platforms accomplish a number of essential actions towards simplified big data analysis and
knowledge discovery. However they need to run in optimal, dynamic, and converged infrastructures to be effective in their operations. In the
recent past, IT infrastructures went through a host of transformations such as optimization, rationalization, and simplification. The cloud idea
has captured the attention of infrastructure specialists these days as the cloud paradigm is being proclaimed as the most pragmatic approach for
achieving the ideals of infrastructure optimization. Hence with the surging popularity of cloud computing, every kind of IT infrastructure
(servers, storages, and network solutions) is being consciously subjected to a series of modernization tasks to empower them to be policy-based,
software-defined, cloud-compliant, service-oriented, networkable, programmable, etc. That is, Big Data analytics is to be performed in
centralised/federated, virtualised, automated, shared, and optimized cloud infrastructures (private, public or hybrid). Application-specific IT
environments are being readied for the big data era. Application-aware networks are the most sought-after communication infrastructures for big
data transmission and processing. Figure 1 illustrates all the relevant and resourceful components for simplifying and streamlining big data
analytics.
As with data warehousing, data marts and online stores, an infrastructure for big data too has some unique requirements. The ultimate goal here
is to easily integrate big data with enterprise data to conduct deeper and influential analytics on the combined data set. As per the White paper
titled “Oracle: Big Data for the Enterprise”, there are three prominent requirements (data acquisition, organization and analysis) for a typical big
data infrastructure. NoSQL has all these three intrinsically.
• Acquire Big Data: The infrastructure required to support the acquisition of big data must deliver low and predictable latency in both
capturing data and in executing short and simple queries. It should be able to handle very high transaction volumes often in a distributed
environment and also support flexible and dynamic data structures. NoSQL databases are the leading infrastructure to acquire and store big
data. NoSQL databases are well-suited for dynamic data structures and are highly scalable. The data stored in a NoSQL database is typically
of a high variety because the systems are intended to simply capture all kinds of data without categorizing and parsing the data. For example,
NoSQL databases are often used to collect and store social media data. While customer-facing applications frequently change, underlying
storage structures are kept simple. Instead of designing a schema with relationships between entities, these simple structures often just
contain a major key to identify the data point and then a content container holding the relevant data. This extremely simple and nimble
structure allows changes to take place without any costly reorganization at the storage layer.
• Organize Big Data: In classical data warehousing terms, organizing data is called data integration. Because there is such a huge volume
of data, there is a tendency and trend gathering momentum to organize data at its original storage location. This saves a lot of time and
money as there is no data movement. The brewing need is to have a robust infrastructure that is innately able to organize big data, process
and manipulate data in the original storage location. It has to support very high throughput (often in batch) to deal with large data
processing steps and handles a large variety of data formats (unstructured, less structured and fully structured).
• Analyse Big Data: The data analysis can also happen in a distributed environment. That is, data stored in diverse locations can be
accessed from a data warehouse to accomplish the intended analysis. The appropriate infrastructure required for analysing big data must be
able to support deeper analytics such as statistical analysis and data mining on a wider variety of data types stored in diverse systems, to
scale to extreme data volumes, to deliver faster response times driven by changes in behaviour and to automate decisions based on analytical
models. Most importantly, the infrastructure must be able to integrate analysis on the combination of big data and traditional enterprise
data to produce exemplary insights for fresh opportunities and possibilities. For example, analysing inventory data from a smart vending
machine in combination with the events calendar for the venue in which the vending machine is located, will dictate the optimal product mix
and replenishment schedule for the vending machine.
Newer and Nimbler Big Data Applications
The success of any technology is to be squarely decided based on the number of mission-critical applications it could create and sustain. That is,
the applicability or employability of the new paradigm to as many application domains as possible is the main deciding-factor for its successful
journey. As far as the development is concerned, big data applications could differ from other software applications to a larger extent. Web and
mobile-enablement of big data applications are also important. As big insights are becoming mandatory for multiple industry segments, there is a
bigger scope of big data applications. Therefore there is a big market for big data application development platforms, patterns, metrics,
methodology, reusable components, etc.
Tending Towards a Unified Architecture
It is an unassailable truth that an integrated IT environment is a minimum requirement for attaining the expected success out of the big data
concepts. Deploying big data platforms in an IT environment that lacks a unified architecture and does not seamlessly and spontaneously
integrate distributed and diverse data sources, metadata, and other essential resources would not produce the desired insights. Such deployments
will quickly lead to a torrent of failed big data projects and in a fragmented setup, achieving the desired results remains a pipe dream forever.
Hence a unified and modular architecture is the need of the hour for taking forward the ideals of the big data discipline. Deploying big data
applications in a synchronized enterprise or cloud IT environment makes analytics simpler, faster, cheaper, and accurate, while remarkably
reducing deployment and operational costs.
Blending of Capabilities
In the ensuing era of big data, there could be multiple formats for data representation, transmission and persistence. The related trend is that
there are databases without any formal schema. SQL is the standard query language for traditional databases whereas in the big data era, there
are NoSQL databases that do not support the SQL, which is the standard for structured querying. Special file systems such as Hadoop Distributed
File System (HDFS) are being produced in order to facilitate big data storage and access. Thus analytics in the big data period is quite different
from the analytics on the SQL databases. However there is a firm place for SQL-based analytics and hence there is an insistence on converging
both to fulfil the varying needs of business intelligence (BI). Tools and technologies that provide a native blending of classic and new data
analytics techniques will have an inherent advantage.
The Rise of Big Data Appliances
Appliances (hardware and virtual) are being prescribed as a viable and value-adding approach for scores of business-critical application
infrastructure solutions such as service integration middleware, messaging brokers, security gateways, load balancing, etc. They are fully
consolidated and pre-fabricated with the full software stack so that their deployment and time-to-operation is quick and simple. There are XML
and SOA appliances in plenty in the marketplace for eliminating all kinds of performance bottlenecks in business IT solutions. In the recent past,
EMC Greenplum and SAP HANA appliances are stealing and securing the attention. SAP HANA is being projected as a game-changing and real-
time platform for business analytics and applications. While simplifying the IT stack, it provides powerful features like significant processing
speed, the ability to handle big data, predictive capabilities and text mining capabilities. Thus the emergence and evolution of appliances
represents a distinct trend as far as big data is concerned.
Big Data Processes
Besides converged architecture, infrastructures, application domains, and platforms, synchronized processes are very important in order to
augment and accelerate big data analytics. Already analytics-attached processes are emerging and evolving consistently. That is, analytics has
become such an important activity to become tightly coupled with processes. Also analytics as a service (AaaS) paradigm is on the verge of
massive adaptation and hence analytics-oriented process integration, innovation, control, and management aspects will gain more prominence
and dominance in the days to unfold.
BIG DATA ANALYTICS FRAMEWORKS AND INFRASTRUCTURE
There are majorly two types of big data processing: real-time and batch processing. The data is flowing endlessly from countless sources these
days. Data sources are on the climb. Innumerable sensors, varying in size, scope, structure, smartness, etc. are pouring data continuously. Stock
markets are emitting a lot of data every second, system logs are being received, stored, processed, analyzed, and acted upon ceaselessly.
Monitoring agents are working tirelessly producing a lot of usable and useful data, business events are captured, knowledge discovery is initiated,
information visualization is realized, etc. to empower enterprise operations. Stream computing is the latest paradigm being aptly prescribed as
the best course of action for real-time receipt, processing and analysis of online, live and continuous data. Real-time data analysis through in-
memory and in-database computing models is gaining a lot of ground these days with the sharp reduction in computer memory costs. For the
second category of batch processing, the Hadoop technology is being recommended with confidence. It is clear that there is a need for competent
products, platforms, and methods for efficiently and expectantly working with both real-time as well as batch data. There is a separate chapter for
in-memory computing towards real-time data analysis and for producing timely and actionable insights.
In this section, you can read more about the Hadoop technology. As elucidated before, big data analysis is not a simple affair and there are
Hadoop-based software programming frameworks, platforms, and appliances emerging to tackle the innate complications. The Hadoop
programming model has turned out to be the central and core method to propel the field of big data analysis. The Hadoop ecosystem is
continuously spreading its wings wider and enabling modules are being incorporated freshly to make Hadoop-based big data analysis simpler,
succinct and quicker.
Apache Hadoop
Apache Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers using a
simple programming model. Hadoop was originally designed to scale up from a single server to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the Hadoop software library itself is designed to detect and
handle failures at the application layer. Therefore, it delivers a highly available service on top of a cluster of cheap computers, each of which may
be prone to failures. Hadoop is based out of the modular architecture and thereby any of its components can be swapped with competent
alternatives if such a replacement brings noteworthy advantages.
The Hadoop Software Family
Despite all the hubbub and hype around Hadoop, few IT professionals know its key drivers, differentiators and killer applications. Because of the
newness and complexity of Hadoop, there are several areas wherein confusion reigns and restrains its full-fledged assimilation and adoption. The
Apache Hadoop product family includes the Hadoop Distributed File System (HDFS), MapReduce, Hive, HBase, Pig, Zookeeper, Flume, Sqoop,
Oozie, Hue, and so on. HDFS and MapReduce together constitute core Hadoop, which is the foundation for all Hadoop-based applications. For
applications in business intelligence (BI), data warehousing (DW), and big data analytics, core Hadoop is usually augmented with Hive and
HBase, and sometimes Pig. The Hadoop file system excels with big data that is file based, including files that contain non-structured data.
Hadoop is excellent for storing and searching multi-structured big data, but advanced analytics is possible only with certain combinations of
Hadoop products, third-party products or extensions of Hadoop technologies. The Hadoop family has its own query and database technologies
and these are similar to standard SQL and relational databases. That means BI/DW professionals can learn them quickly.
The HDFS is a distributed file system designed to run on clusters of commodity hardware. HDFS is highly fault-tolerant because it automatically
replicates file blocks across multiple machine nodes and is designed to be deployed on low-cost hardware. HDFS provides high throughput access
to application data and is suitable for applications that have large data sets. As a file system, HDFS manages files that contain data. Because it is
file-based, HDFS itself does not offer random access to data and has limited metadata capabilities when compared to a DBMS. Likewise, HDFS is
strongly batch-oriented and hence has limited real-time data access functions. To overcome these challenges, you can layer HBase over HDFS to
gain some of the mainstream DBMS capabilities. HBase is one of the many products from the Apache Hadoop product family. HBase is modeled
after Google’s Bigtable and hence HBase, like Bigtable excels with random and real-time access to very large tables containing billions of rows
and millions of columns. Today HBase is limited to straightforward tables and records with little support for more complex data structures. The
Hive meta-store gives Hadoop some DBMS-like metadata capabilities.
When HDFS and MapReduce are combined, Hadoop easily parses and indexes the full range of data types. Furthermore, as a distributed system,
HDFS scales well and has a certain amount of fault-tolerance based on data replication even when deployed atop commodity hardware. For these
reasons, HDFS and MapReduce can complement existing BI/DW systems that focus on structured and relational data. MapReduce is a general-
purpose execution engine that works with a variety of storage technologies including HDFS, other file systems and some DBMSs.
As an execution engine, MapReduce and its underlying data platform handle the complexities of network communication, parallel programming,
and fault-tolerance. In addition, MapReduce controls hand-coded programs and automatically provides multi-threading processes so they can
execute in parallel for massive scalability. The controlled parallelization of MapReduce can apply to multiple types of distributed applications, not
just analytic ones. In a nutshell, Hadoop MapReduce is a software programming framework for easily writing massively parallel applications
which process massive amounts of data in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable and fault-tolerant
manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely
parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks which in turn, assemble one or more
result sets.
Hadoop is not just for new analytic applications, it can revamp old ones too. For example, analytics for risk and fraud that is based on statistical
analysis or data mining benefit from the much larger data samples that HDFS and MapReduce can wring from diverse big data. Further on, most
360-degree customer views include hundreds of customer attributes. Hadoop can provide insight and data to bump up to thousands of attributes,
which in turn provides greater detail and precision for customer-based segmentation and other customer analytics. Hadoop is a promising and
potential technology that allows large data volumes to be organized and processed while keeping the data on the original data storage cluster. For
example, weblogs can be turned into browsing behaviour (sessions) by running MapReduce programs (Hadoop) on the cluster and generating
aggregated results on the same cluster. These aggregated results are then loaded into a Relational DBMS system.
HBase is the mainstream Apache Hadoop database. It is an open source, non-relational (column-oriented), scalable and distributed database
management system that supports structured data storage. Apache HBase, which is modelled after Google Bigtable, is the right approach when
you need random and real-time read/write access to your Big Data. This is for hosting of very large tables (billions of rows X millions of columns)
on top of clusters of commodity hardware. Just as Google Bigtable leverages the distributed data storage provided by the Google File System,
Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. HBase does support writing applications in Avro, REST, and
Thrift.
NoSQL Databases
Next-generation databases are mandated to be non-relational, distributed, open-source and horizontally scalable. The original inspiration is the
modern web-scale databases. Additional characteristics such as schema-free, easy replication support, simple API, eventually consistent/BASE
(not ACID), etc. are also being demanded. The traditional Relational Database Management Systems (RDBMSs) use Structured Query Language
(SQL) for accessing and manipulating data that reside in structured columns of relational tables. However, unstructured data is typically stored
in key-value pairs in a data store and therefore cannot be accessed using SQL. Such data are stored are called NoSQL data stores and are accessed
via get and put commands. There are some Big advantages of NoSQL databases compared to the relational databases as illustrated in the page
http://www.couchbase.com/why-nosql/nosql-database.
• Flexible Data Model: Relational and NoSQL data models are very different. The relational model takes data and separates it into many
interrelated tables that contain rows and columns. Tables reference each other through foreign keys that are stored in columns as well. When
looking up data, the desired information needs to be collected from many tables and combined before it can be provided to the application.
Similarly, when writing data, the write needs to be coordinated and performed on many tables.NoSQL databases follow a very different
model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the
JSON format. Each JSON document can be thought of as an object to be used by your application. A JSON document might, for example,
take all the data stored in a row that spans 20 tables of a relational database and aggregate it into a single document/object. The resulting
data model is flexible and easy to distribute the resulting documents. Another major difference is that relational technologies have rigid
schemas while NoSQL models are schema-less. Changing the schema once data is inserted is a Big deal, extremely disruptive and frequently
avoided. However the exact opposite of the behaviour is desired in the big data era. Application developers need to constantly and rapidly
incorporate new types of data to enrich their applications.
• High Performance and Scalability: To deal with the increase in concurrent users (Big Users) and the amount of data (Big Data),
applications and their underlying databases need to scale using one of two choices: scale up or scale out. Scaling up implies a centralized
approach that relies on bigger and bigger servers. Scaling out implies a distributed approach that leverages many commodity physical or
virtual servers. Prior to NoSQL databases, the default scaling approach at the database tier was to scale up. This was dictated by the
fundamentally centralized, shared-everything architecture of relational database technology. To support more concurrent users and/or store
more data, you need a bigger server with more CPUs, memory, and disk storage to keep all the tables. Big servers tend to be highly complex,
proprietary and disproportionately expensive. NoSQL databases were developed from the ground up to be distributed and scale out
databases. They use a cluster of standard, physical or virtual servers to store data and support database operations. To scale, additional
servers are joined to the cluster and the data and database operations are spread across the larger cluster. Since commodity servers are
expected to fail from time-to-time, NoSQL databases are built to tolerate and recover from such failures making them highly resilient.
NoSQL databases provide a much easier and linear approach to database scaling. If 10,000 new users start using your application, simply
add another database server to your cluster. To add ten thousand more users, just add another server. There’s no need to modify the
application as you scale since the application always sees a single (distributed) database. NoSQL databases share some characteristics with
respect to scaling and performance.
o AutoSharding: A NoSQL database automatically spreads data across servers without requiring applications to participate. Servers
can be added or removed from the data layer without application downtime, with data (and I/O) automatically spread across the
servers. Most NoSQL databases also support data replication, storing multiple copies of data across the cluster and even across data
centers to ensure high availability (HA) and to support disaster recovery (DR). A properly managed NoSQL database system should
never need to be taken offline, for any reason, supporting high availability.
o Distributed Query Support: Sharing a relational database can reduce or eliminate in certain cases the ability to perform complex
data queries. NoSQL database systems retain their full query expressive power even when distributed across hundreds of servers.
o Integrated Caching: To reduce latency and increase sustained data throughput, advanced NoSQL database technologies
transparently cache data in system memory. This behavior is transparent to the application developer and the operations team,
compared to relational technology where a caching tier is usually a separate infrastructure tier that must be developed to and deployed
on separate servers and explicitly managed by the operations team.
There are some serious flaws on the part of relational databases that come in the way of meeting up the unique requirements of modern-day
social web applications, which gradually move to reside in cloud infrastructures. Another noteworthy factor and fact is that doing data analysis
for business intelligence (BI) is increasingly happening in the cloud. That is, cloud analytics is emerging as a hot topic for diligent and deeper
study and investigation. There are some groups in academic and industrial circles striving hard for bringing in the necessary advancements in
order to prop up the traditional databases to cope up with the evolving and enigmatic requirements of social networking applications. However
NoSQL and NewSQL databases are the new breeds of versatile, vivacious and venerable solutions capturing the imagination and attention of
many.
The business need to leverage complex and connected data is driving the adoption of scalable and high-performance NoSQL databases. This new
entrant is to sharply enhance the data management capabilities of various businesses. Several variants of NoSQL databases have emerged over
the past decade in order to handsomely handle the terabytes, petabytes and even exabytes of data generated by enterprises and consumers. They
are specifically capable of processing multiple data types. That is, NoSQL databases could contain different data types such as text, audio, video,
social network feeds, weblogs and many more that are not being handled by traditional databases. These data are highly complex and deeply
interrelated. Therefore the demand is to unravel the truth hidden behind these huge yet diverse data assets besides understanding the insights
and acting on them to enable businesses to plan and surge ahead
Having understood the changing scenario, web-based businesses have been crafting their own custom NoSQL databases to elegantly manage the
increasing data volume and diversity. Amazon’s Dynamo and Google’s Big Table are the shining examples of homegrown databases that can store
lots of data. These NoSQL databases were designed for handling highly complex and heterogeneous data. The key differentiation here is that they
are not built for high-end transactions but for analytic purposes.
Why NoSQL Databases?
B2C e-commerce and B2B e-business applications are highly transactional and the leading enterprise application frameworks and platforms such
as Java Enterprise Edition (JEE) directly and distinctly support a number of transaction types (simple, distributed, nested etc.). For a trivial
example, flight reservation application has to be rigidly transactional otherwise everything is bound to collapse. As enterprise systems are
increasingly distributed, the need for transaction feature is being pronounced as a mandatory one.
In the recent past, social applications have grown fast and especially youth is totally fascinated by a stream of social computing sites which has
resulted in an astronomical growth of those sites. It is no secret that the popularity, ubiquity and utility of Facebook, LinkedIn, Twitter, Google+
and other blogging sites are surging incessantly. There is a steady synchronization between enterprise and social applications with the idea of
adequately empowering enterprise applications with additional power and value. For example, online sellers understand and utilize customers’
choices, leanings, historical transactions, feedbacks, feelings, etc. in order to do more business. That is, businesses are more interactive, open and
inclined towards customers’ participation to garner and glean their views to reach out to more people across the globe and to pour in Richer
Enterprise Applications (REAs). There are specialized protocols and web 2.0 technologies (Atom, RSS, AJAX, mash-up, etc.) to programmatically
tag information about people and places and proclivity to dynamically conceive, conceptualize and concretize more and more people-centric and
premium services.
The point to be conveyed here is that the dormant and dumb database technology has to evolve faster in order to accomplish these new-
generation IT abilities. With the modern data being more complicated and connected, the NoSQL databases need to have the implicit and innate
strength to handle the multi-structured and massive data. A NoSQL database should enable high performance queries on the data. Users should
be able to ask questions such as “Who are all my contacts in Europe?” and “Which of my contacts ordered from this catalog?” A white paper titled
as “NoSQL for the Enterprise” by Neo Technology lists out the uniqueness of NoSQL databases for enterprises. I have reproduced the essential
things from that paper below.
• A Simplified Data Representation: A NoSQL database should be able to easily represent complex and connected data that makes up
today’s enterprise applications. Unlike traditional databases, a flexible schema that allows for multiple data types also enables developers to
easily change applications without disrupting live systems. Databases must be extensible and adaptable. With the massive adoption of
clouds, NoSQL databases ought to be more suitable for clouds.
• EndtoEnd Transactions: Traditional databases are famous for “all or nothing” transactions whereas NoSQL databases give a kind of
leeway on this crucial property. This is due to the fact that the prime reason for the emergence and evolution of NoSQL databases was to
process massive volumes of data in double quick time to come out with actionable inputs. In other words, traditional databases are for
enterprise applications whereas NoSQL databases are for social applications. Specifically the consistency aspect of ACID transactions is not
rigidly insisted in NoSQL databases. Here and there one operation could fail in a social application and it does not matter much. For
instance, there are billions of short messages being tweeted every day and Twitter will probably survive if a single Tweet is lost. But online
banking applications relying on traditional databases have to ensure a very tight consistency in order to be meaningful. That does not mean
that NoSQL databases are off the ACID hook. Instead they are supposed to support ACID transactions including XA-compliant distributed
two-phase commit protocol. The connections between data should be stored on a disk in a structure designed for high-performance retrieval
of connected data sets, all while enforcing strict transaction management. This design delivers significantly better performance for
connecting data than the one offered by relational databases.
• EnterpriseGrade Durability: Every NoSQL database for the enterprise needs to have the enterprise-class quality of durability. That is,
any transaction committed to the database will not be lost at any cost under any circumstances. If there is a flight ticket reserved and the
system crashes due to an internal or external problem thereafter, when the system comes back, the allotted seat still has to be there.
Predominantly the durability feature is ensured through the use of database backups and transaction logs that facilitate the restoration of
committed transactions in spite of any software or hardware hitches. Relational databases have employed the replication method for years
successfully to guarantee the enterprise-strength durability.
The Classification of NoSQL Databases
There are four major categories of NoSQL databases available today: Key-Value stores, Column Family databases, Document databases and
Graph databases. Each was designed to accommodate the huge volumes of data as well as to have room for future data types. The choice of
NoSQL database depends on the type of data you need to store, its size and complexity.
• KeyValue Stores: A key-value data model is quite simple. It stores data in key and value pairs where every key maps to a value. It can
scale across many machines but cannot support other data types. Key-value data stores use a data model similar to the popular memcached
distributed in-memory cache, with a single key-value index for all the data. Unlike memcached, these systems generally provide a persistence
mechanism and additional functionality as well: replication, versioning, locking, transactions, sorting, and/or other features. The client
interface provides inserts, deletes and index lookups. Like memcached, none of these systems offer secondary indices or keys. A key-value
store is ideal for applications that require massive amounts of simple data like sensor data or for rapidly changing data such as stock quotes.
Key-value stores support massive data sets of very primitive data. Amazon’s Dynamo was built as a key-value store.
• Column Family Databases: A Column family database can handle semi-structured data because in theory every row can have its own
schema. It has few mandatory attributes and few optional attributes. It is a powerful way to capture semi-structured data but often sacrifices
consistency for ensuring the availability attribute. Column family databases can accommodate huge amounts of data and the key
differentiator is it helps to sift through the data very fast. Writes are really faster than reads so one natural niche is real-time data analysis.
Logging real-time events is a perfect use case and another one is random and real-time read/write access to the big data. Google’s Big Table
was built on a Column family database. Apache Cassandra, the Facebook database, is another well-known example, which was developed to
store billions of columns per row. However, it is unable to support unstructured data types or end-to-end query transactions.
• Document Databases: A document database contains a collection of key-value pairs stored in documents. The document databases
support more complex data than the key-value stores. While it is good at storing documents, it was not designed with enterprise-strength
transactions and durability in mind. Document databases are the most flexible of the key-value style stores, perfect for storing a large
collection of unrelated and discrete documents. Unlike the key-value stores, these systems generally support secondary indexes and multiple
types of documents (objects) per database, and nested documents or lists. A good application would be a product catalog, which can display
individual items, but not related items. You can see what‘s available for purchase, but you cannot connect it to what other products similar
customers bought after they viewed it. MongoDB and CouchDB are examples of document databases.
• Graph Databases: A graph database uses nodes, relationships between nodes and key-value properties instead of tables to represent
information. This model is typically substantially faster for associative data sets and uses a schema-less and bottom-up model that is ideal for
capturing ad-hoc and rapidly changing data. Much of today’s complex and connected data can be easily stored in a graph database where
there is great value in the relationships among data sets. A graph database accesses data using traversals. A traversal is how you query a
graph, navigating from starting nodes to related nodes according to an algorithm, finding answers to questions like “what music do my
friends like, that I don’t yet own?” or “if this power supply goes down, what web services are affected?” Using traversals, you can easily
conduct end-to-end transactions that represent real user actions.
Cloud Databases
RDBMSs are an integral and indispensable component in enterprise IT and their importance is all set to grow and not diminish. However with
the advent of cloud-hosted and managed computing and storage infrastructures, the opportunity to offer a DBMS as an offloaded and outsourced
service is gaining momentum. Carlo Curino and his team members have introduced a new transactional “database as a service” (DBaaS). A
DBaaS promises to move much of the operational burden of provisioning, configuration, scaling, performance tuning, backup, privacy, and
access control from the database users to the service operator, effectively offering lower overall costs to users. The DBaaS being provided by
leading cloud service providers (CSPs) does not address three important challenges: efficient multi-tenancy, elastic scalability and database
privacy. The authors argue that before outsourcing database software and management into cloud environments, these three challenges need to
be suppressed and surmounted. The key technical features of this DBaaS: (1) a workload-aware approach to multi-tenancy that identifies the
workloads that can be co-located on a database server achieving higher consolidation and better performance over existing approaches (2) the
use of a graph-based data partitioning algorithm to achieve near-linear elastic scale-out even for complex transactional workloads and (3) an
adjustable security scheme that enables SQL queries to run over encrypted data including ordering operations, aggregates, and joins. An
underlying theme in the design of the components of DBaaS is the notion of workload awareness; by monitoring query patterns and data
accesses, the system obtains information useful for various optimization and security functions, reducing the configuration effort for users and
operators. By centralizing and automating many database management tasks, a DBaaS can substantially reduce operational costs and perform
well. There are myriad advantages of using cloud databases, some of which are as follows:
• Either built-in to a larger package with nothing to configure, or comes with a straightforward GUI-based configuration.
• Automated on-the-go scaling with the ability to simply define the scaling rules or manually adjust.
• Anytime, anywhere, any device, any media, any network discoverable, accessible and usable.
• Vendor Lock-in.
Thus newer realities such as NoSQL and NewSQL database solutions are fast arriving and being adopted eagerly. On the other hand, the
traditional database management systems are being accordingly modernized and migrated to cloud environments to substantiate the era of
providing everything as a service. Data as a service, insights as a service, etc. are bound to grow considerably in the days to come as their
realization technologies are fast maturing.
BIG DATA ANALYTICS USE CASES
Enterprises can understand and gain the value of big data analytics based on the number of value-add use cases and how some of the hitherto
hard-to-solve problems can be easily tackled with the help of big data analytics technologies and tools. Every enterprise is mandated to grow with
the help of analytics. As elucidated before, with big data, big analytics is the norm for businesses to take informed decisions. Several domains are
eagerly enhancing their IT capability to have embedded analytics and there are several reports eulogizing the elegance of big data analytics. The
following are some of the prominent use cases.
Customer Satisfaction Analysis
This is the prime problem for most of the product organizations across the globe. There is no foolproof mechanism in place to understand the
customers’ feelings and feedbacks about their products. Gauging the feeling of people correctly and quickly goes a long way for enterprises to ring
in proper rectifications and recommendations in product design, development, servicing and support and this has been a vital task for any
product manufacturer to be relevant for their customers and product consumers. Thus customers’ reviews regarding the product quality need to
be carefully collected through various internal as well as external sources such as channel partners, distributors, sales and service professionals,
retailers, and in the recent past, through social sites, micro-blogs, surveys, etc. However the issue is that the data being gleaned are extremely
unstructured, repetitive, unfiltered, and unprocessed. Extraction of actionable insights becomes a difficult affair here and hence leveraging big
data analytics for a single view of customers (SVoC) will help enterprises gain sufficient insights into the much-needed customer mind set and to
solve their problems effectively and to avoid them in their new product lines.
Market Sentiment Analysis
In today’s competitive and knowledge-driven market economy, business executives and decision-makers need to gauge the market environment
deeply to be successful in their dreams, decisions and deeds. What are the products shining in the market, where the market is heading, who are
the real competitors, what are their top-selling products, how they are doing in the market, what are the bright spots and prospects, and what are
customers’ preferences in the short as well as long-term perspective through a deeper analysis legally and ethically. This information is available
in a variety of web sites, social media sites and other public domains. big data analytics on this data can provide an organization with the much
needed information about Strength, Weakness, Opportunities and Threats (SWOT) for their product lines.
Epidemic Analysis
Epidemics and seasonal diseases like flu start and spread with certain noticeable patterns among the people and so it is pertinent to extract the
hidden information to put timely arrest on the outbreak of the infection. It is all about capturing all types of data originating from different
sources, subjecting them to a series of investigations to extract actionable insights quickly and contemplating the appropriate countermeasures.
There is a news item that says how spying on people data can actually help medical professionals to save lives. Data can be gathered from many
different sources, but few are as superior as Twitter; and tools such as TwitterHose facilitate this data collection, allowing anyone to download 1%
of tweets made during a specified hour at random, giving researchers a nice cross-section of the Twitterverse. Researchers at Johns Hopkins
University have been taking advantage of this tool, downloading tweets at random and sifting through this data to flag any and all mentions of flu
or cold-like symptoms. Because the tweets are Geo-tagged, the researchers can then figure out where the sickness reports are coming from, cross-
referencing this with flu data from the Center for Disease Control to build up a picture of how the virus spreads, and more importantly predict
where it might spread to the next.
In a similar line, with the leverage of the innumerable advancements being accomplished and articulated in the multifaceted discipline of big data
analytics, myriad industry segments are jumping into the big data bandwagon in order to make themselves ready to acquire superior
competencies and capabilities especially in anticipation, ideation, implementation and improvisation of premium and path-breaking services and
solutions for the world market. big data analytics brings forth fresh ways for businesses and governments to analyze a vast amount of
unstructured data (streaming as well as stored) to be highly relevant to their customers and constituencies.
Using Big Data Analytics in Healthcare
The healthcare industry has been a late adopter of technology when compared to other industries such as banking, retail and insurance. As per
the McKinsey report on big data from June 2011, if US health care could use big data creatively and effectively to drive efficiency and quality, it
has been estimated that the potential value from data in this sector could be more than $300 billion in value every year, two-thirds of which
would be in the form of reducing national health care expenditures by about 8 percent.
• Reduce Hospital Readmission: One major cost in healthcare is hospital-readmission costs due to lack of sufficient follow-ups and
proactive engagement with patients. These follow-up appointments and tests are often only documented as free-text in patients’ hospital
discharge summaries and notes. This unstructured data can be mined using text analytics. If timely alerts were to be sent, appointments
scheduled or education materials dispatched, proactive engagement could potentially reduce readmission rates by over 30 percent.
• Patient Monitoring: Inpatient, Out-patient, Emergency Visits and ICU Everything is becoming digitized. With rapid progress in
technology, sensors are embedded in weighing scales, blood glucose devices, wheelchairs, patient beds, X-Ray machines, etc. Digitized
devices generate large streams of data in real-time that can provide insights into patient’s health and behavior. If this data is captured, it can
be put to use to improve the accuracy of information and enable practitioners to better utilize limited provider resources. It will also
significantly enhance patient experience at a health care facility by providing proactive risk monitoring, improved quality of care and
personalized attention. Big data can enable complex event processing (CEP) by providing real-time insights to doctors and nurses in the
control room.
• Preventive Care for ACO: One of the key accountable care (ACO) goals is to provide preventive care. Disease identification and risk
stratification will be very crucial to business function. Managing real-time feeds coming in from HIE, pharmacists, providers and payers will
deliver key information to apply risk stratification and predictive modeling techniques. In the past, companies were limited to historical
claims and HRA/survey data but with HIE, the whole dynamic to data availability for health analytics has changed. Big data tools can
significantly enhance the speed of processing and data mining.
• Epidemiology: Through HIE, most of the providers, payers and pharmacists will be connected through networks in the near future.
These networks will facilitate the sharing of data to better enable hospitals and health agencies to track disease outbreaks, patterns and
trends in health issues across a geographic region or across the world allowing determination of the source and containment plans.
• Patient Care Quality and Program Analysis: With the exponential growth of data and the need to gain insight from information
comes the challenge to process the voluminous variety of information to produce metrics and key performance indicators (KPIs) that can
improve patient care quality and Medicaid programs. Big data provide the architecture, tools and techniques that will allow processing
terabytes and petabytes of data to provide deep analytic capabilities to its stakeholders.
TRADITIONAL DW ANALYTICS VS. BIG DATA ANALYTICS
Data diversity is one of the most formidable challenges in the BI arena today. This is because most BI platforms and products are designed for
operating on relational data and other forms of structured data. Many organizations struggle to wring BI value from the wide range of
unstructured and less-structured data types including text, clickstreams, log files, social media, documents, location data, sensor data, etc.
Hadoop and its allied and associated technologies are renowned for making sense out of diverse big data. For example, developers can push files
containing a wide range of unstructured data into HDFS without needing to define data types or structures at load time. Instead, data is
structured at query or analysis time. This is a good match for analytic methods that are open-ended for discovery purposes, since imposing
structure can alter or hide detailed data that discovery depends on. For BI/DW tools and platforms that demand structured data, Hadoop Hive
and MapReduce can output records and tables as needed. This way, HDFS can be an effective source of unstructured data, yet with structured
output for BI/DW purposes.
Hadoop products show an unbeatable promise as the viable and valuable platforms for advanced analytics, thus complementing the report-
oriented data warehouse with new analytical capabilities especially out of unstructured data. Outside of BI and DW, Hadoop products also show
promise for online archiving, content management, and staging multi-structured data for a variety of applications. This puts pressure on vendors
to offer better integration with Hadoop and to provide tools that reduce the manual coding. There are numerous scenarios for big data computing
where Hadoop can contribute immensely to mainstream analytics.
In the trend towards advanced analytics, users are looking for platforms that enable analytics as an open-ended discovery or exploratory mission.
Discovering new facts and relationships typically results from tapping big data that were previously inaccessible to BI. The discovery also comes
from mixing data of various types from various sources. HDFS and MapReduce enable the exploration of this eclectic mix of big data. As
enunciated earlier, there are a few critical differences between the traditional data warehouse analytics and the current big data analytics. The
data sources, sizes, scopes, successes, and structures are hugely different between the old and the new ones. With the continued maturity of big
data analytics discipline, there can be more realistic and rewarding results in the form of real-time and real-world analytics. Further on, there will
be more decisive and drastic improvements on predictive and prescriptive analytics. The prominent differences are given in Table 1.
Table 1.
MACHINE DATA ANALYTICS BY SPLUNK
All your IT applications, platforms and infrastructures generate data every millisecond of every day. The machine data is one of the fastest
growing and most complex areas of big data. It is also one of the most valuable insights containing a definitive record of users’ transactions,
customer behavior, sensor activity, machine behavior, security threats, fraudulent activity and more. Machine data hold critical insights useful
across the enterprise.
• Map and visualize threat scenario behavior patterns to improve security posture.
Making use of machine data is challenging. It is difficult to process and analyze by traditional data management methods or in a timely manner.
Machine data are generated by a multitude of disparate sources and hence, correlating meaningful events across these is complex. The data is
unstructured and difficult to fit into a predefined schema. Machine data is high-volume and time-series based, requiring new approaches for
management and analysis. The most valuable insights from this data are often needed in real time. Traditional business intelligence, data
warehouse or IT analytics solutions are simply not engineered for this class of high-volume, dynamic and unstructured data.
As indicated in the beginning, machine-generated data is more voluminous than man-generated data. Thus without an iota of doubt, machine
data analytics are occupying a more significant portion in big data analytics. Machine data are being produced 24x7x365 by nearly every kind of
software application and electronic device. The applications, servers, network devices, storage and security appliances, sensors, browsers,
compute machines, cameras and various other systems deployed to support business operations are continuously generating information relating
to their status and activities. Machine data can be found in a variety of formats such as application log files, call detail records, user profiles, key
performance indicators (KPIs), and clickstream data associated with user web interactions, data files, system configuration files, alerts and
tickets. Machine data are generated by both machine-to-machine (M2M) as well as human-to-machine (H2M) interactions. Outside of the
traditional IT infrastructure, every processor-based system including HVAC controllers, smart meters, GPS devices, actuators and robots,
manufacturing systems, and RFID tags and consumer-oriented systems such as medical instruments, personal gadgets and gizmos, aircrafts,
scientific experiments, and automobiles that contain embedded devices are continuously generating machine data. The list is constantly growing.
Machine data can be structured or unstructured. The growth of machine data has accelerated in recent times with the trends in IT
consumerization and industrialization. That is, the IT infrastructure complexity has gone up remarkably driven by the adoption of portable
devices, virtual machines, bring your own devices (BYODs), and cloud-based services.
The goal here is to aggregate, parse, and to visualize these data to spot trends, and act accordingly. By monitoring and analyzing data emitted by a
deluge of diverse, distributed and decentralized data, there are opportunities galore. Someone wrote that sensors are the eyes and ears of future
applications. Environmental monitoring sensors in remote and rough places bring forth the right and relevant knowledge about their operating
environments in real-time. Sensor data fusion leads to develop context and situation-aware applications. With machine data analytics in place,
any kind of performance degradation of machines can be identified in real-time and corrective actions can be initiated with full knowledge and
confidence. Security and surveillance cameras pump in still images and video data that in turn help analysts and security experts to preemptively
stop any kind of undesirable intrusions. Firefighting can become smarter with the utilization of machine data analytics.
The much-needed end-to-end visibility, analytics and real-time intelligence across all of their applications, platforms and IT infrastructures,
enables business enterprises to achieve required service levels, manage costs, mitigate security risks, demonstrate and maintain compliance and
gain new insights to drive better business decisions and actions. Machine data provide a definitive, time-stamped record of current and historical
activity and events within and outside an organization, including application and system performance, user activity, system configuration
changes, electronic transaction records, security alerts, error messages and device locations. Machine data in a typical enterprise are generated in
a multitude of formats and structures, as each software application or hardware device records and creates machine data associated with their
specific use. Machine data also vary among vendors and even within the same vendor across product types, families and models.
There are a number of newer use cases being formulated with the pioneering improvements in smart sensors, their ad-hoc and purpose-specific
network formation capability, data collection, consolidation, correlation, corroboration and dissemination, knowledge discovery, information
visualization, etc. Splunk is a low-profile big data company specializing in extracting actionable insights out of diverse, distributed and
decentralized data. Some real-world customer examples include
• ECommerce: A typical e-commerce site serving thousands of users a day will generate gigabytes of machine data which can be used to
provide significant insights into IT infrastructure and business operations. Expedia uses Splunk to avoid website outages by monitoring
server and application health and performance. Today, ~3,000 users at Expedia use Splunk to gain real-time visibility on tens of terabytes of
unstructured, time-sensitive machine data (from not only their IT infrastructure, but also from online bookings, deal analysis and coupon
use).
• Software as a Service (SaaS): Salesforce.com uses Splunk to mine the large quantities of data generated from its entire technology
stack. It has >500 users of Splunk dashboards from IT users monitoring customer experience to product managers performing analytics on
services like ‘Chatter.’ With Splunk, SFDC claims to have taken application troubleshooting for 100,000 customers to the next level.
• Digital Publishing: NPR uses Splunk to gain insights of their digital asset infrastructure, to monitor and troubleshoot their end-to-end
asset delivery infrastructure, to measure program popularity and views by device, to reconcile royalty payments for digital rights and to
measure abandonment rates and more.
Figure 3 vividly illustrates how Splunk captures data from numerous sources and does the processing, filtering, mining and analysis to generate
actionable insights out of multi-structured machine data.
Figure 3 Splunk reference architecture for machine data
analytics
Splunk Enterprise is the leading platform for collecting, analyzing and visualizing machine data. It provides a unified way to organize and extract
real-time insights from massive amounts of machine data from virtually any source. This includes data from websites, business applications,
social media platforms, application servers, hypervisors, sensors, and traditional databases. Once your data is in Splunk, you can search, monitor,
report, and analyze it, no matter how unstructured, large or diverse it may be. Splunk software gives a real-time understanding of what is
happening and a deep analysis of what has happened, driving new levels of visibility and insight. This is called operational intelligence.
Most organizations maintain a diverse set of data stores – machine data, relational data and other unstructured data. Splunk DB Connect delivers
real-time connectivity to one or many relational databases and Splunk. Hadoop Connect delivers bi-directional connectivity to Hadoop. Both
Splunk apps enable you to drive more meaningful insights from all of your data. The Splunk App for HadoopOps provides real-time monitoring
and analysis of the health and performance of the end-to-end Hadoop environment, encompassing all layers of the supporting infrastructure.
IBM Accelerator for Machine Data Analytics
Machines produce huge amounts of data that contain valuable and actionable information. However, accessing and working with that
information requires large-scale import, extraction, transformation, and statistical analysis. IBM® Accelerator for Machine Data Analytics, a set
of end-to-end applications, helps you import, extract, index, transform, and analyze your data to
• Search within and across multiple log entries based on a text search, faceted search, or a timeline-based search to find events
• Enrich the context of log data by adding and extracting log types into the existing repository
• Uncover patterns.
BIG DATA MIDDLEWARE SOLUTIONS
Industry research firm IDC defines the digital universe as a measure of all the digital data created, replicated and consumed in a single year.
Everything is based on data and enterprises will be driven by data. Therefore, organizations have to seriously ponder about the implementable
ways and means of collecting, storing, and analyzing tremendous amounts of data. They need to harness the increased volume, variety and speed
of data succeed in the competitive marketplace. For the foreseeable future, organizations will continue to rely on infrastructure specifically
designed for big data applications to run reliably and scale seamlessly to keep up with the pace at which data are being generated or transferred.
Infrastructures, middleware platforms and tools are very much important in order to enable big data processing and knowledge extraction. In
order to make the data accessible, understandable and interoperable, novel middleware architectures, algorithms and application development
frameworks need to be in place.
Middleware plays a compelling role in big data analytics. Big data movement infrastructure is one such well-recognized solution for the ensuing
big data era. There are a number of technological products emerging for moving big data quickly and efficiently. Most enterprises need to make
substantial investments in upgrading their data movement infrastructures and also focus on application design practices to meet ensuing big data
business requirements. The volume, velocity and variety of data used in commercial, scientific and governmental applications is increasing
rapidly as the cost of generating, capturing, storing, moving and using data is fast plummeting. Big data, machine-to-machine (M2M)
communications, real-time enterprise initiatives, enlarging device landscape, and enterprise mobility sharply increase the amount of data to be
distributed and processed. High-bandwidth data communication networks provide the underlying pipes for moving large amounts of data
quickly, but additional features are required to make this bandwidth usable by applications. Data movement infrastructure consists of software or
hardware (appliances) that provides the important Quality-of-Service (QoS) features such as assured delivery, content-based routing, security,
caching and retrieval, transformation and a wide choice of communication semantics. These are not there in the unadorned standard
communication protocols such as HTTP/TCP/IP and application protocols such as Remote Procedure Call (RPC), Remote Method Invocation
(RMI), Distributed Common Object Model (DCOM) or Object Request Broker (ORB). Conventional Message-Oriented Middleware (MOM)
provides some of these features, but none of these can scale up quickly to handle the big data application requirements.
The paradigm of Event-Driven Architecture (EDA) is fast picking up especially among financial industries. There are Complex Events Processing
(CEP) engines to facilitate the faster event message capture and analysis, knowledge discovery and subsequent actuations and accelerations in
time based on the knowledge extracted in the previous step. EDA is a critical cog for the roaring success of Business Activity Monitoring (BAM)
and Enterprise Performance Management (EPM) needs of any customer-facing organizations. There are scenarios wherein billions of messages
are being produced and streamed from multiple sources into enterprise systems (control, operational and transactional) per minute and the value
of CEP therefore goes up significantly in bringing up compelling and cognitive insights. Besides traditional databases for data storage and query-
based data retrieval, there are analytics-facilitating data marts, cubes, and warehouses wherein all kinds of data are being directed and persisted
to be in safe custody. In the past, with the flourishing of data sources and sizes, there are schema-less NoSQL databases and file systems such as
HDFS. In short, it is data-driven enterprises with the emerging IT and business landscapes being bombarded with data and content messages
besides event messages. All these clearly insist that with the increasingly networked world, data movement has to be accomplished efficiently and
effectively.
Having understood the need for next-generation data movement infrastructures that offer higher throughput (number of messages or bytes
transmitted per second), lower latency (in microseconds for end-to-end delivery time) and more endpoints (producers [senders] and consumers
[receivers]), accomplished product vendors come up with robust and resilient data movement solutions. In the recent past, with the surging
popularity of in-memory computing, there are data movement products complying with the distributed data grid paradigm, which is the base for
new-generation In-Memory Data Grids (IMDG). IMDG provides a reliable, transactionally consistent, and distributed in-memory data store.
These grids can be used to incrementally extend the performance and scalability of established applications as well as to produce brand new high
performance and highly scalable applications. The fundamental thing is to use main memory for fast access, for distribution of data to scale, for
working with another master data repository, for maintaining a duplicate, and for leveraging remote nodes to provide resilience and persistence.
Big data movement infrastructures have to be efficient and scalable so that it can use fewer instances and much lesser network bandwidth, CPU
power and memory, providing lower latency than a large number of conventional MOM servers handling the same workload. Having few
instances also leads to less hardware and fewer technical support people. The Total Cost of Ownership (TCO) has to be on the lower side whereas
the Return on Investment (RoI) has to be on the higher side. With data becoming pervasive and persuasive, there is a renewed focus on
unearthing sophisticated technologies and products for performing data movement activities at an affordable level. There are a variety of data
movement infrastructures from different vendors leveraging diverse technologies, topologies, protocols and architectures.
Solace Appliances for Data Movement
This is one of the most popular product for big data movement infrastructure. Building the infrastructure to intelligently route big data by
horizontally scaling software is quite expensive and inefficient, and hence companies see significant savings by vertically scaling within the
footprint of their Solace appliances. The new 6x10GE network acceleration blade allows each Solace 3260 appliance to route 40 Gigabits per
second of bidirectional traffic for a total of 80 Gigabits per second with its six 10 Gigabit Ethernet ports. This indicates a fourfold increase in the
throughput compared to an earlier top-of-the-line 2x10GE version of the product. With this feature, a pair of Solace 3260 appliances equipped
with the new 6x10GE network acceleration blade can move as much as 1.7 petabytes of data per day. The new 6x10GE network acceleration blade
is also able to bolster the 3260 appliance’s performance by increasing internal memory and end-user connection counts, doubling compression
capacity and the speed of secure SSL encryption/decryption.
A DETAILED LOOK ON DATA INTEGRATION
Data integration is the leading contributor for the goal of business integration and insights. Hence, it has been an increasingly strategic endeavor
for enterprise IT. With cloud, social, web, mobile, and device IT, big data integration is the norm and paves the way for more comprehensive and
smarter enterprises. Business integration unarguably brings in a number of business, technical and user benefits. Precisely speaking, business
integration ultimately leads to on-demand, sensitive, responsive, cost-effective, competitive, and smarter enterprises. As corporate data volume
surges, its value too subsequently shoots up if it is leveraged proactively and systematically through time-tested technologies and tools. Timely
and actionable intelligence derived out of data getting generated, buffered, processed, and mined is turning out to be potentially powerful and
hence companies are going the extra miles to have an effective and e