Enterprise Big Data Framework Guide V1.4 2 PDF
Enterprise Big Data Framework Guide V1.4 2 PDF
DATA
FRAMEWORK®
ENTERPRISE
BIG DATA
PROFESSIONAL
Note: Enterprise Big Data training and Enterprise Big Data personal certification are accredited
by APMG-International. Only accredited training providers or their affiliates are allowed to
provide courses and examinations in APMG-International s qualification schemes. Readers
wishing to know more about APMG-International or the examination requirements are invited to
visit the website: www.apmg-international.com
Big Data Framework®, Enterprise Big Data Professional®, Enterprise Big Data Analyst®,
Enterprise Big Data Scientist®, Enterprise Big Data Engineer® are registered trademarks of the
Enterprise Big Data Framework.
ISBN: 978-90-828958-0-3
• Attribution ̶ You must give appropriate credit, provide a link to the license, and indicate if
changes were made. You may do so in any reasonable manner, but not in any way that
suggests the licensor endorses you or your use.
• ShareAlike ̶ If you remix, transform, or build upon the material, you must distribute your
contributions under the same license as the original.
TABLE OF CONTENT 3
FOREWORD 6
Target audience
The target audience of this document include:
So what s getting ubiquitous and cheap? Data. And what is complementary to data?
Analysis. So my recommendation is to take lots of courses about how to manipulate
and analyze data: databases, machine learning, econometrics, statistics, visualisation,
and so on. 1
You have the data, the question is do you know how to maximize its
use?
Richard Pharro
CEO APMG-International
All that data potentially contains a wealth of information. By analyzing the data that is
generated every day, governments, researchers and companies might discover knowledge that
they could use for their benefit. For governments, this might be to prevent potential attacks.
For researchers, it might be to develop new medicines. And for companies, it might be the best
location to open a new store. The value of the knowledge is different for each type of
organization, but the process to extract these insights out of the data is very similar.
Extracting valuable knowledge out of massive quantities of data is, however, more difficult than
it sounds. Due to the sheer volume of data that is generated every day, databases grow
massively, and it becomes difficult to capture, form, store, manage, share, analyze and visualize
meaningful insights out of the data.2 For that reason, knowledge about how to deduce valuable
information out of large sets of data has become an area of great interest. This domain of
knowledge is collectively described as Big Data.
Although the importance of Big Data has been recognized over the last decade, people still
have differing opinions on its definition.3 In general, the term Big Data is used when datasets
cannot be managed or processed by traditional commercial software and hardware tools within
a tolerable time. However, Big Data is more than just processing capabilities of the underlying
data sets, and it has gradually evolved into an entire domain of study.
In the rest of this guide, we will therefore adhere to the following definition when we are
discussing Big Data:
The objective of this guide is to discuss these techniques, skills and technologies in a
structured approach, so that the reader is equipped with the knowledge to deduce valuable
insights to support future decisions. We will first introduce the general background and state-
of-the-art of Big Data. Next, we will discuss related technologies and some fundamental
definitions. From chapter 2 onwards, the structure of this guide will be built around the Big
Data Framework, a holistic model of six capabilities to increase Big Data proficiency in
enterprises.
There are various ways in which value can be captured through Big Data and how enterprises
can leverage to facilitate growth or become more efficient.4 Each of these drives the digital
transformation of organizations and have a long-term effect on the way the enterprises will
have to be designed, organized and managed.
Enterprises can capture value from Big Data in one of the following five ways:
2) Data driven discovery. As enterprises create and store more and more transactional
data in digital forms, more performance data becomes available. Big Data can provide
tremendous new insights that might have not been identified previously by finding
patterns or trends in data sets. In the insurance industry for example, Big Data can help
to determine profitable products and provide improved ways to calculate insurance
premiums.
4) The power of automation. The underlying algorithms that analyze Big Data sets can be
used to replace manual decisions and labor-intensive calculations by automated
decisions. Automation can optimize enterprise processes and improve accuracy or
response times. Retailers, for example, can leverage Big Data algorithms to make
purchasing decisions or determine how much stock will provide an optimal rate of return.
5) Innovation and new products. Big data can unearth patterns that identify the need of
new products or increase the design of current products or services. By analyzing
purchasing data or search volumes, organizations can identify demand for products that
the organization might be unaware of. Universities or colleges, for example, might study
Besides the five ways described above, there are many other potential business gains or ways
to capture value with Big Data. Many examples and business cases in this area already exist
and more are designed almost every day. The main challenge for existing enterprises is then to
translate this business value into tangible benefits. Chapter 2 will therefore further discuss how
to formulate a Big Data strategy.
In its true essence, Big Data is not something that is completely new or only of the last two
decades. Over the course of centuries, people have been trying to use data analysis and
analytics techniques to support their decision-making process. The ancient Egyptians around
300 BC already tried to capture all existing data in the library of Alexandria. Moreover, the
Roman Empire used to carefully analyze statistics of their military to determine the optimal
distribution for their armies.
However, in the last two decades, the volume and speed with which data is generated has
changed ‒ beyond measures of human comprehension. The total amount of data in the world
was 4.4 zettabytes in 2013. That is set to rise steeply to 44 zettabytes by 2020. To put that in
perspective, 44 zettabytes is equivalent to 44 trillion gigabytes.6 Even with the most advanced
technologies today, it is impossible to analyze all this data. The need to process these
increasingly larger (and unstructured) data sets is how traditional data analysis transformed
into Big Data in the last decade.
To illustrate this development over time, the evolution of Big Data can roughly be sub-divided
into three main phases.7 Each phase has its own characteristics and capabilities. In order to
understand the context of Big Data today, it is important to understand how each phase
contributed to the contemporary meaning of Big Data.
From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic
introduced a massive increase in semi-structured and unstructured data. Besides the standard
structured data types, organizations now needed to find new approaches and storage solutions
to deal with these new data types in order to analyze them effectively. The arrival and growth
of social media data greatly aggravated the need for tools, technologies and analytics
techniques that were able to extract meaningful information out of this unstructured data.
Mobile devices not only give the possibility to analyze behavioral data (such as clicks and
search queries), but also give the possibility to store and analyze location-based data (GPS-
data). With the advancement of these mobile devices, it is possible to track movement, analyze
physical behavior and even health-related data (number of steps you take per day). This data
provides a whole new range of opportunities, from transportation, to city design and health
care.
DBMS-based, structured content: Web-based, unstructured content Mobile and sensor-based content
• RDBMS & data warehousing • Information retrieval and • Location-aware analysis
• Extract Transfer Load extraction • Person-centered analysis
• Online Analytical Processing • Opinion mining • Context-relevant analysis
• Dashboards & scorecards • Question answering • Mobile visualization
• Data mining & statistical analysis • Web analytics and web • Human-Computer-Interaction
intelligence
• Social media analytics
• Social network analysis
• Spatial-temporal analysis
The characteristics of Big Data are commonly referred to as the four Vs:
1) Volume ‒ The volume of data refers to the size of the data sets that need to be
analyzed and processed, which are now frequently larger than terabytes and petabytes.
The sheer volume of the data requires distinct and different processing technologies
than traditional storage and processing capabilities. In other words, this means that the
data sets in Big Data are too large to process with a regular laptop or desktop
processor. An example of a high-volume data set would be all credit card transactions on
a day within Europe.
2) Velocity ‒ Velocity refers to the speed with which data is generated. High velocity data is
generated with such a pace that it requires distinct (distributed) processing techniques.
An example of a data that is generated with high velocity would be Twitter messages or
Facebook posts.
3) Variety ‒ Variety makes Big Data really big. Big Data comes from a great variety of
sources and generally is one out of three types: structured, semi structured and
unstructured (as discussed in the next section). The variety in data types frequently
4) Veracity ‒ Veracity refers to the quality of the data that is being analyzed. High veracity
data has many records that are valuable to analyze and that contribute in a meaningful
way to the overall results. Low veracity data, on the other hand, contains a high
percentage of meaningless data. The non-valuable in these data sets is referred to as
noise. An example of a high veracity data set would be data from a medical experiment
or trial.
Data that is high volume, high velocity and high variety must be processed with advanced tools
(analytics and algorithms) to reveal meaningful information.9 Because of these characteristics
of the data, the knowledge domain that deals with the storage, processing, and analysis of
these data sets has been labeled Big Data.
1) Data analysis
2) Analytics
4) Big Data
Although all four definitions are closely related, there are some subtle differences between the
terms that have an impact for the design of Big Data solutions.
Data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis
has multiple facets and approaches, encompassing diverse techniques under a variety of
names, in different business, science, and social science domains.
Analytics
Analytics is the discovery, interpretation, and communication of meaningful patterns in data.
Especially valuable in areas rich with recorded information, analytics relies on the simultaneous
application of statistics, computer programming and operations research to quantify
performance.
1) Descriptive analytics: Descriptive analytics or data mining are at the bottom of the big
data value chain, but they can be valuable for uncovering patterns that offer insight. A
simple example of descriptive analytics would be reviewing the number of people that
visited the company s website over the past few months. Descriptive analytics can be
2) Diagnostic analytics: Diagnostic analytics are used for discovery or to determine why
something happened. In a social media marketing campaign for example, diagnostic
analytics can be used to determine why certain advertisements resulted in increased
conversion rates. Diagnostic analytics provide valuable insights for organizations,
because it helps them understand which decisions impact the company s performance.
3) Predictive analytics: Predictive analytics use Big Data to identify past patterns to
predict the future. From trends or patterns in existing data sets, predictive algorithms
calculate the probability that a certain event will occur. For example, some companies are
using predictive analytics for sales lead scoring, indicating which incoming sales leads will
have the highest chance of converting into an actual customer. Properly tuned predictive
analytics can be used to support sales, marketing, or for other types of complex
forecasts.
4) Prescriptive analytics: Prescriptive analytics is the last and most valuable level of
analytics. While Big Data analytics in general sheds light on a subject, prescriptive
analytics gives you a laser-like focus to answer specific questions. For example, in
the health care industry, you can better manage the patient population by using
prescriptive analytics to measure the number of patients who are clinically obese, then
add filters for factors like diabetes and LDL cholesterol levels to determine where to
focus treatment. The same prescriptive model can be applied to almost any industry
target group or problem.
Business Intelligence
Business Intelligence (BI) comprises of the strategies and technologies used by enterprises for
the data analysis of business information. 11 Business Intelligence uses both data analysis and
analytics techniques to consolidate and summarize information that is specifically useful in an
enterprise context.
The key challenge with Business Intelligence is to consolidate the different enterprise
information systems and data sources into a single integrated data warehouse on which
analysis or analytics operations can be performed. A data warehouse is a (large) centralized
database in an organization that combines a variety of different databases from different
sources. An example of Business Intelligence would be to build a management dashboard that
visualizes key enterprise KPIs across different division in the world.
Big Data
As discussed in section 1.3, Big Data is characterized by four key characteristics ̶ the four
V s. Big Data makes use of both data analysis and analytics techniques and frequently builds
upon the data in enterprise data warehouses (as used in BI). As such, it can be considered the
next step in the evolution of Business Intelligence.
• The data that is analyzed in Big Data environments is larger than what most traditional BI
solutions can cope with, and therefore requires distinct and distributed storage and
processing solutions.
• Big Data is characterized by the variety of its data sources and includes unstructured
or semi-structured data. Big Data solutions need, for example, to be able to process
images of audio files.
The difference between Big Data and Business Intelligence is depicted in figure 6:
Figure 6: Big Data includes unstructured data and requires distributed storage / processing
Figure 6 provides a very high-level overview of Big Data solutions. The technical structure and
architecture of Big Data environments will be further discussed in detail in chapter 4 (Big Data
Architecture).
For the analysis of data, it is important to understand that there are three common types of
data structures:
1) Structured data: Structured data is data that adheres to a pre-defined data model and
is therefore straightforward to analyze. Structured data conforms to a tabular format
with relationship between the different rows and columns. Common examples of
structured data are Excel files or SQL databases. Each of these have structured rows
and columns that can be sorted.
2) Unstructured data: Unstructured data is information that either does not have a pre-
defined data model or is not organized in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and
facts as well. This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in structures
databases. Common examples of unstructured data include audio, video files or No-SQL
databases.
3) Semi-structured data: Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated with relational databases or
other forms of data tables, but nonetheless contain tags or other markers to separate
semantic elements and enforce hierarchies of records and fields within the data.
Therefore, it is also known as self-describing structure.13 Examples of semi-structured
data include JSON and XML are forms of semi-structured data.
Most traditional data analysis and analytics techniques (including most Business Intelligence
solutions) have the ability to process structured data. Processing unstructured or semi-
structured data is however much more complex and requires distinct solutions for analysis.
Because of the growing interest in Big Data and its increased use in enterprise organizations,
many data products have been developed. Commercial software companies bundle many data
products together, and license these as Big Data solutions to enterprise organizations. Big Data
solutions are a quick way for enterprises to start leveraging the potential of Big Data analysis,
because enterprises do not need to develop all required data products in-house. The downside
of (commercial) Big Data solutions is that they are often expensive, and it is difficult to alter
any of the underlying algorithms of the Big Data solution.
There are many Big Data solutions available on the market and almost every large Enterprise IT
provider (Google, Amazon, Microsoft, SAP, etc.) now offers one or more Big Data solutions.
Additionally, start-ups play a very important role in the development of Big Data solutions
because they come up with new and innovative data products. The Big Data Framework has
been developed from a vendor-independent perspective and therefore does not recommend
any specific Big Data solutions.
Hadoop
It would not be possible to discuss Big Data solutions without mentioning Hadoop. Hadoop is
an open-source software framework used for distributed storage and processing of dataset
of big data using the MapReduce programming model. It consists of computer clusters built
from commodity hardware. All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are common occurrences and should be automatically
handled by the framework.14
Distributed storage, distributed processing and the Hadoop framework will be explained in
more detail in chapter 4 (Big Data Architecture). However, it is important to note that most Big
Data solutions make use of the Hadoop framework as their underlying software framework.
The term Hadoop has therefore also become known as the ecosystem that connects different
Big Data solutions (and commercial vendors) together.
The knowledge domain of Artificial Intelligence has evolved over the years to include Machine
Learning algorithms (discussed in the next section) and finally Deep Learning, which is driving
today s AI explosion. The evolution of AI, Machine Learning and Deep Learning is depicted in
figure 9:
Figure 9: The evolution of AI, Machine Learning and Deep Learning. Source: NVIDIA
In the course of the evolution of Artificial Intelligence, the underlying algorithms have become
more complex and omnipotent. Besides its technical challenges and complexity, Artificial
Intelligence also raises many sociological and ethical questions that makes the subject even
more complex. The foundational concepts of Artificial Intelligence, Machine Learning and Deep
Learning are further explained in chapter 8 of this guide.
A popular example of the application of AI is self-driving cars. The final objective of self-driving
cars is to mimic the exact same behaviors as natural people would make whilst-driving (or
preferably even better behavior without any accidents). The input data that have to be
processed need to come from different sensors (high variety) and needs to process thousands
of signals every single second (high velocity and high volume) as traffic situations change.
Many large tech organizations envision Artificial Intelligence as a new wave of economic
potential that provides growth potential for large enterprises.16 AI builds on the capabilities and
knowledge of Big Data and it is therefore important to have a solid foundation of this topic.
Machine Learning aims to teach computers to perform certain operations (by running
machine learning algorithms), so that the computer is able to make improved decisions in the
future and can learn from previous situations. Machine Learning is widely used for the
purpose of data mining, which is the subject of shifting through large amounts of data to find
unknown or hidden patterns.
At the highest level, Machine Learning can be subdivided into two different classes:
2) Unsupervised Machine Learning: Here, a computer is fed data and needs to infer
relationships in the data, without any prior knowledge about the data set. Any set of data
can be fed into the computer, after which the machine will try to find certain patterns
and relationships within the data. Unsupervised machine learning is therefore ideal for
the purpose of data mining. The techniques associated with machine learning are
clustering and correlation. An example of unsupervised machine learning would be to
feed large amount of insurance claims into a computer. Based on unsupervised learning
algorithms, the computer might find that certain claims do not fit within a regular pattern
and therefore might be fraudulent. These outliers would then need to be evaluated and
validated by insurance agents.
Classification, regression, clustering and correlation are further explained in chapter 5 (Big
Data Algorithms). Although it is technically not necessary to have big data sets in order to
perform machine learning operations, much value can be generated when the two are paired.
In the rest of this guide, we will therefore consider machine learning in the context of Big Data.
Much of its recently acquired big data, however, comes from telematics sensors in over
46,000 vehicles. The data on UPS package cars (trucks), for example, include their speed,
direction, braking, and drive train performance. The data is not only used to monitor daily
performance, but also to drive a major redesign of UPS drivers route structures. This
initiative, called ORION (OnRoad Integrated Optimization and Navigation), is arguably the
world s largest operations research project.
It also relies heavily on online map data, and will eventually reconfigure a driver s pickups and
drop-offs in real time of more than 8.4 million gallons of fuel by cutting 85 million miles off of
daily routes. UPS estimates that saving only one daily mile driven per driver saves the
company $30 million, so the overall dollar savings are substantial.
The Big Data Framework was developed because ̶ although the benefits and business cases
of Big Data are apparent ̶ many organizations struggle to embed a successful Big Data
practice in their organization. The structure provided by the Big Data Framework provides an
approach for organizations that takes into account all organizational capabilities of a
successful Big Data practice. All the way from the definition of a Big Data strategy, to the
technical tools and capabilities an organization should have.
• The Big Data Framework provides a structure for organizations that want to start with
Big Data or aim to develop their Big Data capabilities further.
• The Big Data Framework includes all organizational aspects that should be taken into
account in a Big Data organization.
• The Big Data Framework is vendor independent. It can be applied to any organization
regardless of choice of technology, specialization or tools.
• The Big Data Framework provides a common reference model that can be used across
departmental functions or country boundaries.
• The Big Data Framework identifies core and measurable capabilities in each of its six
domains so that the organization can develop over time.
Big Data is a people business. Even with the most advanced computers and processors in the
world, organizations will not be successful without the appropriate knowledge and skills. The
The Big Data framework provides a holistic structure toward Big Data. It looks at the various
components that enterprises should consider while setting up their Big Data organization.
Every element of the framework is of equal importance and organizations can only develop
further if they provide equal attention and effort to all elements of the Big Data framework.
The Big Data Framework consists of the following six main elements:
In order to achieve tangible results from investments in Big Data, enterprise organizations need
a sound Big Data strategy. How can return on investments be realized, and where to focus
effort in Big Data analysis and analytics? The possibilities to analyze are literally endless and
organizations can easily get lost in the zettabytes of data. A sound and structured Big Data
strategy is the first step to Big Data success. In chapter 3, we explore the business drivers of
Big Data and discuss how to formulate a Big Data strategy.
The Big Data Architecture element of the Big Data Framework considers the technical
capabilities of Big Data environments. It discusses the various roles that are present within a
Big Data Architecture and looks at the best practices for design. In line with the vendor-
independent structure of the Framework, chapter 4 will consider the Big Data reference
architecture of the National Institute of Standards and Technology (NIST).
The Big Data algorithms element of the framework focuses on the (technical) capabilities of
everyone who aspires to work with Big Data. It aims to build a solid foundation that includes
basic statistical operations and provides an introduction to different classes of algorithms.
Chapter 6 provides an overview of three fundamental Big Data processes that are applicable to
every organization. It discusses the benefits of every process and provides a step-by-step
description of the process activities to embed a Big Data practice in the organization.
In the Big Data Functions chapter, we discuss the non-technical aspects of Big Data. Chapter 7
discusses the practical aspect of setting up a Big Data Center of Excellence (BDCoE) and
provides detailed guidance on the elements of the BDCoE. Additionally, it also addresses
critical success factors for starting Big Data project in the organization.
Artificial Intelligence
The last element of the Big Data Framework addresses Artificial Intelligence (AI). One of the
major areas of interest in the world today, AI provides a whole world of potential. In this part of
the framework, we address the relation between Big Data and Artificial Intelligence and outline
key characteristics of AI.
Many organizations are keen to start Artificial Intelligence projects, but most are unsure where
to start their journey. This guide takes a functional view of AI in the context of bringing
business benefits to enterprise organizations. Chapter 8 therefore showcases how AI follows as
For example, the company has data about its customers from its Total Rewards loyalty
program, web clickstreams, and from real-time play in slot machines. It has traditionally used
all those data sources to understand customers, but it has been difficult to integrate and act
on them in real time, while the customer is still playing at a slot machine or in the resort.
In order to pursue this objective, Caesars has acquired both Hadoop clusters and open-
source and commercial analytics software. It has also added some data scientists to its
analytics group.
There are other goals for the big data capabilities as well. Caesars pays fanatical attention̶
typically through human observation̶to ensuring that its most loyal customers don t wait in
lines. With video analytics on Big Data tools, it may be able to employ more automated
means for spotting service issues involving less frequent customers. Caesars is also
beginning to analyze mobile data and is experimenting with targeted real-time offers to
mobile devices.
A Big Data maturity assessment provides tools that assist organizations to define goals around
their Big Data program and to communicate their Big Data vision to the entire organization.
The underlying maturity models also provide a methodology to measure and monitor the state
of a company s Big Data capability, the effort required to complete their current stage or phase
of maturity and to progress to the next stage. 19 Additionally, the Big Data maturity assessment
measures and manages the speed of both the progress and adoption of Big Data programs in
the organization.
The five levels of Big Data maturity are based on the five-scale CMM-levels:
1) Analytically Impaired (chaotic and ad hoc activity) - Minimal analytics activities and
infrastructure across the enterprise, with ambiguous data and analytics strategy.
Every area of the Big Data Framework is subsequently assessed to determine the level of
capability. The outcome of the Big Data Framework maturity assessment is depicted in the
figure below and provides valuable information on the potential improvement areas for the
organization.
More information about the criteria for performing a Big Data Maturity Assessment, together
with detailed guidance is available on the Big Data Framework website.
In the past, that valuable data from calls couldn t be analyzed. Now, however, United is
turning the voice data into text, and then analyzing it with natural language processing
software. The analysis process can identify ̶ though it s not easy, given the vagaries of the
English language ̶ customers who use terms suggesting strong dissatisfaction. A United
representative can then make some sort of intervention̶perhaps a call exploring the nature
of the problem. The decision being made is the same as in the past ̶ how to identify a
dissatisfied customer ̶ but the tools are different.
To analyze the text data, United Healthcare uses a variety of tools. The data initially goes into
a data lake using Hadoop and NoSQL storage, so the data doesn t have to be normalized.
The natural language processing̶primarily a singular value decomposition," or modified
word count̶takes place on a database appliance. A variety of other technologies are being
surveyed and tested to assess their fit within the future state architecture. United also
makes use of interfaces between its statistical analysis tools and Hadoop.
Case Study 3: Big Data at United Healthcare, International Institute for Analytics
Although most enterprises agree that Big Data provides a competitive advantage, many
organizations remain poorly behind the curve. Cross-industry studies show that on an average,
The reason why so many companies are struggling to realize their competitive advantage
through Big Data is because they have not (adequately) defined a Big Data strategy. In many
organizations, Big Data is still project-based, instead of being embedded into the veins of the
organization.
In order to avoid these pitfalls and realize a long term competitive advantage, the Big Data
Framework starts with defining and formulating a Big Data Strategy. Every other activity or
process that is further discussed throughout the Big Data Framework should relate back to the
Big Data Strategy.
A number of business drivers are at the core of this success and explain why Big Data has
quickly risen to become one of the most coveted topics in the industry. Six main business
drivers can be identified:
In this section, we will explore a high-level overview of each of these business drivers. Each of
these adds to the competitive advantage of enterprises by creating new revenue streams by
reducing the operational costs.
Figure 16: Historical Costs of Computer Memory, reprinted from McCallum and Blok, 2017
Social media data provides insights into the behaviors, preferences and opinions of the public
on a scale that has never been known before. Due to this, it is immensely valuable to anyone
who is able to derive meaning from these large quantities of data. Social media data can be
used to identify customer preferences for product development, target new customers for
future purchases, or even target potential voters in elections.28 Social media data might even
be considered one of the most important business drivers of Big Data.
A Big Data strategy defines and lays out a comprehensive vision across the enterprise and sets
a foundation for the organization to employ data-related or data-dependent capabilities. A well-
defined and comprehensive Big Data strategy makes the benefits or Big Data actionable for
the organization. It sets out the steps that an organization should execute in order to become a
Data Driven Enterprise (as discussed in section 2.4). The Big Data strategy incorporates some
guiding principles to accomplish the data-driven vision, directs the organization to select
specific business goals and is the starting point for data driven planning across the
enterprise.30
Besides the gains of realizing a competitive advantage, enterprises require a Big Data strategy
because it transcends organizational boundaries. Without a Big Data strategy, enterprises will
be forced to deal with a variety of data related activities that will most likely be initiated by
different business units. Various departments are likely to start up their own analytics, Business
Intelligence or data management programs, without taking into account the overall long-term
strategic objectives.
The driving force behind the formulation of an enterprise Big Data strategy should be the
combination of either the CEO/CIO (when Big Data defines the enterprise) or the COO/CIO
(when Big Data optimizes the enterprise). This recognizes that the data is not only an IT asset,
but also an organization wide corporate asset.
A well-defined enterprise Big Data strategy should be actionable for the organizations. In order
to achieve this, organizations can follow the following 5-step approach to formulate their Big
Data strategy:
Each of the steps to formulate a Big Data strategy is explained in further detail in the sections
below.
The Big Data strategy should align to the corporate business objectives and address key
business problems, as the primary purpose of Big Data is to capture value by leveraging data.
One way to accomplish this to align with the enterprise strategic planning process, as most
organizations already have this process in place.
Examples of frequently occurring business objectives from a recent survey have been listed in
figure 20.
• Executive sponsors. The importance of finding and aligning with executive sponsors
cannot be underestimated. Their support is essential throughout the ups and downs of
formulating the Data Strategy and implementing it.
• Right talent on the team. Involving people with the right talent and skill sets is essential
in determining the right business objectives. Explore both internal talent as well as
external consultants.
• Potential trouble makers. Every project or initiative will have some stakeholders who
either deliberately or unintentionally are opposed to change. Knowing who they are, and
their motivations upfront will help later in the process.
As an example, if the scope of the data strategy is to get a 360 degree view of customers and
potential customers, the current state assessment would include any business process, data
In this stage, it is also important to identify and nurture some data evangelists. These people
truly believe in the power of data in making decisions and may already be using the data and
analytics in a powerful way. By involving these people, asking for their input, it becomes easier
to formulate the roadmap in a later stage.
Well-defined Use Cases provide a clear and effective way to define how Big Data technologies
and solutions can realize business goals. After the Use Cases have been developed, the next
One of the most effective ways to prioritize Use Cases is by using a Prioritization Matrix.
Prioritization Matrix facilitates the discussion and debate between the Business and IT
stakeholders in identifying the right Use Cases to start a Big Data initiative ̶ those Use
Cases with both meaningful business value (from the business stakeholders perspectives) and
reasonable feasibility of successful implementation.31
The Prioritization Matrix in the figure below is an excellent management tool for driving
organizational alignment and commitment around the organization s top priority Use Cases.
With the desired future state in mind, the Roadmap should focus on identifying gaps in data
architecture, technology and tools, processes and of course people (skills, training , etc.). The
current state assessment and Use Cases will present multiple strategic options for initiatives
and the next task is to prioritize these options based on complexity, budget and potential
benefits.
The sponsors and stakeholders will have a key role to play in prioritizing these initiatives. The
end result of this phase is a roadmap to roll out the prioritized Big Data initiatives.
The Big Data Strategy documents should include the following sections:
Background / Context This section should articulate background that necessitated the Data
Strategy in the first place. Examples could be: Corporate strategic direction,
Digital Transformation initiative, or mergers & acquisition related context ,
etc.
An example of a Big Data strategy document is available of the Big Data Framework website.
Everyone presently studying the domain of Big Data should have a basic understanding of how
Big Data environments are designed and operated in enterprise environments, and how data
flows through different layers of an organization. Understanding the fundamentals of Big Data
architecture will help system engineers, data scientists, software developers, data architects,
and senior decision makers to understand how Big Data components fit together, and to
develop or source Big Data solutions.
In this chapter, we will only cover the fundamentals of Big Data architecture that apply to every
enterprise. A more in-depth overview is provided in the Enterprise Big Data Engineer workbook.
The objective of a reference architecture is to create an open standard, one that every
organization can use for their benefit. The National Institute of Standards and Technology
(NIST) ̶ one of the leading organizations in the development of standards ̶ has developed
such a reference architecture: the NIST Big Data Reference Architecture.33
The NIST Big Data Reference Architecture is a vendor-neutral approach and can be used by
any organization that aims to develop a Big Data architecture. The Big Data Reference
Architecture, is shown in Figure 24 and represents a Big Data system composed of five logical
functional components or roles connected by interoperability interfaces (i.e., services). Two
fabrics envelop the components, representing the interwoven nature of management and
security and privacy with all five of the components. In the next few paragraphs, each
component will be discussed in further detail, along with some examples.
The NIST Big Data Reference Architecture is organized around five major roles and multiple
sub-roles aligned along two axes representing the two Big Data value chains: the Information
Value (horizontal axis) and the Information Technology (IT; vertical axis). Along the Information
The five main roles of the NIST Big Data Reference Architecture, shown in Figure 24 represent
the logical components or roles of every Big Data environment, and present in every
enterprise:
• System Orchestrator;
• Data Provider;
• Data Consumer.
The two dimensions shown in Figure 24 encompassing the five main roles are:
• Management;
These dimensions provide services and functionality to the five main roles in the areas specific
to Big Data and are crucial to any Big Data solution. The Management and Security & Privacy
dimension are further discussed in chapter 6: Big Data Processes.
System Orchestrator
System Orchestration is the automated arrangement, coordination, and management of
computer systems, middleware, and services.34 Orchestration ensures that the different
applications, data and infrastructure components of Big Data environments all work together.
In order to accomplish this, the System Orchestrator makes use of workflows, automation and
change management processes.
A much cited comparison to explain system orchestration ̶ and the explanation of its name ̶
is the management of a music orchestra. A music orchestra consists of a collection of different
musical instruments that can all play at different tones and at different paces. The task of the
conductor is to ensure that all elements of the orchestra work and play together in sync.
System orchestration is very similar in that regard. A Big Data IT environment consists of a
collection of many different applications, data and infrastructure components. The System
Orchestrator (like the conductor) ensures that all these components work together in sync.
Data Provider
The Data Provider role introduces new data or information feeds into the Big Data system for
discovery, access, and transformation by the Big Data system. The data can originate from
different sources, such as human generated data (social media), sensory data (RFID tags) or
third-party systems (bank transactions).
One of the key characteristics of Big Data (see paragraph 1.3) is its variety aspect, meaning
that data can come in different formats from different sources. Input data can come in the
form of text files, images, audio, weblogs, etc. Sources can include internal enterprise systems
(ERP, CRM, Finance) or external system (purchased data, social feeds). Consequently, data
from different sources may have different security and privacy considerations.
• Collection;
• Preparation;
• Analytics;
• Visualization;
• Access.
The extent and types of applications (i.e., software programs) that are used in this component
of the reference architecture vary greatly and are based on the nature and business of the
enterprise. For financial enterprises, applications can include fraud detection software, credit
score applications or authentication software. In production companies, the Big Data
Application Provider components can be inventory management, supply chain optimization or
route optimization software.
The Big Data Framework Provider can be further sub-divided into the following sub-roles:
The infrastructure layer concerns itself with networking, computing and storage needs to
ensure that large and diverse formats of data can be stored and transferred in a cost-efficient,
secure and scalable way. At its very core, the key requirement of Big Data storage is that it is
able to handle very massive quantities of data and that it keeps scaling with the growth of the
organization, and that it can provide the input/output operations per second (IOPS) necessary
to deliver data to applications. IOPS is a measure for storage performance that looks at the
transfer rate of data.
The platform layer is the collection of functions that facilitates high performance processing
of data. The platform includes the capabilities to integrate, manage and apply processing jobs
to the data. In Big Data environments, this effectively means that the platform needs to
facilitate and organize distributed processing on distributed storage solutions. One of the most
widely used platform infrastructure for Big Data solutions is the Hadoop open source
framework (as discussed in section 4.3). The reason Hadoop provides such a successful
platform infrastructure is because of the unified storage (distributed storage) and processing
(distributed processing) environment.
The processing layer of the Big Data Framework Provider delivers the functionality to query
the data. Through this layer, commands are executed that perform runtime operations on the
data sets. Frequently, this will be through the execution of an algorithm that runs a processing
job. In this layer, the actual analysis takes place. It facilitates the crunching of the numbers in
order to achieve the desired results and value of Big Data.
Data Consumer
Similar to the Data Provider, the role of Data Consumer within the Big Data Reference
Architecture can be an actual end user or another system. In many ways, this role is the mirror
image of the Data Provider. The activities associated with the Data Consumer role include the
following:
• Download;
• Analyze Locally;
• Reporting;
• Visualization;
The Data Consumer uses the interfaces or services provided by the Big Data Application
Provider to get access to the information of interest. These interfaces can include data
reporting, data retrieval and data rendering.
Traditional data analysis ̶ as performed by millions or organizations every day ̶ has a fairly
straightforward and static design. Most enterprises create structured data with stable data
models via a variety of enterprise applications, such as CRM, ERP and various financial
systems.35 Various data integration tools subsequently use extract, transform and load (ETL)
operations to load the data from these enterprise applications to a centralized data warehouse.
Figure 25: traditional data analysis, reprinted from Kumar and Pandey, 2014
In the data warehouse, the different data (originating from the different applications) are neatly
stored into databases with structured rows and columns (similar to a large Excel sheet). Due to
In order to deal with the size (volume) and disparity (variety) of this data, a different
architecture is necessary to ensure that the performance levels are maintained, and the
processing of Big Data brings actual value to the enterprise. To achieve these objectives, Big
Data architectures commonly adhere to the following four core design principles:
In order to process the distributed data, each part is subsequently analyzed within the cluster
itself. Rather than bringing all the data to one central location, processing occurs at each node
simultaneously and therefore parallel to each other. The local processing of data within the
nodes is called distributed processing. Finally, analytics and visualization technology can be
used to display the end result.
The four design principles are embedded in the Hadoop open-sources software framework,
which will be further discussed in the next section.
As discussed in section 4.3, in traditional data analysis environments, the data storage that is
used to store and retrieve data from CRM, ERP or finance systems are structured Relational
Data Base Management Systems (RDBMS). However, because of the unstructured nature of
Big Data, different storage systems are required.
Storage in Big Data environment is a complex subject because of two opposing forces that
apply. On one hand, the storage infrastructure needs to provide information storage with
reliable storage space. On the other hand, it must provide a powerful access interface for
query and analysis of Big Data sets.
• Storage Area Network (SAN). A Storage Area Network is a network which provides
access to consolidated, block level data storage. SANs are primarily used to enhance
storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to
servers so that the devices appear to the operating system as locally attached devices.
Direct Attached Storage is only suitable on a small scale (the storage needs to be physically
attached to the computer). Most enterprises therefore use network storage (in data centres)
to accommodate their storage requirements.
Components of distributed storage for Big Data may be classified into three bottom-up levels:
1) Files systems. A file system is used to control how data is stored and retrieved. Without
a file system, information placed in a storage medium would be one large body of data
with no way to tell where one piece of information stops and the next begins. By
separating the data into pieces and giving each piece a name, the information is easily
isolated and identified. Taking its name from the way paper-based information systems
are named, each group of data is called a "file". The structure and logic rules used to
manage the groups of information and their names are called "file systems". Important
files systems in Big Data are the Google File System (GFS) and the Hadoop Distributed
File System (HDFS), both of which are expandable distributed files systems.
• Offline analysis. Offline analysis is used for applications that are less time sensitive and
for which the real-time value of data is less urgent. Offline processing (also known as
batch processing) imports data on set times and subsequently processes it at time
intervals. Most enterprises utilize the offline analysis architecture based on Hadoop in
order to reduce costs and improve efficiency of data processing. Examples of offline
analysis tools include Scribe (Facebook), Kafka (LinkedIN), TimeTunnel (Tabao) and
Chukwa (Hadoop open source).
Although both real-time analysis and offline analysis provide adequate results, most enterprises
utilize offline processing if the timeliness of data is not key requirement.
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids
are good at generating such. Googol is a kid s term .
Sub-projects and contrib modules in Hadoop also tend to have names that are unrelated to
their function, often with an elephant or other animal theme ( Pig, for example). Smaller
components are given more descriptive (and therefore more mundane) names. This is a
good principle, as it means you can generally work out what something does from its name.
For example, the JobTracker keeps track of MapReduce jobs.
As discussed in section 4.2, the main benefit of the Hadoop software framework is that it
incorporates the four design principles of a Big Data architecture. Since Hadoop is so essential
to understanding how Big Data solutions work, this section will highlight its main components.
One of the core properties of the HDFS is that each of the data parts is replicated multiple
times and distributed across multiple nodes within the cluster. If one node fails, another node
has a copy of that specific data package that can be used for processing.41 Due to this, data
can still be processed and analyzed even when one of the nodes fails due to a hardware failure.
This makes HDFS and Hadoop a very robust system.
NameNode
Since HDFS stores multiple copies of the data parts across different nodes in the cluster, it is
very important to keep track of where the data parts are stored, and which nodes are available
or have failed. The NameNode performs this task. It acts as a facilitator that communicates
where data parts are stored and if they are available.
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
in the file system, and tracks where across the cluster the file data is kept. It does not store the
data of these files itself.42
MapReduce
Once the data parts are stored across different nodes in the cluster, it can be processed. The
MapReduce framework ensures that these tasks are completed by enabling the parallel
distributed processing of the data parts across the multiple nodes in the cluster.
The first operation of the MapReduce framework is to perform a Map procedure. One of the
nodes in the cluster requests the Map procedure ̶ usually in the form of a Java query ̶ in
order to process some data. The node that initiates the Map procedure is labelled the Job
Tracker (discussed next). The Job Tracker then refers to the NameNode to determine which
data is needed to execute the request and where the data parts are located in the cluster.
Once the location of necessary data parts is established, the Job Tracker submits the query to
the individual nodes, where they are processed. The processing thus takes place locally within
each node, establishing the key characteristic of distributed processing.
The second operation of the MapReduce framework is to execute the Reduce method. This
operation happens after processing. When the Reduce job is executed, the Job Tracker will
locate the local results (from the Map procedure) and aggregate these components together
into a single final result. This final result is the answer to the original query and can be loaded
into any number of analytics and visualization environments.
Slave Node
Slave Nodes are the nodes in the cluster that follow directions from the Job Tacker. Unlike the
NameNode, the Slave Nodes do not keep track of the location of the data.
Job Tracker
The Job Tracker ‒ introduced in the MapReduce section ̶ is the node in the cluster that
initiates and coordinates processing jobs. Additionally, the Job Tracker invokes the Map
procedure and the Reduce method.
In order to find value in datasets, data scientists apply algorithms. Algorithms are
unambiguous specifications on how to solve a class of problems. Algorithms can
perform calculation, data processing and automated reasoning tasks. By applying algorithms to
large volumes of data, valuable knowledge and insights can be obtained. A very basic example
of an algorithm ‒ that finds the maximum value in a set of data - is depicted in the figure below.
Figure 28: Example of simple algorithm to find the maximum value in a data set
Algorithms can vary from very simple with only a few lines of code, to very sophisticated and
complex, with millions of lines of code. In this chapter, we start with the basic operations behind
algorithms. More advanced and complex examples are discussed in the Enterprise Big Data
Scientist Guide.
The application of algorithms, and its subsequent use for Big Data, is grounded in the scientific
domain of statistics. Everyone involved in data science should therefore have a fundamental
knowledge about statistical operations and how they could be applied in algorithms. This
chapter will therefore discuss essential statistical operations and provide common algorithms
that are used in Big Data analysis and analytics solutions.
For example, the shooting percentage in basketball is a descriptive statistic that summarizes
the performance of a player or a team. This number is the number of shots made divided by
the number of shots taken. For example, a player who shoots 33% is making approximately one
shot in every three. The percentage summarizes or describes multiple discrete events, and
everyone can compare the statistic to the shooting percentages of other players.
• Dispersion Statistics
• Distribution Shapes
Each descriptive statistic is illustrated with a short example that explains how the statistic
should be calculated, so that it can subsequently be used in the development of Big Data
Algorithms.
Mean
The arithmetic mean (or simply "mean") of a sample is of the sum of the sampled values
divided by the number of items in the sample. In the example below, the mean is calculated for
a group of basketball players.
If the number of variables is even, the median is calculated by selecting the middle two values
and dividing these by two. One of the core properties of the median is that is not greatly
affected by outliers in the data set ‒ extreme values will not have a large impact on the median.
Mode
The mode of a set of data values is the value that appears most often. In other words, it is the
value that is most likely to be sampled. Similar to the mean and median, the mode can provide
key information about a data set. In the example below, the mode is determined for a group of
basketball players.
In case two variables are both appearing as most frequent, the dataset can be classified as
bimodal. If more than two variables are appearing as most frequent, the dataset be labeled as
multi-modal.
Dispersion Statistics
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which
a distribution is stretched or squeezed. Dispersion statistics indicate how data points are
Range
The range of a set of data is the difference between the largest and smallest values.
Interquartile Range
The interquartile range (IQR), also called the mid-spread or middle 50%, or is a measure of
dispersion, being equal to the difference between 75th and 25th percentiles (QR = Q3 Q1). In
other words, the IQR is a statistic that indicates where the middle 50% of values are located, as
per the example below.
The IQR range is a very useful statistic in data science, because it does not include extreme
values (outliers). Extreme values in data sets are commonly generated by corrupt data.
Consequently, the IQR is an adequate measure to eliminate outliers.
Variance
Variance is the expectation of the squared deviation of a random variable from its mean.
Informally, it measures how far a set of (random) numbers are spread out from their average
value. The closer the variance is to zero, the more closely the data points are clustered
together.
∑ (X − μ)2
The official formula for variance = σ 2 =
N
1) Subtract the mean from each value in the data. This gives you a measure of the distance
of each value from the mean.
2) Square each of these distances (so that they are all positive values), and add all of the
squares together.
3) Divide the sum of the squares by the number of values in the data set.
In the example below, the variance is calculated for the age of a group basketball players:
Standard Deviation
The standard deviation (SD, also represented by the Greek letter sigma σ or the Latin letter s)
is a measure that is used to quantify the amount of variation or dispersion of a set of data
values.
A low standard deviation indicates that the data points tend to be close to the mean (also
called the expected value) of the set, while a high standard deviation indicates that the data
points are spread out over a wider range of values.44 The difference between a low- and a high
standard deviation is depicted in figure 35.
The standard deviation can be calculated by taking the square root of the variance. A useful
property of the standard deviation is that, unlike the variance, it is expressed in the same units
as the data.
∑ (X − μ)2
The official formula for standard deviation = σ = σ =
N
The way to calculate the standard deviation is exactly the same as the variance, with the
difference of taking the square root of the variance. The example below calculates the
standard deviation for the same data set of basketball players.
In Big Data analysis and analytics, a number of common distributions are used:
• Frequency distribution
• Probability distribution
• Sampling distribution
• Normal distribution
Frequency distribution
A frequency distribution is a table or a graph displaying the frequency of various outcomes in
a sample. Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table summarizes the distribution of
values in the sample.
<15 5
16-19 12
20-23 40
24-26 20
27-30 34
31-34 15
>35 6
Probability distribution
A probability is the chance or likelihood that a certain outcome will happen. The probability
that a coin flip will be tails is 0.5, which indicates that there is a 50% that the coin will show tails
in a future coin flip.
A probability distribution is a summary graph that depicts the likelihood of all potential
outcomes. It is a mathematical function that can be thought of as providing the probabilities of
occurrence of different possible outcomes in an experiment. The distribution here shows
probabilities on the y-axis.
One of the core properties of the probability distribution is that all potential outcomes should
sum to 100 percent. In other words, the area under the curve of the probability distribution
should always be 1.0 (which is the same as 100% of all probabilities).
Probability distributions are widely used in Big Data, because one of the primary aims is to
predict certain outcomes. When banks, for example, provide new credit cards to potential
clients, they aim to minimize the risk that the client will default, whilst optimizing the chance
that the client will become a profitable customer.
Sampling distribution
A sampling distribution is the probability distribution of a given statistic based on a random
sample. Sampling distributions are important in Big Data because they provide a major
simplification that can be used for predictive analytics. More specifically, sampling distributions
(e.g., Big Data set) allow population inferences to be based on the Big Data set, rather than on
the joint probability distribution of all the individual sample values.
Normal distribution
The normal (or better known as Gaussian) distribution is the most important and common
continuous probability distribution. Normal distributions are important in statistics and are
often used in data science to represent real-valued random variables whose distributions are
not known.45
A normal distribution represents data that occurs commonly where most values are the same
as the average value and only few values are found at the extremities. In a normal distribution,
approximately 99% of the values are within three standard deviations of the mean, and the
area under the curve is equal to one.
Skewness
Skewness is a measure of the asymmetry of a probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or negative, or undefined.
Consider the two distributions in the figure just above. Within each graph, the values on the
right side of the distribution taper differently from the values on the left side. These tapering
sides are called tails, and they provide a visual means to determine which of the two kinds of
skewness a distribution has46:
2) Positive skew: A distribution is positively skewed where when the tail of the curve is
longer on the right side or skewed to the right, and the mean is greater than the median
and mode. The majority of the values exist on the left side of the curve.
Skewness in distributions is important in data science because the skewness can indicate
potential bias (i.e., not an adequate representation the actual data) in data sets.
Standardization
Because of the properties of the standard normal distribution, it is possible to convert other
distributions in terms of their number of standard deviations below or above the mean.
Conversion of these distributions is called standardization. Data points of distributions are
converted into standard scores (better known as z-scores) by using the following formula:
(x − μ)
z! = !
σ
Where μ is the mean of the population and σ is its standard deviation. An example of the
Suppose for example that there is a dataset with different kinds of properties of basketball
players (height, age, average score per game, number of passes, etc.). When comparing the
means and standard deviations of each of these properties, they are impossible to compare
because they have different dimensions. However, when standardized values are used, all
values are displayed equally.
Standardization is one of the most important processes when analyzing Big Data, because they
allow different variables to be combined together. Standardized values are almost always used
in the design and execution of algorithms. An in-depth use of the application of standardization
is discussed in the Enterprise Big Data Scientist guide.
If for example, a study of 500 basketball athletes shows that 99% of all basketball players in
the NBA are taller than 1.95 meters, it could be inferred that 99% of all basketball player in the
NBA are taller than 1.95 meters. This would be a statement based on inferential statistics.
Whether this statement is true depends on the questions whether the sample data is a
representative subset of the entire population.
A sample is a subset of the population that is being analyzed, and about which the data is
available. The elements of a sample are known as sample points, sampling units or
observations. Statistical analysis and algorithms are applied on the sample data in order to
make assumptions and statements about the entire population.
• A sample has been collected of 10.000 people who play basketball by conducting a
survey at basketball clubs.
One of the most important considerations in sampling is to ensure that the sample is a
representative subset of the entire population. Otherwise, there could be bias.
Bias
If the sample that has been selected is not an adequate representation of the entire
population, it is called a biased sample. Bias will result in inadequate or wrong predictions
about the future, because in inferential statistics, assumptions are made about the entire
population based on the sample.
• In order to determine shirt size for basketball players around the world, a sports apparel
company is interested to know the average player height of basketball players in the
world.
• One of the datasets that is readily available is the dataset of player characteristics of the
professional basketball players in the National Basketball Association (NBA) in the United
States.
• Based on this sample, it is concluded that 99% of basketball players in the NBA are
above 1.95 meters tall.
• Although all the calculations have been made correctly, the sports apparel company
would make a big mistake to produce only shirt sizes for players that are 1.95 meters
tall, because the sample is biased.
• The sample only considered NBA players, which is a very small subset of the most
successful basketball players in the world. Within this small group, the average length
might be 1.95 meters, but this does not represent the average height of the average
basketball player in the world.
Most incorrect predications in statistics are made when the sample data is biased.
In the same example that we used before, instead of analyzing the dataset of NBA players, the
sports apparel company could also choose to analyze the data of all members of basketball
clubs throughout the world. If the company would have access to this massive quantity of data,
it could fairly accurately predict player shirt sizes by looking at the average length.
5.4. Correlation
Dependence (or association) is any statistical relationship, whether causal or not, between two
random variables or bivariate data. Correlation is any of a broad class of statistical
relationships involving dependence, although it is mostly used to indicate whether two variables
have a linear relationship. An example of correlation is the relationship between the height of
basketball players and their selection for tryouts in the NBA.
Correlations are useful because they can indicate a predictive relationship that can be
exploited in practice. For example, in order to make purchasing decisions or forecast future
sales. However, the presence of a correlation is not sufficient to infer the presence of a causal
relationship (i.e., correlation does not imply causation).
In correlation, two (or more) variables are compared to each other. These variables can either
be dependent or independent:
• Independent variables are not changed or affected by changes in the other variable.
They operate independently, and are frequently changes to test the effect on the
dependent variable. Common examples of independent variables are temperature, age,
or the height of basketball players.
• Dependent variables are the variables that change based on the fluctuations of the
independent variable. The dependent variables represent the output or outcome whose
variation is being studied. In the example above, the chance of selection for NBA tryouts
is the dependent variable that we would like to know (dependent on the independent
variable of player height ).
Correlation refers to a specific relationship between two variables, and a number of different
correlation coefficients. In this guide, we only discuss the most frequently used correlation ̶
the Pearson correlation ̶ which is sensitive to a linear relationship between two variables.
c o v (X, Y )
ρX ,Y =
(σ X σ Y )
Where cov is the covariance, σ X is the standard deviation of X, and σY is the standard deviation
of Y.
Correlations that are close to either -1 of +1 are considered strong correlations, because the
variables tend to move in similar directions. Even with strong correlations, though, always keep
in mind that correlation does not imply causation.
5.5. Regression
Regression analysis is a set of statistical processes for estimating the relationships among
variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or more independent
variables (or 'predictors').
The most important characteristic of regression is that it always aims to estimate a function of
the independent variables ̶ the regression function. In other words, in regression, we are
trying to find the best fit line in order to make predictions (or forecasts) about the relationship
between variables. Because of its predictive nature, it is widely used in machine learning in
order to find relationships in datasets.
y = αx +β
where α is the slope of the best fit line and β is equal to the y-intercept. The goal is to find the
values of α and β that would provide the best fit through all available data points. These
values can be found by finding the minimum distance between the regression line and the
actual data point (e.g.,̶ minimize the sum of squared errors). In the case of simple linear
regression, α will be the Pearson correlation coefficient.
Figure 43: Linear regression aims to find the best fit line.
• The square of Pearson's correlation coefficient is the same as the R2 in simple linear
regression;
• Neither simple linear regression nor correlation answer questions of causality directly.
• While correlation typically refers to the linear relationship, it can refer to other forms of
dependence, such as polynomial or truly nonlinear relationships;
• While correlation typically refers to Pearson's correlation coefficient, there are other
types of correlation, such as Spearman s.
5.6. Classification
Classification is the problem of identifying to which of a set of categories a new observation
belongs, based on a training set of data containing observations whose category membership
is known. Because the computer is fed sample data, classification is a form of supervised
machine learning.
1) A computer is fed sample data that contains information about the class of each data
point. For example, it learns to classify carrots as vegetables and oranges as fruits .
2) After the training of the machine, new data or observations are provided to the
computer.
3) The computer now starts to classify by itself. In the example, edibles that have similar
characteristics to carrots will be labeled vegetables, whereas edibles that have similar
characteristics to oranges will be labeled as fruits .
5.7. Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense) to each other than to
those in other groups (clusters).
In order to arrive at a cluster, the computer needs to run a clustering algorithm. There are
many known clustering algorithms available, depending on characteristics of the problem to be
solved. A commonality is that most clustering algorithms look at the similarity between data
points.48
Macys.com utilizes a variety of leading-edge technologies for big data, most of which are not
used elsewhere within the company. They include open-source tools like Hadoop, R, and
Impala, as well as purchased software such as SAS, IBM DB2, Vertica, and Tableau. Analytical
initiatives are increasingly a blend of traditional data management and analytics technologies,
and emerging big data tools. The analytics group employs a combination of machine learning
approaches and traditional hypothesis-based statistics.
Case Study 5: Big Data at Macys.com, Source: International institute for analytics
Outliers are generally data points that appear to be unexpected in comparison with the rest of
the data ̶ they do not fit into the pattern of the other data points. As discussed in paragraph
5.2, the standard normal distribution can be used to detect outliers. Remember that within the
standard distribution, 99% of data point fit within three standard deviations of the mean. If one
or more data points are therefore more than three standard deviations from the mean, this
might be an indication that these points are incorrect or contain flawed data.
Outlier detection is a widely used technique, especially in the context of Big Data. Insurance
and credit card companies use outlier detection that detect fraudulent claims or transactions
by looking at data that does not fit within the regular pattern. Similarly, outlier detection
algorithms are used by intelligence agencies to detect anomalies in individual behaviors that
might pose a threat to national security.
In this section, we will briefly discuss the most common data visualization techniques and their
properties:
• Bar charts
• Histograms
• Scatter plots
• Bi-plots
• Box plots
• Q-Q plots
• Pie charts
• Radar charts
More detailed data visualization techniques are discussed in the Enterprise Big Data Analyst
guide.
Bar charts
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent. The bars can be
plotted vertically or horizontally. A vertical bar chart is sometimes called a line graph.
Histograms
A histogram is an accurate representation of the distribution of numerical data. It is an
estimate of the probability distribution of a continuous variable (quantitative variable). A
histogram is similar to a bar chart, but each of the bars is connected to the other.
To construct a histogram, the first step is to "bin" the range of values̶that is, divide the entire
range of values into a series of intervals̶and then count how many values fall into each
interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
The bins (intervals) must be adjacent and are often (but are not required to be) of equal size.49
The data are displayed as a collection of points, each having the value of one variable
determining the position on the horizontal axis and the value of the other variable determining
the position on the vertical axis.
Biplots
A Biplot is an enhanced scatterplot that makes use of both points and vectors to represent
structure. A biplot uses points to represent the scores of the observations on the principal
components, and it uses vectors to represent the coefficients of the variables on the principal
components.
The advantage of the biplot is that the relative location of the points can be interpreted. Points
that are close together correspond to observations that have similar scores on the
components displayed in the plot. To the extent that these components fit the data well, the
points also correspond to observations that have similar values on the variables.
Box plots
A box plot or boxplot is a method for graphically depicting groups of numerical data through
their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers)
indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker
plot and box-and-whisker diagram. Outliers may be plotted as individual points.
A box plot is very useful in Big Data because it immediately shows the mean, median, mode, Q1
and Q3 values, and any potential outliers. It captures and communicates key information very
quickly.
First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to
one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile
of the first distribution (x-coordinate). Thus the line is a parametric curve with the parameter
which is the number of the interval for the quantile.
If the two distributions being compared are similar, the points in the Q‒Q plot will approximately
lie on the line y = x. If the distributions are linearly related, the points in the Q‒Q plot will
approximately lie on a line, but not necessarily on the line y = x. Q-Q plots can also be used as a
graphical means of estimating parameters in a location-scale family of distributions.
In data science, Q-Q plots are of great importance because they immediately show if one of the
data sets that is analyzed has a greater variance than the other. If the points form a line that is
flatter, the distribution plotted on the x-axis has a greater variance as compared to the
distribution plotted on the y-axis. However, if the points form a steeper line, then the
distribution plotted on the y-axis has a greater variance as compared to the distribution plotted
on the x-axis.
Radar charts
A radar chart is a graphical method of displaying multivariate data in the form of a two-
dimensional chart of three or more quantitative variables represented on axes starting from
the same point. The relative position and angle of the axes is typically uninformative. The radar
chart is also known as spider chart due to the nature of its design.
Radar charts are a useful way to display multivariate observations with an arbitrary number of
variables.50 Each point represents a single observation. Typically, radar charts are generated in
a multi-plot format with many stars on each page and each star representing one observation.
In order to avoid the potential pitfalls that Big Data brings, processes can help enterprises to
focus their direction. Processes bring structure, measurable steps and can be effectively
managed on a day-to-day basis. Additionally, processes embed Big Data expertise within the
organization by following similar procedures and steps, embedding it as a practice of the
organization. Analysis becomes less dependent on individuals and therefore greatly enhances
the chances of capturing value in the long term.
Setting up Big Data processes in the enterprise might be a time-consuming task at first, but
definitely provides the benefits in the long run. In this section, we will discuss how Big Data
processes can provide structure in the analysis of data. Big Data processes can be subdivided
into three main sub-processes:
Although closely related and beneficial in any organization, each of the sub processes has a
different focus and function: control, compliance or quality.
With a very large amount of customer data across multiple channels and relationships, the
bank historically was unable to analyze all of its customers at once, and relied on systematic
samples. With Big Data technology, it can increasingly process and analyze data from its full
customer set.
Other than some experiments with analysis of unstructured data, the primary focus of the
bank s big data efforts is on understanding the customer across all channels and
interactions, and presenting consistent, appealing offers to well-defined customer segments.
For example, the Bank utilizes transaction and propensity models to determine which of its
primary relationship customers may have a credit card, or a mortgage loan that could
benefit from refinancing at a competitor. When the customer comes online, calls a call
centre, or visits a branch, that information is available to the online app, or the sales
associate to present the offer. The various sales channels can also communicate with each
other, so a customer who starts an application online but doesn t complete it, could get a
follow-up offer in the mail, or an email to set up an appointment.
Case Study 6: Bank of America Case Study, International Institute for Analytics
As with any process, the data analysis process is sequential and has a clearly identified start
(the trigger) and end result (the outcome). By managing the stages in the data analysis
process, enterprises can better control the outcomes and results of their Big Data projects.
Within Big Data projects, the business objectives (and hence it s underlying problem) can
frequently be subdivided into six types of problems.52 Each of these types has its own way of
dealing with the outcome of the problem and the way in which the final results need to be
interpreted:
An exploratory business objective aims to find a relation between two or more different
variables data sets. The goal of this objective is to find a pattern or relationship in the data that
can be used to optimize performance. Examples could include the identification of products
that are bought together (market basket analysis) or the identification of a sales pattern based
on the weather conditions.
A causal business objective aims to find the underlying relationship of a certain phenomenon
(the cause). This type of objective aims to find the root cause of certain data patterns in order
to better understand relationships. A causal business objective aims to learn why certain data
were created. Examples include finding out why sales performance was higher in a certain
month or what the root cause is of increased quality defects.
A mechanistic business objective aims to find how variables influence outcomes of data sets.
It requires a deeper understanding of the underlying relationships and patterns within data
The first step of big data analysis ̶ determining the business objective ̶ is important
because it specifies which algorithms and techniques (discussed in chapter 5) should be used
to solve the problem. A complete mapping of business objectives and algorithms is discussed
in the Enterprise Big Data Analyst guide.
Most data analysis starts with identification of the raw data. Raw data is data that has not been
processed yet and that is coming directly from the source. Data sources can include:
In order to identify the necessary data to meet the business objectives, a data identification
graph can be drawn, that works backward from the processed data towards the raw (source)
data. A data identification graph is depicted in figure 58.
Data collection
Within most enterprises, (internal) data is stored at various physical locations or data centers
across the globe. In order to make use of this data, the data analyst or data scientist should
obtain the appropriate access rights and collaborate with the data management team (see next
section). Measures should be set up to ensure data integrity and data confidentiality,
safeguarding the data will not fall into the wrong hands. Additionally, most countries have
privacy and regulatory requirements that personal data may not cross the border. This could
cause significant issues if Big Data teams for example aim to compare customer behaviors
from Singapore (located in a data center in Singapore) with customer behavior from the USA
(located in a data center in Boston).
Besides the security and privacy concerns, a second consideration in the data collection
process is to deal with the volume, variety, and especially velocity aspects of the data sets. If
data is renewed or refreshed on a daily basis (for example a Twitter feed), collection includes
decisions regarding the frequency of data imports (real-time vs. batch imports) and how to
deal with previous imports.
Data sourcing
To obtain value from Big Data, internal enterprise data sets are combined with external data
sets (for example weather information or Twitter feeds). Some of these external data sets
might be available for free, but most data sets will need to be acquired from external vendors.
Data acquisition and sourcing also requires the involvement from procurement and ̶
according to sourcing procedures and regulations in most countries ̶ an open and
transparent bidding process. Besides the fact that these procedures take time, great care
should be taken to ensure that the data vendor is not overselling the value of their data
products, whilst major data processing costs are hidden in the data sourcing process.
• To determine whether there are any problems or issues with the data sets;
• To determine whether the business objectives (step 1) can be realized with these data
sets.
The reality in practice is that almost every data set, even if procured from expensive and
trustworthy data providers, has incorrect or missing values that need to be accounted for. The
data review process is therefore of paramount importance, because missing values of outliers
can have a profound impact on the end result.
In case there is flawed, incorrect or corrupt data, the data set needs to be cleansed in the next
step.
Mathematical formulas or models called algorithms may be applied to the data to identify
relationships among the variables, such as correlation or causation. In general terms, models
may be developed to evaluate a particular variable in the data based on other variable in the
data, with some residual error depending on model accuracy (i.e., Data = Model + Error).54
In the domain of politics, for example, a data analyst can use a sample of polls in order to
predict the outcome of an election. In order to do this, the analyst would need to build a model
that can be applied to the data. The process of building a model involves imposing a specific
structure on the data and creating a summary of the data. The (statistical) model is one of the
most valuable steps in the data analysis process, because the accuracy of the model
determines the end result.
Depending on the type of process required, the data processing step can be as simple as
querying a dataset to averages, modes or median. On the other hand, it can be as complex as
combining multiple complex algorithms to run algorithms for facial recognition, DNA
sequencing, or financial market predictions. The duration of the data processing stage varies
depending on the requirements,
Since Big Data and its underlying processing algorithms are sometimes difficult to explain to
business leaders, good communication is essential to the success and value of Big Data.
Communicating on a regular basis (e.g., interim reports) and in a structured fashion (every
Friday) will provide teams and decision makers in enterprise organizations the required trust
that structured procedures are followed.
One of the best ways to communicate the results of any Big Data project is to use the data
visualization techniques that were discussed in section 5.9. By summarizing data into graphs
and figures, it becomes easier to understand. Data visualization technologies can be as
powerful as they are easy to use, allowing data analysts to quickly and easily articulate and
share the insights across the enterprise to others who are less comfortable with the nuances
of data analysis.
The focus on the data governance and its accompanying process has grown greatly over the
last few years, especially due to increased data privacy- and data confidentiality requirements
that have been set by the countries. The data governance process therefore not only needs to
set the policies and assign responsibilities across the enterprise, it additionally needs to ensure
that enterprises are compliant to (local) data laws and regulations.
There is a close relationship between the data governance process and the data management
process, which will be discussed in the next section. Where the data governance process is
Figure 60: The synergy between data governance and data management
The primary objective of the data management process is to ensure data quality. The value
that can be obtained by analyzing Big Data is highly dependent on the quality of the input data.
Even with the most sophisticated Big Data solution the general Garbage-In-Garbage-Out rule
still applies. If data sets are corrupt or erroneous, data analysis might result in invalid results or
conclusions.
Enterprises therefore need the data management process to continually verify, update and
clean the enterprise data. The data management process outlined in this chapter provides a
structured and practical approach to implement the following ideas:
• Enterprises need a way to formalize their expectations for measuring the conformance
of data quality to these expectations;
One of the important elements of this process activity is the generation (and subsequent follow
up) of alerts. Alerts need to be generated if it is detected that data has been corrupted or
changed.
Using validation rules and transformation rules, the quality of data can be improved as
depicted in figure 62.
In order to improve this knowledge, training programs can reduce user errors, increase
productivity and increase compliance with key controls. Education addresses core data
principles and data quality practices complemented by role-specific training. In particular, data
collectors have to understand why and how consumers use data.
Old ways of working are deeply ingrained, especially if there is an underlying distrust of Big
Data and analytics. Setting up a Big Data organization is therefore just as much change
management, as it is sourcing the right skills, processes and technology. The benefits of Big
Data can only be reaped if the hearts and minds of the people in the organization are aligned
with the Big Data strategy. Not only will people need to start working in a different way, they
will also need to make decisions differently. Insights that are deduced with Big Data analysis
should be integrated in the daily decision-making process in order to become a data-driven
enterprise (as discussed in chapter 2).
Organizational culture, organizational structures, and job roles have a large impact on the
success of Big Data initiatives. In this chapter, we will therefore review some best practices on
how to establish a data-driven organization.
In many organizations, there is a (large) gap between the first launch of a Big Data project
(initiated by an enthusiastic sponsor) and scaling-up the benefits of a Big Data project across
the enterprise. In order to obtain long-term value from big data and become a truly data
driven organization, it is crucial to set up a Big Data Centre of Excellence.
A BDCoE is essential to accelerating Big Data adoption by the enterprise in a fast and
structured manner. It reduces the implementation times drastically and therefore the time-to-
market to deploy new data-driven products and services. More importantly, it ensures that the
best practices and methodologies are shared through different teams in the organization. A
BDCoE should be a live and evolving organizational function that expands and grows as the
organization s needs evolve.
A centralized BDCoE can be the foundation for establishing a data driven enterprise that
values data as its strategic asset. The BDCoE can partner with the business to identify which
An effective BDCoE consists of five major pillars that together form the structure for obtaining
value from the centralized function.
Since the topic of Big Data team is so important of achieving success, a more detailed
description of Big Data roles and responsibilities is discussed in section 7.3. Section 7.4
discusses the required skills of Big Data professionals.
A second important requirement for the Big Data labs is to have the hardware compatible for
Big Data processing. In general, Big Data labs require hardware with sufficiently larger RAM
than usual for Big Data processing.
Proof-of-Concepts are usually requested by internal business units or customers based upon
specific Big Data questions (discussed in section 6.2, the first step of the Bid Data analysis
process). By demonstrating clear Proof-of-Concepts for potential use cases, the BDCoE can
showcase its knowledge in the enterprise.
Agile Methodology
Agility and the ability to fail fast or achieve quick results are essential to reaching the potential
of Big Data. An Agile working methodology provides the tools to deliver outcomes quickly and
transparently, typically within two- to three-week sprints. The ability to fail fast is a key Big Data
opportunity ̶ business and technical roadmaps for delivering value need to change more
often than in a traditional waterfall environment.
Charging Models
At the core of the BDCoE are the charging models to justify the (sometimes large) investments
in the people, processes and technology of the Centre. In order to display value, a clear
approach needs to be devised to charge other business units or external clients for services
rendered.
Charging models can be devised based on the number or users, data processed, frequency of
reports or subscription based. A sound and unambiguous charging model will greatly help to
showcase the value of BDCoE to the enterprise.
Big Data analysts are expected to know R, Python, HTML, SQL, C++, and Javascript. They need
to be more than a little familiar with data retrieval and storing systems, data visualization and
data warehousing using ETL tools, Hadoop-based analytics, and Business Intelligence concepts.
These persistent and passionate data miners usually have a strong background in math,
statistics, machine learning, and programming.
Big Data analysts are involved in data crunching and data visualization. If there are requests for
data insights from stakeholders, data analysts have to query databases. They are in charge of
data that is scraped, assuring the quality and managing it. They have to interpret data and
effectively communicate the findings.
The Big Data scientist job role is a senior role that requires deep understanding of algorithms
and data processing operations. People in this role are expected to be experts in R, SAS,
Python, SQL, MatLab, Hive, Pig, and Spark. Data scientists typically hold higher degrees in
quantitative subjects such as statistics and mathematics and are proficient in Big Data
technologies and analytical tools.
Big Data engineers are computer engineers who must know Pig, Hadoop, MapReduce, Hive,
MySQL, Cassandra, MongoDB, NoSQL, SQL, Data streaming, and programming. Data engineers
have to be proficient in R, Python, Ruby, C++, Perl, Java, SAS, SPSS, and Matlab. Other must-
have skills include knowledge of ETL tools, data APIs, data modeling, and data warehousing
solutions. They are typically not expected to know analytics or machine learning.
Big Data engineers develop, construct, test, and maintain highly scalable data management
systems. Unlike data scientists who seek an exploratory and iterative path to arrive at a
solution, data engineers look for the linear path. Data engineers will improve existing systems
by integrating newer data management technologies. They will develop custom analytics
applications and software components. Data engineers collect and store data, do real-time or
batch processing, and serve it for analysis to data scientists via an API.
The Big Data engineer is frequently also referred to as a Big Data architect . Since both job
roles are very similar in nature (e.g., managing Big Data Infrastructure), this guide will use the
term Big Data engineer.
The Big Data Analyst focuses on the movement and interpretation of data, typically with a
focus on the past and present. Alternatively, the Data Scientist may be primarily responsible
for summarizing data in such a way as to provide forecasting, or an insight into future based
on the patterns identified from past and current data. The Big Data Engineer, lastly, is more
concerned with making sure the underlying Big Data infrastructure is available, before the
processing begins.
What are the skills that people who work in Big Data need to have? People working in Big Data
are required to have six core skills for success, as indicated in figure 66. 60
Leadership skills
The domain of Big Data is located between the business domain and IT domain and, at times,
will be under pressure from both sides. People involved in a Big Data team require strong
leadership skills, i.e., the ability to guide or lead other individuals in the organization. Especially
when establishing a Big Data organization (when building the BDCoE), many conflicting
Technical skills
Big Data is deeply rooted in the domain of Information Technology. Without the technology, it
would not be possible to achieve valuable results. Individuals involved in Big Data should
therefore be strongly interested in technology and should understand the underlying concepts
of Big Data solutions and processing technologies. Most of the roles discussed in the previous
section outlined key skills and requirements that are requested most throughout the job
market. Basic skills in technologies such as R, SAS, Python, SQL, MatLab, Hive, Pig, and Spark
are highly recommended to succeed in the domain of Big Data.
Analytical skills
Without the analysis of data and insights, a business wouldn t be able to function effectively. As
a Big Data professional, it is important to have a solid understanding of the business
environment and domain the business you work for operates in. The ability to visualize and
interpret data is also an essential big data skill that combines both creativity and science. Being
able to visualize and analyze data requires a lot of precise hard science and mathematics but it
also calls for creativity, imagination, and curiosity.
Statistical skills
Big Data and the analysis of data in general is always based on the scientific domain of
statistics. Data processing and algorithms perform statistical operation on large data sets.
Communication skills
A requirement to be successful in almost any profession, communication skills are essential for
Big Data professionals. Big Data analysis, operations and algorithms are frequently complex
and require a deeper subject matter expertise. In order to explain concepts or provide
progress updates to business leaders, strong communication skills are required. The ability to
translate complex processes and calculations into easy-to-understand summaries and advice is
one of the most essential elements of success in Big Data projects.
1) Establish a vision on how to create value: The first milestone is to gain a clear view of
what your organization is trying to accomplish with Big Data. The fact that your
organization captures terabytes of data on a daily basis is meaningless if there is not a
2) To succeed with Big Data, start small: Building Big Data capabilities take time. A one-
time large investment in a Big Data team is not going to produce immediate results.
Therefore, a small start with controlled growth is recommended. First, define a few
relatively simple Big Data projects that won t take much time or data to run. For example,
an online retailer might start by identifying what products each customer viewed so that
the company can send a follow-up offer if they don t purchase. A few intuitive examples
like this allow the organization to see what the data can do. More importantly, this
approach yields results that are easy to test to see what type of returns Big Data
provides.
3) Establish Big Data processes from the start: Make it clear from the very beginning
who is responsible for what. Design effective data governance and data management
processes, specifying who is responsible for data definition, creation, verification,
curation, and validation̶the business, IT, or the BDCoE. Section 6.1 discusses Big Data
processes in more detail.
5) Assess your readiness for Big Data: In order to determine where potential gaps and
risks might arise, conduct a Big Data readiness assessment. This is the assessment of
the readiness of your IT environment and in-house skill sets to implement your
organizations Big Data project and empower members of your existing team as citizen
data scientists throughout your organization to put the power of Big Data to work to
drive your business forward.
6) Set up an on-going Big Data training program: Knowledge and skills are the most
important key to success, yet one of the most difficult elements to obtain. Skilled Big
Data professionals are not easy to find and even when a team is established, they
required continual updates on their knowledge in order to grow further. Setting up an
on-going Big Data education program will increase the competency of the organization
and embeds a culture of continuous learning.
Artificial Intelligence is the art of creating machines that perform functions that require
intelligence when performed by people. 62 Although there are many different theories about
what intelligence actually means, this guide will focus on the operational definition of
intelligence by the mathematician and computer scientist Alan Turing. Alan Turing was a
computer scientist who developed one of the first theories on Artificial Intelligence. A computer
possesses intelligence if a human interrogator is given the task to determine which player ‒ A
or B ‒ is a computer and which is a human, but the interrogator is unable to determine the
difference. The interrogator is limited to using written questions. This operational definition of
intelligence is now famously known as the Turing test. 63
• Knowledge representation ̶ the computer needs to store input data and retrieve that
same data at a later time.
• Automated reasoning ̶ the computer needs to be able to use the stored information
to answer questions and draw conclusions. In order to achieve this, the computer would
need to apply an algorithm.
• Machine learning ̶ the computer needs to adapt its response to previous input data in
order to formulate new responses.
Each of these four disciplines is integral to the domain of Artificial Intelligence, and it is
therefore easy to determine the link between Big Data and Artificial intelligence. The same
statistical techniques and algorithms (discussed in chapter 5) that are applied for Big Data
analysis are used in the study of Artificial Intelligence.
The main difference between Big Data and Artificial Intelligence is that, where Big Data analysis
and analytics mostly stop at predictive and prescriptive analytics, Artificial Intelligence goes
one step further. Artificial Intelligence aims to include cognitive science techniques in order to
remodel the human brain. However, there is strong overlap between Big Data and AI and both
domains continue to improve each other.
Investments in Artificial Intelligence is growing fast, predominantly in the tech sector with
companies such as Google and Baidu leading the way. The McKinsey Global Institute estimates
that between $20 billion to $30 billion was spent on AI research and development in 2016.65
However, outside of the tech sector, AI is frequently at early and experimental stages. Most
organizations concentrate their main efforts on machine learning ̶ one of the enabling
capabilities of AI (further discussed in section 8.3).
• So how can business benefit from Artificial Intelligence in a practical way, with clearly
defined business objectives and return in investment? In recent years, a number of Use
Cases have shown that AI can bring long-term business value.
• Highly autonomous (self-driving) cars are developed at most car productions companies
and are forecasted to make up 10 to 15% of global car sales by 2030. The concept of
self-driving cars can only be realized with AI technology, because the car needs to make
decisions in the same way a human does.
• The use of virtual personal assistants has grown rapidly in recent years and are now
included in almost every smartphone. Apple has the famous personal assistant Siri,
Microsoft has Cortana and Amazon launched its own virtual assistant Alexa in 2014.
Virtual personal assistants need to be able to translate (spoken) speech and translate
that into answers, which can only be realized with Artificial Intelligence techniques.
• Call centre solutions based on AI provide real-time input to service desk agents about
the human emotions of the persons on the phone. The technology detects whether
customers on the other side of the line are happy, angry or afraid and adjust the scripts
and advices accordingly. The detection of emotions and translation into the correct
solution is a clear application of Artificial Intelligence.
• Smart thermostats in homes can now adjust the temperatures based on the individual
characteristics of its users. These thermostats learn the behavioral patterns of the
persons that are living in the home and autonomously adjusts it decisions based on this.
These examples showcase the business opportunities that can be realized by applying Artificial
Intelligence are numerous and can cover many different business domains. AI has great
potential in almost every industry by facilitating better decision making ̶ almost to the level of
human decisions.
Cognitive analytics is the design and development of algorithms that are able to reflect human
decision making, based on the perceived environment and personalized characteristics.
Cognitive analytics differentiates from other forms of analytics because of two main reasons:
In order to achieve these two key characteristics of Artificial Intelligence, cognitive analytics
concerns itself with the development of rational agents. An agent is just something that acts
(from the Latin agere," which means to do ). A rational agent is one that acts so as to achieve
the best outcome or, where there is uncertainty, the best expected outcome.66 A rational agent
therefore tries to mimic the rational decisions that are made by humans. Cognitive analytics
therefore concentrates on the design and development of rational agents.
As can be seen in figure 69, an agent percepts data from a specific environment (traffic in self
driving cars or speech in the case of Siri) through one or more sensors. The agent
subsequently processes this data (with some kind of algorithm) and subsequently takes a
External data source (Big Data) provides input or reference data to the rational agent in order
to make its calculations. Since these external data sources can also be updated real-time
(weather forecasts for example), decisions might vary per individual person.
Let s consider the rational intelligent agent for a self-driving car for example. The environment
provides thousands of input signals to the rational agent. These input signals could be traffic
light colors, distance to the object in front, speed of the car behind, etc. The agent combines
this input data with external data from other sources. This external data could the vehicle s
driving history (personalized data) or data from external data bases, such as the weather
forecast (if rain of snow is expected, the car speed is reduced). Based on the input data of the
sensors and the external data, the rational agent makes a decision. In this example, that means
the car increases or decreased speed, brakes, changes gears or changes into a different
direction. The decision of the rational agent will have a direct impact on the environment
because it needs to ensure everyone can drive safely.
Most Artificial Intelligence solutions will need to use some form of NLP in order to facilitate the
transfer of data from the environment to the rational agent (as depicted in figure 70). In the
example of speech recognition in call centre operations, NLP needs to detect the language of
the person, detect the sequences of word and potentially detect emotions in the way the
message is communicated.
Because of the size of different combinations that are possible in NLP, the development of NLP
applications relies heavily on Big Data environments. The English dictionary consists of
approximately 170,000 words and the number of Chinese words is approximately 370,000. The
number of combinations that can be made with these are almost infinite. The amount of data
necessary to store and process these combinations consist of multiple zettabytes.
Key challenges in NLP have to do with syntax and semantics. Syntax is the way sentences are
constructed and how combinations of words give meaning to a sentence. Semantics on the
Knowledge representation
Knowledge representation is the field of Artificial Intelligence dedicated to representing
information about the world in a form that a computer system can utilize to solve complex
tasks. Knowledge representation incorporates findings from psychology about how humans
solve problems and represent knowledge in order to design logical statements that make
complex systems easier to design and build. As such, it heavily relies on the application of logic
in order to model reasoning.
Since most data sets are heterogeneous in terms of their type, structure and accessibility, they
pose challenges for computer systems to interpret them in a systematic manner. Knowledge
representation helps to identify where data is stored and how it can be retrieved at a later
stage when it requires processing. In particular, it aims at building systems that know about
their world and are able to act in an informed way in it, as humans do. A crucial part of these
systems is that knowledge is represented symbolically, and that reasoning procedures are able
to extract consequences of such knowledge as new symbolic representations.67
Automated reasoning
Automated reasoning in Artificial Intelligence is the knowledge capability that concerns itself
with understanding reasoning capabilities in computer systems. The goal of automated
reasoning is to design computer systems that can reason completely automatically (without
human involvement). Automated reasoning is necessary in the design of any Artificial
Intelligence system in order to mimic the process that happens in the human brain. Given the
conditions that can be observed or sensed, the computer system needs to arrive at the best
possible conclusion by following an (automated) thought process.
Machine Learning
Machine Learning, as introduced in section 1.7, is one of the fundamental capabilities required
for both Big Data analytics as well as Artificial Intelligence. The objective of machine learning is
to design a system that improves and gets better over time. Just like humans memorize
information or relationships when they are presented to them, so can computer systems learn
from previous interactions. Common machine learning algorithms (classification, regression and
clustering) were discussed in chapter 5. A more in-depth review will be discussed in the
Enterprise Big Data Scientist guide.
Figure 71: The evolution of AI, Machine Learning and Deep Learning. Source: NVIDIA
Deep learning is a type of machine learning that can process a wider range of data resources,
requires less data pre-processing by humans, and can often produce more accurate results
than traditional machine learning approaches (although it requires a larger amount of data to
do so).
Conventional machine-learning techniques that are used in the analysis of Big Data are limited
in their ability to process data in their raw form. In the example of facial recognition systems,
raw data (i.e., photos of individuals) needs to be transformed in feature vectors that can be
Deep Learning solves this problem using learning data presentation. This allows a machine to
be fed with data and automatically discover the representations needed for detection or
classification. In order to achieve this, Deep Learning breaks down raw data into a number of
layers (using the back propagation algorithm), and subsequently compares these layers with
each other. Using this technique, it becomes more efficient to break down large data sets into
structured pieces of information that can be analyzed.
Deep Learning is predominantly used in processing images, video, speech and audio. Examples
of situations in which Deep Learning is used are depicted in figure 72.
A more in-dept review about Big Data analysis techniques and Big Data algorithms is further
discussed in the Enterprise Big Data Analyst and Enterprise Big Data Scientist professional
guides. We highly encourage students and data enthusiasts to download copies of these
guides in order to further study the world of Big Data.
2Sagiroglu, S. and Sinanc, D., 2013, May. Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International
Conference on (pp. 42-47). IEEE.
3 O'Reilly Radar Team, 2011. Big data now: Current perspectives from O'Reilly Radar. O'Reilly Media, Incorporated.
4Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A.H., 2011. Big data: The next frontier for innovation,
competition, and productivity.
5Mashey, J.R., 1997, October. Big Data and the next wave of infraS-tress. In Computer Science Division Seminar, University of California,
Berkeley.
6 Gantz, John, and David Reinsel. "The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east." IDC
7 Chen, H., Chiang, R.H. and Storey, V.C., 2012. Business intelligence and analytics: From big data to big impact. MIS quarterly, 36(4).
8Sagiroglu, S. and Sinanc, D., 2013, May. Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International
Conference on (pp. 42-47). IEEE.
9 Lee, J., Bagheri, B. and Kao, H.A., 2014. Recent advances and trends of cyber-physical systems and big data analytics in industrial
informatics. In International Proceeding of Int Conference on Industrial Informatics (INDIN) (pp. 1-6).
11 Dedić N. & Stanier C. (2016). Measuring the Success of Changes to Existing Business Intelligence Solutions to Improve Business
Intelligence Reporting. Lecture Notes in Business Information Processing. Springer International Publishing. pp. 225–236.
12Wegner, Peter; Reilly, Edwin D. Encyclopedia of Computer Science. Chichester, UK: John Wiley and Sons Ltd. pp. 507–512.
ISBN 0470864125.
13Buneman, P., 1997, May. Semistructured data. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on
Principles of database systems (pp. 117-121). ACM.
14Zikopoulos, P. and Eaton, C., 2011. Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill
Osborne Media.
$15 MICHAEL COPELAND. 2016. What’s the Difference Between Artificial Intelligence, Machine Learning, and Deep Learning?. [ONLINE]
Available at: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
[Accessed 4 April 2018].
16 Clark, J., 2015. Why 2015 was a breakthrough year in artificial intelligence. Bloomburg Business, 8.
19Braun, H.T., 2015. Evaluation of Big Data maturity models–a benchmarking study to support Big Data maturity assessment in
organizations.
21IDG Enterprise. 2016. 2016 Data & Analytics Research. [ONLINE] Available at: https://www.idg.com/tools-for-marketers/tech-2016-
data-analytics-research/. [Accessed 25 February 2018].
22 Dallemule, L. and Davenport, T.H., 2017. What’s Your Data Strategy?. Harvard Business Review, 95(3), pp.112-121.
23 Simon, P., 2013. Too big to ignore: The business case for big data (Vol. 72). John Wiley & Sons.
24 John Walker, S., 2014. Big data: A revolution that will transform how we live, work, and think.
26T.H. Davenport and D.J. Patil. Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, pages 70–76, October
2012.
27Tufekci, Z., 2014. Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls. ICWSM, 14,
pp.505-514.
28Sudhahar, S., Veltri, G.A. and Cristianini, N., 2015. Automated analysis of the US presidential elections using Big Data and network
analysis. Big Data & Society, 2(1), p.2053951715572916.
29 Nordrum, A., 2016. Popular internet of things forecast of 50 billion devices by 2020 is outdated. IEEE Spectrum, 18.
30 Brown, B., Chui, M. and Manyika, J., 2011. Are you ready for the era of ‘big data’. McKinsey Quarterly, 4(1), pp.24-35.
31 Schmarzo, B., 2013. Big Data: Understanding how data powers big business. John Wiley & Sons.
32 Maier, M., Serebrenik, A. and Vanderfeesten, I.T.P., 2013. Towards a big data reference architecture. University of Eindhoven.
33NIST Big Data Public Working Group Reference Architecture Subgroup. 2017. NIST Big Data Interoperability Framework: Volume 6,
Reference Architecture. [ONLINE] Available at: https://bigdatawg.nist.gov/V2_output_docs.php. [Accessed 21 January 2018].
34A Business Resolution Engine for Cloud Marketplaces, IEEE Third International Conference on Cloud Computing Technology and
Science (CloudCom), IEEE, 2011, pp. 462-469.
36Shvachko, K., Kuang, H., Radia, S. and Chansler, R., 2010, May. The hadoop distributed file system. In Mass storage systems and
technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10). IEEE.
37 Chen, M., Mao, S. and Liu, Y., 2014. Big data: A survey. Mobile networks and applications, 19(2), pp.171-209.
39 Mohan, C., 2013, March. History repeats itself: sensible and NonsenSQL aspects of the NoSQL hoopla. In Proceedings of the 16th
International Conference on Extending Database Technology (pp. 11-16). ACM.
42 Borthakur, D., 2007. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11(2007), p.21.
45 Dixon, W.J. and Massey Frank, J., 1950. Introduction To Statistical Analsis. McGraw-Hill Book Company, Inc; New York.
46 Johnson, N.L., Kotz, S. and Balakrishnan, N., 1994. Continuous Univariate Probability Distributions,(Vol. 1).
48 Everitt, B.S., Landau, S., Leese, M. and Stahl, D., 2011. Hierarchical clustering. Cluster Analysis, 5th Edition, pp.71-110.
51 Bughin, J., 2016. Big data: Getting a better read on performance. The McKinsey Quarterly.
52 Leek, J.T. and Peng, R.D., 2015. What is the question?. Science, 347(6228), pp.1314-1315.
53 Wu, S., 2013. A review on coarse warranty data and analysis. Reliability Engineering & System Safety, 114, pp.1-11.
54 Judd, C.M., McClelland, G.H. and Ryan, C.S., 2011. Data analysis: A model comparison approach. Routledge.
56Otto, B., Wende, K., Schmidt, A. and Osl, P., 2007. Towards a framework for corporate data quality management. ACIS 2007
Proceedings, p.109.
Franks, B., 2012. Taming the big data tidal wave: Finding opportunities in huge data streams with advanced analytics(Vol. 49). John
57
58IBM Big Data & Analytics Hub. 2016. Building a big data center of excellence. [ONLINE] Available at: http://www.ibmbigdatahub.com/
blog/building-big-data-center-excellence. [Accessed 11 February 2018].
59Lisa Heneghan, Marc E Snyder. 2016. Harvey Nash / KPMG CIO Survey 2016. [ONLINE] Available at: https://home.kpmg.com/xx/en/
home/insights/2016/05/harvey-nash-kpmg-cio-survey-2016.html. [Accessed 11 February 2018].
60Debortoli, S., Müller, O. and vom Brocke, J., 2014. Comparing business intelligence and big data skills. Business & Information Systems
Engineering, 6(5), pp.289-300.
61Katal, A., Wazid, M. and Goudar, R.H., 2013, August. Big data: issues, challenges, tools and good practices. In Contemporary
Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
62 Kurzweil, R., Richter, R., Kurzweil, R. and Schneider, M.L., 1990. The age of intelligent machines (Vol. 579). Cambridge: MIT press.
63 Turing, A.M., 2009. Computing machinery and intelligence. In Parsing the Turing Test (pp. 23-65). Springer, Dordrecht.
64Russell, S.J., Norvig, P., Canny, J.F., Malik, J.M. and Edwards, D.D., 2003. Artificial intelligence: a modern approach (Vol. 2, No. 9).
Upper Saddle River: Prentice Hall.
65 Chui, M., 2017. Artificial intelligence the next digital frontier?. McKinsey and Company Global Institute, p.47.
66Russell, S.J., Norvig, P., Canny, J.F., Malik, J.M. and Edwards, D.D., 2003. Artificial intelligence: a modern approach (Vol. 2, No. 9).
Upper Saddle River: Prentice hall.
67 Baral, C. and De Giacomo, G., 2015, January. Knowledge Representation and Reasoning: What's Hot. In AAAI (pp. 4316-4317).