Bda CH1
Bda CH1
Bda CH1
Main points
Overview of Big data, Big Data Characteristics
Different Types of Data
Data Analytics
Data Analytics Life Cycle
Data Analytics
Data analytics is a broader term that encompasses data analysis. Data
analytics is a
discipline that includes the management of the complete data lifecycle,
which encompasses collecting, cleansing, organizing, storing, analyzing
and governing data. The term includes the development of analysis
methods, scientific techniques and automated tools. In Big Data
environments, data analytics has developed methods that allow data
analysis to occur through the use of highly scalable distributed
technologies and frameworks that are capable of analyzing large volumes
of data from different sources. The Big Data analytics lifecycle generally
involves identifying, procuring, preparing and analyzing large amounts of
raw, unstructured data to extract meaningful information that can serve
as an input for identifying patterns, enriching existing enterprise data and
performing large-scale searches.
Different kinds of organizations use data analytics tools and techniques in
different ways.
Take, for example, these three sectors:
• In business-oriented environments, data analytics results can lower
operational costs
and facilitate strategic decision-making.
• In the scientific domain, data analytics can help identify the cause
of a phenomenon to improve the accuracy of predictions.
• In service-based environments like public sector organizations,
data analytics can help strengthen the focus on delivering high-
quality services by driving down costs.
Data analytics enable data-driven decision-making with scientific backing
so that decisions can be based on factual data and not simply on past
experience or intuition alone. There are four general categories of
analytics that are distinguished by the results they produce:
• descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
The different analytics types leverage different techniques and analysis
algorithms. This implies that there may be varying data, storage and
processing requirements to facilitate the delivery of multiple types of
analytic results.
Descriptive Analytics
Descriptive analytics are carried out to answer questions about events
that have already occurred. This form of analytics contextualizes data to
generate information.
Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by
severity and
geographic location?
• What is the monthly commission earned by each sales agent?
It is estimated that 80% of generated analytics results are descriptive in
nature. Value wise, descriptive analytics provide the least worth and
require a relatively basic skillset. Descriptive analytics are often carried
out via ad-hoc reporting or dashboards, as shown in Figure 1.5. The
reports are generally static in nature and display historical data that is
presented in the form of data grids or charts. Queries are executed on
operational data stores from within an enterprise, for example a Customer
Relationship Management system (CRM) or Enterprise Resource Planning
(ERP) system.
Diagnostic Analytics
Diagnostic analytics aim to determine the cause of a phenomenon that
occurred in the past
using questions that focus on the reason behind the event. The goal of
this type of analytics is to determine what information is related to the
phenomenon in order to enable answering questions that seek to
determine why something has occurred.
Such questions include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the
Eastern region than
from the Western region?
• Why was there an increase in patient re-admission rates over the
past three months?
Diagnostic analytics provide more value than descriptive analytics but
require a more advanced skillset. Diagnostic analytics usually require
collecting data from multiple sources and storing it in a structure that
lends itself to performing drill-down and roll-up analysis, as shown in
Figure 1.6. Diagnostic analytics results are viewed via interactive
visualization tools that enable users to identify trends and patterns. The
executed queries are more complex compared to those of descriptive
analytics and are performed on multidimensional data held in analytic
processing systems.
Predictive Analytics
Predictive analytics are carried out in an attempt to determine the
outcome of an event that might occur in the future. With predictive
analytics, information is enhanced with meaning to generate knowledge
that conveys how that information is related. The strength and magnitude
of the associations form the basis of models that are used to generate
future predictions based upon past events. It is important to understand
that the models used for predictive analytics have implicit dependencies
on the conditions under which the past events occurred. If these
underlying conditions change, then the models that make predictions
need to be updated.
Questions are usually formulated using a what-if rationale, such as the
following:
• What are the chances that a customer will default on a loan if they
have missed a monthly payment?
• What will be the patient survival rate if Drug B is administered
instead of Drug A?
• If a customer has purchased Products A and B, what are the
chances that they will also purchase Product C?
Predictive analytics try to predict the outcomes of events, and predictions
are made based on patterns, trends and exceptions found in historical
and current data. This can lead to the identification of both risks and
opportunities.
This kind of analytics involves the use of large datasets comprised of
internal and external data and various data analysis techniques. It
provides greater value and requires a more advanced skillset than both
descriptive and diagnostic analytics. The tools used generally abstract
underlying statistical intricacies by providing user-friendly front-end
interfaces
Prescriptive Analytics
Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken. The focus is not only on which
prescribed option is best to follow, but why. In other words, prescriptive
analytics provide results that can be reasoned about because they embed
elements of situational understanding. Thus, this kind of analytics can be
used to gain an advantage or mitigate a risk.
Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
Prescriptive analytics provide more value than any other type of analytics
and correspondingly require the most advanced skillset, as well as
specialized software and tools. Various outcomes are calculated, and the
best course of action for each outcome is suggested. The approach shifts
from explanatory to advisory and can include the simulation of various
scenarios.
This sort of analytics incorporates internal data with external data.
Internal data might include current and historical sales data, customer
information, product data and business rules. External data may include
social media data, weather forecasts and government produced
demographic data. Prescriptive analytics involve the use of business rules
and large amounts of internal and external data to simulate outcomes
and prescribe the best course of action.
Volume
The anticipated volume of data that is processed by Big Data solutions is
substantial and ever-growing. High data volumes impose distinct data
storage and processing demands, as well as additional data preparation,
curation and management processes.
Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research experiments, such as the Large Hadron
Collider and Atacama Large Millimeter/Submillimeter Array telescope
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
Velocity
In Big Data environments, data can arrive at fast speeds, and enormous
datasets can accumulate within very short periods of time. From an
enterprise’s point of view, the velocity of data translates into the amount
of time it takes for the data to be processed once it enters the
enterprise’s perimeter. Coping with the fast inflow of data requires the
enterprise to design highly elastic and available data processing solutions
and corresponding data storage capabilities.Depending on the data
source, velocity may not always be high.
Variety
Data variety refers to the multiple formats and types of data that need to
be supported by Big Data solutions. Data variety brings challenges for
enterprises in terms of data integration, transformation, processing, and
storage. Figure 1.14 provides a visual representation of data variety,
which includes structured data in the form of financial transactions, semi-
structured data in the form of emails and unstructured data in the form of
images.
Veracity
Veracity refers to the quality or fidelity of data. Data that enters Big Data
environments needs to be assessed for quality, which can lead to data
processing activities to resolve invalid data and remove noise. In relation
to veracity, data can be part of the signal or noise of a dataset. Noise is
data that cannot be converted into information and thus has no value,
whereas signals have value and lead to meaningful information. Data with
a high signal-to-noise ratio has more veracity than data with a lower ratio.
Data that is acquired in a controlled manner, for example via online
customer registrations, usually contains less noise than data acquired via
uncontrolled sources, such as blog postings. Thus the signal-to-noise ratio
of data is dependent upon the source of the data and its type.
Value
Value is defined as the usefulness of data for an enterprise. The value
characteristic is intuitively related to the veracity characteristic in that the
higher the data fidelity, the more value it holds for the business. Value is
also dependent on how long data processing takes because analytics
results have a shelf-life; for example, a 20 minute delayed stock quote
has little to no value for making a trade compared to a quote that is 20
milliseconds old. As demonstrated, value and time are inversely related.
The longer it takes for data to be turned into meaningful information, the
less value it has for a business. Stale results inhibit the quality and speed
of informed decision-making. Figure 1.15 provides two illustrations of how
value is impacted by the veracity of data and the timeliness of generated
analytic results.
Data Visualization
The ability to analyze massive amounts of data and find useful insights
carries little value if the only ones that can interpret the results are the
analysts. The Data Visualization stage, is dedicated to using data
visualization techniques and tools to graphically communicate the
analysis results for effective interpretation by business users.
Business users need to be able to understand the results in order to
obtain value from the analysis and subsequently have the ability to
provide feedback, as indicated by the dashed line leading from stage 8
back to stage 7. The results of completing the Data Visualization stage
provide users with the ability to perform visual analysis, allowing for the
discovery of answers to questions that users have not yet even
formulated.
Utilization of Analysis Results
Subsequent to analysis results being made available to business users to
support business decision-making, such as via dashboards, there may be
further opportunities to utilize the analysis results. The Utilization of
Analysis Results stage, is dedicated to determining how and where
processed analysis data can be further leveraged. Depending on the
nature of the analysis problems being addressed, it is possible for the
analysis results to produce “models” that encapsulate new insights and
understandings about the nature of the patterns and relationships that
exist within the data that was analyzed. A model may look like a
mathematical equation or a set of rules. Models can be used to improve
business process logic and application system logic, and they can form
the basis of a new system or software program.
Common areas that are explored during this stage include the following:
• Input for Enterprise Systems – The data analysis results may be
automatically or
manually fed directly into enterprise systems to enhance and
optimize their
behaviors and performance. For example, an online store can be fed
processed
customer-related analysis results that may impact how it generates
product
recommendations. New models may be used to improve the
programming logic
within existing enterprise systems or may form the basis of new
systems.
• Business Process Optimization – The identified patterns,
correlations and anomalies
discovered during the data analysis are used to refine business
processes. An
example is consolidating transportation routes as part of a supply
chain process.
Models may also lead to opportunities to improve business process
logic.
• Alerts – Data analysis results can be used as input for existing
alerts or may form the
basis of new alerts. For example, alerts may be created to inform
users via email or
SMS text about an event that requires them to take corrective
action.