Unit II
Unit II
Data Analytics
INTRODUCTION TO ANALYTICS
• Analytics is often used to discover, interpret, and
communicate meaningful patterns in data. In business,
healthcare, sports, and many other fields, analytics helps
to inform decision-making and improve efficiency,
effectiveness, and profitability.
• Data Analytics refers to the techniques to analyze data
to enhance productivity and business gain. Data is
extracted from various sources and is cleaned and
categorized to analyze different behavioral patterns. The
techniques and the tools used vary according to the
organization or individual.
Types of analytics
1. Descriptive Analytics (“What has happened?”)
(Data aggregation, summary, data mining)
2. Predictive Analytics (“What might happen?”)
(Regression, LSE,MLE)
3. Prescriptive Analytics (“What should we do?”)
(Optimization, Recommendation)
Basis for Data Analytics Data Analysis
Comparison
Data analytics is ‘general’ form of Data analysis is a specialized form
Form
analytics which is used in businesses to of data analytics used in
make decisions from data which are data- businesses to analyze data and
driven take some insights of it.
Data analysis consisted of defining
Data analytics consist of data collection
a data, investigation, cleaning,
and inspect in general and it has one or
Structure transforming the data to give a
more users.
meaningful outcome.
For analyzing the data
There are many analytics tools in a
OpenRefine, KNIME, RapidMiner,
market but mainly R, Tableau Public,
Tools Google Fusion Tables, Tableau
Python, SAS, Apache Spark, Excel are
Public, NodeXL, WolframAlpha
used.
tools are used.
Data analytics life cycle consist of
Business Case Evaluation, Data The sequence followed in data
Identification, Data Acquisition & analysis are data gathering, data
Filtering, Data Extraction, Data Validation scrubbing, analysis of data and
Sequence
& Cleansing, Data Aggregation & interpret the data precisely so that
Representation, Data Analysis, Data you can understand what your
Visualization, Utilization of Analysis data want to say.
Results.
Basis for
Data Analytics Data Analysis
Comparison
Data Analytics, in general, can be used Data analysis can be used in various
to find masked patterns, anonymous ways like one can perform analysis
correlations, customer preferences, like descriptive analysis,
Usage
market trends and other necessary exploratory analysis, inferential
information that can help to make more analysis, predictive analysis and
notify decisions for business purpose. take useful insights from the data.
Let say you have 1gb customer purchase Suppose you have 1gb customer
related data of past 1 year, now one has purchase related data of past 1
Example to find that what our customers next year and you are trying to find
possible purchases, you will use data what happened so far that means
analytics for that. in data analysis we look into past.
Why is Data Analytics important?
As an enormous amount of data gets generated, the need to extract
useful insights is a must for a business enterprise. Data Analytics has
a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:
• Gather Hidden Insights – Hidden insights from data are gathered
and then analyzed with respect to business requirements.
• Generate Reports – Reports are generated from the data and are
passed on to the respective teams and individuals to deal with
further actions for a high rise in business.
• Perform Market Analysis – Market Analysis can be performed to
understand the strengths and the weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows
improving Business to customer requirements and experience.
Tools in Data Analytics
With the increasing demand for Data Analytics in the market, many
tools have emerged with various functionalities for this purpose.
Either open-source or user-friendly, the top tools in the data
analytics market are as follows.
• R programming – This tool is the leading analytics tool used for
statistics and data modeling. R compiles and runs on various
platforms such as UNIX, Windows, and Mac OS. It also provides
tools to automatically install all packages as per user-
requirement.
• Python – Python is an open-source, object-oriented
programming language which is easy to read, write and
maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas,
Keras etc. It also can be assembled on any platform like SQL
server, a MongoDB database or JSON.
• Tableau Public – This is a free software that connects
to any data source such as Excel, corporate Data
Warehouse etc. It then creates visualizations, maps,
dashboards etc with real-time updates on the web.
• QlikView – This tool offers in-memory data
processing with the results delivered to the end-
users quickly. It also offers data association and data
visualization with data being compressed to almost
10% of its original size.
• SAS – A programming language and environment for
data manipulation and analytics, this tool is easily
accessible and can analyze data from different
sources.
Tools in Data Analytics ...
• Microsoft Excel – This tool is one of the most widely used
tools for data analytics. Mostly used for clients’ internal
data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can
integrate with any data source types such as Access, Excel,
Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining,
text analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-
source data analytics platform, which allows you to analyze
and model data. With the benefit of visual programming,
KNIME provides a platform for reporting and integration
through its modular data pipeline concept.
• OpenRefine – Also known as GoogleRefine, this data
cleaning software will help you clean up data for
analysis. It is used for cleaning messy data, the
transformation of data and parsing data from
websites.
• Apache Spark – One of the largest large-scale data
processing engine, this tool executes applications in
Hadoop clusters 100 times faster in memory and 10
times faster on disk. This tool is also popular for data
pipelines and machine learning model development.
Data Workflows
Reporting Vs Analytics:
Reporting is presenting result of data analysis
and Analytics is process or systems involved in
analysis of data to obtain a desired output.
Various steps involved in Analytics:
• Define your Objective
• Understand Your Data Source
• Prepare Your Data
• Analyze Data
• Report on Results
Step 1 - Define Your Objective
Ask the following questions:
• What are you trying to achieve?
• What could the result look like?
Step 2 - Understand Your Data Source
Ask the following questions:
• What information do I need?
• Can I get the data myself, or do I need to ask an IT
resource?
Step 3 - Prepare Your Data
Ask the following questions:
• Does the data need to be cleansed?
• Does the data need to be normalized?
Step 4 - Analyze Data
Ask the following questions:
• What tests can I run on the data?
• Is help available to understand results?
Step 5 - Report Results
Ask the following questions:
• Will management understand the results?
• Can you represent the results visually?
Various Analytics techniques are:
• Data Preparation
• Reporting, Dashboards & Visualization
• Segmentation
• Forecasting
• Descriptive Modelling
• Predictive Modelling
Application of Modeling in Business
• A statistical model embodies a set of assumptions
concerning the generation of the observed data, and
similar data from a larger population.
• A model represents, often in considerably idealized form,
the data-generating process. Signal processing is an
enabling technology that encompasses the fundamental
theory, applications, algorithms, and implementations of
processing or transferring information contained in many
different physical, symbolic, or abstract formats broadly
designated as signals. It uses mathematical, statistical,
computational, heuristic, and linguistic representations,
formalisms, and techniques for representation, modelling,
analysis, synthesis, discovery, recovery, sensing,
acquisition, extraction, learning, security, or forensics.
Application of Modeling in Business...
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data
Record Data
• Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products that
were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
What is Data?
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
• A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or
9 No Married 75K No
instance
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols
assigned to an attribute for a particular object
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties/operations it possesses:
– Distinctness: =
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
• Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
• ID numbers
– Nominal, ordinal, or interval?
• Biased Scale
– Interval or Ratio
Key Messages for Attribute Types
• The types of operations you choose should be “meaningful” for the
type of data you have
– Distinctness, order, meaningful intervals, and meaningful ratios
are only four properties of data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
there
Substitution
• Impute the value from a new individual who was not
selected to be in the sample.
• In other words, go find a new subject and use their
value instead.
Hot deck imputation
• A randomly chosen value from an individual in the
sample who has similar values on other variables.
• In other words, find all the sample subjects who are
similar on other variables, then randomly choose one
of their values on the missing variable.
• One advantage is you are constrained to only possible
values. In other words, if Age in your study is
restricted to being between 5 and 10, you will always
get a value between 5 and 10 this way.
• Another is the random component, which adds in
some variability. This is important for accurate
standard errors.
Cold deck imputation
• A systematically chosen value from an
individual who has similar values on other
variables.
• This is similar to Hot Deck in most ways, but
removes the random variation. So for
example, you may always choose the third
individual in the same experimental condition
and block.
Regression imputation
• The predicted value obtained by regressing
the missing variable on other variables.
• So instead of just taking the mean, you’re
taking the predicted value, based on other
variables. This preserves relationships among
variables involved in the imputation model,
but not variability around predicted values.
Stochastic regression imputation
• The predicted value from a regression plus a
random residual value.
• This has all the advantages of regression
imputation but adds in the advantages of the
random component.
• Most multiple imputation is based on some
form of stochastic regression imputation.
Interpolation and extrapolation
• An estimated value from other observations from
the same individual. It usually only works in
longitudinal data.
• Use caution, though. Interpolation, for example,
might make more sense for a variable like height in
children–one that can’t go back down over time.
Extrapolation means you’re estimating beyond the
actual range of the data and that requires making
more assumptions that you should.
Single or Multiple Imputation?
• There are two types of imputation–single or multiple.
Usually when people talk about imputation, they
mean single.
• Single refers to the fact that you come up with a
single estimate of the missing value, using one of the
seven methods listed above.
• It’s popular because it is conceptually simple and
because the resulting sample has the same number
of observations as the full data set.
• Single imputation looks very tempting when listwise
deletion eliminates a large portion of the data set.
But it has limitations.
• Some imputation methods result in biased parameter
estimates, such as means, correlations, and regression
coefficients, unless the data are Missing Completely at
Random (MCAR). The bias is often worse than with
listwise deletion, the default in most software.
• The extent of the bias depends on many factors,
including the imputation method, the missing data
mechanism, the proportion of the data that is missing,
and the information available in the data set.
• Moreover, all single imputation methods
underestimate standard errors.
• Since the imputed observations are themselves
estimates, their values have corresponding random
error. But when you put in that estimate as a data
point, your software doesn’t know that. So it
overlooks the extra source of error, resulting in too-
small standard errors and too-small p-values.
• And although imputation is conceptually simple, it
is difficult to do well in practice. So it’s not ideal but
might suffice in certain situations.
• So multiple imputation comes up with multiple
estimates. Two of the methods listed above work
as the imputation method in multiple imputation–
hot deck and stochastic regression.
• Because these two methods have a random
component, the multiple estimates are slightly
different. This re-introduces some variation that your
software can incorporate in order to give your model
accurate estimates of standard error.
• Multiple imputation was a huge breakthrough in
statistics about 20 years ago. It solves a lot of
problems with missing data (though, unfortunately
not all) and if done well, leads to unbiased parameter
estimates and accurate standard errors.
Need for Business Modeling
5 Ways Data Analytics is Transforming Business
Models
1. Strategic Analytics
2. Platform Analytics
3. Enterprise Information Management (EIM)
4. Business Model Transformation
5. Making Data-centric Business
1. Strategic Analytics
Strategic analytics is detailed, data-driven analysis of your entire
system to help you determine what’s driving customer and market
behavior.
The key to strategic analytics is doing it in the right order:
Step 1 — Competitive Advantage Analytics to identify your
capability strengths and weaknesses
Step 2 — Enterprise Analytics to get diagnostics at the enterprise,
business unit and business process levels
Step 3 — Human Capital Analytics for diagnostics at the individual
level to get actionable insights
The data should answer critical questions like:
• What are the key decisions that drive the most value for us?
• What new data is available that hasn’t been mined yet?
• What new analytics techniques haven’t been fully explored?
2. Platform Analytics
• This helps you fuse analytics into your decision-making to
improve core operations. It can help your company harness the
power of data to identify new opportunities.
The important questions to ask include:
• How can we integrate analytics into everyday processes?
• Which processes will benefit from automatic, repeatable, real-
time analysis?
• Could our back-end system benefit from big data analytics?
Platform analytics must include more than a stack of technologies.
As it’s available via many formats and channels, it can be used to
check the pulse of your organization.
It will help you incorporate data analysis into key decisions across all
departments, including sales, marketing, the supply chain, customer
service, customer experience, and other core business functions.
3. Enterprise Information Management (EIM)
• Almost 80% of vital business information is stored in
unmanaged repositories. With strategic and platform
analytics already in place, EIM helps you take advantage of
social, mobile, analytics and cloud technologies (SMAC) to
improve the way data is managed and used across the
company.
• By building agile data management operations with tools for
information creation, capture, distribution and consumption,
EIM will help you:
– Streamline your business practices
– Enhance collaboration efforts
– Boost employee productivity in and out of the office
• When defining your EIM strategy, identify the business
requirements, key issues and opportunities for initiating EIM.
Also, identify potential programs and projects whose success
4. Business Model Transformation
• Companies that embrace big data analytics and transform their
business models in parallel will create new opportunities for revenue
streams, customers, products and services.
• From forecasting demand and sourcing materials to accounting and
the recruitment and training of staff, every aspect of your business
can be reinvented.
Needed changes include:
• Having a big data strategy and vision that identifies and capitalizes on
new opportunities
• Fostering a culture of innovation and experimentation with data
• Understanding how to leverage new skills and technologies, and
managing the impact they have on how information is accessed and
safeguarded
• Building trust with consumers who hold vital data
• Creating partnerships both within and outside your core industry
5. Making Data-centric Business
• Do you generate a large volume of data? Could that data benefit
other organizations, both inside and outside your industry?
• Data-centric business isn’t just an asset, it’s currency. It’s the
source of your core competitiveness, and it’s worth its weight in
gold.
There are three main categories to data analytics:
• Insight: Includes mining, cleansing, clustering and segmenting
data to understand customers and their networks, influence and
product insights
• Optimization: Analyzing business functions, processes and
models
• Innovation: Exploring new, disruptive business models to further
the evolution and growth of your customer base.