Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
Chapter 2
Descriptive Analytics I:
Nature of Data, Statistical
Modeling, and Visualization
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (2 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (2 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Metrics for Analytics Ready Data
• Data source reliability
• Data content accuracy
• Data accessibility
• Data security and data privacy
• Data richness
• Data consistency
• Data currency/data timeliness
• Data granularity
• Data validity and data relevancy
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (2 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Art and Science of Data Preprocessing (1 of 2)
• Data reduction
1. Variables
– Dimensional reduction
– Variable selection
2. Cases/samples
– Sampling
– Balancing / stratification
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and Methods (1 of 3)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and Methods (2 of 3)
Main Task Subtasks Popular Methods
Data cleaning Find and Identify the erroneous values in data (other than
eliminate outliers), such as odd values, inconsistent class
erroneous data labels, odd distributions; once identified, use domain
expertise to correct the values or remove the records
holding the erroneous values.
Data Normalize the Reduce the range of values in each numerically
transformation data valued variable to a standard range (e.g., 0 to 1 or -1
to +1) by using a variety of normalization or scaling
techniques.
Data Discretize or If needed, convert the numeric variables into
transformation aggregate the discrete representations using range-or
data frequency-based binning techniques; for categorical
variables, reduce the number of values by applying
proper concept hierarchies.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and Methods (3 of 3)
Main Task Subtasks Popular Methods
Data Construct new Derive new and more informative variables from the
transformation attributes existing ones using a wide range of mathematical
functions (as simple as addition and multiplication or
as complex as a hybrid combination of log
transformations).
Data reduction Reduce number Principal component analysis, independent
of attributes component analysis, chi-square testing, correlation
analysis, and decision tree induction.
Data reduction Reduce number Random sampling, stratified sampling, expert-
of records knowledge-driven purposeful sampling.
Data reduction Balance skewed Oversample the less represented or undersample
data the more represented classes.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business Analytics (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business Analytics (2 of 2)
• Statistics
– A collection of mathematical techniques to
characterize and interpret data
• Descriptive Statistics
– Describing the data (as it is)
• Inferential statistics
– Drawing inferences about the population based on
sample data
• Descriptive statistics for descriptive analytics
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of Centrality
Tendency
• Arithmetic mean
∑
n
x1 + x2 + ⋅ ⋅ ⋅ + xn x
x = x = i =1 i
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given
variable
• Range
– Max - Min
Standard Deviation
• Variance
∑i = 1 i
n 2
∑
n
( xi − x) 2 ( x − x)
2
s = i =1 s =
n −1 n −1
• Mean Absolute Deviation (MAD)
– Average absolute deviation from the mean
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (2 of 2)
• Quartiles
• Box-and-Whiskers Plot
– a.k.a. box-plot
– Versatile / informative
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Shape of a Distribution
∑i =1 i
n 3
( x − x )
Skewness= S=
(n − 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
∑i =1 i
n 4
( x − x )
Kurtosis
= K
= 4
− 3
ns
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Relationship Between Dispersion and
Shape Properties
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insights 2.1 (1 of 2)
Descriptive Statistics in Excel
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insights 2.1 (2 of 2)
Descriptive Statistics in Excel Creating box-plot in Microsoft Excel
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling for Inferential
Statistics
• Regression
– A part of inferential statistics
– The most widely known and used analytics technique
in statistics
– Used to characterize relationship between
explanatory (input) and response (output) variable
• It can be used for
– Hypothesis testing (explanation)
– Forecasting (prediction)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (1 of 3)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (2 of 3)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (3 of 3)
• x: input, y: output
• Simple Linear Regression
y β 0 + β1 x
=
• Multiple Linear Regression
y β 0 + β1 x1 + β 2 x2 + β3 x3 + ⋅ ⋅ ⋅ + β n xn
=
• The meaning of Beta ( β ) coefficients
– Sign (+ or -) and magnitude
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Process of Developing a Regression Model
– R 2 (R-Square)
– p Values
– Error measures (for
prediction problems)
▪ MSE, MAD, RMSE
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling Assumptions
• Linearity
• Independence
• Normality (Normal Distribution)
• Constant Variance
• Multicollinearity
• What happens if the assumptions do Not hold?
– What do we do then?
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (2 of 2)
1
f ( y) =
1 + e − ( β0 + β1x )
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Time Series Forecasting
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Business Reporting Definitions and Concepts
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Business Reporting
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Types of Business Reports
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Visualization
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
A Brief History of Data Visualization
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The First Pie Chart Created by William
Playfair in 1801
William Playfair is widely credited as the inventor of the modern
chart, having created the first line and pie charts.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Decimation of Napoleon’s Army During
the 1812 Russian Campaign
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
An Example Gapminder Chart Wealth and
Health of Nations
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Emergence of Data Visualization and
Visual Analytics (2 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Visual Analytics
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Visual Analytics by SAS Institute (1 of 2)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insight 2.3
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (1 of 4)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (2 of 4)
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (3 of 4)
• Dashboard design
– The fundamental challenge of dashboard design is to
display all the required information on a single screen,
clearly and without distraction, in a manner that can
be assimilated quickly
• Three layer of information
– Monitoring
– Analysis
– Management
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (4 of 4)
• What to look for in a dashboard
– Use of visual components to highlight data and exceptions
that require action
– Transparent to the user, meaning that they require minimal
training and are extremely easy to use
– Combine data from a variety of systems into a single,
summarized, unified view of the business
– Enable drill-down or drill-through to underlying data
sources or reports
– Present a dynamic, real-world view with timely data
– Require little coding to implement, deploy, and maintain
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Best Practices in Dashboard Design
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
End of Chapter 2
• Questions / Comments
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved