5-6 - Nature of Data, Statistical Modeling, and Visualization
5-6 - Nature of Data, Statistical Modeling, and Visualization
Chapter 3
Nature of Data, Statistical Modeling
and Visualization
Data cleaning Handle missing values in the Fill in missing values (imputations) with most appropriate values (mean, median,
data min/max, mode, etc.); recode the missing values with a constant such as “ML”;
remove the record of the missing value; do nothing.
Blank Identify and reduce noise in Identify the outliers in data with simple statistical techniques (such as averages and
the data standard deviations) or with cluster analysis; once identified, either remove the
outliers or smooth them by using binning, regression, or simple averages.
Blank Find and eliminate erroneous Identify the erroneous values in data (other than outliers), such as odd values,
data inconsistent class labels, odd distributions; once identified, use domain expertise to
correct the values or remove the records holding the erroneous values.
Data transformation Normalize the data Reduce the range of values in each numerically valued variable to a standard range
(e.g., 0 to 1 or −1 to +1) by using a variety of normalization or scaling techniques.
Blank Discretize or aggregate the If needed, convert the numeric variables into discrete representations using range-
data or frequency-based binning techniques; for categorical variables, reduce the number
of values by applying proper concept hierarchies.
Blank Construct new attributes Derive new and more informative variables from the existing ones using a wide
range of mathematical functions (as simple as addition and multiplication or as
complex as a hybrid combination of log transformations).
Data reduction Reduce number of attributes Use principal component analysis, independent component analysis, chi-square
testing, correlation analysis, and decision tree induction.
Blank Reduce number of records Perform random sampling, stratified sampling, expert-knowledge-driven purposeful
sampling.
Blank Balance skewed data Oversample the less represented or undersample the more represented classes.
x1 x2 ... xn
n
xi
x x i 1
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
i 1 ( xi x )2
n
i 1 i
n 2
( x x )
s2 s
n 1 n 1
• Mean Absolute Deviation (MA D)
– Average absolute deviation from the mean
n 3
( xi x )
skewness s i 1
(n 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
n 4
( xi x )
kurtosis K i 1
4
3
ns
Prediction Results
1. Classification
2. Regression
Source: gapminder.org.
Copyright © 2021 Pearson Education Ltd. All Rights Reserved.
The Emergence of Data Visualization
And Visual Analytics (1 of 2)
Figure 3.23 Magic Quadrant for Business Intelligence and Analytics Platforms.
• Magic Quadrant for Business
Intelligence and Analytics
Platforms (Source: Gartner.com)
• Many data visualization companies
are in the 4th quadrant
• There is a move towards
visualization