Chapter 2_ Data Exploration, Preprocessing and Visualization
Chapter 2_ Data Exploration, Preprocessing and Visualization
Contents ●
●
Data integration
Data transformation
● Data reduction
● Discretization and generating
concept hierarchies
● Data visualization tools and
techniques
Getting to know your data
Before jumping into mining, we need to familiarize ourselves with the data.
5
Non-graphical methods: Univariate data
● Categorical data:
A simple tabulation of the frequency of each category
Source: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
6
Non-graphical methods: Univariate data
● Quantitative data: Population distribution of the variable using the data
of the observed sample
○ Central tendency: mean, median, mode etc.
○ Spread (an indicator of how far away from the center we are still likely to find data
values): variance, standard deviation, interquartile range (IQR)
○ Modality (number of peaks in the pdf)
○ Shape: Skewness (measure of asymmetry), Kurtosis (measure of peakedness)
○ Outliers (an observation that lies "far away" from other values)
7
Non-graphical methods: Bivariate data
● Categorical data:
○ Cross-tabulation
○ Correlation
Chi-squared test,
Cramer’s V,
etc.
8
Non-graphical methods: Bivariate data
● Quantitative data:
○ Covariance:
A measure of how much two variables “co-vary”
■ +ve covariance = when one measurement is above the mean the other will probably also be above
the mean, and vice versa
■ -ve covariance = when one variable is above its mean, the other is below its mean
■ Covariances near zero = the two variables vary independently of each other.
○ Correlation:
Statistical relationship between two variables
Commonly used correlation coefficient: Pearson’s correlation coefficient
■ -1 = data lie on a perfect straight line with a negative slope
■ 0 = no linear relationship between the variables
■ 1 = data lie on a perfect straight line with a positive slope
9
Non-graphical methods: Multivariate data
● Covariance and correlation matrices:
Pairwise covariances and/or correlations assembled into a matrix
10
● Bar chart
Graphical methods ●
●
Histogram
Density plot
● Box and whiskers plot
Univariate data
● Time-series plot
11
Bar plot
Bar plots display the distribution (frequencies) of a
categorical variable through vertical or horizontal
bars.
12
Histogram
Histograms are constructed by binning the
data and counting the number of
observations in each bin
13
Histogram
Bar plot vs histogram
14
Density plot
Density plots can be thought of as plots of
smoothed histograms
15
Density plot
The smoothness is controlled by a bandwidth
parameter that is analogous to the histogram
binwidth
16
Box and whiskers plot
Presents information about the central
tendency, symmetry, and skew as well as
outliers
17
Box and whiskers plot
18
Time series plot
At time t
19
● Scatter plot
● Regression
Graphical methods ●
●
Box and whiskers plot
Bar chart
Bivariate data
20
Scatter plot
Uses Cartesian coordinates to show the relationship between two variables of a set
of data
21
Box and whiskers plot
22
Bar plot
Stacked bar plot
23
Bar plot
Grouped bar plot
24
Graphical methods ●
●
Scatter plot matrix
Bubble plot
● Line chart
Multivariate data
25
Scatter plot matrix
Can be used to roughly determine if there
is a linear correlation between multiple
variables
26
Bubble plot
Bubble plots can display the
relationship between three
quantitative variables
27
Line chart
Connecting the points in a scatter plot Multiple lines can be drawn in the same
moving from left to right gives a line plot
plot
28
Graphical methods
Quantitative Categorical Quantitative & Categorical
Bivariate:
Scatterplots
Multivariate:
Scatterplot matrix, Bubble plots (3
variables)
29
Why preprocess the data?
In many practical situations
● Data contains too many attributes, and some of them are clearly irrelevant and
redundant
● Data is incomplete (some values are missing)
● Data is inaccurate or noisy (contains errors, or values that deviate from the
expected)
● Data is inconsistent (e.g., containing discrepancies in the department codes)
● For the same real world entity, attribute values from different sources are
different
● Possible reasons: different representations, different scales, e.g., metric vs.
British units, grading schemes in different countries/universities etc.
● Metadata information can be used to detect such discrepancies
● Data scrubbing tools and data auditing tools can aid in the discrepancy
detection step
Data reduction
● Complex data analysis and mining on huge amounts of data can take a long time
● Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity
of the original data
Data reduction
● Data reduction strategies
○ Dimensionality reduction: Reducing the number of random variables or attributes under
consideration
○ Numerosity reduction: Replacing the original data volume by alternative, smaller forms of data
representation
○ Data compression: Applying transformations to obtain a reduced or “compressed” representation
of the original data
○ Data Aggregation: Combining multiple data points into a single data point by applying a
summarization function
○ Data Generalization: Replacing a data point with a more general data point that still preserves
the important information
Dimensionality reduction
● Reducing the number of random variables or attributes under consideration
● Some methods:
○ Attribute subset selection / feature selection
○ Principal Component Analysis
○ Wavelet Transforms
Attribute subset selection
● Reduces the data set size by removing irrelevant or
redundant attributes (or dimensions)
● The goal is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
● For n attributes, there are 2n possible subsets.
○ An exhaustive search for the optimal subset of attributes (brute-force approach) may not be
feasible
○ Heuristic/greedy methods that explore a reduced search space are commonly used for attribute
subset selection.
○ Stepwise forward selection, stepwise backward selection, decision tree induction etc.
Attribute subset selection
Principal component analysis
● Transforms the data from d-dimensional
space into a new coordinate system of
dimension p, where p ≤ d
● Unlike attribute subset selection, which
reduces the attribute set size by retaining a
subset of the initial set of attributes, PCA
“combines” the essence of attributes by
creating an alternative, smaller set of
variables. The initial data can then be
projected onto this smaller set.
Wavelet transforms
● The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X′, of wavelet coefficients. The two vectors are of the same length.
● A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.
Numerosity reduction
● Replacing the original data volume by alternative, smaller forms of data
representation
● These techniques may be parametric or non-parametric
○ Parametric: Regression and log-linear models
○ Non-parametric: Sampling, histograms, clustering
Sampling
● Allows a large data set to be represented by a much smaller random data
sample (or subset)
● Types
○ Simple random sampling with replacement (SRSWR)
○ Simple random sampling without replacement (SRSWOR)
○ Cluster sampling
○ Stratified sampling
Sampling
Clustering
● Partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters
● In data reduction, the cluster representations of the data are used to replace the
actual data
Data compression
● Applying transformations to obtain a reduced or “compressed” representation of
the original data
● If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
● If we can reconstruct only an approximation of the original data, then the data
reduction is called lossy. Examples: PCA, DWT
Data transformation
● Transforming or consolidating data into forms appropriate for mining
● Strategies
○ Smoothing: removing noise from data
○ Attribute construction: constructing new attributes from the given set of attributes
○ Aggregation: summarizing or aggregating the data
○ Normalization: scaling the attribute data so that they fall within a smaller range
○ Discretization: replacing numerical attribute values with interval labels or conceptual labels
○ Concept hierarchy generation for nominal data: constructing hierarchies of concepts by
generalizing concepts to higher level
Normalization
● Transforming the data to fall within a smaller or common range such as [−1,1]
or [0.0, 1.0].
● Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
● For distance-based methods, normalization helps prevent attributes with
initially large ranges (e.g., income) from outweighing attributes with initially
smaller ranges (e.g., binary attributes).
● Some methods: min-max normalization, z-score normalization, normalization by
decimal scaling etc.
Min-max normalization
Performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute,
A. Min-max normalization maps a value, vi, of A to vi' in the range
[new_minA,new_maxA] by computing
Min-max normalization
Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0]. By min-max normalization, what will be the new value for $73,600?
Z-score normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A.
Where Ā and σA are the mean and standard deviation, respectively, of attribute A.
Z-score normalization is useful when the actual minimum and maximum of attribute A are
unknown, or when there are outliers that dominate the min-max normalization.
Z-score normalization
Suppose that the mean and standard deviation of the values for the attribute income
are $54,000 and $16,000, respectively.
With z-score normalization, what will be the new value for $73,600?
Normalizing by decimal scaling
Normalizes by moving the decimal point of values of attribute A.
Example: Suppose that the recorded values of A range from −986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Discretization
● Dividing the range of a continuous attribute into intervals
● Interval labels can then be used to replace actual data values
● Some classification algorithms only accept categorical attributes
● Types:
○ Supervised (using class information) vs unsupervised (without using class information)
○ Top-down / splitting or bottom-up / merging
Discretization
● Unsupervised discretization
○ Equal-width or equal-frequency
● Supervised discretization
○ Clustering, decision tree
○ Entropy-based discretization
○ Chi square discretization
○ etc.
Unsupervised discretization
● Does not used class information
● Equal-width discretization
○ First, find the minimum and maximum values for the continuous attribute
○ Then, divide the range of the attribute values into the user-specified equal-width discrete
intervals
Original data
Equal-width discretization
Unsupervised discretization
● Equal-frequency discretization
○ Sort the values of the attribute in ascending order
○ Find the number of all possible values for the attribute
○ Then, divide the attribute values into the user-specified number of intervals such that each
interval contains the same number of sorted sequential values
Original data
Equal-frequency discretization
Supervised discretization
Uses class information
Concept hierarchy generation for nominal data
● Nominal attributes have a finite (but possibly large) number of distinct values,
with no ordering among the values. Examples: geographic_location,
job_category etc.
● The concept hierarchies can be used to transform the data into multiple levels of
granularity.
Concept hierarchy generation for nominal data
● Specification of a partial/total ordering of attributes explicitly at the schema
level by users or experts
○ street < city < state < country
● Specification of a hierarchy for a set of values by explicit data grouping
○ {Urbana, Champaign, Chicago} ⊂ Illinois
● Specification of only a partial set of attributes
○ e.g., only street < city, not others
● Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
○ e.g., for a set of attributes: {street, city, state, country}
Automatic concept hierarchy generation
● Some hierarchies can be automatically
generated based on the analysis of the
number of distinct values per attribute in the
data set
● The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Data Visualization
● Aims to communicate data clearly and effectively
through graphical representation
● Aids in discovering data relationships that are
otherwise not easily observable by looking at the
raw data
A famous example: The John Snow’s famous cholera map:
mapping cholera cases during an outbreak in London led to the
discovery that cholera was water-borne as the cases all
clustered around a shared water pump
Goals of Data Visualization
Visualization should
The colors of the pixels reflect the corresponding values (e.g. the smaller the value,
the lighter the shading)
Pixel-oriented Visualization
Example:
Consider a dataset containing: income, credit_limit, transaction_volume, age
Sorting the data in an increasing order of income, and using pixel-based visualization, we
can observe that
● credit limit increases as income
increases
● customers whose income is in
the middle range are more likely
to purchase more
● there is no clear correlation
between income and age.
Pixel-oriented Visualization
Windows do not have to be rectangular.
Geometric Projection Visualization
Pixel-oriented visualization techniques do not show whether there is a dense area in
a multidimensional subspace.