Data Preprocessing
Data Preprocessing
Data Visualization
Data Quality
Data Integration
Data Reduction
Data Transformation
1
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships among
data
Help to find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of patterns derived through analytics / mining
2
Example: Sea Surface Temperature
The picture shows the Sea Surface Temperature
(SST) at different locations over the globe
Tens of thousands of data points are
4
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, sort the records as per a global order
Create m windows on the screen, one for each dimension
The m dimension values of a record are mapped to the corresponding
windows at the pixels at its global position based on the query
The colors of the pixels reflect the corresponding values
Figure high-lights the correlation between income and other attributes
(a) Income (b) Credit Limit (c) transaction volume (d) age
5
Laying Out Pixels in Circle Segments
A record is represented by a circle with colored segments of intensity
in accordance with the value of the attribute
The pixels representing first chunk of records are located closest to the
center, followed by the latter records arranged outward in a continuum
in each sector corresponding to attributes.
8
Scatter Plot Array of Iris Attributes
Parallel Coordinates
Presents k-D data along k equidistant axes which are parallel to one of
the screen axes and each corresponds to an attribute
The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
• • •
11
Parallel Coordinates Plots for Iris Data
Icon-Based Visualization Techniques
13
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)
14
Hierarchical Visualization Techniques
15
Worlds-within-Worlds
16
Worlds-within-worlds:Example
In order to visualize the variation of feature F with reference to the
remaining five dimensions, X1,X2,..., X5, the values of the last
three dimensions will be fixed at say a, b, and c and visualize the
variation in F w.r.t X1 and X2 as 3-D plot, which is called as an
inner world located at origin (a,b,c) in the outer world defined by
(X3,X4,X5).
Tree-Map
Screen-filling method which
uses a hierarchical
partitioning of the screen
into nested rectangular
regions depending on the
attribute values
The x- and y-dimension of
the screen are partitioned
alternately according to the
attribute values (classes)
Google news stories
organized into 7 categories,
in turn into sub-categories.
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 18
Visualizing Complex Data and Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags in a web-site
The importance of
tag is represented
by font size/color
Besides text data,
there are also
methods to visualize
relationships, such as
visualizing social
networks
20
Data Exploration & Preprocessing
Data Visualization
Data Quality
Data Cleaning
Data Integration
Data Reduction
Data Transformation
22
Data Exploration & Preprocessing
Data Visualization
Data Quality
Data Cleaning
Data Integration
Data Reduction
Data Transformation
24
Chapter 3: Data Preprocessing
Data Visualization
Data Quality
Data Cleaning
Data Integration
Data Reduction
Data Transformation
technology limitation
29
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
30
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
31
y
Regression Analysis
Y1
32
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
33
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify same real world entities from multiple data sources,
e.g., S.V.N.Rao =S.V.Narayan Rao = S.Venkat Narayana Rao
(Observed Expected ) 2
2
Expected
36
Chi-Square Calculation: An Example
Scatter plots
showing the
correlation co-
efficient(rAB) from –
1 to 1.
39
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: Will the prices rise or fall together? If so, the stocks are
affected by the same industry trend.
A= (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
42
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Data compression
43
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering and
outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required for data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
44
Mapping Data to a New Space
Fourier transform
Wavelet transform
45
Wavelet Transformation
Discrete wavelet transform (DWT) is a linear signal
processing technique that transforms a given data vector X,
into a numerically different vector, X’ of wavelet coefficients
Useful for data reduction by Compressed approximations:
store only a small fraction of the strongest of the wavelet
coefficients by thresholding
Provides multi-resolution analysis and denoising & cleaning
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
46
What Is Wavelet Transform?
Haar2 Daubechie4
Several Families of DWTs exist and two of them are shown in the above
picture
General procedure is the Hierarchical Pyramid Algorithm:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions:
f1: smoothing for aggregation
f2: difference for details
Applies to pairs of successive data points, resulting in two set of data of
length L/2 corresponding to low and high frequency content
Applies two functions recursively, until the length of each list reaches 2
47
Why Wavelet Transform?
48
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
49
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
The principal components are sorted in the order of decreasing
“significance” or strength
Each input data (vector) is represented in a low dimensional space
defined by the k principal component vectors
Since the components are sorted, the size of the data can be reduced
by eliminating the weak components, i.e., those with low variance
(i.e., using the strongest principal components, it is possible to
reconstruct a good approximation of the original data)
Works for numeric data only
50
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate the information contained in one or more
other attributes
E.g., price of a product and the sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' CGPA
51
Heuristic Search in Attribute Selection
At each step select the best and eliminate the worst attribute
All attributes that appear in the tree are selected and the rest
eliminated
52
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the
important information in a data set more effectively than the
original ones
Attribute construction
Examples:
53
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
54
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a combination of discretized attributes, based on a
smaller subsets of dimensional combinations
Useful for data smoothing and also for dimensionality reduction
55
Histogram Analysis
Divide data into buckets and
store average (sum) for each
bucket
Partitioning rules:
Equal-width: equal bucket
range
Equal-frequency (or equal-
depth)
56
Clustering
Partition data set into clusters based on similarity, and
store cluster representatives (e.g., centroid and
diameter) only
Can be very effective if data is clusterable but not
effective otherwise (if the data is “smeared”)
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
57
Sampling
58
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item from a given population
Sampling without replacement
Raw Data
60
Sampling: Stratified Sampling
61
Data Cube Aggregation
Original Data
Approximated
63
Data Compression
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
65
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be identified with
one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization / Concept hierarchy generation
66
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 12,000
1.0]. Then $73,600 is mapped to 98,000 12,000 (1.0 0) 0 0.716
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
67
Data Discretization
Discretization: Divide the range of a continuous attribute into
intervals and the intervals can be labelled
Interval labels are used to replace actual data values so that
numeric features are transformed into nominal features for further
analysis, e.g., classification
Typical methods: All the methods can be applied recursively
Binning: Top-down split, unsupervised
Histogram analysis: Top-down split, unsupervised
Clustering analysis: Unsupervised, top-down split or bottom-up
merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (supervised, bottom-up merge)
(adjacent intervals having similar class distribution are merged)
68
Simple Discretization: Binning
70
Discretization by Classification &
Correlation Analysis
Classification (Decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split points (discretization point)
Top-down, recursive split
Correlation analysis (Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
Merge performed recursively, until a predefined stopping condition
71
Concept Hierarchy Generation
Concept hierarchy organizes features hierarchically and is usually
associated with each dimension in a data warehouse
Concept hierarchies facilitate drilling down and rolling up to view /
analyze data (maintained in a data warehouse) at multiple granularity
Concept hierarchy formation:
Recursively reduce the data by collecting and replacing low level
values / concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers OR
Concept hierarchy can be automatically formed for both numeric and
nominal data.
For numeric data, discretization methods are used.
72
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
{Vizag, Vijayawada, Tirupati} < AP
Specification of a set of attributes only that constitute the
concept hierarchy but not their ordering
E.g., for a set of attributes: {street, city, state, country}
Solution:Automatic generation of hierarchies (or attribute
levels) by analyzing a dataset based on the number of
distinct values
73
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attributes with more number of distinct values are
placed at the lower levels of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
75
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
76