Datascience
Datascience
l Aggregation
l Sampling
l Dimensionality Reduction
l Feature subset selection
l Feature creation
l Discretization and Binarization
l Attribute Transformation
l Purpose
– Data reduction
u Reduce the number of attributes or objects
– Change of scale
u Cities aggregated into regions, states, countries, etc.
u Days aggregated into weeks, months, or years
l When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
l Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
l Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
01/27/2020 Introduction to Data Mining, 2nd Edition 79
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA
Frequency
30
Counts
20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.