Chap2 Data
Chap2 Data
Types of Data
Data Quality
Data Preprocessing
Objects
dimension, or feature 5 No Divorced 95K Yes
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additivity
length. 10 4 properties of
length.
E
15 5
Properties of Attribute Values
female} test
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 15
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as
important
Words present in documents
Items present in customer transactions
Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are
not present
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Data Matrix
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Two sine waves Observed signal (sum of the two sine waves) Observed signal with noise
1 3 3
0.8
2 2
0.6
0.4
1 1
0.2
magnitude
magnitude
magnitude
0 0 0
-0.2
-1 -1
-0.4
-0.6
-2 -2
-0.8
-1 -3 -3
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
time (seconds) time (seconds) time (seconds)
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
If data quality issues are not handled carefully, then Data
mining algorithms will produce erroneous or spurious output.
Aggregation
Sampling
Dimensionality Reduction
Feature Subset Selection
Feature Creation
Discretization and Binarization
Variable Transformation
Quantitative attributes
– such as price, are typically aggregated by taking a sum or an average
Qualitative attributes
– such as item, can either be omitted or summarized in terms of a higher level
category, e.g., televisions versus electronics
Disadvantages of aggregation
– Potential loss of interesting details
– In store example: aggregation over months loses information about which day of the
week has the highest sales.
Stratified Sampling
Definitions of density and distance between points, which are critical for
clustering and outlier detection, become less meaningful
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
a method of extracting important variables from a large
number of variables available in a dataset
it extracts a set of low-dimensional features from a high-
dimensional dataset with the goal of capturing as much
information as possible(variance) in the data.
x1
01/27/2021 Introduction to Data Mining, 2nd Edition 53
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection
Techniques
✔ Brute-Force approach:
Try all possible feature subsets as input to data mining
algorithm, and then take the subset that produces the
best results
✔ Embedded approaches:
Feature selection occurs naturally as part of the data
mining algorithm
Techniques
✔ Filter approaches:
Feature are selected before data mining algorithm is run
Using some approach that is independent of the data mining
task.
For example: select sets of attributes whose pairwise
correlation is as low as possible.
✔ Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
Frequency
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Aggregation
✔ Normally bunch of data is used, and cumulative data from all is used.
Sampling
✔ Only few representative data is kept, rest is thrown away
Dimensionality Reduction
✔ Picking up only the attributes that are important
Variable Transformation
✔ Scale it by some factor
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 74
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 77
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
r = 2. Euclidean distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 80
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
-0.5
Covariance
Matrix:
0.3 0.2
C
0 . 2 0 . 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4