0% found this document useful (0 votes)

11 views78 pages

Chap2 Data

Uploaded by

kunal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views78 pages

Chap2 Data

Uploaded by

kunal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining , 2nd Edition

by
Tan, Steinbach, Kumar

01/27/2021 Introduction to Data Mining, 2nd Edition 1

Tan, Steinbach, Karpatne, Kumar
Outline

 Attributes and Objects

 Types of Data

 Data Quality

 Data Preprocessing

 Similarity and Distance

01/27/2021 Introduction to Data Mining, 2nd Edition 2

Tan, Steinbach, Karpatne, Kumar
What is Data?

 Collection of data objects Attributes

and their attributes
 An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a
2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No

Objects
dimension, or feature 5 No Divorced 95K Yes

 A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No

8 No Single 85K Yes
– Object is also known as
record, point, case, sample, 9 No Married 75K No

entity, instance, vector, or 10 No Single 90K Yes

observation
10
Attribute Values
 Attribute values are numbers or symbols assigned to
an attribute for a particular object

 Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of

values
 Example: Attribute values for ID and age are integers

– But properties of attribute can be different than the

properties of the values used to represent the attribute
01/27/2021 Introduction to Data Mining, 2nd Edition 4
Tan, Steinbach, Karpatne, Kumar
Measurement of Length
 The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additivity
length. 10 4 properties of
length.
E

15 5
Properties of Attribute Values

 The type of an attribute depends on which of the

following properties/operations it possesses:
– Distinctness: = and =
– Order: <, < , >, and >
– Addition + and -
– Multiplicaton * and /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations
01/27/2021 Introduction to Data Mining, 2nd Edition 6
Tan, Steinbach, Karpatne, Kumar
Types of Attributes

 There are different types of attributes

– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, counts, elapsed
time (e.g., time to run a race)
01/27/2021 Introduction to Data Mining, 2nd Edition 7
Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Nominal

01/27/2021 Introduction to Data Mining, 2nd Edition 8

Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Ordinal

01/27/2021 Introduction to Data Mining, 2nd Edition 9

Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Interval

01/27/2021 Introduction to Data Mining, 2nd Edition 10

Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Ratio

01/27/2021 Introduction to Data Mining, 2nd Edition 11

Tan, Steinbach, Karpatne, Kumar
Difference Between Ratio and Interval

 Is it physically meaningful to say that a

temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

 Consider measuring the height above average

– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?

01/27/2021 Introduction to Data Mining, 2nd Edition 12

Tan, Steinbach, Karpatne, Kumar
Different attribute types

Attribute Description Examples Operations

Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens

Transformations that define attribute levels

Attribute Transformation Comments

Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and

where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their

zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens

Discrete and Continuous Attributes

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 15
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as
important
 Words present in documents
 Items present in customer transactions

 If we met a friend in the grocery store would we ever say

the following?
“I see our purchases are very similar since we didn’t buy most of
the same things.”

01/27/2021 Introduction to Data Mining, 2nd Edition 16

Tan, Steinbach, Karpatne, Kumar
Critiques of the attribute categorization

 Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data

 Real data is approximate and noisy

– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct

01/27/2021 Introduction to Data Mining, 2nd Edition 17

Tan, Steinbach, Karpatne, Kumar
Key Messages for Attribute Types

 The types of operations you choose should be

“meaningful” for the type of data you have
– Distinctness, order, meaningful intervals, and meaningful
ratios are only four (among many possible) properties of data

– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are
not present

– Analysis may depend on these other properties of the data

Many statistical analyses depend only on the distribution

– In the end, what is meaningful can be specific to domain

01/27/2021 Introduction to Data Mining, 2nd Edition 18

Tan, Steinbach, Karpatne, Kumar
Important Characteristics of Data

– Dimensionality (number of attributes)

 High dimensional data brings a number of challenges

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

– Size
 Type of analysis may depend on size of data

01/27/2021 Introduction to Data Mining, 2nd Edition 19

Tan, Steinbach, Karpatne, Kumar
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

01/27/2021 Introduction to Data Mining, 2nd Edition 20

Tan, Steinbach, Karpatne, Kumar
Record Data

 Data that consists of a collection of records, each

of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

01/27/2021 Introduction to Data Mining, 2nd Edition 21

Tan, Steinbach, Karpatne, Kumar
Transaction Data

 A special type of data, where

– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are the
items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Data Matrix

 If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such a data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

01/27/2021 Introduction to Data Mining, 2nd Edition 23

Tan, Steinbach, Karpatne, Kumar
Document Data

 Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

01/27/2021 Introduction to Data Mining, 2nd Edition 24

Tan, Steinbach, Karpatne, Kumar
Graph Data
 Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

01/27/2021 Introduction to Data Mining, 2nd Edition 25
Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Sequences of transactions
Items/Events

An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

01/27/2021 Introduction to Data Mining, 2nd Edition 27

Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

01/27/2021 Introduction to Data Mining, 2nd Edition 28

Tan, Steinbach, Karpatne, Kumar
Data Quality

 Poor data quality negatively affects many data processing

efforts
 Data mining algorithms gives results(extracts) only what is
there in the data.
 If data quality issues are not handled carefully, then Data
mining algorithms will produce erroneous or spurious
output.
 Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default

01/27/2021 Introduction to Data Mining, 2nd Edition 29

Tan, Steinbach, Karpatne, Kumar
Data Quality ..

 To overcome the poor data quality problem, Data mining

focuses on:

1) The detection and correction of data quality problem ( is

often called data cleaning)

2) The use of algorithms that can tolerate poor data quality

01/27/2021 Introduction to Data Mining, 2nd Edition 30

Tan, Steinbach, Karpatne, Kumar
Data Quality …

 What kinds of data quality problems?

 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data

01/27/2021 Introduction to Data Mining, 2nd Edition 31

Tan, Steinbach, Karpatne, Kumar
Noise

 For objects, noise is an extraneous object

 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
– The figures below show two sine waves of the same magnitude and
different frequencies, the waves combined, and the two sine waves with
random noise
 The magnitude and shape of the original signal is distorted

Two sine waves Observed signal (sum of the two sine waves) Observed signal with noise
1 3 3

0.8
2 2
0.6

0.4
1 1
0.2
magnitude

magnitude

magnitude
0 0 0

-0.2
-1 -1
-0.4

-0.6
-2 -2
-0.8

-1 -3 -3
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
time (seconds) time (seconds) time (seconds)

01/27/2021 Introduction to Data Mining, 2nd Edition 32

Tan, Steinbach, Karpatne, Kumar
Outliers

 Outliers are data objects with characteristics that

are considerably different than most of the other
data objects in the data set
–

For example: In fraud and

network intrusion detection,
the goal is to ﬁnd unusual
objects or events from
among a large number of
normal ones.

01/27/2021 Introduction to Data Mining, 2nd Edition 33

Tan, Steinbach, Karpatne, Kumar
Missing Values

 Reasons for missing values

– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

– Eliminate data objects or variables
– Estimate missing values
 Example: time series of temperature
 Example: census results

– Ignore the missing value during analysis

– Replace with all possible values(weighted by their probabilities)
01/27/2021 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Duplicate Data

 Data set may include data objects that are

duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

01/27/2021 Introduction to Data Mining, 2nd Edition 35

Tan, Steinbach, Karpatne, Kumar
Data Quality - In a nutshell

Data mining algorithms gives results(extracts) only what is

there in the data.

If data quality issues are not handled carefully, then Data
mining algorithms will produce erroneous or spurious output.

So the Preprocessing is indeed a very important step to

solve the data quality problems. (next topic)

01/27/2021 Introduction to Data Mining, 2nd Edition 36

Tan, Steinbach, Karpatne, Kumar
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature Subset Selection
 Feature Creation
 Discretization and Binarization
 Variable Transformation

01/27/2021 Introduction to Data Mining, 2nd Edition 37

Tan, Steinbach, Karpatne, Kumar
Aggregation

 Combining two or more attributes (or objects) into a single

attribute (or object)
 Purpose
– Data reduction - reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
– More “stable” data - aggregated data tends to have less variability

01/27/2021 Introduction to Data Mining, 2nd Edition 38

Tan, Steinbach, Karpatne, Kumar
Aggregation

 An obvious issue is how an aggregate transaction is created?

 Quantitative attributes
– such as price, are typically aggregated by taking a sum or an average

 Qualitative attributes
– such as item, can either be omitted or summarized in terms of a higher level
category, e.g., televisions versus electronics

 Disadvantages of aggregation
– Potential loss of interesting details
– In store example: aggregation over months loses information about which day of the
week has the highest sales.

01/27/2021 Introduction to Data Mining, 2nd Edition 39

Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia

 This example is based on precipitation in Australia

from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
 The average yearly precipitation has less variability
than the average monthly precipitation.
 All precipitation measurements (and their standard
deviations) are in centimeters.
01/27/2021 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
01/27/2021 Introduction to Data Mining, 2nd Edition 41
Tan, Steinbach, Karpatne, Kumar
Sampling
 Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of the
data and the final data analysis.

 Statisticians often sample because obtaining the entire

set of data of interest is too expensive or time
consuming.

 Sampling is typically used in data mining because

processing the entire set of data of interest is too
expensive or time consuming.

01/27/2021 Introduction to Data Mining, 2nd Edition 42

Tan, Steinbach, Karpatne, Kumar
Sampling …

 The key principle for effective sampling is the

following:

– Using a sample will work almost as well as using the

entire data set, if the sample is representative

– A sample is representative if it has approximately the

same properties (of interest) as the original set of data

01/27/2021 Introduction to Data Mining, 2nd Edition 43

Tan, Steinbach, Karpatne, Kumar
Sample Size

8000 points 2000 Points 500 Points

Figure 2.9. Example of the loss of structure with

sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 44

Tan, Steinbach, Karpatne, Kumar
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 As each item is selected, it is removed from the population
 Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample.
 In sampling with replacement, the same object can be picked up
more than once
 Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition

01/27/2021 Introduction to Data Mining, 2nd Edition 45

Tan, Steinbach, Karpatne, Kumar
Types of Sampling

 Simple Random Sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 46

Tan, Steinbach, Karpatne, Kumar
Types of Sampling

 Stratified Sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 47

Tan, Steinbach, Karpatne, Kumar
Sample Size
 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

01/27/2021 Introduction to Data Mining, 2nd Edition 48

Tan, Steinbach, Karpatne, Kumar
Curse of Dimensionality

 When dimensionality increases, data becomes increasingly sparse in

the space that it occupies

 Definitions of density and distance between points, which are critical for
clustering and outlier detection, become less meaningful

 Many clustering and classification algorithms have trouble with high-

dimensional data leading to reduced classification accuracy and poor
quality clusters.

01/27/2021 Introduction to Data Mining, 2nd Edition 49

Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

 Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

01/27/2021 Introduction to Data Mining, 2nd Edition 50

Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 A data reduction technique that transforms a large

number of correlated variables into a smaller set of
correlated variables called principal components


a method of extracting important variables from a large
number of variables available in a dataset


it extracts a set of low-dimensional features from a high-
dimensional dataset with the goal of capturing as much
information as possible(variance) in the data.

01/27/2021 Introduction to Data Mining, 2nd Edition 51

Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 Steps Involved in the Principal Component

Analysis:

Standardize the dataset

Compute the covariance matrix for the features in
the dataset

Compute the eigenvalues and eigenvectors for the
covariance matrix

Sort the eigenvalues and their corresponding
eigenvectors

Choose k eigenvalues to form an eigenvector matrix

Transform the original matrix

01/27/2021 Introduction to Data Mining, 2nd Edition 52

Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 Goal is to find a projection that captures the

largest amount of variation in data
x2

x1
01/27/2021 Introduction to Data Mining, 2nd Edition 53
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Another way to reduce dimensionality of data

✔ Use only a subset of the features
 Redundant features
– Duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
 Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
 Redundant and irrelevant features can reduce classification
accuracy and the quality of the clusters that are found.
01/27/2021 Introduction to Data Mining, 2nd Edition 55
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Techniques
✔ Brute-Force approach:

Try all possible feature subsets as input to data mining
algorithm, and then take the subset that produces the
best results

✔ Embedded approaches:

Feature selection occurs naturally as part of the data
mining algorithm

 the data mining algorithm itself decides which attributes to

use and which to ignore.

 For example: Algorithms for building decision tree

classifiers
01/27/2021 Introduction to Data Mining, 2nd Edition 56
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Techniques
✔ Filter approaches:

Feature are selected before data mining algorithm is run
 Using some approach that is independent of the data mining
task.
 For example: select sets of attributes whose pairwise
correlation is as low as possible.

✔ Wrapper approaches:

Use the data mining algorithm as a black box to find best
subset of attributes

01/27/2021 Introduction to Data Mining, 2nd Edition 57

Tan, Steinbach, Karpatne, Kumar
An Architecture for Feature Subset Selection

 It is possible to encompass both the filter and

wrapper approaches within a common architecture

 The feature selection process view as consisting of

four parts:
✔ A measure for evaluating a subset

Filter methods and Wrapper methods differ only in the way in
which they evaluate a subset of features
✔ A search strategy that controls the generation of a new subset of
features
✔ A stopping criterion
✔ A validation procedure

01/27/2021 Introduction to Data Mining, 2nd Edition 58

Tan, Steinbach, Karpatne, Kumar
An Architecture for Feature Subset Selection

01/27/2021 Introduction to Data Mining, 2nd Edition 59

Tan, Steinbach, Karpatne, Kumar
Feature Creation

 Create new attributes that can capture the

important information in a data set much more
efficiently than the original attributes

 Three general methodologies:

– Feature extraction
 Example: extracting edges from images
– Feature construction
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis

01/27/2021 Introduction to Data Mining, 2nd Edition 60

Tan, Steinbach, Karpatne, Kumar
Mapping Data to a New Space

 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

01/27/2021 Introduction to Data Mining, 2nd Edition 61

Tan, Steinbach, Karpatne, Kumar
Discretization

 Discretization is the process of converting a

continuous attribute into a categorical attribute
– A potentially infinite number of values are mapped into
a small number of categories
– Discretization is used in both unsupervised and
supervised settings

 Discretization is typically applied to attributes that

are used in classification or association analysis

01/27/2021 Introduction to Data Mining, 2nd Edition 62

Tan, Steinbach, Karpatne, Kumar
Discretization of continuous attributes

 Transformation of a continuous attribute to a

categorical attribute involves two subtasks:
– deciding how many categories, n, to have
– determining how to map the values of the continuous
attribute to these categories.
 In the ﬁrst step, after the values of the continuous
attribute are sorted, they are then divided into n
intervals by specifying n − 1 split points.
 In the second step, all the values in one interval are
mapped to the same categorical value.

01/27/2021 Introduction to Data Mining, 2nd Edition 63

Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

01/27/2021 Introduction to Data Mining, 2nd Edition 64

Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal interval width approach used to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 65

Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal frequency(equal depth) approach used to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 66

Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

K-means approach to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 67

Tan, Steinbach, Karpatne, Kumar
Binarization
 Binarization maps a continuous or categorical
attribute into one or more binary variables

01/27/2021 Introduction to Data Mining, 2nd Edition 69

Tan, Steinbach, Karpatne, Kumar
Attribute Transformation

 An attribute transform is a function that maps the

entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

01/27/2021 Introduction to Data Mining, 2nd Edition 70

Tan, Steinbach, Karpatne, Kumar
Data Preprocessing – In a Nutshell

 Aggregation
✔ Normally bunch of data is used, and cumulative data from all is used.

 Sampling
✔ Only few representative data is kept, rest is thrown away

 Dimensionality Reduction
✔ Picking up only the attributes that are important

 Feature Subset Selection

 Feature Creation/Extraction
 Discretization and Binarization
✔ Discretize the value, the continuous value may not be useful, for example age range

 Variable Transformation
✔ Scale it by some factor

01/27/2021 Introduction to Data Mining, 2nd Edition 73

Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Measures

 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 74
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity

between two objects, x and y, with respect to a single, simple
attribute.

01/27/2021 Introduction to Data Mining, 2nd Edition 75

Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

 Euclidean Distance

where n is the number of dimensions (attributes) and xk

and yk are, respectively, the kth attributes (components)
or data objects x and y.

 Standardization is necessary, if scales differ.

01/27/2021 Introduction to Data Mining, 2nd Edition 76

Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 77
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

 Minkowski Distance is a generalization of Euclidean

Distance

Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.

01/27/2021 Introduction to Data Mining, 2nd Edition 78

Tan, Steinbach, Karpatne, Kumar
Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different
between two binary vectors

 r = 2. Euclidean distance

 r = ∞. “supremum” (Lmax norm, Lnorm) distance.

– This is the maximum difference between any component of
the vectors

 Do not confuse r with n, i.e., all these distances are

defined for all numbers of dimensions.

01/27/2021 Introduction to Data Mining, 2nd Edition 79

Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 80
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

-0.5

is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

01/27/2021 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

Covariance
Matrix:
 0.3 0.2
  
C
 0 . 2 0 . 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

01/27/2021 Introduction to Data Mining, 2nd Edition 82

Tan, Steinbach, Karpatne, Kumar

The Greatest Miracle in The World
50% (4)
The Greatest Miracle in The World
2 pages
clustering_vivek_saxena
No ratings yet
clustering_vivek_saxena
169 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
3 - Introduction To Data
No ratings yet
3 - Introduction To Data
56 pages
Full
No ratings yet
Full
367 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
chap2_data (1)
No ratings yet
chap2_data (1)
105 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
MT Lab Manual Final - 2019
No ratings yet
MT Lab Manual Final - 2019
102 pages
APznzaYWmQ-qTobj5RXzz8-xtNTobjIxUBBK2CZPI-jNfIhVqkF8b7cZ1tNuaihsGv4VttsFBJ5w8X_jB6b8UegcEnFTG3Rxj-fuGplOc4YDZDKmOqayvVrdoHINtkuN-c4OgbbeX9-btpgsT__OEpp7NeVkQh3HGSQfs_p5pWsx9Et69wyRSeULRuX9f3pX8A4L8v1-fJ7
No ratings yet
APznzaYWmQ-qTobj5RXzz8-xtNTobjIxUBBK2CZPI-jNfIhVqkF8b7cZ1tNuaihsGv4VttsFBJ5w8X_jB6b8UegcEnFTG3Rxj-fuGplOc4YDZDKmOqayvVrdoHINtkuN-c4OgbbeX9-btpgsT__OEpp7NeVkQh3HGSQfs_p5pWsx9Et69wyRSeULRuX9f3pX8A4L8v1-fJ7
67 pages
4swiss WFF021 Instrukcja Obslugi EN
No ratings yet
4swiss WFF021 Instrukcja Obslugi EN
190 pages
chapter 2
No ratings yet
chapter 2
57 pages
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
100% (1)
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
66 pages
3_Introduction to Data (3)
No ratings yet
3_Introduction to Data (3)
55 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
87 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
2DMT
No ratings yet
2DMT
73 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
UFE Lecture-1 Overview Data
No ratings yet
UFE Lecture-1 Overview Data
42 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data
No ratings yet
Data
84 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Datascience
No ratings yet
Datascience
28 pages
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
No ratings yet
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
42 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
CH2 data 1
No ratings yet
CH2 data 1
35 pages
lec1
No ratings yet
lec1
27 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Bacteria 7a Science Fair
No ratings yet
Bacteria 7a Science Fair
15 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
Adash A4300 VA3 Pro Manual
No ratings yet
Adash A4300 VA3 Pro Manual
89 pages
Mypresentation 1
No ratings yet
Mypresentation 1
50 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
04 Chemistry Unit-10 (Student Copy)
No ratings yet
04 Chemistry Unit-10 (Student Copy)
4 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Skripsi Riska Rahman PDF
No ratings yet
Skripsi Riska Rahman PDF
130 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Layout and Synthesis - VLSI - Design - Lab
No ratings yet
Layout and Synthesis - VLSI - Design - Lab
115 pages
Chap7 Extended Association Analysis
No ratings yet
Chap7 Extended Association Analysis
67 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
CLOSER23 Igor
No ratings yet
CLOSER23 Igor
10 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Week 2 Discussion ITS 632 UC
No ratings yet
Week 2 Discussion ITS 632 UC
5 pages
Handling Continuous Attributes: Different Kinds of Rules
No ratings yet
Handling Continuous Attributes: Different Kinds of Rules
33 pages
Star Q Ipe Chemistry 11th Class
67% (3)
Star Q Ipe Chemistry 11th Class
13 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
96 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Fibromyalgia Treatments and Advancements
No ratings yet
Fibromyalgia Treatments and Advancements
3 pages
Association Analysis: Advance Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis: Advance Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
87 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
4 pages
Modul Bahasa Inggris 3
No ratings yet
Modul Bahasa Inggris 3
12 pages
Mehta UrbanInformalSector 1985
No ratings yet
Mehta UrbanInformalSector 1985
8 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Sealing Layer Films
No ratings yet
Sealing Layer Films
1 page
Manu's Social Thoughts
No ratings yet
Manu's Social Thoughts
5 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Unit Ii HMT
No ratings yet
Unit Ii HMT
46 pages
Donner DDP-100 Digital Piano Manuel
67% (3)
Donner DDP-100 Digital Piano Manuel
1 page
Audit Practice Manual
100% (2)
Audit Practice Manual
223 pages
Cover Letter BCG
No ratings yet
Cover Letter BCG
2 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Composing An Independent Critique of A Chosen Selection
No ratings yet
Composing An Independent Critique of A Chosen Selection
3 pages
Taycan - Brochure PDF
No ratings yet
Taycan - Brochure PDF
44 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Year 4 Transit Forms 2
No ratings yet
Year 4 Transit Forms 2
16 pages
Green House Effect
No ratings yet
Green House Effect
9 pages
Free Internet Using Opera Mini For Android Phone Non Rooted
No ratings yet
Free Internet Using Opera Mini For Android Phone Non Rooted
4 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
10th Marksheet
No ratings yet
10th Marksheet
2 pages
Reincarnation of A Ghost ENG
No ratings yet
Reincarnation of A Ghost ENG
16 pages
The Oscilloscope: Figure 1: X, Y, and Z Components of A Displayed Waveform
No ratings yet
The Oscilloscope: Figure 1: X, Y, and Z Components of A Displayed Waveform
40 pages
Regarding The Users of Financial Statements and Their Information Needs
No ratings yet
Regarding The Users of Financial Statements and Their Information Needs
7 pages
Organizational Behavior: Behavior Is Not Random
No ratings yet
Organizational Behavior: Behavior Is Not Random
3 pages
Prayer Points For Spiritual Growth Through Intimacy
100% (1)
Prayer Points For Spiritual Growth Through Intimacy
8 pages
Oral Communication in Context: Quarter 2 - Module 11: Principles of Speech Delivery
No ratings yet
Oral Communication in Context: Quarter 2 - Module 11: Principles of Speech Delivery
18 pages
Bộ Đề Ôn Thi Tốt Nghiệp Môn Tiếng Anh A. Câu Hỏi Test 1 I - Choose the best answer
No ratings yet
Bộ Đề Ôn Thi Tốt Nghiệp Môn Tiếng Anh A. Câu Hỏi Test 1 I - Choose the best answer
11 pages
Rotc
No ratings yet
Rotc
2 pages