0% found this document useful (0 votes)
9 views

Chap2 Data

Uploaded by

kunal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Chap2 Data

Uploaded by

kunal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Kumar

01/27/2021 Introduction to Data Mining, 2nd Edition 1


Tan, Steinbach, Karpatne, Kumar
Outline

 Attributes and Objects

 Types of Data

 Data Quality

 Data Preprocessing

 Similarity and Distance

01/27/2021 Introduction to Data Mining, 2nd Edition 2


Tan, Steinbach, Karpatne, Kumar
What is Data?

 Collection of data objects Attributes


and their attributes
 An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a
2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No

Objects
dimension, or feature 5 No Divorced 95K Yes

 A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No


8 No Single 85K Yes
– Object is also known as
record, point, case, sample, 9 No Married 75K No

entity, instance, vector, or 10 No Single 90K Yes


observation
10
Attribute Values
 Attribute values are numbers or symbols assigned to
an attribute for a particular object

 Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
 Example: Attribute values for ID and age are integers

– But properties of attribute can be different than the


properties of the values used to represent the attribute
01/27/2021 Introduction to Data Mining, 2nd Edition 4
Tan, Steinbach, Karpatne, Kumar
Measurement of Length
 The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additivity
length. 10 4 properties of
length.
E

15 5
Properties of Attribute Values

 The type of an attribute depends on which of the


following properties/operations it possesses:
– Distinctness: = and =
– Order: <, < , >, and >
– Addition + and -
– Multiplicaton * and /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations
01/27/2021 Introduction to Data Mining, 2nd Edition 6
Tan, Steinbach, Karpatne, Kumar
Types of Attributes

 There are different types of attributes


– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, counts, elapsed
time (e.g., time to run a race)
01/27/2021 Introduction to Data Mining, 2nd Edition 7
Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Nominal

01/27/2021 Introduction to Data Mining, 2nd Edition 8


Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Ordinal

01/27/2021 Introduction to Data Mining, 2nd Edition 9


Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Interval

01/27/2021 Introduction to Data Mining, 2nd Edition 10


Tan, Steinbach, Karpatne, Kumar
Types of Attributes - Ratio

01/27/2021 Introduction to Data Mining, 2nd Edition 11


Tan, Steinbach, Karpatne, Kumar
Difference Between Ratio and Interval

 Is it physically meaningful to say that a


temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

 Consider measuring the height above average


– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?

01/27/2021 Introduction to Data Mining, 2nd Edition 12


Tan, Steinbach, Karpatne, Kumar
Different attribute types

Attribute Description Examples Operations


Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


Transformations that define attribute levels

Attribute Transformation Comments


Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


Discrete and Continuous Attributes

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 15
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as
important
 Words present in documents
 Items present in customer transactions

 If we met a friend in the grocery store would we ever say


the following?
“I see our purchases are very similar since we didn’t buy most of
the same things.”

01/27/2021 Introduction to Data Mining, 2nd Edition 16


Tan, Steinbach, Karpatne, Kumar
Critiques of the attribute categorization

 Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data

 Real data is approximate and noisy


– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct

01/27/2021 Introduction to Data Mining, 2nd Edition 17


Tan, Steinbach, Karpatne, Kumar
Key Messages for Attribute Types

 The types of operations you choose should be


“meaningful” for the type of data you have
– Distinctness, order, meaningful intervals, and meaningful
ratios are only four (among many possible) properties of data

– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are
not present

– Analysis may depend on these other properties of the data


Many statistical analyses depend only on the distribution

– In the end, what is meaningful can be specific to domain

01/27/2021 Introduction to Data Mining, 2nd Edition 18


Tan, Steinbach, Karpatne, Kumar
Important Characteristics of Data

– Dimensionality (number of attributes)


 High dimensional data brings a number of challenges

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

– Size
 Type of analysis may depend on size of data

01/27/2021 Introduction to Data Mining, 2nd Edition 19


Tan, Steinbach, Karpatne, Kumar
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

01/27/2021 Introduction to Data Mining, 2nd Edition 20


Tan, Steinbach, Karpatne, Kumar
Record Data

 Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

01/27/2021 Introduction to Data Mining, 2nd Edition 21


Tan, Steinbach, Karpatne, Kumar
Transaction Data

 A special type of data, where


– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are the
items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Data Matrix

 If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such a data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

01/27/2021 Introduction to Data Mining, 2nd Edition 23


Tan, Steinbach, Karpatne, Kumar
Document Data

 Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

01/27/2021 Introduction to Data Mining, 2nd Edition 24


Tan, Steinbach, Karpatne, Kumar
Graph Data
 Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


01/27/2021 Introduction to Data Mining, 2nd Edition 25
Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Sequences of transactions
Items/Events

An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

01/27/2021 Introduction to Data Mining, 2nd Edition 27


Tan, Steinbach, Karpatne, Kumar
Ordered Data

 Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

01/27/2021 Introduction to Data Mining, 2nd Edition 28


Tan, Steinbach, Karpatne, Kumar
Data Quality

 Poor data quality negatively affects many data processing


efforts
 Data mining algorithms gives results(extracts) only what is
there in the data.
 If data quality issues are not handled carefully, then Data
mining algorithms will produce erroneous or spurious
output.
 Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default

01/27/2021 Introduction to Data Mining, 2nd Edition 29


Tan, Steinbach, Karpatne, Kumar
Data Quality ..

 To overcome the poor data quality problem, Data mining


focuses on:

1) The detection and correction of data quality problem ( is


often called data cleaning)

2) The use of algorithms that can tolerate poor data quality

01/27/2021 Introduction to Data Mining, 2nd Edition 30


Tan, Steinbach, Karpatne, Kumar
Data Quality …

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data

01/27/2021 Introduction to Data Mining, 2nd Edition 31


Tan, Steinbach, Karpatne, Kumar
Noise

 For objects, noise is an extraneous object


 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
– The figures below show two sine waves of the same magnitude and
different frequencies, the waves combined, and the two sine waves with
random noise
 The magnitude and shape of the original signal is distorted

Two sine waves Observed signal (sum of the two sine waves) Observed signal with noise
1 3 3

0.8
2 2
0.6

0.4
1 1
0.2
magnitude

magnitude

magnitude
0 0 0

-0.2
-1 -1
-0.4

-0.6
-2 -2
-0.8

-1 -3 -3
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
time (seconds) time (seconds) time (seconds)

01/27/2021 Introduction to Data Mining, 2nd Edition 32


Tan, Steinbach, Karpatne, Kumar
Outliers

 Outliers are data objects with characteristics that


are considerably different than most of the other
data objects in the data set

For example: In fraud and


network intrusion detection,
the goal is to find unusual
objects or events from
among a large number of
normal ones.

01/27/2021 Introduction to Data Mining, 2nd Edition 33


Tan, Steinbach, Karpatne, Kumar
Missing Values

 Reasons for missing values


– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


– Eliminate data objects or variables
– Estimate missing values
 Example: time series of temperature
 Example: census results

– Ignore the missing value during analysis


– Replace with all possible values(weighted by their probabilities)
01/27/2021 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Duplicate Data

 Data set may include data objects that are


duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

01/27/2021 Introduction to Data Mining, 2nd Edition 35


Tan, Steinbach, Karpatne, Kumar
Data Quality - In a nutshell

Data mining algorithms gives results(extracts) only what is


there in the data.

If data quality issues are not handled carefully, then Data
mining algorithms will produce erroneous or spurious output.

So the Preprocessing is indeed a very important step to


solve the data quality problems. (next topic)

01/27/2021 Introduction to Data Mining, 2nd Edition 36


Tan, Steinbach, Karpatne, Kumar
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature Subset Selection
 Feature Creation
 Discretization and Binarization
 Variable Transformation

01/27/2021 Introduction to Data Mining, 2nd Edition 37


Tan, Steinbach, Karpatne, Kumar
Aggregation

 Combining two or more attributes (or objects) into a single


attribute (or object)
 Purpose
– Data reduction - reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
– More “stable” data - aggregated data tends to have less variability

01/27/2021 Introduction to Data Mining, 2nd Edition 38


Tan, Steinbach, Karpatne, Kumar
Aggregation

 An obvious issue is how an aggregate transaction is created?

 Quantitative attributes
– such as price, are typically aggregated by taking a sum or an average

 Qualitative attributes
– such as item, can either be omitted or summarized in terms of a higher level
category, e.g., televisions versus electronics

 Disadvantages of aggregation
– Potential loss of interesting details
– In store example: aggregation over months loses information about which day of the
week has the highest sales.

01/27/2021 Introduction to Data Mining, 2nd Edition 39


Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia

 This example is based on precipitation in Australia


from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
 The average yearly precipitation has less variability
than the average monthly precipitation.
 All precipitation measurements (and their standard
deviations) are in centimeters.
01/27/2021 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
01/27/2021 Introduction to Data Mining, 2nd Edition 41
Tan, Steinbach, Karpatne, Kumar
Sampling
 Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of the
data and the final data analysis.

 Statisticians often sample because obtaining the entire


set of data of interest is too expensive or time
consuming.

 Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

01/27/2021 Introduction to Data Mining, 2nd Edition 42


Tan, Steinbach, Karpatne, Kumar
Sampling …

 The key principle for effective sampling is the


following:

– Using a sample will work almost as well as using the


entire data set, if the sample is representative

– A sample is representative if it has approximately the


same properties (of interest) as the original set of data

01/27/2021 Introduction to Data Mining, 2nd Edition 43


Tan, Steinbach, Karpatne, Kumar
Sample Size

8000 points 2000 Points 500 Points

Figure 2.9. Example of the loss of structure with


sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 44


Tan, Steinbach, Karpatne, Kumar
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 As each item is selected, it is removed from the population
 Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample.
 In sampling with replacement, the same object can be picked up
more than once
 Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition

01/27/2021 Introduction to Data Mining, 2nd Edition 45


Tan, Steinbach, Karpatne, Kumar
Types of Sampling

 Simple Random Sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 46


Tan, Steinbach, Karpatne, Kumar
Types of Sampling

 Stratified Sampling

01/27/2021 Introduction to Data Mining, 2nd Edition 47


Tan, Steinbach, Karpatne, Kumar
Sample Size
 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

01/27/2021 Introduction to Data Mining, 2nd Edition 48


Tan, Steinbach, Karpatne, Kumar
Curse of Dimensionality

 When dimensionality increases, data becomes increasingly sparse in


the space that it occupies

 Definitions of density and distance between points, which are critical for
clustering and outlier detection, become less meaningful

 Many clustering and classification algorithms have trouble with high-


dimensional data leading to reduced classification accuracy and poor
quality clusters.

01/27/2021 Introduction to Data Mining, 2nd Edition 49


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

 Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

01/27/2021 Introduction to Data Mining, 2nd Edition 50


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 A data reduction technique that transforms a large


number of correlated variables into a smaller set of
correlated variables called principal components


a method of extracting important variables from a large
number of variables available in a dataset


it extracts a set of low-dimensional features from a high-
dimensional dataset with the goal of capturing as much
information as possible(variance) in the data.

01/27/2021 Introduction to Data Mining, 2nd Edition 51


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 Steps Involved in the Principal Component


Analysis:

Standardize the dataset

Compute the covariance matrix for the features in
the dataset

Compute the eigenvalues and eigenvectors for the
covariance matrix

Sort the eigenvalues and their corresponding
eigenvectors

Choose k eigenvalues to form an eigenvector matrix

Transform the original matrix

01/27/2021 Introduction to Data Mining, 2nd Edition 52


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

 Goal is to find a projection that captures the


largest amount of variation in data
x2

x1
01/27/2021 Introduction to Data Mining, 2nd Edition 53
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Another way to reduce dimensionality of data


✔ Use only a subset of the features
 Redundant features
– Duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
 Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
 Redundant and irrelevant features can reduce classification
accuracy and the quality of the clusters that are found.
01/27/2021 Introduction to Data Mining, 2nd Edition 55
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Techniques
✔ Brute-Force approach:

Try all possible feature subsets as input to data mining
algorithm, and then take the subset that produces the
best results

✔ Embedded approaches:

Feature selection occurs naturally as part of the data
mining algorithm

 the data mining algorithm itself decides which attributes to


use and which to ignore.

 For example: Algorithms for building decision tree


classifiers
01/27/2021 Introduction to Data Mining, 2nd Edition 56
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

 Techniques
✔ Filter approaches:

Feature are selected before data mining algorithm is run
 Using some approach that is independent of the data mining
task.
 For example: select sets of attributes whose pairwise
correlation is as low as possible.

✔ Wrapper approaches:

Use the data mining algorithm as a black box to find best
subset of attributes

01/27/2021 Introduction to Data Mining, 2nd Edition 57


Tan, Steinbach, Karpatne, Kumar
An Architecture for Feature Subset Selection

 It is possible to encompass both the filter and


wrapper approaches within a common architecture

 The feature selection process view as consisting of


four parts:
✔ A measure for evaluating a subset

Filter methods and Wrapper methods differ only in the way in
which they evaluate a subset of features
✔ A search strategy that controls the generation of a new subset of
features
✔ A stopping criterion
✔ A validation procedure

01/27/2021 Introduction to Data Mining, 2nd Edition 58


Tan, Steinbach, Karpatne, Kumar
An Architecture for Feature Subset Selection

01/27/2021 Introduction to Data Mining, 2nd Edition 59


Tan, Steinbach, Karpatne, Kumar
Feature Creation

 Create new attributes that can capture the


important information in a data set much more
efficiently than the original attributes

 Three general methodologies:


– Feature extraction
 Example: extracting edges from images
– Feature construction
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis

01/27/2021 Introduction to Data Mining, 2nd Edition 60


Tan, Steinbach, Karpatne, Kumar
Mapping Data to a New Space

 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

01/27/2021 Introduction to Data Mining, 2nd Edition 61


Tan, Steinbach, Karpatne, Kumar
Discretization

 Discretization is the process of converting a


continuous attribute into a categorical attribute
– A potentially infinite number of values are mapped into
a small number of categories
– Discretization is used in both unsupervised and
supervised settings

 Discretization is typically applied to attributes that


are used in classification or association analysis

01/27/2021 Introduction to Data Mining, 2nd Edition 62


Tan, Steinbach, Karpatne, Kumar
Discretization of continuous attributes

 Transformation of a continuous attribute to a


categorical attribute involves two subtasks:
– deciding how many categories, n, to have
– determining how to map the values of the continuous
attribute to these categories.
 In the first step, after the values of the continuous
attribute are sorted, they are then divided into n
intervals by specifying n − 1 split points.
 In the second step, all the values in one interval are
mapped to the same categorical value.

01/27/2021 Introduction to Data Mining, 2nd Edition 63


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

01/27/2021 Introduction to Data Mining, 2nd Edition 64


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal interval width approach used to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 65


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal frequency(equal depth) approach used to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 66


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

K-means approach to obtain 4 values.

01/27/2021 Introduction to Data Mining, 2nd Edition 67


Tan, Steinbach, Karpatne, Kumar
Binarization
 Binarization maps a continuous or categorical
attribute into one or more binary variables

01/27/2021 Introduction to Data Mining, 2nd Edition 69


Tan, Steinbach, Karpatne, Kumar
Attribute Transformation

 An attribute transform is a function that maps the


entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

01/27/2021 Introduction to Data Mining, 2nd Edition 70


Tan, Steinbach, Karpatne, Kumar
Data Preprocessing – In a Nutshell

 Aggregation
✔ Normally bunch of data is used, and cumulative data from all is used.

 Sampling
✔ Only few representative data is kept, rest is thrown away

 Dimensionality Reduction
✔ Picking up only the attributes that are important

 Feature Subset Selection


 Feature Creation/Extraction
 Discretization and Binarization
✔ Discretize the value, the continuous value may not be useful, for example age range

 Variable Transformation
✔ Scale it by some factor

01/27/2021 Introduction to Data Mining, 2nd Edition 73


Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Measures

 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 74
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

01/27/2021 Introduction to Data Mining, 2nd Edition 75


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

 Euclidean Distance

where n is the number of dimensions (attributes) and xk


and yk are, respectively, the kth attributes (components)
or data objects x and y.

 Standardization is necessary, if scales differ.

01/27/2021 Introduction to Data Mining, 2nd Edition 76


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 77
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

 Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.

01/27/2021 Introduction to Data Mining, 2nd Edition 78


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.


– A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different
between two binary vectors

 r = 2. Euclidean distance

 r = ∞. “supremum” (Lmax norm, Lnorm) distance.


– This is the maximum difference between any component of
the vectors

 Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

01/27/2021 Introduction to Data Mining, 2nd Edition 79


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 80
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

-0.5

is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


01/27/2021 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

Covariance
Matrix:
 0.3 0.2
  
C
 0 . 2 0 . 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

01/27/2021 Introduction to Data Mining, 2nd Edition 82


Tan, Steinbach, Karpatne, Kumar

You might also like