0% found this document useful (0 votes)
12 views235 pages

All Data Mining Chapters

The lecture notes cover key concepts in data mining, including attributes, types of data, data quality, and data preprocessing. It categorizes attributes into nominal, ordinal, interval, and ratio types, and discusses the importance of data quality and common issues such as noise, outliers, and missing values. Additionally, it describes different types of data sets, including record, document, transaction, and graph data.

Uploaded by

baderahed21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views235 pages

All Data Mining Chapters

The lecture notes cover key concepts in data mining, including attributes, types of data, data quality, and data preprocessing. It categorizes attributes into nominal, ordinal, interval, and ratio types, and discusses the importance of data quality and common issues such as noise, outliers, and missing values. Additionally, it describes different types of data sets, including record, document, transaction, and graph data.

Uploaded by

baderahed21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Kumar

09/14/2020 Introduction to Data Mining, 2nd Edition 1


Tan, Steinbach, Karpatne, Kumar
Outline

Attributes and Objects

Types of Data

Data Quality

Similarity and Distance

Data Preprocessing

09/14/2020 Introduction to Data Mining, 2nd Edition 2


Tan, Steinbach, Karpatne, Kumar
What is Data?

Collection of data objects Attributes


and their attributes
An attribute is a property Tid Refund Marital Taxable
or characteristic of an Status Income Cheat

object 1 Yes Single 125K No


– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No

Objects
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes

A collection of attributes 6 No Married 60K No


describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
Attribute Values

Attribute values are numbers or symbols


assigned to an attribute for a particular object

Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
◆ Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
◆ Example: Attribute values for ID and age are integers
– But properties of attribute can be different than the
properties of the values used to represent the
attribute Introduction to Data Mining, 2nd Edition
09/14/2020 4
Tan, Steinbach, Karpatne, Kumar
Measurement of Length
The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.

15 5
Types of Attributes

There are different types of attributes


– Nominal
◆ Examples: ID numbers, eye color, zip codes
– Ordinal
◆ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
◆ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
◆ Examples: temperature in Kelvin, length, counts,
elapsed time (e.g., time to run a race)
09/14/2020 Introduction to Data Mining, 2nd Edition 6
Tan, Steinbach, Karpatne, Kumar
Properties of Attribute Values

The type of an attribute depends on which of the


following properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations

09/14/2020 Introduction to Data Mining, 2nd Edition 7


Tan, Steinbach, Karpatne, Kumar
Difference Between Ratio and Interval

Is it physically meaningful to say that a


temperature of 10° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale? 5°C=41°F, 10°C=50°F
– the Kelvin scale? 5°C=278.15°K, 10°C=283.15°K

Consider measuring the height above average


– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?

09/14/2020 Introduction to Data Mining, 2nd Edition 8


Tan, Steinbach, Karpatne, Kumar
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


Discrete and Continuous Attributes

Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
09/14/2020 Introduction to Data Mining, 2nd Edition 11
Tan, Steinbach, Karpatne, Kumar
Important Characteristics of Data

– Dimensionality (number of attributes)


◆ High dimensional data brings a number of challenges

– Sparsity
◆ Only presence counts

– Resolution
◆ Patterns depend on the scale

– Size
◆ Type of analysis may depend on size of data

09/14/2020 Introduction to Data Mining, 2nd Edition 15


Tan, Steinbach, Karpatne, Kumar
Types of data sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

09/14/2020 Introduction to Data Mining, 2nd Edition 16


Tan, Steinbach, Karpatne, Kumar
Record Data

Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

09/14/2020 Introduction to Data Mining, 2nd Edition 17


Tan, Steinbach, Karpatne, Kumar
Data Matrix

If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

Such a data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

09/14/2020 Introduction to Data Mining, 2nd Edition 18


Tan, Steinbach, Karpatne, Kumar
Document Data

Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

09/14/2020 Introduction to Data Mining, 2nd Edition 19


Tan, Steinbach, Karpatne, Kumar
Transaction Data

A special type of data, where


– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
09/14/2020 Introduction to Data Mining, 2nd Edition 20
Tan, Steinbach, Karpatne, Kumar
Graph Data

Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


09/14/2020 Introduction to Data Mining, 2nd Edition 21
Tan, Steinbach, Karpatne, Kumar
Ordered Data

Sequences of transactions
Items/Events

An element of
the sequence
09/14/2020 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Ordered Data

Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

09/14/2020 Introduction to Data Mining, 2nd Edition 23


Tan, Steinbach, Karpatne, Kumar
Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

09/14/2020 Introduction to Data Mining, 2nd Edition 24


Tan, Steinbach, Karpatne, Kumar
Data Quality

Poor data quality negatively affects many data processing


efforts

Data mining example: a classification model for detecting


people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default

09/14/2020 Introduction to Data Mining, 2nd Edition 25


Tan, Steinbach, Karpatne, Kumar
Data Quality …

What kinds of data quality problems?


How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:


– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data
09/14/2020 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Noise

For objects, noise is an extraneous object


For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
– The figures below show two sine waves of the same magnitude and
different frequencies, the waves combined, and the two sine waves with
random noise
◆ The magnitude and shape of the original signal is distorted

09/14/2020 Introduction to Data Mining, 2nd Edition 27


Tan, Steinbach, Karpatne, Kumar
Outliers

Outliers are data objects with characteristics that


are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis

– Case 2: Outliers are


the goal of our analysis
◆ Credit card fraud
◆ Intrusion detection

Causes?
09/14/2020 Introduction to Data Mining, 2nd Edition 28
Tan, Steinbach, Karpatne, Kumar
Missing Values

Reasons for missing values


– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values


– Eliminate data objects or variables
– Estimate missing values
◆ Example: time series of temperature
◆ Example: census results

– Ignore the missing value during analysis

09/14/2020 Introduction to Data Mining, 2nd Edition 29


Tan, Steinbach, Karpatne, Kumar
Duplicate Data

Data set may include data objects that are


duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

Examples:
– Same person with multiple email addresses

Data cleaning
– Process of dealing with duplicate data issues

When should duplicate data not be removed?


09/14/2020 Introduction to Data Mining, 2nd Edition 30
Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Measures

Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
09/14/2020 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

09/14/2020 Introduction to Data Mining, 2nd Edition 32


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

Euclidean Distance

where n is the number of dimensions (attributes) and


xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

Standardization is necessary, if scales differ.

09/14/2020 Introduction to Data Mining, 2nd Edition 33


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
09/14/2020 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.

09/14/2020 Introduction to Data Mining, 2nd Edition 35


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.


– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors

r = 2. Euclidean distance

r → . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of
the vectors

Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

09/14/2020 Introduction to Data Mining, 2nd Edition 36


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
09/14/2020 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = (𝐱 − 𝐲)Ʃ−1 (𝐱 − 𝐲)𝑇

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


09/14/2020 Introduction to Data Mining, 2nd Edition 38
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

Covariance
Matrix:
0.3 0.2
= 
C
 0.2 0.3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

09/14/2020 Introduction to Data Mining, 2nd Edition 39


Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance

Given A = (0.5, 0.5) and B = (0, 1). Mahl(A, B) = (A – B)Σ-1(A-B)T


𝑎 𝑏 −1 1 𝑑 −𝑏
Let 𝑀 = ,𝑀 =
𝑐 𝑑 𝑎𝑑 −𝑏𝑐 −𝑐 𝑎

0.3 0.2
⸪Σ =
0.2 0.3
−1 1 0.3 −0.2 0.3 −0.2 6 −4
⸫𝛴 = = 20 =
0.09−0.04 −0.2 0.3 −0.2 0.3 −4 6
6 −4 0.5
0.5 −0.5 =5
−4 6 −0.5
09/14/2020 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Distance

Distances, such as the Euclidean distance,


have some well-known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only
if x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between


points (data objects), x and y.

A distance that satisfies these properties is a


metric
09/14/2020 Introduction to Data Mining, 2nd Edition 41
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Similarity

Similarities, also have some well known


properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.


(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data


objects), x and y.

09/14/2020 Introduction to Data Mining, 2nd Edition 42


Tan, Steinbach, Karpatne, Kumar
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes

Compute similarities using the following quantities


f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes


= (f11) / (f01 + f10 + f11)

09/14/2020 Introduction to Data Mining, 2nd Edition 43


Tan, Steinbach, Karpatne, Kumar
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

09/14/2020 Introduction to Data Mining, 2nd Edition 44


Tan, Steinbach, Karpatne, Kumar
Cosine Similarity

If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.

Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

09/14/2020 Introduction to Data Mining, 2nd Edition 45


Tan, Steinbach, Karpatne, Kumar
Correlation measures the linear relationship
between objects

09/14/2020 Introduction to Data Mining, 2nd Edition 46


Tan, Steinbach, Karpatne, Kumar
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

09/14/2020 Introduction to Data Mining, 2nd Edition 47


Tan, Steinbach, Karpatne, Kumar
Drawback of Correlation

x = (-3, -2, -1, 0, 1, 2, 3)


y = (9, 4, 1, 0, 1, 4, 9)

yi = xi2

mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74

corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0

09/14/2020 Introduction to Data Mining, 2nd Edition 48


Tan, Steinbach, Karpatne, Kumar
Correlation vs Cosine vs Euclidean Distance
Compare the three proximity measures according to their behavior under
variable transformation
– scaling: multiplication by a value
– translation: adding a constant
Property Cosine Correlation Euclidean Distance

Invariant to scaling Yes Yes No


(multiplication)

Invariant to translation No Yes No


(addition)

Consider the example


– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
– ys = y * 2 (scaled version of y), yt = y + 5 (translated version)
Measure (x , y) (x , ys) (x , yt)

Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

Euclidean Distance 1.4142 5.8310 14.2127

09/14/2020 Introduction to Data Mining, 2nd Edition 49


Tan, Steinbach, Karpatne, Kumar
Correlation vs cosine vs Euclidean distance

Choice of the right proximity measure depends on the domain


What is the correct choice of proximity measure for the
following situations?
– Comparing documents using the frequencies of words
◆ Documents are considered similar if the word frequencies are similar

– Comparing the temperature in Celsius of two locations


◆ Two locations are considered similar if the temperatures are similar in
magnitude

– Comparing two time series of temperature measured in Celsius


◆ Two time series are considered similar if their “shape” is similar, i.e., they
vary in the same way over time, achieving minimums and maximums at
similar times, etc.

09/14/2020 Introduction to Data Mining, 2nd Edition 50


Tan, Steinbach, Karpatne, Kumar
Comparison of Proximity Measures

Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
The measure must be applicable to the data and
produce results that agree with domain knowledge
09/14/2020 Introduction to Data Mining, 2nd Edition 51
Tan, Steinbach, Karpatne, Kumar
Information Based Measures

Information theory is a well-developed and


fundamental disciple with broad applications

Some similarity measures are based on


information theory
– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related
measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute

09/14/2020 Introduction to Data Mining, 2nd Edition 52


Tan, Steinbach, Karpatne, Kumar
Information and Probability

Information relates to possible outcomes of an event


– transmission of a message, flip of a coin, or measurement
of a piece of data

The more certain an outcome, the less information


that it contains and vice-versa
– For example, if a coin has two heads, then an outcome of
heads provides no information
– More quantitatively, the information is related the
probability of an outcome
◆ The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure
09/14/2020 Introduction to Data Mining, 2nd Edition 53
Tan, Steinbach, Karpatne, Kumar
Entropy

For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛

𝐻 𝑋 = − ෍ 𝑝𝑖 log 2 𝑝𝑖
𝑖=1

Entropy is between 0 and log2n and is measured in


bits
– Thus, entropy is a measure of how many bits it takes to
represent an observation of X on average
09/14/2020 Introduction to Data Mining, 2nd Edition 54
Tan, Steinbach, Karpatne, Kumar
Entropy Examples

For a coin with probability p of heads and


probability q = 1 – p of tails
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞
– For p= 0.5, q = 0.5 (fair coin) H = 1
– For p = 1 or q = 1, H = 0

What is the entropy of a fair four-sided die?

09/14/2020 Introduction to Data Mining, 2nd Edition 55


Tan, Steinbach, Karpatne, Kumar
Entropy for Sample Data: Example

Hair Color Count p -plog2p


Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219

09/14/2020 Introduction to Data Mining, 2nd Edition 56


Tan, Steinbach, Karpatne, Kumar
Entropy for Sample Data

Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 = − ෍ log 2
𝑚 𝑚
𝑖=1

For continuous data, the calculation is harder


09/14/2020 Introduction to Data Mining, 2nd Edition 57
Tan, Steinbach, Karpatne, Kumar
General Approach for Combining Similarities

Sometimes attributes are of many different types, but an


overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
2: Define an indicator variable, k, for the kth attribute as
follows:
k = 0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects
has a missing value for the kth attribute
k = 1 otherwise
3. Compute

09/14/2020 Introduction to Data Mining, 2nd Edition 61


Tan, Steinbach, Karpatne, Kumar
Using Weights to Combine Similarities

May not want to treat all attributes the same.


– Use non-negative weights 𝜔𝑘

σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘

Can also define a weighted form of distance

09/14/2020 Introduction to Data Mining, 2nd Edition 62


Tan, Steinbach, Karpatne, Kumar
Data Preprocessing

Aggregation
Sampling
Discretization and Binarization
Attribute Transformation
Dimensionality Reduction
Feature subset selection
Feature creation

09/14/2020 Introduction to Data Mining, 2nd Edition 63


Tan, Steinbach, Karpatne, Kumar
Aggregation

Combining two or more attributes (or objects) into


a single attribute (or object)

Purpose
– Data reduction
◆ Reduce the number of attributes or objects
– Change of scale
◆ Cities aggregated into regions, states, countries, etc.
◆ Days aggregated into weeks, months, or years
– More “stable” data
◆ Aggregated data tends to have less variability

09/14/2020 Introduction to Data Mining, 2nd Edition 64


Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia

This example is based on precipitation in


Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
The average yearly precipitation has less
variability than the average monthly precipitation.
All precipitation measurements (and their
standard deviations) are in centimeters.
09/14/2020 Introduction to Data Mining, 2nd Edition 65
Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
09/14/2020 Introduction to Data Mining, 2nd Edition 66
Tan, Steinbach, Karpatne, Kumar
Sampling
Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.

Statisticians often sample because obtaining the


entire set of data of interest is too expensive or
time consuming.

Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

09/14/2020 Introduction to Data Mining, 2nd Edition 67


Tan, Steinbach, Karpatne, Kumar
Sampling …

The key principle for effective sampling is the


following:

– Using a sample will work almost as well as using the


entire data set, if the sample is representative

– A sample is representative if it has approximately the


same properties (of interest) as the original set of data

09/14/2020 Introduction to Data Mining, 2nd Edition 68


Tan, Steinbach, Karpatne, Kumar
Sample Size

8000 points 2000 Points 500 Points

09/14/2020 Introduction to Data Mining, 2nd Edition 69


Tan, Steinbach, Karpatne, Kumar
Types of Sampling
Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
◆ As each item is selected, it is removed from the
population
– Sampling with replacement
◆ Objects are not removed from the population as they
are selected for the sample.
◆ In sampling with replacement, the same object can
be picked up more than once
Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition

09/14/2020 Introduction to Data Mining, 2nd Edition 70


Tan, Steinbach, Karpatne, Kumar
Sample Size
What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

09/14/2020 Introduction to Data Mining, 2nd Edition 71


Tan, Steinbach, Karpatne, Kumar
Discretization

Discretization is the process of converting a


continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped
into a small number of categories
– Discretization is used in both unsupervised and
supervised settings

09/14/2020 Introduction to Data Mining, 2nd Edition 72


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

09/14/2020 Introduction to Data Mining, 2nd Edition 73


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal interval width approach used to obtain 4 values.

09/14/2020 Introduction to Data Mining, 2nd Edition 74


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

Equal frequency approach used to obtain 4 values.

09/14/2020 Introduction to Data Mining, 2nd Edition 75


Tan, Steinbach, Karpatne, Kumar
Unsupervised Discretization

K-means approach to obtain 4 values.

09/14/2020 Introduction to Data Mining, 2nd Edition 76


Tan, Steinbach, Karpatne, Kumar
Discretization in Supervised Settings

– Many classification algorithms work best if both


the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set

09/14/2020 Introduction to Data Mining, 2nd Edition 77


Tan, Steinbach, Karpatne, Kumar
Iris Sample Data Set

Iris Plant data set.


– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
◆ Setosa
◆ Versicolour
◆ Virginica
– Four (non-class) attributes
◆ Sepal width and length
◆ Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
09/14/2020 Introduction to Data Mining, 2nd Edition 78
Tan, Steinbach, Karpatne, Kumar
Discretization: Iris Example …

How can we tell what the best discretization is?


– Unsupervised discretization: find breaks in the data
values 50
◆ Example:
Petal Length 40

Counts
30

20

10

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to find


breaks
09/14/2020 Introduction to Data Mining, 2nd Edition 79
Tan, Steinbach, Karpatne, Kumar
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high Introduction
or petal length
to Data high implies
Mining, Virginica.
2nd Edition
09/14/2020 80
Tan, Steinbach, Karpatne, Kumar
Binarization

Binarization maps a continuous or categorical


attribute into one or more binary variables

Typically used for association analysis

Often convert a continuous attribute to a


categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
09/14/2020 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
Attribute Transformation

An attribute transform is a function that maps the


entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
Refers to various techniques to adjust to

differences among attributes in terms of frequency
of occurrence, mean, variance, range
◆ Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
09/14/2020 Introduction to Data Mining, 2nd Edition 82
Tan, Steinbach, Karpatne, Kumar
Example: Sample Time Series of Plant Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
09/14/2020 Introduction to Data Mining, 2nd Edition 83
Tan, Steinbach, Karpatne, Kumar
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
09/14/2020 Introduction to Data Mining, 2nd Edition 84
Tan, Steinbach, Karpatne, Kumar
Curse of Dimensionality

When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

Definitions of density and


distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful •Randomly generate 500 points
•Compute difference between max and
min distance between any pair of points
09/14/2020 Introduction to Data Mining, 2nd Edition 85
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction

Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

09/14/2020 Introduction to Data Mining, 2nd Edition 86


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

Goal is to find a projection that captures the


largest amount of variation in data
x2

x1
09/14/2020 Introduction to Data Mining, 2nd Edition 87
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

09/14/2020 Introduction to Data Mining, 2nd Edition 88


Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

Another way to reduce dimensionality of data


Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Many techniques developed, especially for
classification
09/14/2020 Introduction to Data Mining, 2nd Edition 89
Tan, Steinbach, Karpatne, Kumar
Feature Creation

Create new attributes that can capture the


important information in a data set much more
efficiently than the original attributes

Three general methodologies:


– Feature extraction
◆ Example: extracting edges from images
– Feature construction
◆ Example: dividing mass by volume to get density
– Mapping data to new space
◆ Example: Fourier and wavelet analysis

09/14/2020 Introduction to Data Mining, 2nd Edition 90


Tan, Steinbach, Karpatne, Kumar
Mapping Data to a New Space

Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

09/14/2020 Introduction to Data Mining, 2nd Edition 91


Tan, Steinbach, Karpatne, Kumar
Data Mining
Classification: Alternative Techniques

Imbalanced Class Problem

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Class Imbalance Problem

Lots of classification problems where the classes


are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly
line

02/03/2018 Introduction to Data Mining, 2nd Edition 2


Challenges

Evaluation measures such as accuracy is not


well-suited for imbalanced class

Detecting the rare class is like finding needle in a


haystack

02/03/2018 Introduction to Data Mining, 2nd Edition 3


Confusion Matrix

Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

02/03/2018 Introduction to Data Mining, 2nd Edition 4


Accuracy

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

Most widely-used metric:

a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
02/03/2018 Introduction to Data Mining, 2nd Edition 5
Problem with Accuracy

Consider a 2-class problem


– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

02/03/2018 Introduction to Data Mining, 2nd Edition 6


Problem with Accuracy

Consider a 2-class problem


– Number of Class NO examples = 990
– Number of Class YES examples = 10

If a model predicts everything to be class NO,


accuracy is 990/1000 = 99 %
– This is misleading because the model does
not detect any class YES example
– Detecting the rare class is usually more
interesting (e.g., frauds, intrusions, defects,
etc)

02/03/2018 Introduction to Data Mining, 2nd Edition 7


Alternative Measures

PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
02/03/2018 Introduction to Data Mining, 2nd Edition 8
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0 .5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 * 1 * 0 .5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0 .5
990
Accuracy = = 0.99
1000

02/03/2018 Introduction to Data Mining, 2nd Edition 9


Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0 .5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 * 1 * 0 .5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0 .5
990
Accuracy = = 0.99
1000

PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0 .1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1 * 1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0 .1
991
Accuracy = = 0.991
1000
02/03/2018 Introduction to Data Mining, 2nd Edition 10
Alternative Measures

PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

02/03/2018 Introduction to Data Mining, 2nd Edition 11


Alternative Measures

PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

PREDICTED CLASS
Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8

02/03/2018 Introduction to Data Mining, 2nd Edition 12


Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

 is the probability that we reject


the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).

 is the probability that we


accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).

02/03/2018 Introduction to Data Mining, 2nd Edition 13


Alternative Measures

PREDICTED CLASS Precision (p) = 0.8


TPR = Recall (r) = 0.8
Class=Yes Class=No
FPR = 0.2
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

PREDICTED CLASS
Precision (p) =~ 0.04
Class=Yes Class=No
TPR = Recall (r) = 0.8
ACTUAL
Class=Yes 40 10 FPR = 0.2
CLASS Class=No 1000 4000 F - measure (F) =~ 0.08
Accuracy =~ 0.8

02/03/2018 Introduction to Data Mining, 2nd Edition 14


Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS

PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL FPR = 0.5
Class=No 25 25
CLASS

PREDICTED CLASS Precision (p) = 0.5


Class=Yes Class=No
TPR = Recall (r) = 0.8
Class=Yes 40 10
ACTUAL FPR = 0.8
CLASS Class=No 40 10

02/03/2018 Introduction to Data Mining, 2nd Edition 15


ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off


between detection rate and false alarm rate
Developed in 1950s for signal detection theory to
analyze noisy signals
ROC curve plots TPR against FPR
– Performance of a model represented as a
point in an ROC curve
– Changing the threshold parameter of classifier
changes the location of the point

02/03/2018 Introduction to Data Mining, 2nd Edition 16


ROC Curve

(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal

Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class

02/03/2018 Introduction to Data Mining, 2nd Edition 17


ROC (Receiver Operating Characteristic)

To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most
likely positive class record to the least likely positive
class record

Many classifiers produce only discrete outputs (i.e.,


predicted class)
– How to get continuous-valued outputs?
◆ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM

02/03/2018 Introduction to Data Mining, 2nd Edition 18


Example: Decision Trees
Decision Tree
x2 < 12.63

x1 < 13.29 x2 < 17.35


Continuous-valued outputs
x1 < 6.56 x1 < 2.15

x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35

x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107

x1 < 12.11
x2 < 1.38 0.727
0.164

x1 < 18.88
0.143 0.669 0.271

0.654 0

02/03/2018 Introduction to Data Mining, 2nd Edition 19


ROC Curve Example

x2 < 12.63

x1 < 13.29 x2 < 17.35

x1 < 6.56 x1 < 2.15


0.059 0.220

x1 < 7.24
x2 < 8.64 0.071
0.107

x1 < 12.11
x2 < 1.38 0.727
0.164

x1 < 18.88
0.143 0.669 0.271

0.654 0

02/03/2018 Introduction to Data Mining, 2nd Edition 20


ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive

At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
02/03/2018 Introduction to Data Mining, 2nd Edition 21
Using ROC for Model Comparison

No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR

Area Under the ROC


curve
Ideal:
▪ Area =1
Random guess:
▪ Area = 0.5

02/03/2018 Introduction to Data Mining, 2nd Edition 22


How to Construct an ROC curve

• Use a classifier that produces a


Instance Score True Class
continuous-valued score for
1 0.95 +
each instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)

02/03/2018 Introduction to Data Mining, 2nd Edition 23


How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

02/03/2018 Introduction to Data Mining, 2nd Edition 24


Handling Class Imbalanced Problem

Class-based ordering (e.g. RIPPER)


– Rules for rare class have higher priority

Cost-sensitive classification
– Misclassifying rare class as majority class is
more expensive than misclassifying majority
as rare class

Sampling-based approaches

02/03/2018 Introduction to Data Mining, 2nd Edition 25


Cost Matrix

PREDICTED CLASS

Class=Yes Class=No
ACTUAL
CLASS Class=Yes f(Yes, Yes) f(Yes,No)

C(i,j): Cost of
Class=No f(No, Yes) f(No, No)
misclassifying class i
example as class j
Cost PREDICTED CLASS
Matrix
C(i, j) Class=Yes Class=No Cost =  C (i, j )  f (i, j )
Class=Yes C(Yes, Yes) C(Yes, No)
ACTUAL
CLASS
Class=No C(No, Yes) C(No, No)

02/03/2018 Introduction to Data Mining, 2nd Edition 26


Computing Cost of Classification

Cost PREDICTED CLASS


Matrix
C(i,j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
02/03/2018 Introduction to Data Mining, 2nd Edition 27
Cost Sensitive Classification

Example: Bayesian classifer


– Given a test record x:
◆ Compute p(i|x) for each class i
◆ Decision rule: classify node as class k if

k = arg max p(i | x )


i

– For 2-class, classify x as + if p(+|x) > p(-|x)


◆ This decision rule implicitly assumes that
C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)

02/03/2018 Introduction to Data Mining, 2nd Edition 28


Cost Sensitive Classification

General decision rule:


– Classify test record x as class k if
k = arg min  p(i | x )  C (i, j )
j i
2-class:
– Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+)
– Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-)
– Decision rule: classify x as + if Cost(+) < Cost(-)
◆ if C(+,+) = C(-,-) = 0:
C ( −, + )
p( + | x ) 
C ( −, + ) + C ( + , − )
02/03/2018 Introduction to Data Mining, 2nd Edition 29
Sampling-based Approaches

Modify the distribution of training data so that rare


class is well-represented in training set
– Undersample the majority class
– Oversample the rare class

Advantages and disadvantages

02/03/2018 Introduction to Data Mining, 2nd Edition 30


Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 4

Instance-Based Learning

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Nearest Neighbor Classifiers

Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

2/12/2020 Introduction to Data Mining, 2nd Edition 2


Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of labeled records
– Distance metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

2/12/2020 Introduction to Data Mining, 2nd Edition 3


Nearest Neighbor Classification

Compute proximity between two points:


– Example: Euclidean distance

𝑑(𝒙, 𝒚) = ෍(𝒙𝒊 − 𝒚𝒊 )2
𝑖

Determine the class from nearest neighbor list


– Take the majority vote of class labels among
the k-nearest neighbors
– Weight the vote according to distance
◆ weight factor, 𝑤 = 1/𝑑2
2/12/2020 Introduction to Data Mining, 2nd Edition 4
Nearest Neighbor Classification…

Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

2/12/2020 Introduction to Data Mining, 2nd Edition 5


Nearest Neighbor Classification…

Choice of proximity measure matters


– For documents, cosine is better than correlation or
Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs

2/12/2020 Introduction to Data Mining, 2nd Edition 6


Nearest Neighbor Classification…

Data preprocessing is often required


– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
◆Example:

– height of a person may vary from 1.5m to 1.8m


– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0


means a standard deviation of 1

2/12/2020 Introduction to Data Mining, 2nd Edition 7


Nearest-neighbor classifiers

Nearest neighbor
classifiers are local
classifiers

They can produce 1-nn decision boundary is


decision boundaries of a Voronoi Diagram
arbitrary shapes.

2/12/2020 Introduction to Data Mining, 2nd Edition 8


Nearest Neighbor Classification…

How to handle missing values in training and


test sets?
– Proximity computations normally require the
presence of all attributes
– Some approaches use the subset of attributes
present in two instances
◆ This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
◆ Thus, proximities are not comparable

2/12/2020 Introduction to Data Mining, 2nd Edition 9


Nearest Neighbor Classification…

Handling irrelevant and redundant attributes


– Irrelevant attributes add noise to the proximity
measure
– Redundant attributes bias the proximity
measure towards certain attributes
– Can use variable selection or dimensionality
reduction to address irrelevant and redundant
attributes

2/12/2020 Introduction to Data Mining, 2nd Edition 10


Improving KNN Efficiency

Avoid having to compute distance to all objects in


the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
Condensing
– Determine a smaller set of objects that give
the same performance
Editing
– Remove objects to improve efficiency
2/12/2020 Introduction to Data Mining, 2nd Edition 11
Data Mining
Classification: Alternative Techniques

Bayesian Classifiers

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Bayes Classifier

• A probabilistic framework for solving classification


problems
• Conditional Probability: P(Y | X ) =
P( X , Y )
P( X )
P( X , Y )
P( X | Y ) =
P(Y )
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) =
P( X )

02/10/2020 Introduction to Data Mining, 2nd Edition 2


Using Bayes Theorem for Classification

• Consider each attribute and class


label as random variables

al al us
ir c ir c o
• Given a record with attributes (X1, te go
te go
ntin
u
la ss
c a c a co c
X2,…, Xd) Tid Refund Marital Taxable
Status Income Evade
– Goal is to predict class Y
1 Yes Single 125K No
– Specifically, we want to find the value of 2 No Married 100K No
Y that maximizes P(Y| X1, X2,…, Xd ) 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
• Can we estimate P(Y| X1, X2,…, Xd ) 6 No Married 60K No

directly from data? 7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

02/10/2020 Introduction to Data Mining, 2nd Edition 3


Example Data
Given a Test Record:
al al us
go
ir c
go
X = (Refund = No, Divorced, Income = 120K)
ir c
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade • Can we estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)?
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
5 No Divorced 95K Yes
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

02/10/2020 Introduction to Data Mining, 2nd Edition 4


Using Bayes Theorem for Classification

• Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P ( X 1 X 2  X d | Y ) P (Y )
P (Y | X 1 X 2  X n ) =
P( X 1 X 2  X d )

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


02/10/2020 Introduction to Data Mining, 2nd Edition 5
Example Data
Given a Test Record:
al al us
go
ir c
go
ir cX = (Refund = No, Divorced, Income = 120K)
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

02/10/2020 Introduction to Data Mining, 2nd Edition 6


Naïve Bayes Classifier

• Assume independence among attributes Xi when class is


given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj)  P(Xi| Yj) is


maximal.

02/10/2020 Introduction to Data Mining, 2nd Edition 7


Conditional Independence

• X and Y are conditionally independent given Z if


P(X|YZ) = P(X|Z)

• Example: Arm length and reading skills


– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are conditionally
independent given age

02/10/2020 Introduction to Data Mining, 2nd Edition 8


Naïve Bayes on Example Data
Given a Test Record:
al al us
go
ir c
go
X = (Refund = No, Divorced, Income = 120K)
ir c
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade P(X | Yes) =
1 Yes Single 125K No
P(Refund = No | Yes) x
2 No Married 100K No
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10 No Single 90K Yes
10

P(Income = 120K | No)

02/10/2020 Introduction to Data Mining, 2nd Edition 9


Estimate Probabilities from Data
l l
ic a ic a
ous
or or u
te g
te g
ntin
cla s•s P(y) = fraction of instances of class y
ca ca co
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No • For categorical attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Xi =c| y) = nc/ n
5 No Divorced 95K Yes
– where |Xi =c| is number of
6 No Married 60K No instances having attribute
7 Yes Divorced 220K No value Xi =c and belonging to
8 No Single 85K Yes class y
9 No Married 75K No
– Examples:
10 No Single 90K Yes
P(Status=Married|No) = 4/7
10

P(Refund=Yes|Yes)=0

02/10/2020 Introduction to Data Mining, 2nd Edition 10


Estimate Probabilities from Data

• For continuous attributes:


– Discretization: Partition the range into bins:
◆ Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

02/10/2020 Introduction to Data Mining, 2nd Edition 11


Estimate
oric a l
Probabilities
oric a l
uous
from Data
teg teg ntin s s
c a c a co cla
Tid Refund Marital
Status
Taxable
Income Evade • Normal distribution:
( X i − ij ) 2
1 Yes Single 125K No −
1 2 ij2
2 No Married 100K No P( X i | Y j ) = e
3 No Single 70K No
2 2
ij

4 Yes Married 120K No


– One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No
• For (Income, Class=No):
7 Yes Divorced 220K No
8 No Single 85K Yes
– If Class=No
9 No Married 75K No ◆ sample mean = 110
10
10 No Single 90K Yes ◆ sample variance = 2975

1 −
( 120−110 ) 2

P( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2 (54.54)
02/10/2020 Introduction to Data Mining, 2nd Edition 12
Example of Naïve Bayes Classifier
Given a Test Record:

X = (Refund = No, Divorced, Income = 120K)


Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 • P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0  P(Divorced | No)
P(Refund = No | Yes) = 1  P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7  1/7  0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 • P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3  P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0  P(Income=120K | Yes)
= 1  1/3  1.2  10-9 = 4  10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

02/10/2020 Introduction to Data Mining, 2nd Edition 13


Naïve Bayes Classifier can make decisions with partial
information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Divorced, Refund = No)
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10-9 x 1 x 1/3 x 3/10 /
sample variance = 2975
P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
02/10/2020 Introduction to Data Mining, 2nd Edition 14
Example of Naïve Bayes Classifier
Given a Test Record:

X = (Refund = No, Divorced, Income = 120K)


Naïve Bayes Classifier:
P(Yes) = 3/10
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(No) = 7/10
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Marital Status = Divorced | No) = 1/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Married | Yes) = 0 P(Divorced, Refund = No)
For Taxable Income: P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
If class = No: sample mean = 110 P(Divorced, Refund = No)
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

02/10/2020 Introduction to Data Mining, 2nd Edition 15


Issues with Naïve Bayes Classifier

Naïve Bayes Classifier:


P(Yes) = 3/10
P(Refund = Yes | No) = 3/7 P(No) = 7/10
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7 P(No | Married) = 4/7 x 7/10 / P(Married)
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0

For Taxable Income:


If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

02/10/2020 Introduction to Data Mining, 2nd Edition 16


Issues with Naïve Bayes Classifier
ic al ic al us
o
gor gor in
u s
te te nt a s Naïve Bayes Classifier:
Consider the
ca table cwitha Tid =co7 deleted cl
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = Yes: sample mean = 90
10 No Single 90K Yes sample variance = 25
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0

02/10/2020 Introduction to Data Mining, 2nd Edition 17


Issues with Naïve Bayes Classifier

• If one of the conditional probabilities is zero, then


the entire expression becomes zero
• Need to use other estimates of conditional probabilities
than simple fractions n: number of training
instances belonging to class y
• Probability estimation:
nc: number of instances with
Xi = c and Y = y
𝑛𝑐
original: 𝑃 𝑋𝑖 = 𝑐 𝑦) = v: total number of attribute
𝑛 values that Xi can take
𝑛𝑐 + 1 p: initial estimate of
Laplace Estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑣 (P(Xi = c|y) known apriori
m: hyper-parameter for our
𝑛𝑐 + 𝑚𝑝 confidence in p
m − estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑚

02/10/2020 Introduction to Data Mining, 2nd Edition 18


Example of Naïve Bayes Classifier

Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) =    = 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) =    = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) = 0.06  = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004  = 0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

02/10/2020 Introduction to Data Mining, 2nd Edition 19


Naïve Bayes (Summary)

• Robust to isolated noise points

• Handle missing values by ignoring the instance


during probability estimate calculations

• Robust to irrelevant attributes

• Redundant and correlated attributes will violate


class conditional assumption
–Useother techniques such as Bayesian Belief
Networks (BBN)

02/10/2020 Introduction to Data Mining, 2nd Edition 20


Bayesian Belief Networks

• Provides graphical representation of probabilistic


relationships among a set of random variables
• Consists of:
– A directed acyclic graph (dag) A B
◆ Node corresponds to a variable
◆ Arc corresponds to dependence
C
relationship between a pair of variables

– A probability table associating each node to its


immediate parent

02/10/2020 Introduction to Data Mining, 2nd Edition 22


Conditional Independence

D
D is parent of C
A is child of C
C
B is descendant of D
D is ancestor of A

A B

• A node in a Bayesian network is conditionally


independent of all of its nondescendants, if its
parents are known
02/10/2020 Introduction to Data Mining, 2nd Edition 23
Conditional Independence

• Naïve Bayes assumption:

X1 X2 X3 X4 ... Xd

02/10/2020 Introduction to Data Mining, 2nd Edition 24


Probability Tables

• If X does not have any parents, table contains


prior probability P(X)
Y

• If X has only one parent (Y), table contains


conditional probability P(X|Y) X

• If X has multiple parents (Y1, Y2,…, Yk), table


contains conditional probability P(X|Y1, Y2,…, Yk)

02/10/2020 Introduction to Data Mining, 2nd Edition 25


Example of Bayesian Belief Network

Exercise=Yes 0.7 Diet=Healthy 0.25


Exercise=No 0.3 Diet=Unhealthy 0.75

Exercise Diet

D=Healthy D=Healthy D=Unhealthy D=Unhealthy


Heart E=Yes E=No E=Yes E=No
Disease HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25

Blood
Chest Pain
Pressure

HD=Yes HD=No HD=Yes HD=No


CP=Yes 0.8 0.01 BP=High 0.85 0.2
CP=No 0.2 0.99 BP=Low 0.15 0.8

02/10/2020 Introduction to Data Mining, 2nd Edition 26


Example of Inferencing using BBN

• Given: X = (E=No, D=Yes, CP=Yes, BP=High)


– Compute P(HD|E,D,CP,BP)?
• P(HD=Yes| E=No,D=Yes) = 0.55
P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High)
 0.55  0.8  0.85 = 0.374
Classify X
• P(HD=No| E=No,D=Yes) = 0.45 as Yes
P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
 0.45  0.01  0.2 = 0.0009

02/10/2020 Introduction to Data Mining, 2nd Edition 27


Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 4

Rule-Based

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Rule-Based Classifier

Classify records by using a collection of


“if…then…” rules
Rule: (Condition) → y
– where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
– Examples of classification rules:
◆ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
◆ (Taxable Income < 50K)  (Refund=Yes) → Evade=No

2/12/2020 Introduction to Data Mining, 2nd Edition 2


Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
2/12/2020 Introduction to Data Mining, 2nd Edition 3
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

2/12/2020 Introduction to Data Mining, 2nd Edition 4


Rule Coverage and Accuracy
Tid Refund Marital Taxable
Coverage of a rule: Status Income Class

– Fraction of records 1 Yes Single 125K No


2 No Married 100K No
that satisfy the
3 No Single 70K No
antecedent of a rule 4 Yes Married 120K No

Accuracy of a rule: 5 No Divorced 95K Yes


6 No Married 60K No
– Fraction of records 7 Yes Divorced 220K No
that satisfy the 8 No Single 85K Yes

antecedent that 9 No Married 75K No


10 No Single 90K Yes
also satisfy the 10

consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%

2/12/2020 Introduction to Data Mining, 2nd Edition 5


How does Rule-based Classifier Work?

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

2/12/2020 Introduction to Data Mining, 2nd Edition 6


Characteristics of Rule Sets: Strategy 1

Mutually exclusive rules


– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule

Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
2/12/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2

Rules are not mutually exclusive


– A record may trigger more than one rule
– Solution?
◆ Ordered rule set
◆ Unordered rule set – use voting schemes

Rules are not exhaustive


– A record may not trigger any rules
– Solution?
◆ Use a default class
2/12/2020 Introduction to Data Mining, 2nd Edition 8
Ordered Rule Set

Rules are rank ordered according to their priority


– An ordered rule set is known as a decision list
When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
2/12/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes

Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

2/12/2020 Introduction to Data Mining, 2nd Edition 10


Building Classification Rules

Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R

Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules

2/12/2020 Introduction to Data Mining, 2nd Edition 11


Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met

2/12/2020 Introduction to Data Mining, 2nd Edition 12


Example of Sequential Covering

(i) Original Data (ii) Step 1

2/12/2020 Introduction to Data Mining, 2nd Edition 13


Example of Sequential Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

2/12/2020 Introduction to Data Mining, 2nd Edition 14


Rule Growing

Two common strategies

Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)

Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1

(a) General-to-specific (b) Specific-to-general

2/12/2020 Introduction to Data Mining, 2nd Edition 15


Rule Evaluation
FOIL: First Order Inductive
Foil’s Information Gain Learner – an early rule-
based learning algorithm

– R0: {} => class (initial rule)


– R1: {A} => class (rule after adding conjunct)

– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0

– 𝑝0 : number of positive instances covered by R0


𝑛0 : number of negative instances covered by R0
𝑝1 : number of positive instances covered by R1
𝑛1 : number of negative instances covered by R1

2/12/2020 Introduction to Data Mining, 2nd Edition 16


Rule Evaluation

2/12/2020 Introduction to Data Mining, 2nd Edition 17


Rule Evaluation

2/12/2020 Introduction to Data Mining, 2nd Edition 18


Direct Method: RIPPER

Building a Rule Set:


– Use sequential covering algorithm
◆ Finds the best rule that covers the current set of
positive examples
◆ Eliminate both positive and negative examples
covered by the rule
– Each time a rule is added to the rule set,
compute the new description length
◆ Stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far

2/12/2020 Introduction to Data Mining, 2nd Edition 21


Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

2/12/2020 Introduction to Data Mining, 2nd Edition 23


Advantages of Rule-Based Classifiers

Has characteristics quite similar to decision trees


– As highly expressive as decision trees
– Easy to interpret
– Performance comparable to decision trees
– Can handle redundant attributes

Better suited for handling imbalanced classes

Harder to handle missing values in the test set

2/12/2020 Introduction to Data Mining, 2nd Edition 29


Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 4

Rule-Based

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Rule-Based Classifier

Classify records by using a collection of


“if…then…” rules
Rule: (Condition) → y
– where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
– Examples of classification rules:
◆ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
◆ (Taxable Income < 50K)  (Refund=Yes) → Evade=No

2/12/2020 Introduction to Data Mining, 2nd Edition 2


Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
2/12/2020 Introduction to Data Mining, 2nd Edition 3
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

2/12/2020 Introduction to Data Mining, 2nd Edition 4


Rule Coverage and Accuracy
Tid Refund Marital Taxable
Coverage of a rule: Status Income Class

– Fraction of records 1 Yes Single 125K No


2 No Married 100K No
that satisfy the
3 No Single 70K No
antecedent of a rule 4 Yes Married 120K No

Accuracy of a rule: 5 No Divorced 95K Yes


6 No Married 60K No
– Fraction of records 7 Yes Divorced 220K No
that satisfy the 8 No Single 85K Yes

antecedent that 9 No Married 75K No


10 No Single 90K Yes
also satisfy the 10

consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%

2/12/2020 Introduction to Data Mining, 2nd Edition 5


How does Rule-based Classifier Work?

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

2/12/2020 Introduction to Data Mining, 2nd Edition 6


Characteristics of Rule Sets: Strategy 1

Mutually exclusive rules


– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule

Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
2/12/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2

Rules are not mutually exclusive


– A record may trigger more than one rule
– Solution?
◆ Ordered rule set
◆ Unordered rule set – use voting schemes

Rules are not exhaustive


– A record may not trigger any rules
– Solution?
◆ Use a default class
2/12/2020 Introduction to Data Mining, 2nd Edition 8
Ordered Rule Set

Rules are rank ordered according to their priority


– An ordered rule set is known as a decision list
When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
2/12/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes

Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

2/12/2020 Introduction to Data Mining, 2nd Edition 10


Building Classification Rules

Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R

Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules

2/12/2020 Introduction to Data Mining, 2nd Edition 11


Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met

2/12/2020 Introduction to Data Mining, 2nd Edition 12


Example of Sequential Covering

(i) Original Data (ii) Step 1

2/12/2020 Introduction to Data Mining, 2nd Edition 13


Example of Sequential Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

2/12/2020 Introduction to Data Mining, 2nd Edition 14


Rule Growing

Two common strategies

Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)

Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1

(a) General-to-specific (b) Specific-to-general

2/12/2020 Introduction to Data Mining, 2nd Edition 15


Rule Evaluation
FOIL: First Order Inductive
Foil’s Information Gain Learner – an early rule-
based learning algorithm

– R0: {} => class (initial rule)


– R1: {A} => class (rule after adding conjunct)

– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0

– 𝑝0 : number of positive instances covered by R0


𝑛0 : number of negative instances covered by R0
𝑝1 : number of positive instances covered by R1
𝑛1 : number of negative instances covered by R1

2/12/2020 Introduction to Data Mining, 2nd Edition 16


Rule Evaluation

2/12/2020 Introduction to Data Mining, 2nd Edition 17


Rule Evaluation

2/12/2020 Introduction to Data Mining, 2nd Edition 18


Direct Method: RIPPER

Building a Rule Set:


– Use sequential covering algorithm
◆ Finds the best rule that covers the current set of
positive examples
◆ Eliminate both positive and negative examples
covered by the rule
– Each time a rule is added to the rule set,
compute the new description length
◆ Stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far

2/12/2020 Introduction to Data Mining, 2nd Edition 21


Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

2/12/2020 Introduction to Data Mining, 2nd Edition 23


Advantages of Rule-Based Classifiers

Has characteristics quite similar to decision trees


– As highly expressive as decision trees
– Easy to interpret
– Performance comparable to decision trees
– Can handle redundant attributes

Better suited for handling imbalanced classes

Harder to handle missing values in the test set

2/12/2020 Introduction to Data Mining, 2nd Edition 29


Data Mining

Chapter 5
Association Analysis: Basic Concepts

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

02/14/2018 Introduction to Data Mining, 2nd Edition 1


Association Rule Mining

Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

02/14/2018 Introduction to Data Mining, 2nd Edition 2


Definition: Frequent Itemset
Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset TID Items

◆ An itemset that contains k items 1 Bread, Milk


Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
02/14/2018 Introduction to Data Mining, 2nd Edition 3
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
X → Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} → {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain Example:
both X and Y {Milk , Diaper }  {Beer}
– Confidence (c)
◆ Measures how often items in Y  (Milk, Diaper, Beer) 2
appear in transactions that s= = = 0.4
|T| 5
contain X
 (Milk, Diaper, Beer) 2
c= = = 0.67
 (Milk, Diaper) 3
02/14/2018 Introduction to Data Mining, 2nd Edition 4
Association Rule Mining Task

Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
02/14/2018 Introduction to Data Mining, 2nd Edition 5
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d − k 
R =       
d −1 d −k

 k   j 
k =1 j =1

= 3 − 2 +1 d d +1

If d=6, R = 602 rules

02/14/2018 Introduction to Data Mining, 2nd Edition 6


Mining Association Rules

TID Items Example of Rules:


1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
02/14/2018 Introduction to Data Mining, 2nd Edition 7
Mining Association Rules

Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still


computationally expensive

02/14/2018 Introduction to Data Mining, 2nd Edition 8


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
02/14/2018 Introduction to Data Mining, 2nd Edition 9
Frequent Itemset Generation

Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
02/14/2018 Introduction to Data Mining, 2nd Edition 10
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

02/14/2018 Introduction to Data Mining, 2nd Edition 11


Reducing Number of Candidates

Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

Apriori principle holds due to the following property


of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support

02/14/2018 Introduction to Data Mining, 2nd Edition 12


Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
02/14/2018 Introduction to Data Mining, 2nd Edition 13
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

02/14/2018 Introduction to Data Mining, 2nd Edition 14


Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

02/14/2018 Introduction to Data Mining, 2nd Edition 15


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

02/14/2018 Introduction to Data Mining, 2nd Edition 16


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

02/14/2018 Introduction to Data Mining, 2nd Edition 17


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}

02/14/2018 Introduction to Data Mining, 2nd Edition 18


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13

02/14/2018 Introduction to Data Mining, 2nd Edition 19


Apriori Algorithm

– Fk: frequent k-itemsets


– Lk: candidate k-itemsets
Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
◆ Candidate Generation: Generate Lk+1 from Fk
◆ Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
◆ Support Counting: Count the support of each candidate in
Lk+1 by scanning the DB
◆ Candidate Elimination: Eliminate candidates in Lk+1 that are
infrequent, leaving only those that are frequent => Fk+1

02/14/2018 Introduction to Data Mining, 2nd Edition 20


Candidate Generation: Brute-force method

02/14/2018 Introduction to Data Mining, 2nd Edition 21


Candidate Generation: Merge Fk-1 and F1 itemsets

02/14/2018 Introduction to Data Mining, 2nd Edition 22


Candidate Generation: Fk-1 x Fk-1 Method

02/14/2018 Introduction to Data Mining, 2nd Edition 23


Candidate Generation: Fk-1 x Fk-1 Method

Merge two frequent (k-1)-itemsets if their first (k-2) items


are identical

F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only


prefix of length 1 instead of length 2

02/14/2018 Introduction to Data Mining, 2nd Edition 24


Candidate Pruning

Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

L4 = {ABCD,ABCE,ABDE} is the set of candidate


4-itemsets generated (from previous slide)

Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

After candidate pruning: L4 = {ABCD}


02/14/2018 Introduction to Data Mining, 2nd Edition 25
Alternate Fk-1 x Fk-1 Method

Merge two frequent (k-1)-itemsets if the last (k-2) items of


the first one is identical to the first (k-2) items of the
second.

F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE

02/14/2018 Introduction to Data Mining, 2nd Edition 26


Candidate Pruning for Alternate Fk-1 x Fk-1 Method

Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

L4 = {ABCD,ABDE,ACDE,BCDE} is the set of


candidate 4-itemsets generated (from previous
slide)
Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
After candidate pruning: L4 = {ABCD}
02/14/2018 Introduction to Data Mining, 2nd Edition 27
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support
counting step.

02/14/2018 Introduction to Data Mining, 2nd Edition 28


Support Counting of Candidate Itemsets

Scan the database of transactions to determine the


support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation

TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

02/14/2018 Introduction to Data Mining, 2nd Edition 29


Support Counting of Candidate Itemsets

To reduce number of comparisons, store the candidate


itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets

02/14/2018 Introduction to Data Mining, 2nd Edition 30


Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items


02/14/2018 Introduction to Data Mining, 2nd Edition 31
Support Counting Using a Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
02/14/2018 Introduction to Data Mining, 2nd Edition 32
Support Counting Using a Hash Tree

145 234 345


124 567 356
457 357
125 689
458 367
159 368
136

02/14/2018 Introduction to Data Mining, 2nd Edition 33


Support Counting Using a Hash Tree

145 234 345


124 567 356
457 357
1 2 51 4 5 124 136 345 356 367 689
458 457 357 368 367
159 125 689 368
136 458
159

02/14/2018 Introduction to Data Mining, 2nd Edition 34


Support Counting Using a Hash Tree

234
567
1 2 41 4 5 136 345 356 367
457 357 368
125 689
4 5 81 2 4 125 159
1 5 94 5 7 458

02/14/2018 Introduction to Data Mining, 2nd Edition 35


Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

02/14/2018 Introduction to Data Mining, 2nd Edition 36


Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

02/14/2018 Introduction to Data Mining, 2nd Edition 37


Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

02/14/2018 Introduction to Data Mining, 2nd Edition 38


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

02/14/2018 Introduction to Data Mining, 2nd Edition 39


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

02/14/2018 Introduction to Data Mining, 2nd Edition 40


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
02/14/2018 Introduction to Data Mining, 2nd Edition 41

You might also like