Data Mining: Concepts and Techniques: - Chapter 6
Data Mining: Concepts and Techniques: - Chapter 6
Target marketing
Medical diagnosis
Fraud detection
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
will occur
If the accuracy is acceptable, use the model to classify data
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
May 10, 2024 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
attributes
Speed
time to construct the model (training time)
age?
<=30 overcast
31..40 >40
no yes no yes
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
needs to be maximized
May 10, 2024 Data Mining: Concepts and Techniques 27
Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P ( X | C i ) P ( x | C i ) P ( x | C i ) P ( x | C i ) ... P ( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ ( x ) 2
1
g ( x, , )
2 2
e
2
and P(xk|Ci) is P ( X | C i ) g ( xk , Ci , Ci )
May 10, 2024 Data Mining: Concepts and Techniques 28
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer = >40 low yes excellent no
‘no’ 31…40 low yes excellent yes
<=30 medium no fair no
Data sample <=30 low yes fair yes
X = (age <=30, >40 medium yes fair yes
Income = medium,
<=30 medium yes excellent yes
Student = yes
31…40 medium no excellent yes
Credit_rating = Fair)
31…40 high yes fair yes
>40 medium no excellent no
May 10, 2024 Data Mining: Concepts and Techniques 29
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
May 10, 2024 Data Mining: Concepts and Techniques 31
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Classification:
predicts categorical class labels
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = n, y Y = {+1, –1}
We want a function f: X Y
- k
x0 w0
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k )
vector x vector w sum function i 0
Output vector
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input layer
I j wij Oi j
i
Input vector: X
May 10, 2024 Data Mining: Concepts and Techniques 47
How A Multi-Layer Neural Network Works?
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Inseparable
A1
Transform the original input data into a higher
dimensional space
SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
SVM Website
http://www.kernel-machines.org/
Representative implementations
LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only C language
SVM-torch: another recent implementation also written in C.
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01 )
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R and R , if the antecedent of R is more general
1 2 1
than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are not
positively correlated, based on a χ2 test of statistical significance
Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule
Instance-based learning:
Store training examples and delay the processing
space.
Locally weighted regression
Case-based reasoning
based inference
May 10, 2024 Data Mining: Concepts and Techniques 79
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _ .
+
_ .
+
xq + . . .
May 10, 2024
_
+ .
Data Mining: Concepts and Techniques 80
Discussion on the k-NN Algorithm
Non-linear regression
(x x )( yi y )
w w yw x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
d d
d
d
| y i yi ' | ( yi yi ' ) 2
i 1
Relative absolute error: i 1
d Relative squared error: d
| y
i 1
i y|
(y
i 1
i y)2
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers