0% found this document useful (0 votes)
8 views44 pages

06-Classification Part1

The document provides an introduction to classification in data mining, detailing its definition, examples, and various methods such as decision trees, Bayesian classification, and neural networks. It outlines the steps involved in classification, including model construction, testing, and usage, as well as applications of classification algorithms in real-world scenarios. Additionally, it discusses techniques like information gain and Gini index for attribute selection in decision tree algorithms.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views44 pages

06-Classification Part1

The document provides an introduction to classification in data mining, detailing its definition, examples, and various methods such as decision trees, Bayesian classification, and neural networks. It outlines the steps involved in classification, including model construction, testing, and usage, as well as applications of classification algorithms in real-world scenarios. Additionally, it discusses techniques like information gain and Gini index for attribute selection in decision tree algorithms.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Classification

Part 1

CME4416 – Introduction to Data Mining

Asst. Prof. Dr. Göksu Tüysüzoğlu


Outline

◘ What Is Classification?
◘ Classification Examples
◘ Classification Methods
– Decision Trees
– Bayesian Classification
– K-Nearest Neighbor
– Neural Network
– Support Vector Machines (SVM)
– Fuzzy Set Approaches
What Is Classification?

◘ Classification
– Construction of a model to classify data
– When constructing the model, use the training set and the class labels
(i.e. yes no) in the target column

Training Set Model


Terminology

◘ Classifier: An algorithm that maps the input data to a specific category.


◘ Classification model: A classification model tries to draw some conclusion
from the input values given for training. It will predict the class
labels/categories for the new data.
◘ Binary Classification: Classification task with two possible outcomes.
Eg: Gender classification (Male / Female)
◘ Multi-class classification: Classification with more than two classes. In
multi-class classification, each sample is assigned to one and only one
target label.
Eg: An animal can be a cat or dog but not both at the same time.
◘ Multi-label classification: Classification task where each sample is
mapped to a set of target labels (more than one class).
Eg: A news article can be about sports, a person, and location at the
same time.
Classification Steps

1. Model construction
– Each tuple is assumed to belong to a predefined class
– The set of tuples used for model construction is training set
– The model is represented as classification rules, trees, or mathematical formulas

2. Test Model
– Using test set, estimate accuracy rate of the model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model

3. Model Usage (Classifying future or unknown objects)


– If the accuracy is acceptable, use the model to classify data tuples whose classes
don’t known
Classification Steps
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K No


2 No Married 100K No Yes Married 50K Yes 2. Test Model
3 No Single 70K No No Married 150K Yes
4 Yes Married 120K No Yes Divorced 90K No
Test
10

5 No Divorced 95K Yes


6 No Married 60K No
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No 1. Construct Model
10 No Single 90K Yes
10

Training
Learn
Set Classifier Model

Refund Marital Taxable


Status Income Cheat

Yes Divorced 50K ? New


No Married 50K ? Data
Yes Single 150K ? 3. Use Model
Applications of Classification Algorithms

◘ Email spam classification


◘ Bank customers loan pay willingness prediction
◘ Cancer tumor cells identification
◘ Sentiment analysis
◘ Drugs classification
◘ Facial key points detection
◘ Pedestrians detection in an automotive car driving.

Previous customers Classifier Rules


Age Good/
Salary > 5 L
Salary Bad
Profession Prof. = Exec
Location
Customer type

New applicant’s data


Classification Techniques
1. Decision Trees 4. Neural Network

2. Bayesian Classification 5. Support Vector Machines (SVM)


p(c j ) n
c max
cj p(d )
 p(a
i 1
i | cj)

3. K-Nearest Neighbor 6. Fuzzy Set Approaches


Classification Techniques

Decision Trees

Bayesian Classification

K-Nearest Neighbor

Classification Neural Network

Support Vector Machines (SVM)

Fuzzy Set Approaches


Decision Trees

◘ Decision Tree is a tree where


– internal nodes are simple decision rules on one or more attributes
– leaf nodes are predicted class labels
◘ Decision trees are used for deciding between several courses of
action
age income student credit_rating buys_computer Attribute
<=30 high no fair no
<=30 high no excellent no
Value
31…40 high no fair yes age?
>40 medium no fair yes
>40 low yes fair yes >40
>40 low yes excellent no
<=30 31..40 Classification
31…40 low yes excellent yes
<=30 medium no fair no student? yes credit rating?
<=30 low yes fair yes
>40 medium yes fair yes No Yes Excellent Fair
<=30 medium yes excellent yes
31…40 medium no excellent yes no yes no yes
31…40 high yes fair yes
>40 medium no excellent no
Decision Regions
Rules Indicated by Decision Trees

◘ Write a rule for each path in the decision tree from the root to a leaf.
Entropy
Information Gain

◘ Information gain is a measure of the reduction in the overall


entropy of a set of instances that is achieved by testing on a
descriptive feature
◘ Computing information gain is a three-step process:
1. Compute the entropy of the original dataset with respect to the
target feature. This gives us an measure of how much information is
required in order to organize the dataset into pure sets.
2. For each descriptive feature, create the sets that result by
partitioning the instances in the dataset using their feature values, and
then sum the entropy scores of each of these sets. This gives a
measure of the information that remains required to organize the
instances into pure sets after we have split them using the descriptive
feature.
3. Subtract the remaining entropy value (computed in step 2) from the
original entropy value (computed in step 1) to give the information
gain.
Information Gain

◘ Which attribute is the best classifier?


– Calculate the information gain G(S,A) for each attribute A.
– Select the attribute with the highest information gain.

m
Entropy(S)   p i log2 p i Entropy(S)  p1 log 2 p1  p 2 log 2 p 2
i 1

| Si |
Gain( S , A) Entropy( S )  i A |S|
Entropy( S i )
Information Gain

Classification tree training [1]


Decision Tree Algorithms

◘ ID3
– Quinlan (1981)
– Tries to reduce expected number of comparison
◘ C 4.5
– Quinlan (1993)
– It is an extension of ID3
– Just starting to be used in data mining applications
– Also used for rule induction
◘ CART
– Breiman, Friedman, Olshen, and Stone (1984)
– Classification and Regression Trees
◘ CHAID
– Kass (1980)
– Oldest decision tree algorithm
– Well established in database marketing industry
◘ QUEST
– Loh and Shih (1997)
ID3 Algorithm

Input: A decision table with discrete-valued attributes.


Output: A decision tree.

1. For each attribute A, compute its information gain.


2. Select extended attribute A* with maximum information gain.
3. Partition the data set into k subsets according to the values of A*,
where k is the number of values of A*.
4. For each subset, if the class labels of the instances are same, then
obtain a leaf node with the label; otherwise, repeat the above process.
5. Output a decision tree.
Decision Tree Construction

Which attribute first?


Decision Tree Construction

Entropy( S )  (9 / 14) log2 (9 / 14)  (5 / 14) log2 (5 / 14) 0,940


Decision Tree Construction
Day Wind Tennis?
D1 weak no
Values(Wind)=weak, strong D2 strong no
D3 weak yes
D4 weak yes
Sweak = [6+, 2-]
D5 weak yes
Sstrong = [3+, 3-] D6 strong no
D7 strong yes
D8 weak no
D9 weak yes
D10 weak yes
D11 strong yes
D12 strong yes
D13 weak yes
D14 strong no
Decision Tree Construction
Entropi ( S )  (9 / 14) log 2 (9 / 14)  (5 / 14) log 2 (5 / 14) 0,940

Outlook
Sunny Overcast Rain
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971

Humidity
High Normal
[3+, 4-] [6+,
E=0.985 1-]
E=0.592
Decision Tree Construction

Gain(S, Outlook) = 0,246


Gain(S, Temperature) = 0,029
Gain(S, Humidity) = 0,151
Gain(S, Wind) = 0,048
Decision Tree Construction

Outlook
Sunny Rain
Overcast
? yes ?
[2+, 3-] [4+, 0-] [3+, 2-]
Decision Tree Construction

◘ Which attribute is next?

Outlook
Sunny Rain
Overcas
? t yes
?
[2+, 3-] [4+, 0-] [3+, 2-]

Gain( S sunny ,Wind ) 0,970  (2 / 5)1,0  (3 / 5)0,918 0,970 0,019


Gain( S sunny , Humidity ) 0,970  (3 / 5)0,0  (2 / 5)0,0 0,970

Gain( S Sunny , Temperatur e) 0,970  (2 / 5)0  (2 / 5)1  (1 / 5)0 0,570


Decision Tree Construction

Outlook

Sunny Overcast Rain

Humidity Wind
yes
[D3,D7,D12,D13]
High Normal Weak Strong

no yes yes no
[D1,D2, D8] [D9,D11] [D4,D5,D10] [D6,D14]
Converting the Tree to Rules

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a large
number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)

v | Dj | | Dj |
SplitInfo A ( D )   log 2 ( )
j 1 |D| |D|

 GainRatio(A) = Gain(A)/SplitInfo(A)
 The attribute with the maximum gain ratio is selected as the splitting
attribute
Computation of Gain Ratio
◘ Suppose the attribute “Wind” partitions D into 8 in D1: {Weak} and 6 in
D2: {Strong}
 8   8   6   6 
SplitInfoWind ( D )    log     log  0.9852
 14   14   14   14 

Gain(S, Wind) = 0.048


GainRatio(Wind) = Gain(S, Wind)/SplitInfo(Wind)
=0.048/0.9852=0.0487
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D) 1  p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as gini A ( D) |D1| gini( D1)  |D2 | gini( D 2)
|D| |D|

 Reduction in Impurity: gini( A) gini( D)  giniA ( D)


 The attribute provides the largest reduction in impurity is chosen to
split the node
 Information gain can be calculated using the Gini index by replacing
the entropy measure with the Gini index.
Computation of Gini Index
◘ Ex. D has 9 tuples in “PlayTennis” = “yes” and 5 in “no”
2 2
 9  5
gini ( D) 1       0.459
 14   14 
◘ Suppose the attribute “Wind” partitions D into 8 in D 1: {Weak} and 6 in
D2: {Strong}
 8  6
giniwind ( D)   Gini( DWeak )    Gini( DStrong )
 14   14 
2 2 2 2
 3  3  6  2
gini( DStrong ) 1       0.5 gini( DWeak ) 1       0.375
 6  6  8  8
 8  6
giniwind ( D)   * 0.375    * 0.5 0.429
 14   14 
gini( A) gini( D)  giniWind ( D) 0.459  0.429 0.03
Overfitting and Tree Pruning
◘ Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or
outliers
– Poor accuracy for unseen samples
◘ Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Random Forest
◘ A random forest F = (G1;…;Gm) is an ensemble of random decision trees Gi.
◘ Initially proposed by Breiman in 2001[1].
◘ It uses Bootstrap sampling and aggregating[2] when building each individual
tree.
◘ The prediction of an uncorrelated forest of trees is more accurate than that
of any individual tree.
Disadvantage:
High memory consumption
O(2d)

Memory consumption
grows exponentially with
the depth d of the trees.
Classification Techniques

Decision Trees

Bayesian Classification

K-Nearest Neighbor

Classification Neural Network

Support Vector Machines (SVM)

Fuzzy Set Approaches


Classification Techniques
2- Bayesian Classification
◘ A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities.
◘ Foundation: Based on Bayes’ Theorem.
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows
the Bayes theorem

P(H | X) P(X | H )P(H )


P(X)

Example: According to a study, 1 out of 43 children develops a certain disease in


adulthood, and although it is not completely reliable, the test results of an infected child
are 80% positive and a healthy child's test is 10% positive. According to this information,
what is the likelihood that a child with a positive test result will actually be ill?

P (A): The probability that the child is ill = 1/43


P (B): Probability of the test being positive = 1/43 * 0.80 + 42/43 * 0.10 = 5/43
P (A | B): The probability of a positive test appearing (unknown)
P (B | A): The probability of positive test of the diseased child = 0.80

P (A | B) = P (B | A) * P (A) / P (B) => (0.80 * 1/43) / (5/43) = 0.16 = 16%.


Classification Techniques
2- Bayesian Classification
◘ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
age income student credit_rating buys_computer
◘ P(C1): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no fair no
P(C2): P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
◘ Compute P(X|Ci) for each class >40 medium no fair yes
>40 low yes fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
>40 low yes excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 <=30 medium yes excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 medium no excellent yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 31…40 high yes fair yes
>40 medium no excellent no
◘ P(X|C1) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C2) : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


Classification Techniques
2- Bayesian Classification
Classification Techniques

Decision Trees

Bayesian Classification

K-Nearest Neighbor

Classification Neural Network

Support Vector Machines (SVM)

Fuzzy Set Approaches


K-Nearest Neighbor (k-NN)

◘ An object is classified by a majority vote of its neighbors (k closest


members) .

◘ If k = 1, then the object is simply assigned to the class of its nearest


neighbor.

◘ Euclidean Distance measure is used to calculate how close


K-Nearest Neighbor (k-NN)
Confusion Matrix with Classification Metrics

Predicted Class

Positive Negative

True True Positive (TP) False Negative (FN)


Actual Class

Type II Error

False False Positive (FP) True Negative (TN)


Type I Error
Confusion Matrix of Email Classification
Predicted Class

Positive Negative
Actual Class

True TP = 45 FN = 20

False FP = 5 TN = 30

The 69.23% spam emails are correctly


classified and excluded from all non-
spam emails.
The 85.71% non-spam emails
are accurately classified
and excluded from all spam emails.

The 90% of examples are classified as


spam are actually spam.

The 75% of examples are correctly


classified by the classifier.
Validation Techniques
◘ Simple Validation

Training set Test set


◘ Cross Validation

Training set Test set

Test set Training set


◘ n-Fold Cross Validation

Test set
References
◘ [1] Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for
classification, regression, density estimation, manifold learning and semi-supervised
learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3), 81-227.

You might also like