0% found this document useful (0 votes)
2 views

DataClassification

The document provides an overview of data classification in data mining, focusing on key concepts such as training, validation, and test data, as well as various classification algorithms like k-NN and decision trees. It discusses the importance of cross-validation, model evaluation metrics, and the challenges of overfitting and underfitting. Additionally, it outlines strategies for addressing rare classes and small data issues, emphasizing the need for optimal parameter selection and preprocessing techniques.

Uploaded by

huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DataClassification

The document provides an overview of data classification in data mining, focusing on key concepts such as training, validation, and test data, as well as various classification algorithms like k-NN and decision trees. It discusses the importance of cross-validation, model evaluation metrics, and the challenges of overfitting and underfitting. Additionally, it outlines strategies for addressing rare classes and small data issues, emphasizing the need for optimal parameter selection and preprocessing techniques.

Uploaded by

huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

COMP5009

DATA MINING

DATA
CLASSIFICATION
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2
CLASSIFICATION
BASICS
 Key concepts
Aggarwal Chapters/Sections
10.1  Data Partitioning
10.2-10.2.1
10.3  Cross-validation
10.5-10.5.1
10.8
 Evaluation
10.9
11.1
11.2
11.3

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2


CLASSIFICATION

Y : Test data
D: Training data  To predict class labels

 For selecting models and tuning parameters  A test sample must also belong to one of the known
categories/classes
 Already partitioned into groups, categories, or
classes  Important statistical (learning) assumption:
samples come from the same distribution that
 Class labels provided by domain experts
generates the training samples
 Each sample has a label indicating which class it  If a test sample is identical to a sample in the training
belongs to data, it must also have the same class label

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3


DATA CLASSIFICATION
edible or cap- cap- cap- stalk- stalk-
odor gill-size gill-color
poisonous? shape surface color shape root
p x s n p n k e e
e x s y a b k e c
e b s w l b n e c
p x y w p n n e e
e x s g n b k t e

Train e x y y a b n e c
e b s w a b g e c
e b y w l b n e c
p x y w p n p e e
e b s y a b g e c
e x y y l b g e c
? x y y a b n e c
Predict
? b s y a b w e c
https://archive-beta.ics.uci.edu/ml/datasets/73
COMP5009 – DATA MINING, CURTIN UNIVERSITY 4
DATA CLASSIFICATION

Apples Pears
Apple
or
Pear?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5


DATA CLASSIFICATION

Apples Pears
Apple
or
Pear?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6


DATA CLASSIFICATION

Apples Pears
Apple
or
Pear?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7


TRAINING, VALIDATION, AND TEST DATA

 Retrain best model using the optimum


 Test data must be kept separate to avoid overfitting
parameters and all training data
(and over confidence)
 Test best model on testing data
 Try multiple models

 Models typically have hyper-parameter(s)

 Select values that produce best


average performance with cross validation
 Tuning = optimizing over hyper-parameters

 Select best model and parameters for final model

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8


SUPERVISED LEARNING

Training models: mathematical description of how


attributes/features are mapped to the classes.

Classification: supervised learning Examples:

 Training data used to learn structure of the groups  Extreme case: no model at all, just memory-based (k-
NN)
Two main phases:
 Partitioning the attribute spaces into regions of
 Training: construct predictive model from training data dominant labels (decision trees)
 Testing: apply model to test samples to predict labels  Linear combination of attributes (SVM, linear
discriminant analysis)
 A neural network with suitable weights
 Probabilistic representation (Bayesian methods)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9


PYTHON EXAMPLE: IRIS DATA

Use Iris flower measurements to predict species


Learn from labeled data
Predict class of unseen data
See prac 05

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10


MODEL EVALUATION

Accuracy
 The fraction of test instances that were correctly
Confusion matrix
labeled (TP + TN)/Total
Precision Actual 1 Actual 0
 Fraction reported positives that are correct TP/(TP+FP) Predicted 1 True positive False positive
Recall
Predicted 0 False negative True negative
 Fraction of positives that are reported TP/(TP+FN)
F1-measure
Receiver Operating Curve (ROC)
 2* Precision * Recall / (Precision + Recall)
 Area under the graph of FPR vs TPR ε [0,1]
 2*TP / (2*TP + FP +FN)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11


TPR, FPR AND THRESHOLD

 Models/methods typically have a threshold


Likelihood of being class A

parameter that is used to separate class A from


class B
 TPR and FPR are thus functions of the threshold

 Low threshold -> more detections, more false


alarms
 High threshold -> fewer detections, fewer false
alarms

COMP5009 – DATA MINING, CURTIN UNIVERSITY 12


RECEIVER OPERATING CURVE (ROC)

 Plot of detection probability (TPR) vs false alarm


rate (FPR)
 Starts from (0,0) and ends at (1,1)

 Desirable curve: detection probability close to 1 for


small false alarm value

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

COMP5009 – DATA MINING, CURTIN UNIVERSITY 13


ROC AND DESIRABLE PERFORMANCE

 How to quantify “desirable performance”


 Area under the curve (AUC): as close to 1 as possible
 Equal error rate (EER): miss detection rate = false alarm
rate

 Practical interpretations
 Unavoidable trade-off between false alarm rate and
detection probability
 Operating point: depends on the specific application

COMP5009 – DATA MINING, CURTIN UNIVERSITY 14


CONFUSION MATRIX

 Useful for binary and multi-


label classification methods
 Can identify similar classes or
classes which are easily
confused

https://towardsdatascience.com/visual-guide-to-the-confusion-matrix-bb63730c8eba

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15


VALIDATION

Goal of classification: predictive power


Data subsets  Predict the test samples as accurate as possible

 With class labels: for building the predictive model  Accurately explaining the training data not the main
goal
 Without class labels: for predicting class labels by
applying the model How to predict accuracy on unseen test samples?
Question: How to make best use of data with class  Take a subset of the labelled data out for validation
labels?
 Validation subset needs to reflect the statistics of
unseen data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16


N-FOLD CROSS VALIDATION

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17


VALIDATION EXAMPLE

 Use subsets to train and


validate
 Apply best model to new
data
 Hope/Assume that sample
data is representative of
population data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 18


CROSS VALIDATION

n-fold cross validation


 Divide labelled data into n blocks

 Take one block out for validation, train on others


Stratified cross-validation:
 Repeat for all blocks
 Ensures that each class is represented
 Measure average performance across all blocks
proportionally in all the subsets
Leave-one-out cross validation:
 Special case when block size = 1 sample

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19


CROSS-VALIDATION SPLITTING SCHEMES

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20


OPTIMAL USE OF YOUR TRAINING DATA

1. Choose a cross validation method

2. Choose a classification method

3. Determine the optimal parameters for the classification method using cross validation

 This will use most data for training during each iteration

4. Once you have the optimal parameters, retrain on ALL the data

 This will use all data for training your best model

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21


SCALING AND PREPROCESSING FOR CLASSIFICATION TASKS

1. Use your training data to determine how you will scale your data (e.g. decide on min/max/μ/σ)

2. Apply your scaling to ALL your data (both the training and test data)

3. If you use test data to decide on the scaling properties, then you leak information into your training data set, and
become overconfident in the performance of your classifier.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22


BINARY VS MULTI-CATEGORY CLASSIFICATION

Many algorithms primarily developed for binary


classification, i.e. 2 classes/categories
Classification methods developed for binary  One-versus-all
classification can be extended to the multi-category  Meaning: compare one category (positive) against “the
case: rest” (negative)
 Build k binary classifiers during training
 One-versus-one (also called All-versus-all)
 Testing: each classifier produces a score (confidence) if
 Build k(k − 1)/2 binary classifiers during training
the result for that category is positive
 Apply all k(k − 1)/2 binary classifiers during testing
 Pick the category with the highest score (of out k scores)
 A category has +1 if predicted by a classifier
 Pick the category with the highest score

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23


CLASSIFICATION ISSUES

Data size: Number of training samples for each category


Solutions
small
 Obtain more data samples: collect or inject samples
 the learned model does not reflect well how data
(resampling, introducing small noise, data
points are distributed => poor prediction
augmentation)
Overfitting: model optimized only on the training set and
 Select the right models, need to understand data well
not on the unseen samples
 Regularization: more penalty for more complex models
 lack generalization ability =>poor prediction
 Use validation/cross-validation to select robust training
Underfitting: model too simple to describe data statistics
models
or model not suitable to the data
 Select/combine suitable classification methods
 low prediction accuracy

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24


ADDRESSING RARE CLASSES AND SMALL DATA

Problems Solutions

 Some classes rarely occur in the training data  Some algorithms can incorporate weights
and are often misclassified  Down weight common classes, up weight rare classes
 Training data is limited, and no new data can be  Include weights into decision boundaries and/or
obtained (easily) evaluation metrics

 Biased sampling
 Over sample rare class, or under sample common classes

 Synthetic oversampling (SMOTE*)


 Use existing data to generate synthetic training data

COMP5009 – DATA MINING, CURTIN UNIVERSITY


*Synthetic Minority Over-sampling Technique 25
CLASSIFICATION
ALGORITHMS
Aggarwal Ch 10.2, 10.3, 10.5.1,  k-NN
 Decision trees
 Naïve Bayes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26


K-NEAREST NEIGHBOURS CLASSIFICATION

 Data labeled as Red, Green, Blue

 Label new data (star) based on the class


if it's neighbors

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27


K-NEAREST NEIGHBOURS CLASSIFICATION

 k=1 nearest neighbor

 k>1 (odd) majority vote of neighbors

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28


K-NN

Cons
 Slow classification when training size is large
Pros
 Solution: Pre-processing (clustering e.g. BIRCH)
 Simple
 Curse of dimensionality: reduction?
 Memory based: no need to train a model
 Sensitive to local noise/outliers
 Can be used with few training examples
 Does not exploit data structure

 Non-resolvable cases with n>2 categories

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29


K-NN CONSIDERATIONS

 Nearest neighbors
 Smallest distances
 Most similar

 Data similarity and distances


 Numeric: Lp norm, cosine, correlation, etc.

 Categorical: overlap measure with/without inverse


occurance frequency
 Mixed-type data: weighted sum of numerical and
categorical similarities

 Choice of k, and distance metric is important

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30


K-NN EXAMPLE
Play Golf
Outlook Temperature Humidity Windy

Rainy Hot High False No

Rainy Hot High True No

Overcast Hot High False Yes


 Use overlap measure
Sunny Mild High False Yes

Sunny Cool Normal False Yes

Sunny Cool Normal True No

Overcast Cool Normal True Yes

Rainy Mild High False No

Rainy Hot Normal False ?

 Use k=1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31


DECISION TREES

 Tree – hierarchical structure of a/b choices or splits

 Each split partitions the space into sub-spaces

 After some number of splits, each partition is


labeled

From Zaki + Meira

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34


DECISION TREES

From Zaki + Meira

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35


DECISION TREES

Model
 A set of hierarchical decisions on the feature variables
 Tree-like structure
 Split criterion: divide a subset of the training data into
two or more parts
 Internal nodes: where splits happen

 Leaf nodes: dominant class labels


Predict  How do we split?
 Traverse from root -> leaf according to splits  When do we stop?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36


SPLIT CRITERIA

Goal: to maximize the separation of the different


classes among the child nodes
 Dependent on the type of attribute
How do we measure the best split option?
 Binary: only one choice
 Categorical with r different values: r -way split, converting  Error rate/purity
to binary  Gini index
 Numeric: several options
 Entropy
 r -way split if containing a small number of r ordered values

 Common: split using binary condition, e.g x ≤ a

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37


SPLITTING WITH ERROR RATE OR PURITY

 Purity = fraction of samples having dominant class  Compute weighted average of error rates
label
 Error rate = 1 - Purity

 Based on smallest weighted average of error rates  Repeat this for all possible r -way splits
 Given a r -way split of set S into S1 , S2 , . . . , Sr  Select the one with the lowest weighted average
 Nr : number of samples in Sr error rate
 For each set compute the error rate er = 1 − pr

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38


EXAMPLE 1

 Using error rate, which is the better split?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 39


 Left half: 6 squares, 2 circles, total 8 samples,
dominant class is square
 e1 = 2/(2+6) = 2/8

 Right half: 3 squares, 7 circles, total 10 samples,


dominant class is circle
 e2 = 3/(3+7) = 3/10

 Weighted average of error rates


 esplit1 = (2/8 x 8 + 3/10 x 10) / (8+10) = 5/18

COMP5009 – DATA MINING, CURTIN UNIVERSITY 40


 Left half: 7 squares, 4 circles, total
 Left half: 6 squares, 2 circles, total 8 samples, 11 samples, dominant class is square
dominant class is square
 e1 = 4/(4+7) = 4/11
 e1 = 2/(2+6) = 2/8
 Right half: 2 squares, 5 circles, total 7
 Right half: 3 squares, 7 circles, total 10 samples, samples, dominant class is circle
dominant class is circle
 e2 = 2/(2+5) = 2/7
 e2 = 3/(3+7) = 3/10
 Weighted average of error rates
 Weighted average of error rates
 esplit2 = (4/11 x 11 + 2/7 x 7) / (11+7) = 6/18
 esplit1 = (2/8 x 8 + 3/10 x 10) / (8+10) = 5/18

COMP5009 – DATA MINING, CURTIN UNIVERSITY 41


 Left half: 7 squares, 4 circles, total
 Left half: 6 squares, 2 circles, total 8 samples, 11 samples, dominant class is square
dominant class is square
 e1 = 4/(4+7) = 4/11
 e1 = 2/(2+6) = 2/8
 Right half: 2 squares, 5 circles, total 7
 Right half: 3 squares, 7 circles, total 10 samples, samples, dominant class is circle
dominant class is circle
 e2 = 2/(2+5) = 2/7
 e2 = 3/(3+7) = 3/10
 Weighted average of error rates
 Weighted average of error rates
 esplit2 = (4/11 x 11 + 2/7 x 7) / (11+7) = 6/18
 esplit1 = (2/8 x 8 + 3/10 x 10) / (8+10) = 5/18

5/18 < 6/18 so split 1 is better

COMP5009 – DATA MINING, CURTIN UNIVERSITY 42


SPLITTING WITH GINI INDEX

 Compute weighted average of Gini indices:


 Given a r -way split of set S into S1 , S2 , . . . , Sr

 Nr : number of samples in Sr

 For each subset Sr


 p1 , p2 , . . . , pk fraction of samples from k classes
 Select the split with the lowest weighted average
 Gini index is:
Gini index

COMP5009 – DATA MINING, CURTIN UNIVERSITY 43


EXAMPLE 2

 Using the Gini index, which is the better split?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 44


 Left half: 6 squares, 2 circles, total 8 samples,
dominant class is square

 Right half: 3 squares, 7 circles, total 10 samples,


dominant class is circle

 Weighted average of error rates

COMP5009 – DATA MINING, CURTIN UNIVERSITY 45


 Left half: 6 squares, 2 circles, total 8 samples,  Left half: 7 squares, 4 circles, total
dominant class is square 11 samples, dominant class is square

 Right half: 3 squares, 7 circles, total 10 samples,  Right half: 2 squares, 5 circles, total 7
dominant class is circle samples, dominant class is circle

 Weighted average of Gini indicies  Weighted average of Gini indicies

COMP5009 – DATA MINING, CURTIN UNIVERSITY 46


 Left half: 6 squares, 2 circles, total 8 samples,  Left half: 7 squares, 4 circles, total
dominant class is square 11 samples, dominant class is square

 Right half: 3 squares, 7 circles, total 10 samples,  Right half: 2 squares, 5 circles, total 7
dominant class is circle samples, dominant class is circle

 Weighted average of Gini indicies  Weighted average of Gini indicies

split 1 is better
COMP5009 – DATA MINING, CURTIN UNIVERSITY 47
SPLITTING WITH ENTROPY

Entropy: measure of disorder or uncertainty


 Given a r -way split of set S into S1 , S2 , . . . , Sr

 Nr : number of samples in Sr  Compute weighted average of entropy scores

 For each subset Sr


 p1 , p2 , . . . , pk fraction of samples from k classes

 Entropy is:  Select the split with the lowest weighted entropy

 pi=0 => pi log2(pi) = 0

COMP5009 – DATA MINING, CURTIN UNIVERSITY 48


WHEN TO STOP SPLITTING

 Stopping criteria: prevent further splits of an


attribute when
 Size: the number of data points in the current ≤ a
 Recall: overfitting is one major issue to address predefined size threshold η (typically small); OR

 Overfitting in decision trees: can partition the data  Purity: when the purity of the current set ≥ a pre-defined
space with zero training error with deep trees. purity threshold π

 Deep trees = more complex model => may not  Tree pruning: converting internal nodes to leaf
perform well on unseen data nodes
 Consider both the training error and the tree complexity
 Error typically measured on the validation subset to
evaluate the effectiveness of pruning

COMP5009 – DATA MINING, CURTIN UNIVERSITY 49


EXAMPLE 3

 Given the above split, do we need to split either


group further?
 Assume thresholds:
 η=5 (size)
 Π=0.9 (purity)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 50


DECISION TREES

Pro Con

 Simple and interpretable  Complex calculation if outcomes are linked

 Easy addition of new scenarios  High cost with large trees

 Effective and efficient  Prone to overfitting

 Analytical power  Lack of probabilistic sense (confidence, certainty,


etc.)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 51


BAYES CLASSIFICATION

 Probability of joint event and conditional probability

 Bayes' theorem
A = {class of interest} B = {all samples}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 52


BAYES CLASSIFICATION

 P(A): the prior likelihood of sample belonging to the


class of interest (based on training data)
 P(B): the prior likelihood of sample existing
(ignored)
A = {class of interest} B = {all samples}
 P(A|B): given the current sample, how likely it is
from a class of interest
 P(B|A): if the sample comes from a class if interest,
how likely it is observed

COMP5009 – DATA MINING, CURTIN UNIVERSITY 53


BAYES CLASSIFICATION IN PRACTICE

 D: training data set  Prediction of class: class that maximize the


posterior probability
 x1 , . . . xn : training samples, each with
d dimensions
 Classes {c1 , … , ck }

 P(ci): probability of class ci = fraction of samples  For numeric attributes, we can replace probability
from ci with density f
 P(x|ci ): probability of x, belonging to class ci

COMP5009 – DATA MINING, CURTIN UNIVERSITY 54


EXAMPLE 1D - IRIS

 For test point we compute

 for all 3 classes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 55


MODELLING THE CLASS DISTRIBUTION P(X|C)

Categorical Numeric

 simple counting (1D)  approximate by Gaussian and replace probability


 Bernoulli modeling (multi-dimensional)
with density evaluation

 Example: Training data have 10 Red and 8 Blue  Mean and Standard deviation estimated from
values for class c1 training data
 Current test sample x is Red  Multi-dimensional data: mean vector and
covariance matrix
 P(x|c1) = 10/18
 Note: If some categories have few or zero counts,
adjust the base count with +1.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 56


EXMPLE 2D - IRIS

COMP5009 – DATA MINING, CURTIN UNIVERSITY 57


LARGE DIMENSIONALITY – BAYES CLASSIFIER

 Categorical: joint probability of attribute values  Numeric: mean vectors and covariance matrix

COMP5009 – DATA MINING, CURTIN UNIVERSITY 58


NAÏVE BAYES

Issue: Estimating μ/Σ reliably from the training data is


Naïve Bayes: ignore the cross-terms!
difficult!
 Assume attributes are independent
 d2 − d cross-terms in the covariance matrix
 Estimate their distributions separately (multiple 1D
 They describe how attributes vary against each
problems)
other
 Simplify the calculations
 Need a lot of samples to estimate reliably
 Works better than expected
 Need a lot of computing power (time)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 59


LARGE DIMENSIONALITY – NAÏVE BAYES CLASSIFIER

 Categorical: joint probability of attribute values  Numeric: mean vectors and covariance matrix

COMP5009 – DATA MINING, CURTIN UNIVERSITY 60


EXAMPLE 2D - IRIS

Note the lack of rotation:


Naïve bayes ignores the
correlation between
attributes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 61


Attribute 1 Attribute 2 Attribute 3 Class
T T 5 Y
EXAMPLE T T 7 Y
T F 8 N
 Given the data table to the right, classify a new F F 3 Y
object (T, F, 1) using Naïve bayes.
F T 7 N
F T 4 N
F F 5 N
T F 6 Y
F T 1 N

COMP5009 – DATA MINING, CURTIN UNIVERSITY 62


Attribute 1 Attribute 2 Attribute 3 Class
T T 5 Y
EXAMPLE T T 7 Y
F F 3 Y
 Split the table into classes T F 6 Y
 Compute mean and variance of numerical attribute

 Class Y Attribute 1 Attribute 2 Attribute 3 Class


 μ = 5.25, σ=1.71 T F 8 N
 Class N F T 7 N
 μ = 5, σ = 2.74 F T 4 N
Density function is then F F 5 N
F T 1 N

COMP5009 – DATA MINING, CURTIN UNIVERSITY 63


Attribute 1 Attribute 2 Attribute 3 Class
T T 5 Y
EXAMPLE T T 7 Y
F F 3 Y
 Test for class Y T F 6 Y
 Class prior P(Y) = 4/9

 Categorical Attribute 1 Attribute 2 Attribute 3 Class


 P(a1 =T | Y) = 3/4 T F 8 N
 P(a2 = F | Y) = 1/2 F T 7 N
 Numerical F T 4 N
 μ = 5.25, σ=1.71 F F 5 N
F T 1 N
 All up P = 0.44 x 0.75 x 0.5 x 0.026 = 0.0043

T F 1 ?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 64


Attribute 1 Attribute 2 Attribute 3 Class
T T 5 Y
EXAMPLE T T 7 Y
F F 3 Y
 Test for class N T F 6 Y
 Class prior P(N) = 5/9

 Categorical Attribute 1 Attribute 2 Attribute 3 Class


 P(a1 =T | N) = 1/5 T F 8 N
 P(a2 = F | N) = 2/5 F T 7 N
 Numerical F T 4 N
 μ = 5, σ=2.74 F F 5 N
F T 1 N
 All up P = 0.55 x 0.2 x 0.4 x 0.126=0.0055

 0.0055 > 0.0043 -> predict class N


T F 1 N

COMP5009 – DATA MINING, CURTIN UNIVERSITY 65


Classification basics
 Key concepts

SUMMARY  Data Partitioning


 Cross-validation

 Evaluation
Classification algorithms
 k-NN

 Decision trees
 Naïve Bayes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 66


NEXT: REGRESSION
CHAPTER 11.5.1, 11.5.5, 11.5.6

COMP5009 – DATA MINING, CURTIN UNIVERSITY 67

You might also like