DataClassification
DataClassification
DATA MINING
DATA
CLASSIFICATION
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2
CLASSIFICATION
BASICS
Key concepts
Aggarwal Chapters/Sections
10.1 Data Partitioning
10.2-10.2.1
10.3 Cross-validation
10.5-10.5.1
10.8
Evaluation
10.9
11.1
11.2
11.3
Y : Test data
D: Training data To predict class labels
For selecting models and tuning parameters A test sample must also belong to one of the known
categories/classes
Already partitioned into groups, categories, or
classes Important statistical (learning) assumption:
samples come from the same distribution that
Class labels provided by domain experts
generates the training samples
Each sample has a label indicating which class it If a test sample is identical to a sample in the training
belongs to data, it must also have the same class label
Train e x y y a b n e c
e b s w a b g e c
e b y w l b n e c
p x y w p n p e e
e b s y a b g e c
e x y y l b g e c
? x y y a b n e c
Predict
? b s y a b w e c
https://archive-beta.ics.uci.edu/ml/datasets/73
COMP5009 – DATA MINING, CURTIN UNIVERSITY 4
DATA CLASSIFICATION
Apples Pears
Apple
or
Pear?
Apples Pears
Apple
or
Pear?
Apples Pears
Apple
or
Pear?
Training data used to learn structure of the groups Extreme case: no model at all, just memory-based (k-
NN)
Two main phases:
Partitioning the attribute spaces into regions of
Training: construct predictive model from training data dominant labels (decision trees)
Testing: apply model to test samples to predict labels Linear combination of attributes (SVM, linear
discriminant analysis)
A neural network with suitable weights
Probabilistic representation (Bayesian methods)
Accuracy
The fraction of test instances that were correctly
Confusion matrix
labeled (TP + TN)/Total
Precision Actual 1 Actual 0
Fraction reported positives that are correct TP/(TP+FP) Predicted 1 True positive False positive
Recall
Predicted 0 False negative True negative
Fraction of positives that are reported TP/(TP+FN)
F1-measure
Receiver Operating Curve (ROC)
2* Precision * Recall / (Precision + Recall)
Area under the graph of FPR vs TPR ε [0,1]
2*TP / (2*TP + FP +FN)
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Practical interpretations
Unavoidable trade-off between false alarm rate and
detection probability
Operating point: depends on the specific application
https://towardsdatascience.com/visual-guide-to-the-confusion-matrix-bb63730c8eba
With class labels: for building the predictive model Accurately explaining the training data not the main
goal
Without class labels: for predicting class labels by
applying the model How to predict accuracy on unseen test samples?
Question: How to make best use of data with class Take a subset of the labelled data out for validation
labels?
Validation subset needs to reflect the statistics of
unseen data
3. Determine the optimal parameters for the classification method using cross validation
This will use most data for training during each iteration
4. Once you have the optimal parameters, retrain on ALL the data
This will use all data for training your best model
1. Use your training data to determine how you will scale your data (e.g. decide on min/max/μ/σ)
2. Apply your scaling to ALL your data (both the training and test data)
3. If you use test data to decide on the scaling properties, then you leak information into your training data set, and
become overconfident in the performance of your classifier.
Problems Solutions
Some classes rarely occur in the training data Some algorithms can incorporate weights
and are often misclassified Down weight common classes, up weight rare classes
Training data is limited, and no new data can be Include weights into decision boundaries and/or
obtained (easily) evaluation metrics
Biased sampling
Over sample rare class, or under sample common classes
Cons
Slow classification when training size is large
Pros
Solution: Pre-processing (clustering e.g. BIRCH)
Simple
Curse of dimensionality: reduction?
Memory based: no need to train a model
Sensitive to local noise/outliers
Can be used with few training examples
Does not exploit data structure
Nearest neighbors
Smallest distances
Most similar
Use k=1
Model
A set of hierarchical decisions on the feature variables
Tree-like structure
Split criterion: divide a subset of the training data into
two or more parts
Internal nodes: where splits happen
Purity = fraction of samples having dominant class Compute weighted average of error rates
label
Error rate = 1 - Purity
Based on smallest weighted average of error rates Repeat this for all possible r -way splits
Given a r -way split of set S into S1 , S2 , . . . , Sr Select the one with the lowest weighted average
Nr : number of samples in Sr error rate
For each set compute the error rate er = 1 − pr
Nr : number of samples in Sr
Right half: 3 squares, 7 circles, total 10 samples, Right half: 2 squares, 5 circles, total 7
dominant class is circle samples, dominant class is circle
Right half: 3 squares, 7 circles, total 10 samples, Right half: 2 squares, 5 circles, total 7
dominant class is circle samples, dominant class is circle
split 1 is better
COMP5009 – DATA MINING, CURTIN UNIVERSITY 47
SPLITTING WITH ENTROPY
Entropy is: Select the split with the lowest weighted entropy
Overfitting in decision trees: can partition the data Purity: when the purity of the current set ≥ a pre-defined
space with zero training error with deep trees. purity threshold π
Deep trees = more complex model => may not Tree pruning: converting internal nodes to leaf
perform well on unseen data nodes
Consider both the training error and the tree complexity
Error typically measured on the validation subset to
evaluate the effectiveness of pruning
Pro Con
Bayes' theorem
A = {class of interest} B = {all samples}
P(ci): probability of class ci = fraction of samples For numeric attributes, we can replace probability
from ci with density f
P(x|ci ): probability of x, belonging to class ci
Categorical Numeric
Example: Training data have 10 Red and 8 Blue Mean and Standard deviation estimated from
values for class c1 training data
Current test sample x is Red Multi-dimensional data: mean vector and
covariance matrix
P(x|c1) = 10/18
Note: If some categories have few or zero counts,
adjust the base count with +1.
Categorical: joint probability of attribute values Numeric: mean vectors and covariance matrix
Categorical: joint probability of attribute values Numeric: mean vectors and covariance matrix
T F 1 ?
Evaluation
Classification algorithms
k-NN
Decision trees
Naïve Bayes