Data Analytics
C LASSIFICATION
Classification: Motivation and Applications Train-Validation Split
and Cross-Validation Evaluation Metrics and Class Imbalance
Overfitting
kNN Classifier
Naive Bayes Classifier Decision Tree
Entroy, Conidtional Entropy, Information Gain
US Classification 1 / 56
Classification: Definition
Classification is a supervised task
Input: A collection of objects Features Class Label
x1 x2 xm y
feature vectors with class labels o1
o2
o.3
Output: A model for the class attribute
train instances
.
feature vectors
oi
as a function of other attributes ..
on
test instances
?
Training Set: Instances whose class labels are used for learning Test Set:
Instances with same attributes as training set but
missing/hidden class labels
Goal: Model should accurately assign class labels to unlabeled instances
US Classification 2 / 56
Classification
Input: A collection of objects Features Class Label
x1 x2 xm y
feature vectors with class labels o1
o2
o.3 train instances
Output: A model for the class attribute .
feature vectors
oi
as a function of other attributes ..
on
test instances ?
source: javapoint.com
US Classification 3 / 56
Classification
US Classification 4 / 56
Classification: Applications
Targeted Advertisement
Enhance marketing by identifying customers who are likely to buy a product
Use customer purchase history, demographics etc. for similar (old) products
buy/no buy as class labels
US Classification 5 / 56
Classification: Applications
Credit Card Fraudulent Transaction Detection
User transactions history and card holders characteristics
fair/fraud as class labels
source: Benchaji et.al. (2019)
US Classification 6 / 56
Classification: Applications
Predict Customer Attrition/Churn
User customers transactions and feedback history
churn/no-churn as class labels
US Classification 7 / 56
Classification: Applications
Text Classification
Text is converted into feature vectors before classification
Document Classification
source: towardsdatascience.com
Sentiment Analysis Emotion Mining
US Classification 8 / 56
Classification: Applications
Sky Survey Cataloging
Classify astronomical objects as stars or galaxies
Use telescoping survey images (from Palomar Observatory) 3000 images
with 23040⇥23040 pixels per image
Extract features values 40 features per object
US Classification 9 / 56
Classification: Applications
US Classification 10 / 56
Classification Evaluation Metrics
Train-Validation Split and Cross-Validation
US Classification 11 / 56
Classification
The model (classifier) is learned by finding patterns in training set
Performance on training set does not (necessarily) indicate
generalization power of the model
A validation set (a subset of training set) is used to learn parameters
and tune architecture of classifier and estimate error
For generalization of the model, validation set must be representative of
the input instances
Since test set is never used during training, it provides an unbiased
estimate of generalization error star/galaxy as class label
US Classification 12 / 56
Classification: Training-Validation split
Generally obtained by randomly splitting the dataset
e.g. 70—30, 80—20 random
Train-Validation split Use average performance of multiple random
(splits)
source: medium.com
US Classification 13 / 56
Classification: Cross-Validation
The dataset is randomly split into k folds
In each of k the i th fold is used for validation and the rest for training Every
instance is used once for validation and k — 1 times for training 5-fold, 10-
fold cross-validation
source: Scikit-learn
US Classification 14 / 56
Classification: Evaluation Metrics
Binary Classifiers (for classifying into two classes) are evaluated by
tabulating the classification results in a Confusion Matrix
Actual Classes Positive Negative
Positive
True False
Positive Positive
Negative
Predicted Classes
False True
Negative Negative
Some summary statistics of the confusion matrix are
TP + TN FP + FN
ACCURACY = ERROR =
TP + TN + FP + FN TP + TN + FP + FN
ACCURACY and ERROR are usually reported as percentages
US Classification 15 / 56
Classification: Evaluation Metrics
Actual Classes Positive
Negative
True False
Positive Positive
Negative Positive
Predicted Classes
False True
Negative Negative
With big imbalance in classes, ACCURACY and ERROR are misleading
In a tumors dataset 99% samples are negative (Blindly) predicting all
as negatives gives 99% accuracy But cancer is not detected
Have to use cost matrix/loss function (essentially weighted accuracy)
US Classification 16 / 56
Classification: Evaluation Metrics
TP
PRECISION =
+ FPTP
. sensitivity (measure of exactness)
TP
RECALL =
TP + FN
. specificity (measure of completeness)
F-measure: Maximizes both
2
F1 =
1 1
+
PRECISION RECALL
US Classification 17 / 56
Classification: O VERFIT T ING
Overfitting: The phenomenon when model performs very well on training
data but does not generalize to testing data
The model learns the data and not the underlying function.
Essentially learning by-rote
Model has too much freedom (many parameters with wider ranges)
Validation, Cross-validation, early stopping, regularization, model
comparison, Bayesian priors help avoiding over fitting
US Classification 18 / 56
Classification: O VERFIT T ING
US Classification 19 / 56
Classifier/Model
A classifier utilizes training data to understand how input variables
are related to the class variable
A model is built, which can be used to predict labels for unseen data
Kinds of Classifiers
Lazy Classifiers Eager Classifiers
US Classification 20 / 56
Kinds of Classifiers
Lazy Classifiers
Store the training data and wait for testing data
For an unseen test data record (data point), assign class label based on the
most related points in the training data
Less training time, more prediction time
Examples: k-Nearest Neighbor (kNN) Classifier
Eager Classifiers
Construct a classification model based on training data For a test data
point, use the model to assign class label More training time but less
prediction time
Examples: Naive Bayes, Decision Tree
US Classification 21 / 56
Nearest Neighbors Classification and Regression
US Classification 22 / 56
k-Nearest Neighbor (kNN) Classifier
k-NN is a simple method used for classification . also for regression
The class label of a test instance x is predicted to be the most common
class among the k nearest neighbors of x in the train set
Assign the test instance ( ? ) class A ( F ) or
class B (N)
k = 3 nearest neighbors (`2 distance)
1 F and 2 N = ) assigned label = N
k = 7 nearest neighbors (`2 distance)
4 F and 3 N = ) assigned label = F
US Classification 23 / 56
k-Nearest Neighbor (kNN)
The class label of a test instance x is predicted to be the most common
class among the k nearest neighbors of x in the train set
Assumes that the proximity measure captures class membership
Definition of proximity measure (defining ‘nearest’) is critical
The parameter k is important and sensitive to local structure of data
US Classification 24 / 56
k-Nearest Neighbor (kNN) Regression
In k-NN regression, for a test instance x the value of target variable y
is the ‘average’ of values of y of k-nearest neighbors of x in train set
The ‘average’ can be the weighted mean (weighted by similarity), in
this case generally take k so all points are included in neighborhood
P
sim(x, x 0) y(x 0)
x 0 2D
y(x) = P
sim(x, x 0)
x 0 2D
y(x) is the value of target variable y in instance x
US Classification 25 / 56
Naive Bayes Classifier
US Classification 26 / 56
Naive Bayes Classifier
Classify x = (x1,. .., xn) into one of K classes C1,. .., CK
Naive Bayes is a conditional probability model
For instance x it computes probabilities Pr[class = Cj |x] for each class Cj
Assumes that
1 All attributes are equally important
2 Attributes are statistically independent given the class label
knowing value of one attribute says nothing about value of another
Independence assumption is almost never correct (thus the word
Naive …….but works well in practice
Model is the probabilities calculated from training data for each attribute
with respect to class label
US Classification 27 / 56
Naive Bayes Classifier
Classify x = (x1,. .., xn) into one of K classes C1,. .., CK
We want to compute The Likelihood:Probability of Prior: Probability of class Cj
this. The Posterior predictor(s) given a class Cj . , without considering x.
probability of class Computed from frequencies of Estimated from frequency of
Cj given the object x predictor(s) in class Cj in train set labels Cj in train set
P (x |C j ) ⇥ P (C j)
P (C j |x) = P (x)
Evidence: Probability of ob-
serving x, This is indepedent
on classes C and x is given.
Efectively constant.
Apply the independence assumption
P (x |Cj ) = P (x1 |Cj ) ⇥ P (x2 | Cj ) ⇥ .. . ⇥ P (xn | Cj )
Substitute in numerator and ignore the denominator
P (Cj | x) = P (x1 |Cj ) ⇥ P (x2 | Cj ) ⇥ .. . ⇥ P (xn | Cj ) ⇥[ P (Cj ) ]
US Classification 28 / 56
Naive Bayes: Running Example
Train on records of weather conditions and whether or not game was played.
Given weather condition (test instance) predict whether game will be played
N. Milkic & U. Krcadinac @ Uni. of Belgrade
US Classification 29 / 56
Naive Bayes: Running Example
P (play = yes |x ) = P (outl = ⇤|yes) ⇥P (temp = ⇤|yes) ⇥P (humid = ⇤|yes) ⇥P (wind = ⇤|yes) ⇥[P (yes)]
P (play = no |x ) = P (outl = ⇤|no) ⇥P (temp = ⇤|no) ⇥P (humid = ⇤|no) ⇥P (wind = ⇤|no) ⇥[P (no)]
US Classification 30 / 56
Naive Bayes: Running Example
US Classification 31 / 56
Naive Bayes: Running Example
US Classification 32 / 56
Naive Bayes: Running Example
Given weather condition x = (sunny, cool, high, true) will game be played?
P (play = yes |x ) = P (sunny|yes) ⇥P (cool|yes) ⇥P (high|yes) ⇥P (true|yes) ⇥[P (yes)]
= 0.22 ⇥0.33 ⇥0.33 ⇥0.33 ⇥[0.64] = 0.0053
P (play = no |x ) = P (sunny|no) ⇥P (cool|no) ⇥P (high|no) ⇥P (true|no) ⇥[P (no)]
= 0.60 ⇥0.20 ⇥0.80 ⇥0.80 ⇥[0.36] = 0.0206
US Classification 33 / 56
Naive Bayes: Issues some issues
for Naive Bayes classifier: you are encouraged to read about
Zero frequency problem: probability = 0 for an attribute in a class
For example: P[Outlook = sunny|yes]=0
One zero would make whole product zero
Solution: Laplace smoothing (add-one smoothing)
Missing value of an attribute for a test instance
usually attribute is omitted from probability calculation
What if values of attributes are continuous?
Discretization solves the problem in many cases
Can also assume a probability distribution for each continuous
attribute and learn distribution parameters from training set
US Classification 34 / 56
DecisionTree Classifier
US Classification 35 / 56
Decision Tree
Fundamentally, an if-then rule set for classifying objects
Builds model in the form of a tree structure
US Classification 36 / 56
Decision Tree
Outlook
sunny overcast rainy
Zemel, Ustasun, Fidler @ Uni, of Toronto
Humidity Yes Windy
high normal true false
Temp Yes No Yes
high mild cool
Windy No No
true false
No Yes
Decision tree for binary classification of instance with nominal attributes Decision tree for binary classification of instance with numeric attributes
Each internal node tests an attribute xi
Branches correspond to possible (subsets of) values of xi
Each leaf node assigns a class label y
US Classification 37 / 56
Classification using Decision Trees
To classify a test instance x traverse the tree from root to leaf
Take branches at internal nodes according to results of their tests
Predict the class label at the leaf node reached
Zemel, Ustasun, Fidler @ Uni, of Toronto
US Classification 38 / 56
Classification using Decision Trees
To classify a test instance x traverse the tree from root to leaf
Take branches at internal nodes according to results of their tests
Predict the class label at the leaf node reached
Given weather condition x = (sunny, cool, high, true)
will game be played?
Outlook
sunny overcast rainy
Humidity Yes Windy
high normal true false
Temp Yes No Yes
high mild cool
Windy No No
true false
No Yes
US Classification 39 / 56
Building Decision Tree
Building the optimal decision tree is NP-H ARD problem
J. Leskovec @ Stanford
Recursively build the tree top-down, using greedy heuristics Start with
an empty decision tree
Split the current dataset by the best attribute until stopping condition
US Classification 40 / 56
Building Decision Tree
Suppose at some node G in the tree built so far
J. Leskovec @ Stanford
Shall we continue building the tree?
If Yes, G is internal, which attribute to split on (test)?
If No, G is leaf, what is the prediction rule?
US Classification 41 / 56
Building Decision Tree
Stop when the leaf (subtree at G)
J. Leskovec @ Stanford
is pure (purity?) or
When the size of sub-dataset at G is
small e.g. |DG | 5
.. .
US Classification 42 / 56
Building Decision Tree
If we stop at G, then prediction at G can be
J. Leskovec @ Stanford
mode of class labels in sub-dataset DG
For a numeric target variable
prediction could be an average of
target variable values in DG
When target variable is numeric it is
called Regression Tree
US Classification 43 / 56
Attribute Selection
US Classification 44 / 56
Building Decision Tree
Top attributes are selected based on metrics J. Leskovec @ Stanford
e.g.
Entropy
Information Gain
Gini Index
Common algorithms for Decision Tree are ID3, C4.5, ...
US Classification 45 / 56
Entropy
In information theory, entropy quantify the average level of information
content or uncertainty in a random variable
Flip a fair and a biased coin 16
Outcome of Coin 1 4
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 ? 0 1
11
9
Outcome of Coin 2
1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 ? 0 1
In which case would we be more surprised if the next outcome is a 1?
Entropy values range between 0 and 1 bit . unit of entropy
Max surprise is for fair coin (p = 1/2) over . no reason to expect an outcome
another
Min entropy value is 0 bit for p =0 or p =1
These slides about information theory concepts are adapted from Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 46 / 56
Entropy
A random variable X taking value x1,. .., xn has entropy
Xn
H(X ) = — p(xi ) logp(x i )
i =1
For fair coin, p = 1/2, H(·) = —1/2 log 1/2 —1/2 log 1/2 = 1
For 1-sided coin, p = 1/0, H(·) = —1log1—0log0 = 0
H(·) = —16/20 log 16/20 —4/20 log 4/20 =
0.721928
H(·) = —9/20 log 9/20 —11/20 log 11/20 = 0.99277
Flip a fair and a biased coin 16
Outcome of Coin 1 4
In which case would we be more surprised if the next outcome is a 1?
1
0
0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
? 1
US Classification 47 / 56
Entropy
A random variable X taking value x1,. .., xn has entropy
Xn
H(X ) = — p(xi ) logp(x i )
i =1
source: Wikipedia
Entropy H ( X ) (expected surprisal) of a coin flip (in bits)
plotted versus the bias of the coin P r ( X = 1) = P (heads)
US Classification 48 / 56
Entropy of joint distribution
Entropy of the joint distribution of random variables X and Y
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 49 / 56
Conditional Entropy
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 50 / 56
Conditional Entropy
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 51 / 56
Conditional Entropy
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 52 / 56
Conditional Entropy
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 53 / 56
Information Gain
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 54 / 56
Information Gain
Grosse, Farahmand, & Carrasquilla, Uni. of Toronto
US Classification 55 / 56
Classification: Some other Concepts
Some other concepts related to classification you should be familiar with
Decision boundary ROC-Curve
Multi-Class classification form binary classifier
O NE-VS-A LL O NE-VS-R EST
Some classifiers you should read about, (at least wikipedia level is
essential for reading papers and using them in your projects)
Random Forest, Support Vector Machine, Neural Networks, Deep
Learning
US Classification 56 / 56