CSCI417
Machine Intelligence
Lecture # 2
Spring 2024
1
Tentative Course Topics
1.Machine Learning Basics
2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.
2
Other Names
Non-parametric classification
algorithm.
Instance based algorithm.
Lazy learning algorithm.
Competitive learning algorithm
4
Classification revisited
Classification is dividing up objects so that each is assigned to one of
a number of discrete and definite categories known as classes.
Examples:
• customers who are likely to buy or not buy a particular product in a
supermarket
• people who are at high, medium or low risk of acquiring a certain illness
5
Classification revisited
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No S in g le 75K ?
T id R e fu n d M a r ita l T a x a b le
S ta tu s In c o m e C heat Yes M a r r ie d 50K ?
1 Yes S in g le 125K No No M a r r ie d 150K ?
2 No M a r r ie d 100K No Yes D iv o r c e d 90K ?
3 No S in g le 70K No No S in g le 40K ?
4 Yes M a r r ie d 120K No No M a r r ie d 80K ?
10
5 No D iv o r c e d 95K Yes
6 No M a r r ie d 60K No Test
7 Yes D iv o r c e d 220K No Set
8 No S in g le 85K Yes
9 No M a r r ie d 75K No
Training
Learn
10 No S in g le 90K Yes Model
10
Set Classifier
6
Before K-NN After K-NN
X2 x2
Category 2 Category 2
New data point K-N N New data point assigned
to Category 1
Category 1 Category 1
x1 x1
1. A positive integer k is specified, along with a new sample (k= 1, 3, 5)
2. We select the k entries in our training data set which are closest to
the new sample
3. We find the most common classification of these entries
4. This is the classification we give to the new sample
Example
Two classes
Two attributes
How to classify (9.1,11)
9
Example
Two classes
Two attributes
How to classify (9.1,11)
10
S T E P 1: Choose the number K of neighbors:K =5
x2
Category 2
New data point
Category 1
x1
y
P (x ,y )
2 2 2
y2
y1
P 1(x 1,y1)
x1 x2
x
x2
Category 1: 3 neighbors
Category 2
Category 2: 2 neighbors
New data point
Category 1
x1
Classifying with distance measurements k-Nearest Neighbors (kNN)
• Store all training examples
Each point is a “vector” of attributes
• Classify new examples based on most “similar” training examples
Similar means “closer” in vector space
16
Example
19
Components of K-NN classifier
Distance metric
− How do we measure distance between instances?
− Determines the layout of the example space categorical variables
continuous variables
The k hyper-parameter
− How large a neighborhood
should we consider?
− Determines the complexity of
the hypothesis space
26
Decision Boundaries
• Regions in feature space closest to every training
example
• If test point is in the region corresponding to a
given input point, return its label
28
Decision Boundary of a kNN
Tunes the complexity of the
hypothesis space
− If k = 1, every training example
has its own neighborhood
− If k = N, the en re feature
space is one neighborhood!
• Higher k yields smoother
decision boundaries
30
Variations on kNN
Weighted voting
• Default: all neighbors have equal weight
• Extension: weight votes of neighbors by (inverse) distance
• The intuition behind weighted kNN, is to give more weight to the points
which are nearby and less weight to the points which are farther away.
Epsilon Ball Nearest Neighbors
Same general principle as kNN, but change the method for selecting which
training examples vote
• Instead of using K nearest neighbors, use all examples such that
( )≤
31
https://www.cs.umd.edu/
Issues with KNN – Effect of K
32
Evaluation
Supervised Learning: Training and Testing
In supervised learning, the task is to learn a mapping from inputs to outputs given a
training dataset 1 1 𝑛 of n input-output pairs.
Let be our ML algorithm that maps input examples to output labels ,
parameterized by θ, where θ captures all the learnable parameters of our ML algorithm.
35
Classification Loss Function
average misclassification rate
Our goal is to minimize the loss function, i.e., find a set of parameters θ, that make
the misclassification rate as close to zero as possible.
Remember that, For continuous labels or response variables, a common loss function is the
Mean Square Error (MSE)
36
Performance Measurement
• The quality of predictions from a learned model is often expressed in terms of a
loss function. A loss function tells you how much you will be penalized for
making a guess when the answer is actually .
• There are many possible loss functions. Here are some frequently used examples:
• Loss applies to predictions drawn from finite domains.
37
Performance Measurement
• Squared loss
• Linear loss
• Asymmetric loss Consider a situation in
which you are trying to predict whether
someone is having a heart attack. It
might be much worse to predict “no”
when the answer is really “yes”, than
the other way around.
38
Accuracy – the simple metric
No. of items in a class labeled correctly
No. of items in that class
39
What fraction of the examples are classified correctly? C1
Acc = ?
= 9/10
C2
5
• Acc(M1) = ? M2 M1
C1 C1
• Acc(M2) = ?
C2 C2
What’s the problem?
6
Shortcomings of Accuracy
Let’s delve into the possible classification cases. Either the classifier got a positive
example labeled as positive, or it made a mistake and marked it as negative. Conversely, a
negative example may have been mis-labeled as positive, or correctly guessed negative.
So we define the following metrics:
True Positives (TP): number of positive examples, labeled as such.
False Positives (FP): number of negative examples, labeled as positive.
True Negatives (TN): number of negative examples, labeled as such.
False Negatives (FN): number of positive examples, labeled as negative.
42
Example – Spam Classifier
In this case, accuracy = (10 + 100)/(10 + 100 + 25 + 15) = 73.3%. We may be tempted to think our
classifier is pretty decent since it detected nearly 73% of all the spam messages.
However, look what happens when we switch it for a dumb classifier that always says “no spam”:
43
A new dump Spam Classifier
We get accuracy = (0 + 125)/(0 + 125 + 0 + 25) = 83.3%.
This looks crazy. We changed our model to a completely useless one, with exactly zero
predictive power, and yet, we got an increase in accuracy.
44
• Imbalanced data (distribution of classes)!
• Some errors matter more than others …
− Given medical record, predict patient has COVID or not
− Given an email, detect spam
• When classes are highly unbalanced, we focus on one target class
(usually the rare class), denoted as the “positive” class.
7
Accuracy paradox
This is called the accuracy paradox. When TP < FP, then accuracy will always increase
when we change a classification rule to always output “negative” category.
Conversely, when TN < FN, the same will happen when we change our rule to always
output “positive”.
So what can we do, so we are not tricked into thinking one classifier model is better
than other one, when it really isn’t?
46
positive negative
FP TN
Actual
TP FN
positive negative
Predicted
8
positive
positive negative
M2 M1
? ?
C1 C1
Actual
M1
? ?
positive negative
Predicted
positive negative
C2 C2
? ?
Actual
M2
? ?
positive negative
Predicted
9
positive
M2 M1
positive negative
2 6 C1 C1
Actual
M1
1 1
positive negative
Predicted
C2 C2
positive negative
1 7
Actual
M2
0 2
positive negative
Predicted
10
Actual correct prediction divided by the
positive negative
total prediction made by the model.
2 6
Actual
M1 Out of all the positive predicted, what percentage is truly positive.
1 1
positive negative
Predicted =
+
positive negative
= 1/3 FP TN
negative
1
1 7
Actual
Actual
M2
positive
TP FN
2 =0/1 positive negative
0 2 Precision: % of positive
Predicted
positive negative predictions that are correct
Predicted
What fraction of the actual positive
examples are predicted as positive?
positive negative
2 6 Out of the total positive, what percentage are predicted positive.
Actual
M1
1 1 FP TN
negative
Actual
positive negative
= TP FN
positive
Predicted
+ FN positive negative
Predicted
positive negative
1 = 1/2
1 7
Actual
M2
2 = 0/2
0 2
positive negative Recall: % of gold positive
Predicted examples that are found
12
which is better (high precision and low recall or vice-versa)?
Detect few positive examples but misses many others The ideal
1
Precision
Predict everything
as positive
0 1
Recall
15
False Positive or False Negative Sick, healthy
Spam, not Spam
In the medical example, what is worse, a False In the spam detector example, what is worse, a
Positive, or a False Negative? False Positive, or a False Negative?
Predicted Predicted Sent to Spam Sent to inbox
Sick Healthy
Sick 1000 200 Spam 1000 200
False Negative False Negative
Healthy 800 8000 Not Spam 800 8000
False Positive False Positive
55
there should be a metric that combines both
measure
2
1
+
− Harmonic mean of and
• Weighted measure
different weightage to recall and precision
Beta represents how many times recall is more important than precision.
If the recall is twice as important as precision, the value of Beta is 2.
16
M2 M1
M1 M2 C1 C1
Precision ? ?
Recall ? ?
F1 ? ?
C2 C2
17
M2 M1
M1 M2 C1 C1
Precision 1/3 = 0.33 0/1 = 0
Recall 1/2 = 0.5 0/2 = 0
F1 0.4 0
C2 C2
18
C1
• Accuracy = ?
= (3+3+1)/10 = 0.7
What’s the problem?
C2
• Good measure when classes are nearly balanced!
C3
19
C1
Predicted
Actual
C2
C3
20
Predicted C1
3 0 1
Actual
0 3 1 C2
1 0 1
C3
P
R
F1
21
Predicted C1
3 0 1
Actual
0 3 1 C2
1 0 1
C3
P 0.75 1 0.333
R 0.75 0.75 0.5
F1 0.75 0.86 0.4
Macro-F1 = (0.75+0.86+0.4)/3 = 0.67 average of the class-wise F1 scores 22
Overfitting and Underfitting
Overfitting:
It occurs when you pay too much attention to the
specifics of the training data, and are not able to
generalize well.
•Often, this means that your model is fitting noise,
rather than whatever it is supposed to fit.
•Or didn’t have sufficient data to learn from
Underfitting:
Learning algorithm had the opportunity to learn more
from training data, but didn’t (too simple model).
•Or didn’t have sufficient data to learn from
66
67
CSCI417
Machine Intelligence
Thank you
Spring 2024
68