0% found this document useful (0 votes)
148 views40 pages

BookSlides 5A Similarity Based Learning

The document summarizes key concepts from sections 5.1-5.3 of chapter 5 on similarity-based learning from a textbook on machine learning. It discusses the big idea of classifying unknown instances based on similarity to known examples. It then covers fundamentals of the approach, including feature spaces to represent data as points in n-dimensional space and distance metrics like Euclidean and Manhattan distance to measure similarity between points. Standard algorithms like k-nearest neighbors that use these concepts are also mentioned.

Uploaded by

Mba Nani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views40 pages

BookSlides 5A Similarity Based Learning

The document summarizes key concepts from sections 5.1-5.3 of chapter 5 on similarity-based learning from a textbook on machine learning. It discusses the big idea of classifying unknown instances based on similarity to known examples. It then covers fundamentals of the approach, including feature spaces to represent data as points in n-dimensional space and distance metrics like Euclidean and Manhattan distance to measure similarity between points. Standard algorithms like k-nearest neighbors that use these concepts are also mentioned.

Uploaded by

Mba Nani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Idea Fundamentals Standard Approach Epilogue Summary

Fundamentals of Machine Learning for


Predictive Data Analytics
Chapter 5: Similarity-based Learning
Sections 5.1, 5.2, 5.3

John Kelleher and Brian Mac Namee and Aoife D’Arcy

[email protected] [email protected] [email protected]


Big Idea Fundamentals Standard Approach Epilogue Summary

1 Big Idea

2 Fundamentals
Feature Space
Distance Metrics

3 Standard Approach: The Nearest Neighbor Algorithm


A Worked Example

4 Epilogue

5 Summary
Big Idea Fundamentals Standard Approach Epilogue Summary

Big Idea
Big Idea Fundamentals Standard Approach Epilogue Summary

The year is 1798 and you are Lieutenant-Colonel David


Collins of HMS Calcutta who is exploring the region around
Hawkesbury River, in New South Wales.
After an expedition up the river one of the men tells you
that he saw a strange animal near the river.
You ask him to describe the animal to you and he explains
that he didn’t see it very well but that he did notice that it
had webbed feet and a duck-bill snout, and that it growled
at him.
In order to plan the expedition for the next day you decide
that you need to classify the animal so that you can figure
out whether it is dangerous to approach it or not.
Grrrh! Score

3 7 7 1
7 3 7 1
7 3 3 2
Figure: Matching animals you remember to the features of the
unknown animal described by the sailor. Note: The images used in
this figure were created by Jan Gillbank for the English for the
Australian Curriculum website (http://www.e4ac.edu.au) and
are used under the Create Commons Attribution 3.0 Unported licence
(http://creativecommons.org/licenses/by/3.0). The
images were sourced via Wikimedia Commons.
Big Idea Fundamentals Standard Approach Epilogue Summary

The process of classifying an unknown animal by matching


the features of the animal against the features of animals
you can remember neatly encapsulates the big idea
underpinning similarity-based learning:

if you are trying to classify something then you should


search your memory to find things that are similar and
label it with the same class as the most similar thing in
your memory

One of the simplest and best known machine learning


algorithms for this type of reasoning is called the nearest
neighbor algorithm.
Big Idea Fundamentals Standard Approach Epilogue Summary

Fundamentals
Big Idea Fundamentals Standard Approach Epilogue Summary

The fundamentals of similarity-based learning are:


Feature space
Similarity metrics
Big Idea Fundamentals Standard Approach Epilogue Summary

Feature Space

Table: The speed and agility ratings for 20 college athletes labelled
with the decisions for whether they were drafted or not.
ID Speed Agility Draft ID Speed Agility Draft
1 2.50 6.00 No 11 2.00 2.00 No
2 3.75 8.00 No 12 5.00 2.50 No
3 2.25 5.50 No 13 8.25 8.50 No
4 3.25 8.25 No 14 5.75 8.75 Yes
5 2.75 7.50 No 15 4.75 6.25 Yes
6 4.50 5.00 No 16 5.50 6.75 Yes
7 3.50 5.25 No 17 5.25 9.50 Yes
8 3.00 3.25 No 18 7.00 4.25 Yes
9 4.00 4.00 No 19 7.50 8.00 Yes
10 4.25 3.75 No 20 7.25 5.75 Yes
8
Agility
6
4
2

2 3 4 5 6 7 8
Speed

Figure: A feature space plot of the data in Table 2 [25] . The triangles
represent ’Non-draft’ instances and the crosses represent the ’Draft’
instances.
Big Idea Fundamentals Standard Approach Epilogue Summary

Feature Space

A feature space is an abstract n-dimensional space that is


created by taking each of the descriptive features in an
ABT to be the axes of a reference space and each
instance in the dataset is mapped to a point in the feature
space based on the values of its descriptive features.
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

A similarity metric measures the similarity between two


instances according to a feature space
Mathematically, a metric must conform to the following four
criteria:
1 Non-negativity: metric(a, b) ≥ 0
2 Identity: metric(a, b) = 0 ⇐⇒ a = b
3 Symmetry: metric(a, b) = metric(b, a)
4 Triangular Inequality:
metric(a, b) ≤ metric(a, c) + metric(b, c)
where metric(a, b) is a function that returns the distance
between two instances a and b.
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

One of the best known metrics is Euclidean distance which


computes the length of the straight line between two
points. Euclidean distance between two instances a and b
in a m-dimensional feature space is defined as:
v
u m
uX
Euclidean(a, b) = t (a[i] − b[i])2 (1)
i=1
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

Example
The Euclidean distance between instances d12 (S PEED= 5.00,
AGILITY= 2.5) and d5 (S PEED= 2.75,AGILITY= 7.5) in Table 2
[25]
is:
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

Example
The Euclidean distance between instances d12 (S PEED= 5.00,
AGILITY= 2.5) and d5 (S PEED= 2.75,AGILITY= 7.5) in Table 2
[25]
is:

q
Euclidean(h5.00, 2.50i , h2.75, 7.50i) = (5.00 − 2.75)2 + (2.50 − 7.50)2

= 30.0625 = 5.4829
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

Another, less well known, distance measure is the


Manhattan distance or taxi-cab distance.
The Manhattan distance between two instances a and b in
a feature space with m dimensions is:1
m
X
Manhattan(a, b) = abs(a[i] − b[i]) (2)
i=1

1
The abs() function surrounding the subtraction term indicates that we
use the absolute value, i.e. non-negative value, when we are summing the
differences; this makes sense because distances can’t be negative.
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

n
d ea
cli
Eu


Manhattan

Figure: The Manhattan and Euclidean distances between two points.


Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

Example
The Manhattan distance between instances d12 (S PEED= 5.00,
AGILITY= 2.5) and d5 (S PEED= 2.75,AGILITY= 7.5) in Table 2
[25]
is:
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

Example
The Manhattan distance between instances d12 (S PEED= 5.00,
AGILITY= 2.5) and d5 (S PEED= 2.75,AGILITY= 7.5) in Table 2
[25]
is:

Manhattan(h5.00, 2.50i , h2.75, 7.50i) = abs(5.00 − 2.75) + abs(2.5 − 7.5)


= 2.25 + 5 = 7.25
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

The Euclidean and Manhattan distances are special cases


of Minkowski distance
The Minkowski distance between two instances a and b in
a feature space with m descriptive features is:
m
!1
X p

Minkowski(a, b) = abs(a[i] − b[i])p (3)


i=1

where different values of the parameter p result in different


distance metrics
The Minkowski distance with p = 1 is the Manhattan
distance and with p = 2 is the Euclidean distance.
Big Idea Fundamentals Standard Approach Epilogue Summary

Distance Metrics

The larger the value of p the more emphasis is placed on


the features with large differences in values because these
differences are raised to the power of p.
Example
Manhattan Euclidean
Instance ID Instance ID (Minkowski p=1) (Minkowski p=2)
12 5 7.25 5.4829
12 17 7.25 8.25
17 Manhattan
Euclidean

8
Agility 5
6
4

12
2

2 3 4 5 6 7 8
Speed

The Manhattan and Euclidean distances between instances d12


(S PEED= 5.00, AGILITY= 2.5) and d5 (S PEED= 2.75, AGILITY= 7.5) and
between instances d12 and d17 (S PEED= 5.25, AGILITY= 9.5).
Big Idea Fundamentals Standard Approach Epilogue Summary

Standard Approach: The Nearest


Neighbor Algorithm
Big Idea Fundamentals Standard Approach Epilogue Summary

The Nearest Neighbour Algorithm

Require: set of training instances


Require: a query to be classified
1: Iterate across the instances in memory and find the
instance that is shortest distance from the query position in
the feature space.
2: Make a prediction for the query equal to the value of the
target feature of the nearest neighbor.
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

Table: The speed and agility ratings for 20 college athletes labelled
with the decisions for whether they were drafted or not.
ID Speed Agility Draft ID Speed Agility Draft
1 2.50 6.00 No 11 2.00 2.00 No
2 3.75 8.00 No 12 5.00 2.50 No
3 2.25 5.50 No 13 8.25 8.50 No
4 3.25 8.25 No 14 5.75 8.75 Yes
5 2.75 7.50 No 15 4.75 6.25 Yes
6 4.50 5.00 No 16 5.50 6.75 Yes
7 3.50 5.25 No 17 5.25 9.50 Yes
8 3.00 3.25 No 18 7.00 4.25 Yes
9 4.00 4.00 No 19 7.50 8.00 Yes
10 4.25 3.75 No 20 7.25 5.75 Yes
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

Example
Should we draft an athlete with the following profile:

S PEED= 6.75, AGILITY= 3


Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

8
Agility
6
4

?
2

2 3 4 5 6 7 8
Speed

Figure: A feature space plot of the data in Table 2 [25] with the position
in the feature space of the query represented by the ? marker. The
triangles represent ’Non-draft’ instances and the crosses represent
the ’Draft’ instances.
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

Table: The distances (Dist.) between the query instance with


S PEED = 6.75 and AGILITY = 3.00 and each instance in Table 2 [25] .

ID S PEED AGILITY D RAFT Dist. ID S PEED AGILITY D RAFT Dist.


18 7.00 4.25 yes 1.27 11 2.00 2.00 no 4.85
12 5.00 2.50 no 1.82 19 7.50 8.00 yes 5.06
10 4.25 3.75 no 2.61 3 2.25 5.50 no 5.15
20 7.25 5.75 yes 2.80 1 2.50 6.00 no 5.20
9 4.00 4.00 no 2.93 13 8.25 8.50 no 5.70
6 4.50 5.00 no 3.01 2 3.75 8.00 no 5.83
8 3.00 3.25 no 3.76 14 5.75 8.75 yes 5.84
15 4.75 6.25 yes 3.82 5 2.75 7.50 no 6.02
7 3.50 5.25 no 3.95 4 3.25 8.25 no 6.31
16 5.50 6.75 yes 3.95 17 5.25 9.50 yes 6.67
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

(a) Voronoi tessellation (b) Decision boundary (k = 1)

Figure: (a) The Voronoi tessellation of the feature space for the
dataset in Table 2 [25] with the position of the query represented by the
? marker; (b) the decision boundary created by aggregating the
neighboring Voronoi regions that belong to the same target level.
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

One of the great things about nearest neighbour


algorithms is that we can add in new data to update the
model very easily.
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

Table: The extended version of the college athletes dataset.

ID S PEED AGILITY D RAFT ID S PEED AGILITY D RAFT


1 2.50 6.00 no 12 5.00 2.50 no
2 3.75 8.00 no 13 8.25 8.50 no
3 2.25 5.50 no 14 5.75 8.75 yes
4 3.25 8.25 no 15 4.75 6.25 yes
5 2.75 7.50 no 16 5.50 6.75 yes
6 4.50 5.00 no 17 5.25 9.50 yes
7 3.50 5.25 no 18 7.00 4.25 yes
8 3.00 3.25 no 19 7.50 8.00 yes
9 4.00 4.00 no 20 7.25 5.75 yes
10 4.25 3.75 no 21 6.75 3.00 yes
11 2.00 2.00 no
Big Idea Fundamentals Standard Approach Epilogue Summary

A Worked Example

(a) Voronoi tessellation (b) Decision boundary (k = 1)

Figure: (a) The Voronoi tessellation of the feature space when the
dataset has been updated to include the query instance; (b) the
updated decision boundary reflecting the addition of the query
instance in the training set.
Big Idea Fundamentals Standard Approach Epilogue Summary

Epilogue
Big Idea Fundamentals Standard Approach Epilogue Summary

Returning to 1798 and HMS Calcutta, the next day you


accompany your men on the expedition up the river and
you encounter the strange animal the sailor had described
to you.
This time when you see the animal yourself you realize that
it definitely isn’t a duck!
It turns out that you and your men are the first Europeans
to encounter a platypus2 .

2
The story recounted here of the discovery of the platypus is loosely
based on real events.
Big Idea Fundamentals Standard Approach Epilogue Summary

Figure: A duck-billed platypus.The platypus image used in here was


created by Jan Gillbank for the English for the Australian Curriculum
website (http://www.e4ac.edu.au) and are used under the
Create Commons Attribution 3.0 Unported licence
(http://creativecommons.org/licenses/by/3.0). The
image was sourced via Wikimedia Commons.
Big Idea Fundamentals Standard Approach Epilogue Summary

This epilogue illustrates two important, and related,


aspects of supervised machine learning:
1 supervised machine learning is based on the stationarity
assumption which states that the data doesn’t change -
remains stationary - over time.
2 in the context of classification, supervised machine learning
creates models that distinguish between the classes that
are present in the dataset they are induced from. So, if a
classification model is trained to distinguish between lions,
frogs and ducks, the model will classify a query as being
either a lion, a frog or a duck; even if the query is actually a
platypus.
Big Idea Fundamentals Standard Approach Epilogue Summary

Summary
Big Idea Fundamentals Standard Approach Epilogue Summary

Similarity-based prediction models attempt to mimic a very


human way of reasoning by basing predictions for a target
feature value on the most similar instances in
memory—this makes them easy to interpret and
understand.
This advantage should not be underestimated as being
able to understand how the model works gives people
more confidence in the model and, hence, in the insight
that it provides.
Big Idea Fundamentals Standard Approach Epilogue Summary

The inductive bias underpinning similarity-based


classification is that things that are similar (i.e., instances
that have similar descriptive features) belong to the same
class.
The nearest neighbor algorithm creates an implicit global
predictive model by aggregating local models, or
neighborhoods.
The definition of these neighborhoods is based on
proximity within the feature space to the labelled training
instances.
Queries are classified using the label of the training
instance defining the neighborhood in the feature space
that contains the query.
Big Idea Fundamentals Standard Approach Epilogue Summary

1 Big Idea

2 Fundamentals
Feature Space
Distance Metrics

3 Standard Approach: The Nearest Neighbor Algorithm


A Worked Example

4 Epilogue

5 Summary

You might also like