Lecture 02 Supervised Learning 27102022 124322am
Lecture 02 Supervised Learning 27102022 124322am
1. ML in a Nutshell
2. Representation, Evaluation, Optimization
3. Types of Learning
4. Trade-offs in Machine Learning
Supervised Learning
Output
Input
Model
y∈ Y
x∈ X y = f(x)
An item y
An item x
drawn from an output space Y
drawn from an input space X
An algorithm
required to design it
We consider model that apply a function f() to input items x and return an output y =
f(x).
In (supervised) machine learning, we deal with systems whose f(x) is learned from
examples.
We typically use machine learning when the function f(x) we want to apply is unknown to us, and
we cannot “think” about it.
Supervised Learning Settings
5
Produce useful predictions (on unseen data)
Supervised Learning
6
Supervised Learning
7
001 M 19 20,00 0
0
021 F 20 22,00 1
0
031 F 34 45,00 1
0
041 M 23 25,00 0
0
082 M 22 22,00 1
0
092 F 21 21,00 0
0
120 M 50 60,00 0
0
Supervised Learning : Examples
Disease diagnosis
x: Properties of patient (symptoms, lab tests)
f : Disease (or maybe: recommended therapy)
Part-of-Speech tagging
x: An English sentence (e.g., The can will rust)
f : The part of speech of a word in the sentence
Face recognition
x: Bitmap picture of person’s face
f : Name the person (or maybe: a property of)
Automatic Steering
x: Bitmap picture of road surface in front of car
f : Degrees to turn the steering wheel
Good features are essential
Ranking:
Labels are ordinal
Learn an ordering f(x1) > f(x2) over input
Views of Learning
Learning is the removal of our remaining uncertainty:
Suppose we knew that the unknown function was an m-of-n Boolean
function, then we could use the training data to infer which function it is.
Learning requires guessing a good, small hypothesis class:
We can start with a very small class and enlarge it until it contains an
hypothesis that fits the data.
We could be wrong !
Our prior knowledge might be wrong
Our guess of the hypothesis space could be wrong
If this is the unknown function, then we will make errors when we are given
new examples, and are asked to predict the value of the function
Spam Detection Example
Suppose there are 10,000 email messages
Each with a label either “spam” or “not_spam” (could add those labels manually).
Convert each email message into a feature vector.
How to convert a real-world entity, such as an email message, into a feature vector?
One common way to convert a text into a feature vector, called bag of words, is to take a
dictionary of English words (let’s say it contains 20,000 alphabetically sorted words) and
stipulate that in feature vector:
the first feature is equal to 1 if the email message contains the word “a”; otherwise, this feature is 0;
the second feature is equal to 1 if the email message contains the word “aaron”; otherwise, this feature equals
0;
• ... •
the feature at position 20,000 is equal to 1 if the email message contains the word “zulu”; otherwise, this
feature is equal to 0.
Repeat the above procedure for every email message in the collection, which gives us 10,000
feature vectors (each vector having the dimensionality of 20,000) and a label
(“spam”/“not_spam”
Spam Detection
Now input data is ready.
For example, some algorithms require numbers like 0 (to represent the label “not_spam”) and 1
(to represent the label “spam”).
The given algorithm use to illustrate supervised learning is called Support Vector Machine
(SVM).
This algorithm requires that the positive label (in our case it’s “spam”) has the numeric value of
+1 (one), and the negative label (“not_spam”) has the value of -1 (minus one)
After having a dataset and a learning algorithm, now apply the learning algorithm to the dataset
to get the model
SVM
SVM sees every feature vector as a point in a high-dimensional space (in our case, space is
20,000-dimensional).
The algorithm puts all feature vectors on an imaginary 20,000- dimensional plot and draws an
imaginary 20,000-dimensional line (a hyperplane) that separates examples with positive
labels from examples with negative labels.
In machine learning, the boundary separating the examples of different classes is called the
decision boundary.
The equation of the hyperplane is given by two parameters, a real-valued vector w of the
same dimensionality as our input feature vector x, and a real number b like this:
wx – b =0
where the expression wx means , where D is the
number of dimensions of the feature vector x
Now, the predicted label for some input feature vector x is given like this:
Y = sign (wx –b)
where sign is a mathematical operator that takes any value as input and returns +1 if the input
is a positive number or -1 if the input is a negative number
SVM
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the
optimal values w* and b* for parameters w and b. Once the learning algorithm identifies these
optimal values, the model f(x) is then defined as:
Now, how does the machine find w* and b*? It solves an optimization problem. Machines are
good at optimizing functions under constraints
So what are the constraints we want to satisfy here? First of all, we want the model to predict the
labels of our 10,000 examples correctly. Remember that each example i = 1,..., 10000 is given by
a pair (xi, yi), where xi is the feature vector of example i and yi is its label that takes values either
-1 or +1. So the constraints are naturally
SVM
Preferably the hyperplane should separates positive examples from negative ones with the
largest margin.
The margin is the distance between the closest examples of two classes, as defined by the
decision boundary. A large margin contributes to a better generalization, that is how well
the model will classify new examples in the future.
Why, by minimizing the norm of w, do we find the highest margin between the two classes?
The distance between these hyperplanes is given by so the smaller the norm ||w||, the larger
the distance between these two hyperplanes.
This particular version of the algorithm builds the so-called linear model. It’s called linear
because the decision boundary is a straight line (or a plane, or a hyperplane).
SVM can also incorporate kernels that can make the decision boundary arbitrarily non-linear. In
some cases, it could be impossible to perfectly separate the two groups of points because of
noise in the data, errors of labeling, or outliers (examples very different from from a “typical”
example in the dataset).
Another version of SVM can also incorporate a penalty hyperparameter for misclassification of
training examples of specific classes.
Evaluate the Performance of
Supervised Learning
Machines learn by means of a loss function. It’s a method of evaluating how well specific
algorithm models the given data. If predictions deviates too much from actual results, loss
function would cough up a very large number. Gradually, with the help of some optimization
function, loss function learns to reduce the error in prediction.
There’s no one-size-fits-all loss function to algorithms in machine learning. There are various
factors involved in choosing a loss function for specific problem such as type of machine
learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage
of outliers in the data set.
Loss functions play an important role in any statistical model - they define an objective which
the performance of the model is evaluated against and the parameters learned by the model are
determined by minimizing a chosen loss function.
Loss Function
Two major categories depending upon the type of learning task we are dealing with
Regression losses : Regression, deals with predicting a continuous value. Few known loss
functions are:
Classification losses. In classification, we deal with categorical values. Few known loss
functions are:
Binary Cross Entropy Loss
Hinge Loss
Mean Absolute Error (MAE) / L1 Loss
Regression problems may have variables that are not strictly Gaussian in nature due to the
presence of outliers (values that are very different from the rest of the data).
Mean Absolute Error would be an ideal option in such cases because it does not take into
account the direction of the outliers (unrealistically high positive or negative values).
MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predicted value yi, n being the total number of data points in
the dataset, the mean absolute error is defined as:
Mean Squared Error (MSE)
/ L2 Loss
Prefer by researchers, because most variables can be modeled into a Gaussian distribution.
Mean Squared Error is the average of the squared differences between the actual and the predicted values.
For a data point Yi and its predicted value Ŷi, where n is the total number of data points in the dataset, the
mean squared error is defined as:
It’s only concerned with the average magnitude of error irrespective of their direction. However, due to
squaring, predictions which are far away from actual values are penalized heavily in comparison to less
deviated predictions. Plus MSE has nice mathematical properties which makes it easier to calculate
gradients.
Mean absolute error, is measured as the average of sum of absolute differences between predictions and
actual observations. Like MSE, this as well measures the magnitude of error without considering their
direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the
gradients. Plus MAE is more robust to outliers since it does not make use of square.
Mean Bias Error
Mean Bias Error takes the actual difference between the target and the predicted value, and not the absolute
difference. One has to be cautious as the positive and the negative errors could cancel each other out, which
is why it is one of the lesser-used loss functions.
Mean Bias Error is used to calculate the average bias in the model. Bias, in a nutshell, is overestimating or
underestimating a parameter. Corrective measures can be taken to reduce the bias post-evaluating the model
using MBE.
Where yi is the true value, ŷi is the predicted value and ’n’ is the total number of data points in the dataset.
Mean Squared Logarithmic Error
(MSLE)
Sometimes, one may not want to penalize the model too much for predicting unscaled quantities directly.
Relaxing the penalty on huge differences can be done with the help of Mean Squared Logarithmic Error.
Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error, except the natural
logarithm of the predicted values is used rather than the actual values.
Where yi is the true value, ŷi is the predicted value and ’n’ is the total number of data points in the dataset.
Binary Cross Entropy Loss
Entropy is the measure of randomness in the information being processed, and cross entropy is a measure of
the difference of the randomness between two random variables.
If the divergence of the predicted probability from the actual label increases, the cross-entropy loss increases.
For example, predicting a probability of .011 when the actual observation label is 1 would result in a high
loss value. In an ideal situation, a “perfect” model would have a log loss of 0. Looking at the loss function
would make things even clearer -
Hinge Loss/Multi class SVM Loss
Hinge loss is primarily developed for
support vector machines for
calculating the maximum margin
from the hyperplane to the classes.
Loss functions penalize wrong
predictions and does not do so for the
right predictions. So, the score of the
target label should be greater than the
sum of all the incorrect labels by a
margin of (at the least) one.
Loss functions penalize wrong predictions and does not do so for the right
predictions. So, the score of the target label should be greater than the sum of all the
incorrect labels by a margin of (at the least) one.
Hinge Loss/Multi class SVM Loss
Consider an example where we have three training examples and three classes
to predict — Dog, cat and horse. Below the values predicted by our algorithm
for each of the classes. Computing hinge losses for all 3 training examples :-
Summary
Learning?