0% found this document useful (0 votes)
42 views29 pages

Lecture 02 Supervised Learning 27102022 124322am

The document discusses supervised machine learning. It defines supervised learning as using a set of labeled training data to learn a function that maps inputs to outputs. The training data is used to train a model, which can then be used to make predictions on new, unlabeled data. Some key supervised learning tasks mentioned include disease diagnosis, part-of-speech tagging, and face recognition. The document also discusses important supervised learning concepts like features, label spaces, and different learning settings like classification and regression. It provides an example of using support vector machines for spam detection.

Uploaded by

Misbah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views29 pages

Lecture 02 Supervised Learning 27102022 124322am

The document discusses supervised machine learning. It defines supervised learning as using a set of labeled training data to learn a function that maps inputs to outputs. The training data is used to train a model, which can then be used to make predictions on new, unlabeled data. Some key supervised learning tasks mentioned include disease diagnosis, part-of-speech tagging, and face recognition. The document also discusses important supervised learning concepts like features, label spaces, and different learning settings like classification and regression. It provides an example of using support vector machines for spam detection.

Uploaded by

Misbah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

LECTURER:

Humera Farooq, Ph.D.


Computer Sciences Department,
Bahria University (Karachi Campus)
SUPERVISED LEARNING
Outline

1. ML in a Nutshell
2. Representation, Evaluation, Optimization
3. Types of Learning
4. Trade-­offs in Machine Learning
Supervised Learning

 The learning algorithm would receive a set of


inputs along with the corresponding correct outputs
to train a model
Training Data Model Prediction
(Labeled Data)
Supervised Learning
4

Output
Input
Model
y∈ Y
x∈ X y = f(x)
An item y
An item x
drawn from an output space Y
drawn from an input space X

An algorithm
required to design it

 We consider model that apply a function f() to input items x and return an output y =
f(x).
 In (supervised) machine learning, we deal with systems whose f(x) is learned from
examples.
 We typically use machine learning when the function f(x) we want to apply is unknown to us, and
we cannot “think” about it.
Supervised Learning Settings
5
Produce useful predictions (on unseen data)
Supervised Learning
6
Supervised Learning
7

Features Label Output

User ID Gender Age Salary Purchase


d
001 M 19 20,000 0
021 F 20 22,000 1
Training
031 F 34 45,000 1 Data
041 M 23 25,000 0
082 M 22 22,000 1
092 F 21 21,000 0
120 M 50 60,000 0
Testing
920 M 32 34,000 0
Data
125 F 33 35,000 0
874 M 45 55,000 1
Supervised learning: Training
8

User ID Gende Age Salar Purc


r y hased

001 M 19 20,00 0
0
021 F 20 22,00 1
0
031 F 34 45,00 1
0
041 M 23 25,00 0
0
082 M 22 22,00 1
0
092 F 21 21,00 0
0
120 M 50 60,00 0
0
Supervised Learning : Examples
 Disease diagnosis
 x: Properties of patient (symptoms, lab tests)
 f : Disease (or maybe: recommended therapy)
 Part-of-Speech tagging
 x: An English sentence (e.g., The can will rust)
 f : The part of speech of a word in the sentence
 Face recognition
 x: Bitmap picture of person’s face
 f : Name the person (or maybe: a property of)
 Automatic Steering
 x: Bitmap picture of road surface in front of car
 f : Degrees to turn the steering wheel
Good features are essential

The choice of features is crucial for how


well a task can be learned.
 In many application areas (language, vision,
etc.), a lot of work goes into designing
suitable features.
 This requires domain expertise.
The Label space Y
 The label space Y determines what kind of supervised learning task
we are dealing with
 Output labels Y are categorical (Classification)
 Binary classification: Two possible labels (0, 1) , (-1 , 1) or (1 ,2 )
 Multiclass classification: M possible labels (1, 2, ……. M)
 Output labels Y are structured objects (sequences of labels, parse trees, etc.)

 Output labels Y are numerical (Regression):


 Labels are continuous-valued
 Learn a linear/polynomial function

 Ranking:
 Labels are ordinal
 Learn an ordering f(x1) > f(x2) over input
Views of Learning
 Learning is the removal of our remaining uncertainty:
 Suppose we knew that the unknown function was an m-of-n Boolean
function, then we could use the training data to infer which function it is.
 Learning requires guessing a good, small hypothesis class:
 We can start with a very small class and enlarge it until it contains an
hypothesis that fits the data.

 We could be wrong !
 Our prior knowledge might be wrong
 Our guess of the hypothesis space could be wrong

 If this is the unknown function, then we will make errors when we are given
new examples, and are asked to predict the value of the function
Spam Detection Example
 Suppose there are 10,000 email messages
 Each with a label either “spam” or “not_spam” (could add those labels manually).
 Convert each email message into a feature vector.
 How to convert a real-world entity, such as an email message, into a feature vector?
 One common way to convert a text into a feature vector, called bag of words, is to take a
dictionary of English words (let’s say it contains 20,000 alphabetically sorted words) and
stipulate that in feature vector:
 the first feature is equal to 1 if the email message contains the word “a”; otherwise, this feature is 0;
 the second feature is equal to 1 if the email message contains the word “aaron”; otherwise, this feature equals
0;
 • ... •
 the feature at position 20,000 is equal to 1 if the email message contains the word “zulu”; otherwise, this
feature is equal to 0.
 Repeat the above procedure for every email message in the collection, which gives us 10,000
feature vectors (each vector having the dimensionality of 20,000) and a label
(“spam”/“not_spam”
Spam Detection
 Now input data is ready.

 output labels are still in the form of human-readable text.

 Some learning algorithms require transforming labels into numbers.

 For example, some algorithms require numbers like 0 (to represent the label “not_spam”) and 1
(to represent the label “spam”).

 The given algorithm use to illustrate supervised learning is called Support Vector Machine
(SVM).

 This algorithm requires that the positive label (in our case it’s “spam”) has the numeric value of
+1 (one), and the negative label (“not_spam”) has the value of -1 (minus one)

 After having a dataset and a learning algorithm, now apply the learning algorithm to the dataset
to get the model
SVM
 SVM sees every feature vector as a point in a high-dimensional space (in our case, space is
20,000-dimensional).
 The algorithm puts all feature vectors on an imaginary 20,000- dimensional plot and draws an
imaginary 20,000-dimensional line (a hyperplane) that separates examples with positive
labels from examples with negative labels.
 In machine learning, the boundary separating the examples of different classes is called the
decision boundary.
 The equation of the hyperplane is given by two parameters, a real-valued vector w of the
same dimensionality as our input feature vector x, and a real number b like this:
wx – b =0
 where the expression wx means , where D is the
number of dimensions of the feature vector x
 Now, the predicted label for some input feature vector x is given like this:
 Y = sign (wx –b)
 where sign is a mathematical operator that takes any value as input and returns +1 if the input
is a positive number or -1 if the input is a negative number
SVM
 The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the
optimal values w* and b* for parameters w and b. Once the learning algorithm identifies these
optimal values, the model f(x) is then defined as:

F(x) = sign (w*x – b* )


 Therefore, to predict whether an email message is spam or not spam using an SVM model, you
have to take a text of the message, convert it into a feature vector, then multiply this vector by
w*, subtract b* and take the sign of the result. This will give us the prediction (+1 means “spam”,
-1 means “not_spam”).

 Now, how does the machine find w* and b*? It solves an optimization problem. Machines are
good at optimizing functions under constraints

 So what are the constraints we want to satisfy here? First of all, we want the model to predict the
labels of our 10,000 examples correctly. Remember that each example i = 1,..., 10000 is given by
a pair (xi, yi), where xi is the feature vector of example i and yi is its label that takes values either
-1 or +1. So the constraints are naturally
SVM
 Preferably the hyperplane should separates positive examples from negative ones with the
largest margin.
 The margin is the distance between the closest examples of two classes, as defined by the
decision boundary. A large margin contributes to a better generalization, that is how well
the model will classify new examples in the future.

To achieve that, we need to minimize the Euclidean


norm of w denoted by

So, the optimization problem that we want the


machine to solve looks like this:

The blue and orange circles represent,


respectively, positive and negative examples,
and the line given by wx - b = 0 is the
decision boundary.
SVM

is just a compact way to write the above two constraints.

Why, by minimizing the norm of w, do we find the highest margin between the two classes?

Geometrically, the equations wx - b = 1 and wx - b = -1 define two parallel hyperplanes.

The distance between these hyperplanes is given by so the smaller the norm ||w||, the larger
the distance between these two hyperplanes.

This particular version of the algorithm builds the so-called linear model. It’s called linear
because the decision boundary is a straight line (or a plane, or a hyperplane).

SVM can also incorporate kernels that can make the decision boundary arbitrarily non-linear. In
some cases, it could be impossible to perfectly separate the two groups of points because of
noise in the data, errors of labeling, or outliers (examples very different from from a “typical”
example in the dataset).
Another version of SVM can also incorporate a penalty hyperparameter for misclassification of
training examples of specific classes.
Evaluate the Performance of
Supervised Learning
 Machines learn by means of a loss function. It’s a method of evaluating how well specific
algorithm models the given data. If predictions deviates too much from actual results, loss
function would cough up a very large number. Gradually, with the help of some optimization
function, loss function learns to reduce the error in prediction.

 There’s no one-size-fits-all loss function to algorithms in machine learning. There are various
factors involved in choosing a loss function for specific problem such as type of machine
learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage
of outliers in the data set.

 Loss functions play an important role in any statistical model - they define an objective which
the performance of the model is evaluated against and the parameters learned by the model are
determined by minimizing a chosen loss function.
Loss Function
 Two major categories depending upon the type of learning task we are dealing with
 Regression losses : Regression, deals with predicting a continuous value. Few known loss
functions are:

Mean Absolute Error (MAE)


Mean Squared Error (MSE)
Mean Bias Error (MBE)
Mean Squared Logarithmic Error (MSLE)

 Classification losses. In classification, we deal with categorical values. Few known loss
functions are:
Binary Cross Entropy Loss
Hinge Loss
Mean Absolute Error (MAE) / L1 Loss

 Regression problems may have variables that are not strictly Gaussian in nature due to the
presence of outliers (values that are very different from the rest of the data).

 Mean Absolute Error would be an ideal option in such cases because it does not take into
account the direction of the outliers (unrealistically high positive or negative values).

 MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predicted value yi, n being the total number of data points in
the dataset, the mean absolute error is defined as:
Mean Squared Error (MSE)
/ L2 Loss
 Prefer by researchers, because most variables can be modeled into a Gaussian distribution.
 Mean Squared Error is the average of the squared differences between the actual and the predicted values.
For a data point Yi and its predicted value Ŷi, where n is the total number of data points in the dataset, the
mean squared error is defined as:

 It’s only concerned with the average magnitude of error irrespective of their direction. However, due to
squaring, predictions which are far away from actual values are penalized heavily in comparison to less
deviated predictions. Plus MSE has nice mathematical properties which makes it easier to calculate
gradients.

 Mean absolute error, is measured as the average of sum of absolute differences between predictions and
actual observations. Like MSE, this as well measures the magnitude of error without considering their
direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the
gradients. Plus MAE is more robust to outliers since it does not make use of square.
Mean Bias Error
 Mean Bias Error takes the actual difference between the target and the predicted value, and not the absolute
difference. One has to be cautious as the positive and the negative errors could cancel each other out, which
is why it is one of the lesser-used loss functions.

 Mean Bias Error is used to calculate the average bias in the model. Bias, in a nutshell, is overestimating or
underestimating a parameter. Corrective measures can be taken to reduce the bias post-evaluating the model
using MBE.

Where yi is the true value, ŷi is the predicted value and ’n’ is the total number of data points in the dataset.
Mean Squared Logarithmic Error
(MSLE)
 Sometimes, one may not want to penalize the model too much for predicting unscaled quantities directly.
Relaxing the penalty on huge differences can be done with the help of Mean Squared Logarithmic Error.
 Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error, except the natural
logarithm of the predicted values is used rather than the actual values.

Where yi is the true value, ŷi is the predicted value and ’n’ is the total number of data points in the dataset.
Binary Cross Entropy Loss

 Entropy is the measure of randomness in the information being processed, and cross entropy is a measure of
the difference of the randomness between two random variables.
 If the divergence of the predicted probability from the actual label increases, the cross-entropy loss increases.
For example, predicting a probability of .011 when the actual observation label is 1 would result in a high
loss value. In an ideal situation, a “perfect” model would have a log loss of 0. Looking at the loss function
would make things even clearer -
Hinge Loss/Multi class SVM Loss
 Hinge loss is primarily developed for
support vector machines for
calculating the maximum margin
from the hyperplane to the classes.
 Loss functions penalize wrong
predictions and does not do so for the
right predictions. So, the score of the
target label should be greater than the
sum of all the incorrect labels by a
margin of (at the least) one.

 This margin is the maximum


margin from the hyperplane to the
data points, which is why hinge
loss is preferred for SVMs.
Hinge Loss/Multi class SVM Loss
 In simple terms, the score of correct category should be greater than sum of scores of all
incorrect categories by some safety margin (usually one). And hence hinge loss is used for
maximum-margin classification, most notably for support vector machines. Although not
differentiable, it’s a convex function which makes it easy to work with usual convex optimizers
used in machine learning domain

 Loss functions penalize wrong predictions and does not do so for the right
predictions. So, the score of the target label should be greater than the sum of all the
incorrect labels by a margin of (at the least) one.
Hinge Loss/Multi class SVM Loss
Consider an example where we have three training examples and three classes
to predict — Dog, cat and horse. Below the values predicted by our algorithm
for each of the classes. Computing hinge losses for all 3 training examples :-
Summary

Learning?

You might also like