0% found this document useful (0 votes)
35 views43 pages

Lecture 02

This document discusses linear classification and introduces the classification problem. It explains that the goal of classification is to predict discrete class labels rather than continuous target values. The document then focuses on setting up the problem of binary classification, where the goal is to classify inputs into one of two classes.

Uploaded by

BoogeyMan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views43 pages

Lecture 02

This document discusses linear classification and introduces the classification problem. It explains that the goal of classification is to predict discrete class labels rather than continuous target values. The document then focuses on setting up the problem of binary classification, where the goal is to classify inputs into one of two classes.

Uploaded by

BoogeyMan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Linear Classification

Introduction to Machine Learning: Lecture II


Nebu John Mathai
Recap
Linear Regression
Regression: Why?
• What problem is regression solving?
• Modelling a phenomenon
• Predicting the behavior of a phenomenon, based on this model
• Consider a phenomenon, P, with some I/O behavior that we can
investigate and tabulate: inputs (x), outputs (t)
• Regression: postulate a tunable model
• That is, a model with various adjustable parameters
• Adjust these parameters based on the data (fit)
• Validate the adjustments on novel data (generalization)
Regression: Setup
• Consider a N-element dataset containing input (data, x) and output
(target, t) characteristics for some phenomenon, P, of interest
• We want to develop a model, M, of P so that we can predict its
behavior
• That is, given existing or novel values of x, accurately measure what P would
output, given that x
• Recall: fit versus generalization
• Fit: Given values of x from our dataset, our model’s output (y) is an accurate
prediction of the associated t
• Generalization: Given new values of x that we have not seen yet, the model is
an accurate prediction of the associated t
Regression: Plan
• Here is the supervised learning plan of attack of this problem
• 0. We separate the dataset into:
• Training Set
• Validation Set
• 1. We propose a tunable (i.e., parametrized) computational model, M, that has the same I/O
characteristics of P (i.e., the dataset)
• Input type and dimensionality of P and M are identical (ref: x \in R^D)
• Output type and dimensionality of P and M are identical (ref: t \in R)
• 2. We define the Loss function, L
• From this we can derive the performance surface E
• 3. Learning: We tune the parameters of M using (a) the Training Set, and (b) Gradient Descent
(i.e., we learn using the Training Set to achieve fit)
• 4. Validation: Once (3) has converged, we validate the model using the Validation set
• This is novel data: thus this step is to measure generalization
Linear Regression: Details
• Linear Regression is a special case of the regression problem
• M is chosen to be a linear model: the prediction here is a weighted linear
combination of the input vector x

• The N-element dataset, is partitioned into TrS (~70% N) and VaS


(~30% N)
Linear Regression: Details
• Now we define the error function (loss function) that expresses our
measure of how “off” a prediction is:

• With this loss function we can define a performance surface:

• This performance surface is a scalar valued function over the hypothesis


space (i.e., the space of all hypotheses (w, b))
• It defines the average loss of M(x; w’,b’)
• M(x; w’,b’) is the particular instance of M with parameters set to (w, b)
Linear Regression: Details
• With E defined, we now just need to find the lowest point (the global
minima) on this surface E, and use the value of (w, b) at this minima
point
• For this we employ gradient descent
• Start with some setting of the hypothesis
• Calculate the gradient of E with respect to the hypothesis space, at this value of
the hypothesis
• Move in the opposite direction of the gradient to find the minima
• For the linear regression problem, the performance surface E is
guaranteed to be convex, and thus the single extrema of this function
(i.e., the point on E where the gradient is zero) is a minima
Linear Regression: Details
• After identifying the minima of E with respect to TrS, we have a
setting for the hypothesis that enables M to fit TrS
• We now need to validate M for its ability to generalize to new data
(i.e., data that it was not trained on)
• For this reason, we take VaS and calculate E
• If E(VaS) <= E(TrS) then we have an empirical indication that the tuned model
generalizes
Linear Regression: Details
• This is all empirical!
• It is of course possible that our dataset was biased, and also that our partitioning
of DS into TrS and VaS was done in a biased manner.
• We thus have NO theoretical guarantee that our LR tuned model will generalize to all
possible data
• We have no guaranty as to M’s validity
• Hence: in employing a ML model, we must be always vigilant in validating the model
continuously to see if there is any indication that our model is invalid
• ML is thus very different from automata theory
• In the latter we have theoretical guarantees that we have solved a problem per the theory
• Why is ML used? Sometimes we have to cope with the absence of a theory for
our problem.
Extensions
• LR allows us to model functions (or “learn functions”) of form using a
linear model that is tuned
• Can extend this to learning functions of form using a polynomial
(nonlinear) model that is tuned
• No new development was needed
• We used feature functions to transform the original scalar data set into a
multidimensional one (powers of the original data were mapped to new
dimensions), and then used standard LR
• Exercise: Show how the above techniques can be applied to the
modelling of general multidimensional systems via linear or
polynomial regression
Hyperparameters
• We also covered the notion of hyperparameters: parameters in the
process that are outside of the optimization loop
• C.f. w,b which are parameters, and are being actively optimized
• The learning rate is one such hyperparameter
• The maximum degree for polynomial regression is another
• The degree is special, because we can put an additional loop around our
optimization loop so that when validation fails, we can adjust the
hyperparameter to increase model complexity
• We discussed the conflict between fit and generality
• The next slide is illustrative
t *

Which model is more appropriate?


How would they both fare with new data
(in red)?

x
*
Expectations
• At this point, you must know:
• The setup of the regression problem
• The full development of the LR solution
• How to extend LR to polynomial regression
• In multiple dimensions
• The role of the data set and its partitions { Trs, VaS, TeS }
• What hyper-parameters are and how they differ from parameters (i.e., the
hypothesis space)
• How to code an LR solver and how to use it
• Training vs Validation vs Testing
• Ask if any of this is unclear
Classification
Problem Setup; Linear Binary Classification
Classification
• What is the classification problem?
• Suppose we have medical case data that shows how various
symptoms and measurements of an organism relates to the presence
or absence of a medical condition
• A classification system tuned on this data would allow us to take in a new
vector (symptoms, measurements) and indicate whether this indicated the
presence of absence of disease
• Suppose we have a set of pictures with labels that indicate the object
being pictured
• A classification system tuned on this data would allow us to take in a new
input vector (image pixels) and indicate what the imaged object is
Classification
• Like with regression, our input is a vector (possibly multidimensional)
• E.g., a vector of measurements, pixels, etc.
• Unlike with regression, our output (target) is not a real-valued
variable but rather a class: some discrete variable with a fixed number
of options
Classification vs Regression
• To understand classification (and indeed, any ML technique) we can
start by looking at the datasets involved
• Regression:
• Our goal was to design a model that can predict real-valued targets, t, given
the input x
• Classification:
• Our goal here is to also design a model that will predict targets
• Now, however, we call the target a class and the class is not real-valued, but a
discrete space (i.e., an integer space)
Classification
• Classification:
• This change of the nature of the target set from real-valued to integral
makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer
computer to approximate real-valued computations to refresh in your mind
that real numbers and integers are very different spaces
• We will see in the next lecture why we can not simply rehash our LR
development to develop a classification ML system
Classification
• Classification:
• This change of the nature of the target set from real-valued to integral
makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer
computer to approximate real-valued computations to refresh in your mind
that real numbers and integers are very different spaces
• We will not be able to look at the t-vs-x plot in the same manner as for LR
• t is a discrete space, and so our real-variables calculus tools can not be used directly
• We will see in the next lecture why we can not simply rehash our LR development to
develop a classification ML system
Setup: Binary Classification
• We will start our development by choosing the most basic, non-trivial
classification space: the discrete set { -1, 1 }
• Classification:
• As stated previously, we will not visualize this data set by plotting t-
versus-x, as we would normally be able to do if t were real-valued
• Instead, we will plot the data in x’s space and color the data points
based on the class of the point (i.e., either -1 or 1)
• We need to determine a method to classify input vectors x into either
the class -1 or the class 1: this is called Binary Classification
Setup: Binary Classification
• If we are not viewing the data as t-vs-x (as we did in LR), then we are not
able to use our LR development as is to learn the classification function
• Instead, we will try to learn the coefficients for a hypersurface in such that
the elements of the two classes are on different sides of the hypersurface
(if this is possible, we are said to have a linearly separable dataset)
• Recall from linear algebra that the equation for such a hypersurface has
form:
• This hypersurface will split the space into two half-spaces:
• Points u on one side of the hypersurface would obey:
• Points v on the other side of the hypersurface would obey:
Setup: Binary Classification
• Summarizing:
• Dataset:
• Model:

• As with LR, y is the output of the model, the predictor of the class
• This prediction is computed by looking at z to determine which side of
the hypersurface we’re on
• Note: you may notice some deficiencies due to the fact that the predictor can
not, on the surface, distinguish between being on the surface, and on the
positive side of the surface; we’ll address this soon.
Binary Linear Classifier
• Dataset:
• Model:

• The above model is parametrized by w and b


• We now discuss how to determine the values of the parameters
• As with LR, and indeed with all supervised learning techniques, we
will be adjusting these parameters based on the data and some
feedback as to our model’s instantaneous correctness
Perceptron
• To make the development simpler to present we are going to absorb
b into the w vector, by modifying the dataset as follows
• Original Dataset:
• Modified Dataset:
• That is, the modified data set is the original data set to which an extra
dummy dimension has been added with a value of 1 for all data
vectors
• Since this transform can always be done, we will assume it has
already been done in the next slides
Perceptron
• With this modified dataset, we have the model

• The above model is parametrized by D+1-dimensional w


• Note: still defines a hypersurface in (i.e., we have not altered the dimensionality nor the solution
by our absorption of b into the weight vector)
Perceptron
• If tuned properly, the model should classify data points from the data
set correctly:

• How do we achieve this proper tuning?


• It turns out there is a very simple algorithm that is guaranteed to
converge in finite time if the dataset is linearly separable
Perceptron Learning Rule (PLR)
• With thanks to Rosenblatt, 1958
• Let’s consider the j-th element of out dataset
• If the tuned model is producing a correct prediction then:
• For we want
• We will strengthen this criterion and demand:
• This criterion prevents us from allowing data points on the hypersurface
• Similarly, we demand:
• If this were not the case, then we would have to adjust w
• How?
Perceptron Learning Rule
• Based on the prior development we only have two cases:

• If the product is strictly positive then we know:


• Both target and prediction have the correct sense
• Prediction is not on the hypersurface (i.e., the =0 case is excluded)
Perceptron Learning Rule
• If
• For positive targets, this means that is negative and must be increased
• Let’s propose we add to w (i.e., )
• Then:
• Recall:
• Similarly, for negative targets:
• Combining the two cases:
Perceptron Learning Rule
• Estimate w
• While (True):

• (i.e., for all elements in our training set)


• Calculate
• If

• Break if updated==True
• Note: if the training set (Trs) is linearly separable then this is guaranteed to
converge
• Since we may not know TrS is linearly separable, you should augment this loop to
have a maximum iteration count (to prevent infinite loops). Do this as an exercise.
Exercises
• Consider the following Boolean functions:
• OR, AND, NOT and XOR
• Recall: for x and y in { -1, 1 }
• OR(x,y) returns a 1 for all cases except when x=y=-1
• AND(x,y) returns a -1 for all cases except when x=y=1
• NOT(x) returns a -1 when x=1, and a +1 otherwise
• XOR(x,y) returns a -1 when x=y, and a 1 otherwise
• What is the structure of the Perceptron that you would require to learn each of
those functions? (i.e., write the w and x vectors)
• Derive the weights by hand for each function.
• Now code up the PLR in Python and see what it computes for w
• Test what you got: were (a) your hand calculations and (b) the output of the PLR
correct?
Limits of the PLR
• The Perceptron Learning Rule is only guaranteed to converge if the
dataset is linearly separable
• Given a dataset with a binary target (in { -1, 1 }), lets call the data
vectors with class=1 the positive examples
• Likewise, the data vectors with class=-1 are the negative examples
• A linearly separable dataset is one where a hyperplane separates the
positive and negative examples
• In such a case, the PLR is guaranteed to converge to the hyperplane
coefficients
Limits of the PLR
• Exercise: graph the aforementioned Boolean functions and attempt to
draw a hyperplane (a line in 2D) between the positive and negative
examples. What do you observe? Compare this with the results of the
computation of w for each case that you did earlier.
• As you can see, a very simple function (the XOR) produces results that
are not linearly separable, and thus can not be learned via a
perceptron. This is a strong indictment of the Perceptron, given that
XOR is trivially constructed from two ANDs and one OR!
• What else can one not do with a Perceptron?
Limits of the PLR
• Let’s first introduce the concept of a convex set
• A convex set, S is one wherein given any two points within the space, the line
joining these two points (i.e., the set of points on the line) are also within the
set, i.e.:

• For =1
• For a linearly separable problem, the region of positive examples
must be convex and the region of negative examples must be convex
Limits of the PLR
• Let’s consider the XOR function.
• Let’s suppose it was linearly separable. Then we ought to be able to draw
convex sets for the regions of positive examples (and likewise for negative
examples).
• If we draw a line from the two -1s we ought to have a convex set (if XOR was
linearly separable).
• Likewise for the two 1s.
• Note, however, that the two spaces (the two lines) intersect, meaning that
the point of intersection must belong to both sets, which is nonsense.
• Hence the XOR is not linearly separable as the regions of positive examples
and that of negative examples are not convex
Limits of the Perceptron
• In looking at the limits of LC we see that the lack of linear separability
(LS) in the dataset is the problem
• Two cases:
• Dataset is LS but noisy
• Dataset is NOT LS
Addressing the Limits
• The PLR is great for linearly separable (LS) datasets (DS)
• What if our dataset is contaminated with noise?
• Then a fundamentally LS DS will not look LS!
• In this case we will need to optimize: find the best linear separator for most of
the data
• This will involve learning
• I.e., loss functions and gradient descent
• The next lecture will deal with this
• Recall our comments that classification, being a mapping from a real data space
to a discrete target space is crucially different from the regression problem
• We’ll see why soon
Addressing the Limits
• Though useful at learning linearly separable datasets, the Perceptron
is limited and can’t learn even simple nonlinear functions
• Exercise: show that XOR is nonlinear
• Recall how LR was extended to polynomial regression via the feature
functions
• We’ll see that we can employ a similar tactic to make the Perceptron
learn the XOR
• But more than that, we’ll uncover a fundamental principle that will
guide us into more interesting ML territory: neural networks
Map so far
• Linear Regression (LR)
• Dataset characteristics
• Tunable model
• Optimization framework for the solution of LR: learning
• Linear Classification (LC)
• Dataset characteristics
• Tunable model (Perceptron)
• Perceptron Learning Rule: non-optimization framework for the solution of LC
• Next: Learning and Classification
• + Optimization framework for the solution of linear classification with noise
• Next next: Learning non-linear classification functions
• + Generalization of the model
• + How to optimize with this more general model
Linear Classification: Terminology
• Classification • Convex Subset
• Linear Classification • Linearly Separable Problems
• Binary Classification
• Model
• Perceptron Model
• Negative Examples
• Positive Examples
• Hypersurface
Required Reading
• N/A
Useful Reference Material
• An excellent reference for AI, ML, linear regression, and other related
topics is found in Artificial Intelligence: A Modern Approach (Third
Edition; by Russel and Norvig)
• Professor Grosse (Computer Science) has further excellent material:
• http://www.cs.toronto.edu/~
rgrosse/courses/csc321_2018/readings/L03%20Linear%20Classifiers.pdf
• Neural Networks - A Systematic Introduction, Raul Rojas, Springer-
Verlag, Berlin, New-York, 1996 (502 p.,350 illustrations).
• Chapter 3 and 4

You might also like