Lecture 02
Lecture 02
x
*
Expectations
• At this point, you must know:
• The setup of the regression problem
• The full development of the LR solution
• How to extend LR to polynomial regression
• In multiple dimensions
• The role of the data set and its partitions { Trs, VaS, TeS }
• What hyper-parameters are and how they differ from parameters (i.e., the
hypothesis space)
• How to code an LR solver and how to use it
• Training vs Validation vs Testing
• Ask if any of this is unclear
Classification
Problem Setup; Linear Binary Classification
Classification
• What is the classification problem?
• Suppose we have medical case data that shows how various
symptoms and measurements of an organism relates to the presence
or absence of a medical condition
• A classification system tuned on this data would allow us to take in a new
vector (symptoms, measurements) and indicate whether this indicated the
presence of absence of disease
• Suppose we have a set of pictures with labels that indicate the object
being pictured
• A classification system tuned on this data would allow us to take in a new
input vector (image pixels) and indicate what the imaged object is
Classification
• Like with regression, our input is a vector (possibly multidimensional)
• E.g., a vector of measurements, pixels, etc.
• Unlike with regression, our output (target) is not a real-valued
variable but rather a class: some discrete variable with a fixed number
of options
Classification vs Regression
• To understand classification (and indeed, any ML technique) we can
start by looking at the datasets involved
• Regression:
• Our goal was to design a model that can predict real-valued targets, t, given
the input x
• Classification:
• Our goal here is to also design a model that will predict targets
• Now, however, we call the target a class and the class is not real-valued, but a
discrete space (i.e., an integer space)
Classification
• Classification:
• This change of the nature of the target set from real-valued to integral
makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer
computer to approximate real-valued computations to refresh in your mind
that real numbers and integers are very different spaces
• We will see in the next lecture why we can not simply rehash our LR
development to develop a classification ML system
Classification
• Classification:
• This change of the nature of the target set from real-valued to integral
makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer
computer to approximate real-valued computations to refresh in your mind
that real numbers and integers are very different spaces
• We will not be able to look at the t-vs-x plot in the same manner as for LR
• t is a discrete space, and so our real-variables calculus tools can not be used directly
• We will see in the next lecture why we can not simply rehash our LR development to
develop a classification ML system
Setup: Binary Classification
• We will start our development by choosing the most basic, non-trivial
classification space: the discrete set { -1, 1 }
• Classification:
• As stated previously, we will not visualize this data set by plotting t-
versus-x, as we would normally be able to do if t were real-valued
• Instead, we will plot the data in x’s space and color the data points
based on the class of the point (i.e., either -1 or 1)
• We need to determine a method to classify input vectors x into either
the class -1 or the class 1: this is called Binary Classification
Setup: Binary Classification
• If we are not viewing the data as t-vs-x (as we did in LR), then we are not
able to use our LR development as is to learn the classification function
• Instead, we will try to learn the coefficients for a hypersurface in such that
the elements of the two classes are on different sides of the hypersurface
(if this is possible, we are said to have a linearly separable dataset)
• Recall from linear algebra that the equation for such a hypersurface has
form:
• This hypersurface will split the space into two half-spaces:
• Points u on one side of the hypersurface would obey:
• Points v on the other side of the hypersurface would obey:
Setup: Binary Classification
• Summarizing:
• Dataset:
• Model:
• As with LR, y is the output of the model, the predictor of the class
• This prediction is computed by looking at z to determine which side of
the hypersurface we’re on
• Note: you may notice some deficiencies due to the fact that the predictor can
not, on the surface, distinguish between being on the surface, and on the
positive side of the surface; we’ll address this soon.
Binary Linear Classifier
• Dataset:
• Model:
• Break if updated==True
• Note: if the training set (Trs) is linearly separable then this is guaranteed to
converge
• Since we may not know TrS is linearly separable, you should augment this loop to
have a maximum iteration count (to prevent infinite loops). Do this as an exercise.
Exercises
• Consider the following Boolean functions:
• OR, AND, NOT and XOR
• Recall: for x and y in { -1, 1 }
• OR(x,y) returns a 1 for all cases except when x=y=-1
• AND(x,y) returns a -1 for all cases except when x=y=1
• NOT(x) returns a -1 when x=1, and a +1 otherwise
• XOR(x,y) returns a -1 when x=y, and a 1 otherwise
• What is the structure of the Perceptron that you would require to learn each of
those functions? (i.e., write the w and x vectors)
• Derive the weights by hand for each function.
• Now code up the PLR in Python and see what it computes for w
• Test what you got: were (a) your hand calculations and (b) the output of the PLR
correct?
Limits of the PLR
• The Perceptron Learning Rule is only guaranteed to converge if the
dataset is linearly separable
• Given a dataset with a binary target (in { -1, 1 }), lets call the data
vectors with class=1 the positive examples
• Likewise, the data vectors with class=-1 are the negative examples
• A linearly separable dataset is one where a hyperplane separates the
positive and negative examples
• In such a case, the PLR is guaranteed to converge to the hyperplane
coefficients
Limits of the PLR
• Exercise: graph the aforementioned Boolean functions and attempt to
draw a hyperplane (a line in 2D) between the positive and negative
examples. What do you observe? Compare this with the results of the
computation of w for each case that you did earlier.
• As you can see, a very simple function (the XOR) produces results that
are not linearly separable, and thus can not be learned via a
perceptron. This is a strong indictment of the Perceptron, given that
XOR is trivially constructed from two ANDs and one OR!
• What else can one not do with a Perceptron?
Limits of the PLR
• Let’s first introduce the concept of a convex set
• A convex set, S is one wherein given any two points within the space, the line
joining these two points (i.e., the set of points on the line) are also within the
set, i.e.:
• For =1
• For a linearly separable problem, the region of positive examples
must be convex and the region of negative examples must be convex
Limits of the PLR
• Let’s consider the XOR function.
• Let’s suppose it was linearly separable. Then we ought to be able to draw
convex sets for the regions of positive examples (and likewise for negative
examples).
• If we draw a line from the two -1s we ought to have a convex set (if XOR was
linearly separable).
• Likewise for the two 1s.
• Note, however, that the two spaces (the two lines) intersect, meaning that
the point of intersection must belong to both sets, which is nonsense.
• Hence the XOR is not linearly separable as the regions of positive examples
and that of negative examples are not convex
Limits of the Perceptron
• In looking at the limits of LC we see that the lack of linear separability
(LS) in the dataset is the problem
• Two cases:
• Dataset is LS but noisy
• Dataset is NOT LS
Addressing the Limits
• The PLR is great for linearly separable (LS) datasets (DS)
• What if our dataset is contaminated with noise?
• Then a fundamentally LS DS will not look LS!
• In this case we will need to optimize: find the best linear separator for most of
the data
• This will involve learning
• I.e., loss functions and gradient descent
• The next lecture will deal with this
• Recall our comments that classification, being a mapping from a real data space
to a discrete target space is crucially different from the regression problem
• We’ll see why soon
Addressing the Limits
• Though useful at learning linearly separable datasets, the Perceptron
is limited and can’t learn even simple nonlinear functions
• Exercise: show that XOR is nonlinear
• Recall how LR was extended to polynomial regression via the feature
functions
• We’ll see that we can employ a similar tactic to make the Perceptron
learn the XOR
• But more than that, we’ll uncover a fundamental principle that will
guide us into more interesting ML territory: neural networks
Map so far
• Linear Regression (LR)
• Dataset characteristics
• Tunable model
• Optimization framework for the solution of LR: learning
• Linear Classification (LC)
• Dataset characteristics
• Tunable model (Perceptron)
• Perceptron Learning Rule: non-optimization framework for the solution of LC
• Next: Learning and Classification
• + Optimization framework for the solution of linear classification with noise
• Next next: Learning non-linear classification functions
• + Generalization of the model
• + How to optimize with this more general model
Linear Classification: Terminology
• Classification • Convex Subset
• Linear Classification • Linearly Separable Problems
• Binary Classification
• Model
• Perceptron Model
• Negative Examples
• Positive Examples
• Hypersurface
Required Reading
• N/A
Useful Reference Material
• An excellent reference for AI, ML, linear regression, and other related
topics is found in Artificial Intelligence: A Modern Approach (Third
Edition; by Russel and Norvig)
• Professor Grosse (Computer Science) has further excellent material:
• http://www.cs.toronto.edu/~
rgrosse/courses/csc321_2018/readings/L03%20Linear%20Classifiers.pdf
• Neural Networks - A Systematic Introduction, Raul Rojas, Springer-
Verlag, Berlin, New-York, 1996 (502 p.,350 illustrations).
• Chapter 3 and 4