0% found this document useful (0 votes)
8 views22 pages

Lec 3

The document discusses linear models and their application in authentication through secret questions, focusing on the concept of large margin classifiers and Support Vector Machines (SVM). It explains the optimization problems associated with SVMs, including the use of slack variables and the hinge loss function to balance classification accuracy and margin size. Additionally, it covers optimization techniques such as gradient descent and coordinate descent for finding optimal classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views22 pages

Lec 3

The document discusses linear models and their application in authentication through secret questions, focusing on the concept of large margin classifiers and Support Vector Machines (SVM). It explains the optimization problems associated with SVMs, including the use of slack variables and the hinge loss function to balance classification accuracy and margin size. Additionally, it covers optimization techniques such as gradient descent and coordinate descent for finding optimal classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

The Best Linear

Model?
Authentication by Secret
Questions

Give me your device ID and TS271828182845


answer the following questions
1. 10111100 1. 1
2. 00110010 2. 0
3. 10001110 3. 1
4. 00010100 4. 0
5. … 5. …
SERVER DEVICE
Arbiter PUFs If the top signal reaches the finish line first,
the “answer” to this question is 0, else if the
bottom signal reaches first, the “answer” is 1

Question: 1011

1 0 1 1

1?
What just happened?
𝐱9 𝐱2 Embedding Challenge Response
10111100 1
𝐱0 00110010 0
𝐱5 10001110 1
00010100 0
𝐱7 01101111 1
𝐱4 01010111 1
𝐱8 10100110 0
10101001 0
11010111 0
𝐱3 𝐱6 𝐱1 00001010 1
Linear Models
We have

where
𝐰

If , upper signal wins and answer is 0


If , lower signal wins and answer is 1
Thus, answer is simply
This is nothing but
a linear classifier!
It seems infinitely

The “best” Linear Classifier


many classifiers
perfectly classify
the data. Which
6
one should I
choose?

Indeed! Such models would be very brittle


and might misclassify test data (i.e. predict
the wrong class), even those test data
which look very similar to train data
It is better to not select a
model whose decision
boundary passes very close
to a training data point
Large Margin Classifiers 7
Fact: distance of origin from hyperplane is
Fact: distance of a point from this hyperplane is
Given train data for a binary classfn problem where
and , we want two things from a classifier
Demand 1: classify every point correctly – how to ask
this politely?
One way: demand that for all ,
Easier way: demand that for all ,
Demand 2: not let any data point come close to the
boundary
Demand that be as large as possible
Support Vector Machines 8
Just a fancy way of saying Let us simplify


this optimization
Please find me a linear classifier problem
that perfectly
classifies the train data while keeping data points

as far away from the hyperplane as possible
The mathematical way of writing this request is the
following
Constraints Objective

such that for all


This looks so
This is known as an
complicated, how will I
optimization problem
ever find a solution to
with an objective and
this optimization
Constraints are usually

Constrained Optimization 101


specified using math
equations. The set of points
that satisfy all the constraints
9
HOW WE MUSTObjectiv
SPEAK TO MELBOis called
HOW the WEfeasible
SPEAK setTO
of the
A HUMAN
Constrai optimization problem
e I want to find an unknown
nts
that
You For gives
optimization meproblem
your specifiedthe best value
such that according
has
constraints,
no solution to since
the this function
optimal
no
and etc. etc. (least)
pointbtw,
Oh! value
satisfies
of all
not is
anyyour
andwould
it do!
must isconstraints
achieved at

satisfy these
Feasible conditions
Feasible set is
set is All I am saying is, of the
the interval
empty!
s.t.
s.t. values of that satisfy my
and conditions, find me the one
and that gives the best value
according to
3 6
Back to SVMs 10
Assume there do exist params that perfectly classify all
train data
Consider one such params which classifies train data
perfectly
Now, as
Thus, geometric margin is same as since model has
perfect classification!
We will use this useful fact to greatly simplify the
optimization problem
We will remove What if train data is non-linearly
this separable i.e no linear classifier
assumption can
later perfectly classify it? For example
Support Vector Machines 11
Let be the data point that comes closest to the
hyperplane i.e.
Recall that all this discussion holds only for a perfect
classifier
Let and consider
Note this gives us for all as well as (as )
Thus, instead of searching for , easier to search for

min {‖~ 𝐰 2‖2 }


such
~ ❑ for all
𝐰 , 𝑏 that
~
The C-SVM Technique What prevents me from misusing the
slack variables to learn a model that
misclassifies
The termevery data you
prevents point?
from
12
For linearly separable casesdoingwhere we
so. If we setsuspect
to be a a perfect
classifier exists large value (it is a hyper-
parameter), then it will
penalize solutions that misuse
s.t. forslack
all too much
If a linear classifier cannot perfectly classify data,
Having the then
constraint
find model using prevents us from
misusing slack to
artificially inflate the
s.t. for all margin
Recall English
as well as for all phrase “cut me
The terms are called slack variables. They allow some slack”
some data points to come close to the hyperplane or
be misclassified altogether
From C-SVM to Loss Functions 13
We can further simplify the previous optimization
problem
Note basically allows us to have (even )
Thus, the amount of slack we want is just
However, recall that we must also satisfy
𝑥 [ ]
Another way of saying that if you already have+ ¿= max { 𝑥 ,0 } ¿
, then
you don’t need any slack i.e. you should have in this
case
Thus, we need only set
The above is nothing but the popular hinge loss
function!
Hinge Loss 14
Captures how well as a classifier classified a data point
Suppose on a data point , a model gives prediction
score of (for a linear model , we have )
We obviously want for correct classification but we
also want for large margin – hinge loss function
captures both

Note that hinge loss not only penalizes


misclassification
but also correct classification if the data point gets
Final Form of C-SVM 15
Recall that the C-SVM optimization finds a model by
solving

s.t. for all


as well as for all
Using the previous discussion, we can rewrite the
above very simply
Use Calculus for Optimization 16
Method 1: First order optimality Condition
Exploits the fact that gradient must vanish at a local optimum
Also exploits the fact that for convex functions, local minima are
global
Warning: works only for simple convex functions when there are no
constraints
To Do: given a convex function that we wish to minimize, try
finding all the stationary points of the function (set gradient to
zero)
If you find only one, that has to be the global minimum 
Example:
only at
i.e. is cvx i.e. is global min
Use Calculus for Optimization 17
Method 2: Perform (sub)gradient descent
Recall that direction opposite to gradient offers
steepest descent How to initialize ?
(SUB)GRADIENT DESCENT
1. Given: obj. func. to minimize How to choose
2. Initialize Often called “step
3. For length” or “learning
rate”
1. Obtain a (sub)gradient
2. Choose a step length What is
convergence?
3. Update
4. Repeat until convergence How to decide if we
have converged?
Gradient Descent
Move
Choose step length(GD)
carefully else may
18
opposite to overshoot the global
the minimum even with
gradients great initialization
Also,
initialization
may affect
result
Our initialization was such
that we converged to This
a time
local minimuminitialization was
really nice!

With convex fns, all Still need to be


Global local minima are careful with step
minimum global minima and lengths otherwise
can afford to be less may overshoot global
carefull with minima
19
So gradient descent, although a
Behind the scenes in GD for SVM
mathematical tool from calculus,
actually tries very actively to make the
(ignore bias for now)
model perform better on all data points
, where

Assume for a moment for sake of understanding

No change to due to
Small : is large do not change too much!
the data point

If does well on , say , then


Large : Feel free to change as much as the gradient dictates

If does badly on , say , then may get much better


margin on than
Stochastic Gradient Method 20
, where
Calculating each takes time since - total
At each time, choose a random data point
- only time!!
Warning: may have to perform several SGD steps
than we had to do with GD but each SGD step is much
cheaper than a GD step
We take Doawerandom data
really need point to
to spend avoidallbeing
Initially, we need unlucky
is a
(also it is
so cheap)
much time on just one general direction in which
update? No,toSGD
move
gives a
Especially in the beginning,
cheaper way to
when we are far away from
perform gradient
Mini-batch SGD 21
If data is very diverse, the “stochastic” gradient may
vary quite a lot depending on which random data point
is chosen
This is called variance (more on this later) but this
can slow down the SGD process – make it jittery
One solution, choose more than one random point
At each step, choose random data points ( = mini batch
size) without replacement, say and use

Takes time to execute MBSGD – more expensive than


SGD
Coordinate Descent
Sometimes we are able to optimize completely along a
given variable (even if constraints are there) – called
coordinate
22
Similar to GD except onlyminimization (CM)
one coordinate is changed in
a single step
E.g. s.t. with as th partial derivative
COORDINATE DESCENT
CCD: choose coordinate cyclically
i.e. 1. For
SCD: choose randomly 1. Select a coordinate
2. Let
Block CD: choose a small set of
coordinates at each to update 3. Let for
4. Repeat until
Randperm: permute coordinates
randomly and choose them in convergence
that order. Once the list is over, choose a new random

You might also like