0% found this document useful (0 votes)

7 views

SML_Lecture5

The document covers the theory and algorithms of supervised machine learning, focusing on linear classification, including concepts like generalization error, model selection, and specific algorithms such as perceptron and logistic regression. It explains the geometric interpretation of linear classifiers, their properties, and the learning process involved in training these models. Additionally, it discusses the convergence of the perceptron algorithm and the loss function associated with it.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

SML_Lecture5

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

CS-E4715 Supervised Machine Learning

Lecture 5: Linear classification

Course topics

• Part I: Theory
• Introduction
• Generalization error analysis & PAC learning
• Rademacher Complexity & VC dimension
• Model selection
• Part II: Algorithms and models
• Linear models: perceptron, logistic regession
• Support vector machines
• Kernel methods
• Boosting
• Neural networks (MLPs)
• Part III: Additional topics
• Feature learning, selection and sparsity
• Multi-class classification
• Preference learning, ranking

1
Linear classification
Linear classification

• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued

vector, output space: Y = {−1, +1}
• Training sample S = {(x1 , y1 ), . . . , (xm , ym )} drawn from an
unknown distribution D
• Hypothesis class
Pd d
H = {x 7→ sgn j=1 wj xj + w0 |w ∈ R , w0 ∈ R} consists of
P
d
functions h(x) = sgn j=1 wj xj + w0 that map each example in
one of the two classes
(
+1, a ≥ 0
• sgn (a) = is the sign function
−1 a < 0

2
Linear classifiers

Linear classifiers
 
d
X
wj xj + w0  = sgn wT x + w0

h(x) = sgn 
j=1

have several attractive properties

• They are fast to evaluate and takes small space to store (O(d) time
and space)
• Easy to understand: |wj | shows the importance of variable xj and its
sign tells if the effect is positive or negative
• Linear models have relatively low complexity (e.g. VCdim = d + 1)
so they can be reliably estimated from limited data

Good practise is to try a linear model before something more complicated

3
The geometry of the linear classifier

• The points
{x ∈ X |g (x) = wT x + w0 = 0} define
a hyperplane in Rd , where d is the
number of variables in x
• The hyperplane g (x) = wT x + w0 = 0
splits the input space into two
half-spaces. The linear classifier
predicts +1 for points in the halfspace
{x ∈ X |g (x) = wT x + w0 ≥ 0} and
−1 for points in
{x ∈ X |g (x) = wT x + w0 < 0}

4
The geometry of the linear classifier

• w is the normal vector of the

hyperplane g (x) = wT x + w0 = 0
• The distance of the hyperplane from
the origin
qP is |w0 |/ kwk, where
kwk = 2
j wj denotes the
Euclidean norm
• If w0 < 0 the hyperplane lies in the
direction of w from origin, otherwise
it lies in the direction of −w

5
The geometry of the linear classifier

• The value g (x0 ) tells where x0 lies in

relation to the hyperplane:
• g (x0 ) > 0: x0 lies in the halfspace
that is in the direction of w from
the hyperplane
• g (x0 ) = 0: x0 lies on the hyperplane
• g (x0 ) < 0: x0 lies in the direction of
−w from the hyperplane
• The distance of a point x0 from the
hyperplane g (x) = 0 is |g (x0 )|/ kwk

6
Learning linear classifiers
Change of representation

• Consider the parameters (w, w0 ) of the linear function

g (x) = wT x + w0
• For presentation is is convenient to subsume term w0 into the weight
vector " #
w
w⇐
w0
and augment all inputs with a constant 1:
" #
x
x⇐
1

• The models give the same value for x:

" #T " #
w x
= w T x + w0
w0 1

7
Geometric interpretation

• Geometrically, the hyperplane in the

changed representation goes now
through origin
• The positive points have an acute
angle with w: wT x ≥ 0
• The negative points have an obtuse
angle with w: wT x < 0

8
Checking for prediction errors

• When the labels are Y = {−1, +1} for a training example (x, y ) we
have for g (x) = wT x,
(
y if x is correctly classified
sgn (g (x)) =
−y if x is incorrectly classified
• Alternative we can just multiply with the correct label to check for
misclassification:
(
≥ 0 if x is correctly classified
yg (x) =
< 0 if x is incorrectly classified

9
Margin

• The geometric margin of a labeled

example (x, y ) is given by
γ(x) = yg (x)/ kwk
• It takes into account both the
distance |wT x|/ kwk from the
hyperplane, and whether x is on the
correct side of the hyperplane
• The unnormalized version of the
margin is sometimes called the
functional margin γ(x) = yg (x)
• Often the term margin is used for
both variants, assuming the context
makes clear which one is meant

10
Perceptron
Perceptron

• Perceptron algorithm by Frank

Rosenblatt (1956) is perhaps the first
machine learning algorithm
• Its purpose was to learn a linear
function separating two classes
• It was built in hardware and shown to
be capable of performing rudimentary
pattern recognition tasks
• New York Times in 1958: ”the
embryo of an electronic computer that
[the Navy] expects will be able to
walk, talk, see, write, reproduce itself
Mark I perceptron ca. 1958 (Picture: Wikipedia)
and be conscious of its existence.”
(Source: Wikipedia)

11
The perceptron algorithm

• The perceptron algorithm a learns a hyperplane separating two

classes
g (x) = wT x
• It processes incrementally a set of training examples
• At each step, it finds a training example xi that is incorrectly
classified by the current model
• It updates the model by adding the example to the current weight
vector together with the label: w(t+1) ← w(t) + yi xi
• This process is continued until incorrectly predicted training
examples are not found

12
The perceptron algorithm

Input: Training set S = {(xi , yi )}m d

i=1 , x ∈ R , y ∈ {−1, +1}
(1)
Initialize w ← (0, . . . , 0), t ← 1, stop ← FALSE
repeat
T
if exists i, s.t. yi w(t) xi ≤ 0 then
w(t+1) ← w(t) + yi xi
else
stop ← TRUE
end if
t ←t +1
until stop

13
Understanding the update rule

• Let us examine the update rule

w(t+1) ← w(t) + yi xi

• We can see that the margin of the example (xi , yi ) increases after
the update
T
yi g (t+1) (xi ) = yi w(t+1) xi = yi (w(t) + yi xi )T xi
T 2
= yi w(t) xi + yi2 xT
i xi = yi g
(t)
(xi ) + kxi k
≥ yi g (t) (xi )

• Note that this does not guarantee that yi g (t+1) (xi ) > 0 after the
update, further updates may be required to achieve that

14
Perceptron animation

• Assume w(t) has been found by running the algorithm for t steps
• We notice two misclassified examples

15
Perceptron animation

• Select the misclassified example (φ(xi ), −1)

• Note: φ(xi ) is here some transformation of xi e.g. with some basis
functions but it could be identity φ(x) = x

(τ) T φ
+ w >0
+ φ(x i)
(τ) _
w
+

_
+ _
(τ)T
w φ <0

15
Perceptron animation

• Update the weight vector: w(t+1) = w(t) + yi φ(xi )

+
+ φ(x i)
(τ) _
w
+

(τ) _ φ(x
w i)
_

_
+ _

15
Perceptron animation

• The update tilts the hyperplane to make the example ”more

correct”, i.e. more negative
• We repeat the process by finding the next misclassified example
φ(xi+1 ) and update: w(t+2) = w(t+1) + yi+1 φ(xi+1 )

+
+
(τ+1) _ φ(x )
w i+1 _
+
(τ+1)
w

_
+ φ(x i+1) _

15
Perceptron animation

• Next iteration

+
+
(τ+2)
w _
+

_
+ _

15
Perceptron animation

• Next iteration

+
+

_
+

_
+ _

15
Perceptron animation

• Finally we have found a hyperplane that correctly classify the

training points
• We can stop the iteration and output the final weight vector

+
+

_
+

_
+ _

15
Convergence of the perceptron algorithm

• The perceptron algorithm can be shown to eventually converge to a

consistent hyperplane if the two classes are linearly separable, that
is, if there exists a hyperplane that separates the two classes
• Theorem (Novikoff):
• Let S = {(xi , yi )}m
i=1 be a linearly separable training set.
• Let R = maxxi ∈S kxi k.
• Let there exist a vector w∗ that satisfies kw∗ k = 1 and
yi w∗T xi + bopt ≥ γ for i = 1 . . . , m.
• Then the perceptron algorithm will stop after at most t ≤ ( 2R γ
)2
(t) (t)
iterations and output a weight vector w for which yi w xi ≥ 0 for
all i = 1 . . . , m

16
Convergence of the perceptron algorithm

The number of iterations in the bound t ≤ ( 2R 2

γ ) depend on:

• γ: The largest achievable

geometric margin so that all
training examples have at
least that margin
• R: The smallest radius of the γ
d-dimensional ball that
encloses the training data R

• Intuitively: how large the

||w|| = 1
margin in is relative to the w
distances of the training
points
However, Perceptron algorithm does not stop on a non-separable training
set, since there will always be a misclassified example that causes an
update
17
The loss function of the Perceptron algorithm

It can be shown that the

Perceptron algorithm is using the
following loss:

LPerceptron (y , wT x) = max(0, −y wT x)

• y wT x is the margin
• if y wT x < 0, a loss of
−y wT x is incurred, otherwise
no loss is incurred

18
Convexity of Perceptron loss

A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).

• Geometrical interpretation:
the graph of a convex
function lies below the line
segment from (x, f (x)) to
(y , f (y ))
• It is easy to see that
Perceptron loss is convex but
zero-one loss is not convex

19
Convexity of Perceptron loss

• The convexity of the Perceptron loss has an important consequence:

every local minimum is also the global minimum
• In principle we can minimize it with incremental updates that
gradually decrease the loss
• In contrast, finding a hyperplane that minimizes the zero-one loss is
computationally hard (NP-hard to minimize training error)
• However, we need better algorithms than the Perceptron, which
terminate when we are close to the optimum

20
Logistic regression
Logistic regression

Logistic regression is a classification technique (despite the name)

• it gets its name from the logistic

function
1 exp(z)
φlogistic (z) = =
1 + exp(−z) 1 + exp(z)

that maps a real valued input z onto

the interval 0 < φlogistic (z) < 1
• The function is an example of
sigmoid (”S” shaped) functions

21
Logistic function: a probabilistic interpretation

• The logistic function φlogistic (z) is the inverse of logit function

• The logit function is the logarithm of odds ratio of probability p of
and event happening vs. the probability of the event not happening,
1 − p;
p
z = logit(p) = log = log p − log(1 − p)
1−p
• Thus the logistic function
1
φlogistic (z) = logit −1 (z) =
1 + exp(−z)

answer the question ”what is the probability p that gives the log
odds ratio of z”

22
Logistic regression

• Logistic regression model assumes a underlying conditional

probability:

exp(+ 12 y wT x)
Pr (y |x) =
exp(+ 12 y wT x) + exp(− 12 y wT x)

where the denominator normalizes the right-hand side to be between

zero and one.
• Dividing the numerator and denominator by exp(+ 12 y wT x) reveals
1
the logistic function Pr (y |x) = φlogistic (y wT x) = 1+exp(−y wT x)

• The margin z = y wT x is thus interpreted as the log odds ratio of

(y |x)
label y vs. label −y given input x: y wT x = log PrPr(−y |x)
• Note: these equations assume the labels Y = {−1, +1}. With labels
Y = {0, 1} the equations will be slightly different.

23
Logistic loss

• Consider the maximization of the likelihood of the observed

input-output in the training data:
m m
Y Y 1
w∗ = argmaxw P(yi |xi ) = argmaxw
1 + exp(−y wT x)
i=1 i=1

• Since the logarithm is monotonically increasing function, we can

take the logarithm to obtain an equivalent objective:
m
X m
X
log Pr (yi |xi ) = − log(1 + exp(−yi wT xi ))
i=1 i=1

• The right-hand side is the logistic loss:

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Minimizing the logistic loss correspond maximizing the likelihood of

the training data
24
Geometric interpretation of Logistic loss

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Logistic loss is convex and

differentiable
• It is a monotonically decreasing
function of the margin y wT x
• The loss changes fast when the
margin is highly negative =⇒
penalization of examples far in the
incorrect halfspace
• It changes slowly for highly positive
margins =⇒ does not give extra
bonus for being very far in the correct
halfspace
25
Logistic regression optimization problem

• To train a logistic regression model, we need to find the w that

Pm
minimizes the average logistic loss J(w) = m1 i=1 Llogistic (yi , wT xi )
over the training set:
m
1 X
min J(w) = log(1 + exp(−yi wT xi )
m
i=1

w .r .t parameters w ∈ Rd

• The function to be minimized is continuous and differentiable

• However, it is a non-linear function so it is not easy to find the
optimum directly (e.g. unlike in linear regression)
• We will use stochastic gradient descent to incrementally step
towards the direction where the objective decreases fastest, the
negative gradient

26
Gradient

• The gradient is the vector of partial derivatives of the objective

function J(w) with respect to all parameters wj
m m iT
1 X 1 Xh ∂ ∂
∇J(w) = ∇Ji (w) = ∂w1 Ji (w), . . . , ∂wd J i (w)
m m
i=1 i=1

• Compute the gradient by using the regular rules for differentiation.

For the logistic loss we have

∂ ∂ exp(−yi wT xi )
Ji (w) = log(1 + exp(−yi wT xi )) = · (−yi xij )
∂wj ∂wj 1 + exp(−yi wT xi )
1
=− yi xij = −φlogistic (−yi wT xi )yi xij
1 + exp(yi wT xi )

27
Stochastic gradient descent

• We collect the partial derivatives with respect to a single training

example into a vector:
 
−(φlogistic (−yi wT xi )yi ) · xi1
 .. 

 . 

∇Ji (w) =  −(φlogistic (−yi wT xi )yi ) · xij  = −φlogistic (−yi wT xi )yi · xi
 

 .. 

 . 
T
−(φlogistic (−yi w xi )yi ) · xid

• The vector −∇Ji (w) gives the update direction that fastest
decreases the loss on training example (xi , yi )

28
Stochastic gradient descent

• Evaluating the full gradient

m m
1 X 1 X
∇J(w) = ∇Ji (w) = − φlogistic (−yi wT xi )yi · xi
m m
i=1 i=1

is costly since we need to process all training examples

• Stochastic gradient descent instead uses a series of smaller updates
that depend on single randomly drawn training example (xi , yi ) at a
time
• The update direction is taken as −∇Ji (w)
• Its expectation is the full negative gradient:

−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)

• Thus on average, the updates match that of using the full gradient

29
Stochastic gradient descent algorithm

Initialize w = 0; t = 1;
repeat
Draw a training example (x, y ) uniformly at random;
Compute the update direction corresponding to the training example:
∆w = −∇Jt (w);
Determine a stepsize ηt ;
Update w = w − ηt ∇Jt (w);
t = t + 1;
until stopping criterion statisfied
Output w;

30
Stepsize selection

Consider the SGD update: w = w − ηt OJt (w)

• The stepsize parameter ηt , also called the learning rate is a critical
one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be
unnecessarily slow
• Too large stepsize may cause the method to continually overshoot
the optimum.

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 31
Diminishing stepsize

• We can use a diminishing stepsize by starting with an initial larger

stepsize, controlled by hyperparameter η0 > 0
• In each iteration, the stepsize is divided by the iteration counter
t > 0:
η0
ηt =
t
• Caution: In practice, finding a good value for hyperparameter η0
requires experimenting with several values

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 32
Stopping criterion

When should we stop the algorithm? Some possible choices:

1. Set a maximum number of iterations, after which the algorithm

terminates
• This needs to be separately calibrated for each dataset to avoid
premature termination
2. Gradient of the objective: If we are at a optimum point w∗ of J(w),
the gradient vanishes ∇J(w∗ ) = 0, so we can stop kJ(w)k < γ
where γ is some user-defined parameter
3. It is usually sufficient to train until the zero-one error on training
data does not change anymore
• This usually happens before the logistic loss converges

33
Summary

• Linear classification model are and important class of machine

learning models, they are used as standalone models and appear as
building blocks of more complicated, non-liner models
• Perceptron is a simple algorithm to train linear classifiers on linearly
separable data
• Logistic regression is a classification method that can be interpreted
as maximizing odds ratios of conditional class probabilities
• Stochastic gradient descent is an efficient optimization method for
large data that is nowadays very widely used

Carbon_Credits_Through_a_Shariah_Lens_1739630381
No ratings yet
Carbon_Credits_Through_a_Shariah_Lens_1739630381
11 pages
FedEx Shipment 780598911619: Your Package Is Delayed.
33% (3)
FedEx Shipment 780598911619: Your Package Is Delayed.
3 pages
CARRARO Transmission OM
89% (9)
CARRARO Transmission OM
26 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Consumer Buying To Tide Detergent Behaviour of Tide Detergent
100% (1)
Consumer Buying To Tide Detergent Behaviour of Tide Detergent
46 pages
As 4876.1-2002 Motor Vehicle Frontal Protection Systems Road User Protection
0% (1)
As 4876.1-2002 Motor Vehicle Frontal Protection Systems Road User Protection
7 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
7. Perceptron
No ratings yet
7. Perceptron
23 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
8_SVM
No ratings yet
8_SVM
55 pages
Perceptron Linear Classifiers
No ratings yet
Perceptron Linear Classifiers
42 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
3 Linear
No ratings yet
3 Linear
5 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
33 pages
ML_Lec 6- Linear Classifiers
No ratings yet
ML_Lec 6- Linear Classifiers
55 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
ML-UNIT-I
No ratings yet
ML-UNIT-I
14 pages
Lect 1
No ratings yet
Lect 1
24 pages
NN Theory
No ratings yet
NN Theory
138 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
No ratings yet
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
9 pages
Session 6 Machine Learning Algorithms
No ratings yet
Session 6 Machine Learning Algorithms
46 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
Perceptron
No ratings yet
Perceptron
26 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Ds 2
No ratings yet
Ds 2
27 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
05_optimization_basics
No ratings yet
05_optimization_basics
94 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
PRu 4
No ratings yet
PRu 4
13 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
No ratings yet
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
49 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
125 pages
lecture19
No ratings yet
lecture19
8 pages
Machine Learning-4
No ratings yet
Machine Learning-4
18 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
05 Linear Classifiers
No ratings yet
05 Linear Classifiers
59 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Main Notes
No ratings yet
Main Notes
227 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Rizal System Research Manual
No ratings yet
Rizal System Research Manual
60 pages
POAL - Letters To Staff and Offer
No ratings yet
POAL - Letters To Staff and Offer
6 pages
Ge8161-Lab Programs
No ratings yet
Ge8161-Lab Programs
37 pages
Nist Fips 180-4 PDF
No ratings yet
Nist Fips 180-4 PDF
36 pages
Case Kiddie Chapter 1
No ratings yet
Case Kiddie Chapter 1
3 pages
Ford Faultcodes
No ratings yet
Ford Faultcodes
5 pages
CS405 Assignment 2 Solution Spring 2024
No ratings yet
CS405 Assignment 2 Solution Spring 2024
6 pages
Chapter 6
No ratings yet
Chapter 6
3 pages
Bls Study Guide
No ratings yet
Bls Study Guide
2 pages
Summative Test in Tle
No ratings yet
Summative Test in Tle
4 pages
ISO-8859-1 - The Voice 8-17
No ratings yet
ISO-8859-1 - The Voice 8-17
50 pages
Instant ebooks textbook Game Development Essentials Game QA Testing 1st Edition Luis Levy download all chapters
100% (9)
Instant ebooks textbook Game Development Essentials Game QA Testing 1st Edition Luis Levy download all chapters
67 pages
Vehicle Invoice
No ratings yet
Vehicle Invoice
1 page
STM32 F-4 Series Marketing Presentation Customer Presentation
No ratings yet
STM32 F-4 Series Marketing Presentation Customer Presentation
97 pages
Measuring and Examining Innovation in Philippine Business and Industry
No ratings yet
Measuring and Examining Innovation in Philippine Business and Industry
58 pages
Comau Robotics Presentation: Property of Comau S.p.A. - Duplication Prohibited
No ratings yet
Comau Robotics Presentation: Property of Comau S.p.A. - Duplication Prohibited
15 pages
Sample Solutions Homework 2
No ratings yet
Sample Solutions Homework 2
3 pages
Itinerary May 2
No ratings yet
Itinerary May 2
6 pages
Intermediate Accounting 4: Investment Property
No ratings yet
Intermediate Accounting 4: Investment Property
1 page
Appendix F Annexure F Proof of Submission of Pre Application BAR
No ratings yet
Appendix F Annexure F Proof of Submission of Pre Application BAR
10 pages
HHS Hollow Plunger Cylinder SA1
No ratings yet
HHS Hollow Plunger Cylinder SA1
1 page
Diener & Lucas 2000 Chapter - Subjective Emotional Well-Being
No ratings yet
Diener & Lucas 2000 Chapter - Subjective Emotional Well-Being
16 pages
ROTAREX_ Corporate brochure
No ratings yet
ROTAREX_ Corporate brochure
24 pages
Digital Code Lock Using Arduino With LCD Display and User Defined Password
100% (1)
Digital Code Lock Using Arduino With LCD Display and User Defined Password
7 pages
Hoist Data - DZX-1102 - ZX084 - 6.0 & 12.0 ton
No ratings yet
Hoist Data - DZX-1102 - ZX084 - 6.0 & 12.0 ton
3 pages

SML_Lecture5

Uploaded by

SML_Lecture5

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 5: Linear classification

• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued

have several attractive properties

Good practise is to try a linear model before something more complicated

• w is the normal vector of the

• The value g (x0 ) tells where x0 lies in

• Consider the parameters (w, w0 ) of the linear function

• The models give the same value for x:

• Geometrically, the hyperplane in the

• The geometric margin of a labeled

• Perceptron algorithm by Frank

• The perceptron algorithm a learns a hyperplane separating two

Input: Training set S = {(xi , yi )}m d

• Let us examine the update rule

• Select the misclassified example (φ(xi ), −1)

• Update the weight vector: w(t+1) = w(t) + yi φ(xi )

• The update tilts the hyperplane to make the example ”more

• Finally we have found a hyperplane that correctly classify the

• The perceptron algorithm can be shown to eventually converge to a

The number of iterations in the bound t ≤ ( 2R 2

• γ: The largest achievable

• Intuitively: how large the

It can be shown that the

A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).

• The convexity of the Perceptron loss has an important consequence:

Logistic regression is a classification technique (despite the name)

• it gets its name from the logistic

that maps a real valued input z onto

• The logistic function φlogistic (z) is the inverse of logit function

• Logistic regression model assumes a underlying conditional

where the denominator normalizes the right-hand side to be between

• The margin z = y wT x is thus interpreted as the log odds ratio of

• Consider the maximization of the likelihood of the observed

• Since the logarithm is monotonically increasing function, we can

• The right-hand side is the logistic loss:

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Minimizing the logistic loss correspond maximizing the likelihood of

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Logistic loss is convex and

• To train a logistic regression model, we need to find the w that

• The function to be minimized is continuous and differentiable

• The gradient is the vector of partial derivatives of the objective

• Compute the gradient by using the regular rules for differentiation.

• We collect the partial derivatives with respect to a single training

• Evaluating the full gradient

is costly since we need to process all training examples

−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)

Consider the SGD update: w = w − ηt OJt (w)

• We can use a diminishing stepsize by starting with an initial larger

When should we stop the algorithm? Some possible choices:

1. Set a maximum number of iterations, after which the algorithm

• Linear classification model are and important class of machine

You might also like