0% found this document useful (0 votes)

25 views12 pages

Chapter Classification

Uploaded by

RAKESH SWAIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views12 pages

Chapter Classification

Uploaded by

RAKESH SWAIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CHAPTER 4

Classification

4.1 Classification
Classification is a machine learning problem seeking to map from inputs Rd to outputs in
an unordered set. Examples of classification output sets could be {apples, oranges, pears} in contrast to a continu-
if we’re trying to figure out what type of fruit we have, or {heartattack, noheartattack} ous real-valued output,
as we saw for linear re-
if we’re working in an emergency room and trying to give the best medical care to a new gression
patient. We focus on an essential simple case, binary classification, where we aim to find
a mapping from Rd to two outputs. While we should think of the outputs as not having
an order, it’s often convenient to encode them as {−1, +1}. As before, let the letter h (for
hypothesis) represent a classifier, so the classification process looks like:

x→ h →y .

Like regression, classification is a supervised learning problem, in which we are given a

training data set of the form

Dn = x(1) , y(1) , . . . , x(n) , y(n) .

We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? As in regression, we want it to work well on new data,
making good predictions on examples it hasn’t seen. But we don’t know exactly what
data this classifier might be tested on when we use it in the real world. So, we have to
assume a connection between the training data and testing data; typically, they are drawn
independently from the same probability distribution.
In classification, we will often use 0-1 loss for evaluation (as discussed in Section 1.3).
For that choice, we can write the training error and the testing error. In particular, given a
training set Dn and a classifier h, we define the training error of h to be
n

1 X 1 h(x(i) ) ̸= y(i)
En (h) = .
n
i=1
0 otherwise

29
MIT 6.390 Spring 2024 30

For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error

n+n ′
1 X 1 h(x(i) ) ̸= y(i)
E(h) = ′
n
i=n+1
0 otherwise

on n ′ new examples that were not used in the process of finding the classifier.
We begin by introducing the hypothesis class of linear classifiers (Section 4.2) and then
define an optimization framework to learn linear logistic classifiers (Section 4.3).

4.2 Linear classifiers

We start with the hypothesis class of linear classifiers. They are (relatively) easy to under-
stand, simple in a mathematical sense, powerful on their own, and the basis for many other
more sophisticated methods. Following their definition, we present a simple learning al-
gorithm for classifiers.

4.2.1 Linear classifiers: definition

A linear classifier in d dimensions is defined by a vector of parameters θ ∈ Rd and scalar
θ0 ∈ R. So, the hypothesis class H of linear classifiers in d dimensions is parameterized by
the set of all vectors in Rd+1 . We’ll assume that θ is a d × 1 column vector.
Given particular values for θ and θ0 , the classifier is defined by

+1 if θT x + θ0 > 0
h(x; θ, θ0 ) = sign(θ x + θ0 ) =
T
.
−1 otherwise

Remember that we can think of θ, θ0 as specifying a d-dimensional hyperplane (compare

the above with Eq. 2.3). But this time, rather than being interested in that hyperplane’s
values at particular points x, we will focus on the separator that it induces. The separator is
the set of x values such that θT x+θ0 = 0. This is also a hyperplane, but in d−1 dimensions!
We can interpret θ as a vector that is perpendicular to the separator. (We will also say that
θ is normal to the separator.)
For example, in two dimensions (d = 2) the separator has dimension 1, which means it
is a line, and the two components of θ = [θ1 , θ2 ]T give the orientation of the separator, as
illustrated in the following example.

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 31

1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1

The diagram below shows the θ vector (in green) and the separator it defines:

x2 θT x + θ0 = 0

θ θ2
x1
θ1

What is θ0 ? We can solve for it by plugging a point on the line into the equation for
the line. It is often convenient
to choose a point on one of the axes, e.g., in this case,
0
x = [0, 1]T , for which θT + θ0 = 0, giving θ0 = 1.
1

In this example, the separator divides Rd , the space our x(i) points live in, into two half-
spaces. The one that is on the same side as the normal vector is the positive half-space, and
we classify all points in that space as positive. The half-space on the other side is negative
and all points in it are classified as negative.
Note that we will call a separator a linear separator of a data set if all of the data with
one label falls on one side of the separator and all of the data with the other label falls on
the other side of the separator. For instance, the separator in the next example is a linear
separator for the illustrated data. If there exists a linear separator on a dataset, we call this
dataset linearly separable.

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 32

−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5

3
The diagram below shows several points classified by h. In particular, let x(1) =
2

4
and x(2) = .
−1

3
h(x(1) ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2

4
h(x (2)
; θ, θ0 ) = sign −1 1.5 + 3 = sign(−2.5) = −1
−1

Thus, x(1) and x(2) are given positive and negative classifications, respectively.

x(1)
θT x + θ0 = 0

x(2)

Study Question: What is the green vector normal to the separator? Specify it as a
column vector.

Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?

4.3 Linear logistic classifiers

Given a data set and the hypothesis class of linear classifiers, our goal will be to find the
linear classifier that optimizes an objective function relating its predictions to the training
data. To make this problem computationally reasonable, we will need to take care in how
we formulate the optimization problem to achieve this goal.
For classification, it is natural to make predictions in {+1, −1} and use the 0-1 loss func-
tion, L01 , as introduced in Chapter 1:

0 if g = a
L01 (g, a) = .
1 otherwise

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 33

However, even for simple linear classifiers, it is very difficult to find values for θ, θ0 that
minimize simple 0-1 training error
n
1X
J(θ, θ0 ) = L01 (sign(θT x(i) + θ0 ), y(i) ) .
n
i=1

This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ ′ , θ0′ ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs. NP.”
misclassifications so they have the same J value.

• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches in a sensible way through the space
of hypotheses for a good one. For these reasons, we investigate another hypothesis class:
linear logistic classifiers, providing their definition, then an approach for learning such clas-
sifiers using optimization.

4.3.1 Linear logistic classifiers: definition

The hypotheses in a linear logistic classifier (LLC) are parameterized by a d-dimensional
vector θ and a scalar θ0 , just as is the case for linear classifiers. However, instead of making
predictions in {+1, −1}, LLC hypotheses generate real-valued outputs in the interval (0, 1).
An LLC has the form
h(x; θ, θ0 ) = σ(θT x + θ0 ) .
This looks familiar! What’s new?
The logistic function, also known as the sigmoid function, is defined as
1
σ(z) = ,
1 + e−z
and is plotted below, as a function of its input z. Its output can be interpreted as a proba-
bility, because for any value of z the output is in (0, 1).

σ(z)
1

0.5

z
−4 −3 −2 −1 1 2 3 4

Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 34

What does an LLC look like? Let’s consider the simple case where d = 1, so our input
points simply lie along the x axis. Classifiers in this case have dimension 0, meaning that
they are points. The plot below shows LLCs for three different parameter settings: σ(10x +
1), σ(−2x + 1), and σ(2x − 3).

σ(θT x + θ0 )
1

0.5

x
−4 −3 −2 −1 1 2 3 4

Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier is that it’s a mapping from Rd →
{−1, +1} or to some other discrete set. So, then, it seems like an LLC is actually not a
classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice. For example, if
the consequences of predicting +1 when the answer should be −1 are much worse than
the consequences of predicting −1 when the answer should be +1, then we might set the
prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 35

Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) = 0.5,
that is, the “boundary” between positive and negative predictions with prediction
threshold 0.5, is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?

4.3.2 Learning linear logistic classifiers

Optimization is a key approach to solving machine learning problems; this also applies
to learning linear logistic classifiers (LLCs) by defining an appropriate loss function for
optimization. A first attempt might be to use the simple 0-1 loss function L01 that gives
a value of 0 for a correct prediction, and a 1 for an incorrect prediction. As noted earlier,
however, this gives rise to an objective function that is very difficult to optimize, and so we
pursue another strategy for defining our objective.
For learning LLCs, we’d have a class of hypotheses whose outputs are in (0, 1), but for
which we have training data with y values in {+1, −1}. How can we define an appropriate
loss function? We start by changing our interpretation of the output to be the probability that
the input should map to output value 1 (we might also say that this is the probability that the
input is in class 1 or that the input is ‘positive.’)
Study Question: If h(x) is the probability that x belongs to class +1, what is the
probability that x belongs to the class -1? Assuming there are only these two classes.
Intuitively, we would like to have low loss if we assign a high probability to the correct
class. We’ll define a loss function, called negative log-likelihood (NLL), that does just this. In
addition, it has the cool property that it extends nicely to the case where we would like to
classify our inputs into more than two classes.
In order to simplify the description, we assume that (or transform our data so that) the
labels in the training data are y ∈ {0, 1}. Remember to be sure
We would like to pick the parameters of our classifier to maximize the probability as- your y values have this
form if you try to learn
signed by the LLC to the correct y values, as specified in the training set. Letting guess
an LLC using NLL!
g(i) = σ(θT x(i) + θ0 ), that probability is
n
That crazy huge Π rep-
Y g(i) if y(i) = 1 resents taking the prod-
, uct over a bunch of fac-
i=1
1 − g(i) otherwise tors just as huge Σ rep-
resents taking the sum
under the assumption that our predictions are independent. This can be cleverly rewritten, over a bunch of terms.
when y(i) ∈ {0, 1}, as
Yn
y(i)
(1 − g(i) )1−y
(i)
g(i) .
i=1

Study Question: Be sure you can see why these two expressions are the same.
The big product above is kind of hard to deal with in practice, though. So what can we
do? Because the log function is monotonic, the θ, θ0 that maximize the quantity above will
be the same as the θ, θ0 that maximize its log, which is the following:
n
X
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1

Finally, we can turn the maximization problem above into a minimization problem by tak-

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 36

ing the negative of the above expression, and write in terms of minimizing a loss
n
X
Lnll (g(i) , y(i) )
i=1

where Lnll is the negative log-likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
What is the objective function for linear logistic classification? We can finally put all for the logarithm and
it won’t make any real
these pieces together and develop an objective function for optimizing regularized neg-
difference. If we ask
ative log-likelihood for a linear logistic classifier. In fact, this process is usually called you for numbers, use
“logistic regression,” so we’ll call our objective Jlr , and define it as log base e.
!
1X
n That’s a lot of fancy
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ ∥θ∥2 .
T (i) (i)
(4.1) words!
n
i=1

Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.
What role does regularization play for classifiers? This objective function has the same
structure as the one we used for regression, Eq. 2.2, where the first term (in parentheses)
is the average loss, and the second term is for regularization. Regularization is needed
for building classifiers that can generalize well (just as was the case for regression). The
parameter λ governs the trade-off between the two terms as illustrated in the following
example.
Suppose we wish to obtain a linear logistic classifier for this one-dimensional dataset:
1.2

1.0

0.8

0.6
y

0.4

0.2

0.0

0.2
8 6 4 2 0 2 4 6 8
x
Clearly, this can be fit very nicely by a hypothesis h(x) = σ(θx), but what is the best value
for θ? Evidently, when there is no regularization (λ = 0), the objective function Jlr (θ) will
approach zero for large values of θ, as shown in the plot on the left, below. However, would
the best hypothesis really have an infinite (or very large) value for θ? Such a hypothesis
would suggest that the data indicate strong certainty that a sharp transition between y = 0
and y = 1 occurs exactly at x = 0, despite the actual data having a wide gap around x = 0.

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 37

20 λ=0 20 λ = 0.2

15 15
Jlr (θ)

Jlr (θ)
10 10

5 5

0 0
4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
θ θ
Study Question: Be sure this makes sense. When the θ values are very large, what
does the sigmoid curve look like? Why do we say that it has a strong certainty in
that case?
In absence of other beliefs about the solution, we might prefer that our linear logistic
classifier not be overly certain about its predictions, and so we might prefer a smaller θ
over a large θ. By not being overconfident, we might expect a somewhat smaller θ to per-
form better on future examples drawn from this same distribution. This preference can be To refresh on some vo-
realized using a nonzero value of the regularization trade-off parameter, as illustrated in cabulary, we say that
in this example, a very
the plot on the right, above, with λ = 0.2.
large θ would be overfit
Another nice way of thinking about regularization is that we would like to prevent to the training data.
our hypothesis from being too dependent on the particular training data that we were
given: we would like for it to be the case that if the training data were changed slightly, the
hypothesis would not change by much.

4.4 Gradient descent for logistic regression

Now that we have a hypothesis class (LLC) and a loss function (NLL), we need to take
some data and find parameters! Sadly, there is no lovely analytical solution like the one
we obtained for regression, in Section 2.6.2. Good thing we studied gradient descent! We
can perform gradient descent on the Jlr objective, as we’ll see next. We can also apply
stochastic gradient descent to this problem.
Luckily, Jlr has enough nice properties that gradient descent and stochastic gradient de-
scent should generally “work”. We’ll soon see some more challenging optimization prob-
lems though – in the context of neural networks, in Section 6.7.
First we need derivatives with respect to both θ0 (the scalar component) and θ (the
vector component) of Θ. Explicitly, they are: Some passing familiar-
ity with matrix deriva-
1 X (i)
n
tives is helpful here. A
∇θ Jlr (θ, θ0 ) = g − y(i) x(i) + 2λθ foolproof way of com-
n puting them is to com-
i=1
pute partial derivative
1 X (i)
n
∂Jlr (θ, θ0 ) of J with respect to each
= g − y(i) .
∂θ0 n component θi of θ.
i=1

Note that ∇θ Jlr will be of shape d × 1 and ∂Jlr

∂θ0 will be a scalar since we have separated θ0
from θ here.
Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1. How does d relate to m as discussed
for Θ in the previous section?

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 38

Study Question: Compute ∇θ ∥θ∥2 by finding the vector of partial derivatives

(∂ ∥θ∥2 /∂θ1 , . . . , ∂ ∥θ∥2 /∂θd ). What is the shape of ∇θ ∥θ∥2 ?

Study Question: Compute ∇θ Lnll (σ(θT x + θ0 ), y) by finding the vector of partial

derivatives (∂Lnll (σ(θT x + θ0 ), y)/∂θ1 , . . . , ∂Lnll (σ(θT x + θ0 ), y)/∂θd ).

Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for logistic regression be-
comes:

LR-G RADIENT-D ESCENT(θinit , θ0init , η, ϵ)

1 θ(0) = θinit
(0)
2 θ0 = θ0init
3 t=0
4 repeat
5 t = t+1 P
1 n (t−1) T (i) (t−1)
6 θ(t) = θ(t−1) − η i=1 σ θ x + θ0 − y (i)
x(i)
+ 2λθ(t−1)
n P
(t−1) T (i)
− η n1 n
(t) (t−1) (t−1)
7 θ0 = θ0 i=1 σ θ x + θ0 − y (i)

(t) (t−1)
8 until Jlr (θ(t) , θ0 ) − Jlr (θ(t−1) , θ0 ) <ϵ
(t)
9 return θ(t) , θ0

Logistic regression, implemented using batch or stochastic gradient descent, is a useful

and fundamental machine learning technique. We will also see later that it corresponds to
a one-layer neural network with a sigmoidal activation function, and so is an important
step toward understanding neural networks.

4.4.1 Convexity of the NLL Loss Function

Much like the squared-error loss function that we saw for linear regression, the NLL loss
function for linear logistic regression is a convex function. This means that running gradi-
ent descent with a reasonable set of hyperparameters will converge arbitrarily close to the
minimum of the objective function.
We will use the following facts to demonstrate that the NLL loss function is a convex
function:
• if the derivative of a function of a scalar argument is monotonically increasing, then
it is a convex function,
• the sum of convex functions is also convex,
• a convex function of an affine function is a convex function.
Let z = θT x + θ0 ; z is an affine function of θ and θ0 . It therefore suffices to show that the
functions f1 (z) = − log(σ(z)) and f2 (z) = − log(1 − σ(z)) are convex with respect to z.
First, we can see that since,
d d
f1 (z) = [− log(1/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z))] ,
dz
= − exp(−z)/(1 + exp(−z)),
= −1 + σ(z),

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 39

the derivative of the function f1 (z) is a monotonically increasing function and therefore f1
is a convex function.
Second, we can see that since,
d d
f2 (z) = [− log(exp(−z)/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z)) + z] ,
dz
= σ(z),

the derivative of the function f2 (z) is also monotonically increasing and therefore f2 is a
convex function.

4.5 Handling multiple classes

So far, we have focused on the binary classification case, with only two possible classes. But
what can we do if we have multiple possible classes (e.g., we want to predict the genre of
a movie)? There are two basic strategies:
• Train multiple binary classifiers using different subsets of our data and combine their
outputs to make a class prediction.

• Directly train a multi-class classifier using a hypothesis class that is a generalization

of logistic regression, using a one-hot output encoding and NLL loss.
The method based on NLL is in wider use, especially in the context of neural networks,
and is explored here. In the following, we will assume that we have a data set D in which
the inputs x(i) ∈ Rd but the outputs y(i) are drawn from a set of K classes {c1 , . . . , cK }. Next,
we extend the idea of NLL directly to multi-class classification with K classes, where the
T
training label is represented with what is called a one-hot vector y = y1 , . . . , yK , where
yk = 1 if the example is of class k and yk = 0 otherwise. Now, we have a problem of
mapping an input x(i) that is in Rd into a K-dimensional output. Furthermore, we would
like this output to be interpretable as a discrete probability distribution over the possible
classes, which means the elements of the output vector have to be non-negative (greater
than or equal to 0) and sum to 1.
We will do this in two steps. First, we will map our input x(i) into a vector value
z(i) ∈ RK by letting θ be a whole d × K matrix of parameters, and θ0 be a K × 1 vector, so
that
z = θT x + θ0 .
Next, we have to extend our use of the sigmoid function to the multi-dimensional softmax Let’s check dimensions!
function, that takes a whole vector z ∈ RK and generates θT is K×d and x is d×1,
and θ0 is K × 1, so z is
 P  K × 1 and we’re good!
exp(z1 )/ i exp(zi )
 .. 
g = softmax(z) =  .  .
P
exp(zK )/ i exp(zi )

which can be interpreted as a probability distribution over K items. To make the final
prediction of the class label, we can then look at g, find the most likely probability over
these K entries in g, (i.e. find the largest entry in g,) and return the corresponding index as
the “one-hot” element of 1 in our prediction.
Study Question: Convince yourself that the vector of g values will be non-negative
and sum to 1.

Last Updated: 04/07/24 16:49:48

MIT 6.390 Spring 2024 40

Putting these steps together, our hypotheses will be

h(x; θ, θ0 ) = softmax(θT x + θ0 ) .

Now, we retain the goal of maximizing the probability that our hypothesis assigns to
the correct output yk for each input x. We can write this probability, letting g stand for our
Q
“guess”, h(x), for a single example (x, y) as K yk
k=1 gk .

Study Question: How many elements that are not equal to 1 will there be in this
product?
The negative log of the probability that we are making a correct guess is, then, for one-
hot vector y and probability distribution vector g,
K
X
Lnllm (g, y) = − yk · log(gk ) .
k=1

We’ll call this NLLM for negative log likelihood multiclass. It is also worth noting that the
NLLM loss function is also convex; however, we will omit the proof.
Study Question: Be sure you see that is Lnllm is minimized when the guess assigns
high probability to the true class.

Study Question: Show that Lnllm for K = 2 is the same as Lnll .

4.6 Prediction accuracy and validation

In order to formulate classification with a smooth objective function that we can optimize
robustly using gradient descent, we changed the output from discrete classes to probability
values and the loss function from 0-1 loss to NLL. However, when time comes to actually
make a prediction we usually have to make a hard choice: buy stock in Acme or not? And,
we get rewarded if we guessed right, independent of how sure or not we were when we
made the guess.
The performance of a classifier is often characterized by its accuracy, which is the per-
centage of a data set that it predicts correctly in the case of 0-1 loss. We can see that accuracy
of hypothesis h on data D is the fraction of the data set that does not incur any loss:
n
1X
A(h; D) = 1 − L01 (g(i) , y(i) ) ,
n
i=1

where g(i) is the final guess for one class or the other that we make from h(x(i) ), e.g., after
thresholding. It’s noteworthy here that we use a different loss function for optimization
than for evaluation. This is a compromise we make for computational ease and efficiency.

Last Updated: 04/07/24 16:49:48

ML - Logistic Regression&KNN
No ratings yet
ML - Logistic Regression&KNN
48 pages
STAR 21 Strategic Technologies
100% (1)
STAR 21 Strategic Technologies
330 pages
Classification
100% (2)
Classification
105 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
782 pages
CoSM Vision Plan 2018 Small
No ratings yet
CoSM Vision Plan 2018 Small
64 pages
Stockhausen Serves Imperialism, and Other Articles With Commentary and Notes (Cornelius Cardew) (Z-Library)
No ratings yet
Stockhausen Serves Imperialism, and Other Articles With Commentary and Notes (Cornelius Cardew) (Z-Library)
132 pages
Main
No ratings yet
Main
5 pages
Figures of Speech
100% (1)
Figures of Speech
3 pages
Algorithms For Artificial Intelligence
No ratings yet
Algorithms For Artificial Intelligence
69 pages
Ds 2
No ratings yet
Ds 2
27 pages
04 LogisticRegression
No ratings yet
04 LogisticRegression
29 pages
Multi-Level Mock Reading
No ratings yet
Multi-Level Mock Reading
11 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Communication
No ratings yet
Communication
33 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Chap 10
No ratings yet
Chap 10
50 pages
Split Case Pump Mos
100% (1)
Split Case Pump Mos
10 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
Daa-Unit-2 R16
No ratings yet
Daa-Unit-2 R16
33 pages
Logistic Regression
No ratings yet
Logistic Regression
29 pages
Lec 43
No ratings yet
Lec 43
9 pages
07 NuMicro FMC
No ratings yet
07 NuMicro FMC
21 pages
23S1-SS ZG653-M1-CS02B - WhatIsSoftArch
No ratings yet
23S1-SS ZG653-M1-CS02B - WhatIsSoftArch
39 pages
Multi-Stage Payment Methods
No ratings yet
Multi-Stage Payment Methods
11 pages
Asynchronous Transfer Mode: CS420/520 Axel Krings Sequence 11
No ratings yet
Asynchronous Transfer Mode: CS420/520 Axel Krings Sequence 11
25 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
03 Classification Handout
No ratings yet
03 Classification Handout
24 pages
L6 Lecture Image - Classification.fundemental v4
No ratings yet
L6 Lecture Image - Classification.fundemental v4
66 pages
7 - American National Standards Institute (ANSI) Standard
No ratings yet
7 - American National Standards Institute (ANSI) Standard
20 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
RDVV 2025 Time Table
No ratings yet
RDVV 2025 Time Table
3 pages
Will Happen?
No ratings yet
Will Happen?
43 pages
Machine Learning - Classification
No ratings yet
Machine Learning - Classification
13 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Chapter 2 - Linear Classifiers
No ratings yet
Chapter 2 - Linear Classifiers
4 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Varm All 300 English-1
No ratings yet
Varm All 300 English-1
26 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Costello - Written Testimony
No ratings yet
Costello - Written Testimony
9 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
ML 41
No ratings yet
ML 41
49 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
General Ledger Conversion Document - Workday Community
No ratings yet
General Ledger Conversion Document - Workday Community
7 pages
Industry 4.0
No ratings yet
Industry 4.0
4 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Module 3.1
No ratings yet
Module 3.1
25 pages
16 MM MS Plate 355 JR - India-MTC
No ratings yet
16 MM MS Plate 355 JR - India-MTC
1 page
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
10JUL07
No ratings yet
10JUL07
7 pages
02 - 01 - Polymorphism-Lab-Review-And-Practice-Lesson-Notes-Optional-Download - Polymorphism - Labs
No ratings yet
02 - 01 - Polymorphism-Lab-Review-And-Practice-Lesson-Notes-Optional-Download - Polymorphism - Labs
20 pages
MT6761 Android Scatter
No ratings yet
MT6761 Android Scatter
12 pages
Objective:: Lab#10: 7-Segment Display SSUET/QR/114
No ratings yet
Objective:: Lab#10: 7-Segment Display SSUET/QR/114
4 pages
jinnes,+CJNR Vol 36 Issue 01 Art 02
No ratings yet
jinnes,+CJNR Vol 36 Issue 01 Art 02
9 pages
02 - 01 - Changing-Objects-With-Methods-Lesson-Notes-Optional-Download - Mutability - Changing Objects With Methods
No ratings yet
02 - 01 - Changing-Objects-With-Methods-Lesson-Notes-Optional-Download - Mutability - Changing Objects With Methods
15 pages
Lec 2
No ratings yet
Lec 2
37 pages
Dacia Spring 2022 0120
No ratings yet
Dacia Spring 2022 0120
5 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
NESessity v1.3 Parts List
No ratings yet
NESessity v1.3 Parts List
3 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Activity 6-2 Name: Joella Mae Escanda Section: B
No ratings yet
Activity 6-2 Name: Joella Mae Escanda Section: B
2 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Organisational Change Od Assignment
No ratings yet
Organisational Change Od Assignment
12 pages
93 - Grammar Likes and Dislikes
No ratings yet
93 - Grammar Likes and Dislikes
3 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
Machine Learning: Linear Models For Classification 1
No ratings yet
Machine Learning: Linear Models For Classification 1
30 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
No ratings yet
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
5 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
01 - 01 - Changing-Objects-With-Functions-Lesson-Notes-Optional-Download - Mutability - Changing Objects With Functions
No ratings yet
01 - 01 - Changing-Objects-With-Functions-Lesson-Notes-Optional-Download - Mutability - Changing Objects With Functions
4 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
W4 Ecs7020p
No ratings yet
W4 Ecs7020p
48 pages
Developing Your Network Marketing Game Plan
No ratings yet
Developing Your Network Marketing Game Plan
4 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet

Chapter Classification

Uploaded by

Chapter Classification

Uploaded by

CHAPTER 4

Like regression, classification is a supervised learning problem, in which we are given a

4.2 Linear classifiers

4.2.1 Linear classifiers: definition

Remember that we can think of θ, θ0 as specifying a d-dimensional hyperplane (compare

Last Updated: 04/07/24 16:49:48

Last Updated: 04/07/24 16:49:48

4.3 Linear logistic classifiers

Last Updated: 04/07/24 16:49:48

4.3.1 Linear logistic classifiers: definition

Last Updated: 04/07/24 16:49:48

Last Updated: 04/07/24 16:49:48

4.3.2 Learning linear logistic classifiers

Last Updated: 04/07/24 16:49:48

where Lnll is the negative log-likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

Last Updated: 04/07/24 16:49:48

4.4 Gradient descent for logistic regression

Note that ∇θ Jlr will be of shape d × 1 and ∂Jlr

Last Updated: 04/07/24 16:49:48

Study Question: Compute ∇θ ∥θ∥2 by finding the vector of partial derivatives

Study Question: Compute ∇θ Lnll (σ(θT x + θ0 ), y) by finding the vector of partial

LR-G RADIENT-D ESCENT(θinit , θ0init , η, ϵ)

Logistic regression, implemented using batch or stochastic gradient descent, is a useful

4.4.1 Convexity of the NLL Loss Function

Last Updated: 04/07/24 16:49:48

4.5 Handling multiple classes

• Directly train a multi-class classifier using a hypothesis class that is a generalization

Last Updated: 04/07/24 16:49:48

Putting these steps together, our hypotheses will be

Study Question: Show that Lnllm for K = 2 is the same as Lnll .

4.6 Prediction accuracy and validation

Last Updated: 04/07/24 16:49:48

You might also like