0% found this document useful (0 votes)
13 views19 pages

Lecture01 Introduction Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Lecture01 Introduction Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

|section|Neural networks|

|video|https://www.youtube.com/embed/MrZvXcwQJdg?
Lecture 1: Introduction si=BRr6mIPzcjbE1N_a|

Peter Bloem In this lecture, we will discuss the basics of neural networks.
Deep Learning What they are, and how to use them.

dlvu.github.io

THE PLAN

part 1: neural networks

part 2: classification and regression

part 3: autoencoders

PART ONE: NEURAL NETWORKS

We’ll start with a quick recap of the basic principles


RECAP: NEURAL NETWORKS
behind neural networks. We expect you’ve seen most
of this before, but it’s worth revisiting the most
important parts, and setting up our names and
notation for them
The name neural network comes from neurons the
cells that make up most of our brain and nervous
system. A neuron receives multiple different input
signals from other cells through connections called
dendrites. It processes these in a relatively simple way,
deriving a single new output signal, which it sends out
through its single axon. The axon branches out so that
image source: http://www.sciencealert.com/scientists-build-an-artificial-neuron-that-fully-mimics-a-human-brain-cell
the single signal can reach multiple other cells.
In the very early days of AI (the late 1950s),
researchers decided to try a simple approach: the
brain is the only intelligent system we know, and the
brain is made of neurons, so why don’t we simply
model neurons in a computer?

1957: THE PERCEPTRON


The of idea of a neuron needed to be radically
simplified to work with computers of that age, but
doing so yielded one of the first successful machine
learning systems: the perceptron.
y

The perceptron has a number of inputs, each of which


y = w1x1 + w2x2 + b is multiplied by a weight. The result is summed over all
w1 w2 b weights and inputs, together with a bias parameter, to
provide the output of the perceptron. If we're doing
x1 x2 1 binary classification, we can take the sign of the output
as the class (if the output is bigger than 0 we predict
class A otherwise class B).
5
The bias parameter is often represented as a special
input node, called a bias node, whose value is fixed to
1.
For most of you, this will be nothing new. This is simply
a linear classifier or linear regression model. It just
happens to be drawn as a network.
But the real power of the brain doesn't come from
single neurons, it comes from chaining a large number
of neurons together. Can we do the same thing with
perceptrons: link the outputs of one perceptron to the
inputs of the next in a large network, and so make the
whole more powerful than any single perceptron?

PROBLEM: COMPOSING NEURONS


This is where the perceptron turns out to be too
simple an abstraction. Composing perceptrons (making
y the output of one perceptron the input of another)
doesn’t make them more powerful.
1 2 As an example, we’ve chained three perceptrons
together on the left. We can write down the function
1 3 4 2
computed by this perceptron, as shown on the bottom
1 3 2 1
left. Working out the brackets gives us a simple linear
function of four arguments. Or equivalently, a single
x1 x2 x3 x4 x1 x2 x3 x4 perceptron with 4 inputs. This will always happen, no
y = 1(1x1 + 3x2) + 2(2x3 + 1x4) y = 1x1 + 3x2 + 4x3 + 2x4
matter how we chain the perceptrons together.
6
This is because perceptrons are linear functions.
Composing together linear functions will only ever give
you another linear function. We’re not creating models
that can learning non-linear functions.
We’ve removed the bias node here for clarity, but that
doesn’t affect our conclusions: any composition of
affine functions is itself an affine function.
If we're going to build networks of perceptrons that do
anything a single perceptron can't do, we need
another trick.

NONLINEARITY
The simplest solution is to apply a nonlinear function
to each neuron, called the activation function. This is a
y
scalar function (a function from a number to another
number) which we apply to the output of a perceptron
w1 w2 b y = (w1 x1 + w2 x2 + b) after all the weighted inputs have been combined.

x1 x2
One popular option (especially in the early days) is the
1 logistic sigmoid. The sigmoid takes the range of
sigmoid (x) = numbers from negative to positive infinity and
1 + e-x
squishes them down to the interval between 0 and 1.
x if x > 0 Another, more recent nonlinearity is the linear
r(x) =
ReLU
0 otherwise rectifier, or ReLU nonlinearity. This function just sets
7
every negative input to zero, and keeps everything else
the same.
Not using an activation function is also called using a
linear activation.
If you're familiar with logistic regression, you've seen
the sigmoid function already: it's stuck on the end of a
linear regression function (that is, a perceptron) to turn
the outputs into class probabilities. Now, we will take
these sigmoid outputs, and feed them as inputs to
other perceptrons.

FEEDFORWARD NETWORK
Using these nonlinearities, we can arrange single
neurons into neural networks. Any arrangement of
y
output layer
perceptrons and nonlinearities makes a neural
network, but for ease of training, the arrangement
shown here was the most popular for a long time.
v2

h1 h2 h3 hidden layer It’s called a feedforward network or multilayer


perceptron. We arrange a layer of hidden units in the
2
w1

middle, each of which acts as a perceptron with a


x1 x2
input layer (features) nonlinearity, connecting to all input nodes. Then we
have one or more output nodes, connecting to all
aka Multilayer Perceptron (MLP)
nodes in the hidden layer. Crucially:
8
• There are no cycles, the network “feeds forward”
from input to output.
• Nodes in the same layer are not connected to
each other, or to any other layer than the
previous one.
• Each layer is fully connected to the previous
layer, every node in one layer connects to every
node in the layer before it.
In the 80s and 90s feedforward networks usually had
just one hidden layer, because we hadn’t figured out
how to train deeper networks. Later, we began to see
neural networks with more hidden layers, but still
following these basic rules.
Note: Every orange and blue line in this picture
represents one parameter of the model.

REGRESSION
With that, let’s see how we can use such a
feedforward network to attack some basic machine
learning problems.
y
If we want to train a regression model (a model that
linear regression predicts a numeric value), we put non-linearities on
the hidden nodes, and no activation on the output
h1 h2 h3
node. That way, the output can range from negative to
feature extractor positive infinity, and the nonlinearities on the hidden
layer ensure that we can learn functions that a single
x1 x2 perceptron couldn't learn.
We can think of the first layer as learning some
9
nonlinear transformation of the inputs, the features in
machine learning parlance, and we can think of the
the second layer as performing linear regression on
these derived, nonlinear features.

LOSS
The next step is to figure out a loss function. This tells
you how well your network is doing with its current
weights. The lower the loss the better you are doing.
y: predicted blood pressure t: known bp (given by data)

Here’s what that looks like for a simple regression


loss: how far is our prediction from the truth?
problem. We feed the network somebody’s age and
weight, and we we ask it to predict their blood
for example: pressure. We compare the predicted blood pressure y
loss = (y - t)2
<latexit sha1_base64="WNWffcw33je3OFr0u91K5Qy6krI=">AAANu3icfZdNb9s2HMbV7q3L6i3djtuBWFCgG7JAShy/HALUkm30sLZZ0Ly0cRZQNK1opkSCohy7gi677Kvsun2bfZtRtiNLFGWdCD4PH//4pyiTLiN+JEzzv0ePP/n0s8+/ePLlzldPG19/s/vs24uIxhzhc0QJ5VcujDDxQ3wufEHwFeMYBi7Bl+7UyfTLGeaRT8N3YsHwTQC90J/4CArZdbv7w0jguUgIjaIUnIAXC/ALGCGWiPSn3w9vd/fMA3P5gGrDWjf2jPVzevvs6V87ozFFcYBDgQiMomvLZOImgVz4iOB0ZxRHmEE0hR6+jsWkc5P4IYsFDlEKnkttEhMgKMhQwdjnGAmykA2IuC8TALqDHCIhJ7RTjopwCAMc7Y9nPotWzWjmrRoCymrcJPNltdLSwMTjkN35aF4iS2AQBVDcVTqjReCWO3FMMJ8F5c6MUjIqzjnmyI+yGpzKwrxl2QJE7+jpWr9bsDscRmkSc5IWB0oBc44ncuCyGWERs2Q5Gbnq0+hE8BjvZ81l30kf8ukZHu/LnFJHGWdCKBTlLleZxnwia53VK8T3iAYBDMfJiKXJ6pUZ7R+kUnwOXCL9LoV8XHaepUkyysrouuBMWkvim4L4RhUHBXGw/hFKxmBCOZjJV4LyCEgjkBbuIxyVR5/noyfgXI2+KIgXqnhZEC9V0Y0LalxRZwV1VlHvC+q9qs4L4lwVFwVxUckVBVWo6seC+FEVrwrilSq+L4jvVfGDEitXR+6ihdz4eCK/Qst3LpmiNHn17vWvadJdPus3JcbAKhuR+2A8Grba3XaqyuRBbw47lt2v6rmhbfcsR2fIHb22Y/YHG5ZDxZtDm2Zr0GupUYhs9E7XGVb1DazZa/dtjWFDO3Sag/a6fBiHitXLfU631ayUxctzurZtH3erem6wm47TOdQYcofT7/d7zhKFxZwRrHjZg7HVOjarUSwP6pitZk+jbxbA7Nh2BZYVWOyhbfWtJYvAkChOkb8sTqfXHahBYlN/u+c4lQUUhfJ3HKvf1Bg2rMf948HRkoRyGHpqVWhevla7c9RRo2geNGzLFayw0M0vDbu22a6w0ALL0HbsrLDlnSg32bV1k6w+yJt9B/YskKaqWW401ZztvQez4iUaM6l3a+3b/PoBpBa+OlOE6uKRJhzVwiAdC6qHR1p4tA3eq/q9unhPE+7Vwng6Fq8e3tPCe9vgWdXP6uKZJpzVwjAdC6uHZ1p4tg1eVP2iLl5owkUtjNCxiHp4oYUX2+Bp1U/r4qkmnNbCUB0LrYenWnhahpd/HtkxHxKQHZMpAX64PhiUvlksOz5MkTxJrtyrifexvC5w/FoeK97KMy6UZ7yfkxHkXuCHqbw+eKP9rLXNCOcPRtmSVxdLvahUGxeHB1br4Pi35t5Le32JeWJ8b/xovDAso228NF4Zp8a5gYw/jb+Nf4x/GycN1PijQVbWx4/WY74zSk8j/h+rvAxY</latexit>

to the true blood pressure t (which we assume is given


by the data). The loss should then be a value that is
high if the prediction is very wrong and that gets lower
age weight
as the prediction gets closer to the truth.
10
A simple loss function for regression is the squared
error. We just take the difference between the
prediction y and the truth t, and we square it. This
gives us a value that is 0 for a perfect prediction and
that gets larger as the difference between y and t gets
bigger.
It’s nice if the loss is zero when the prediction is
perfect, but this isn’t required.
The loss can be defined for a single instance (as it is
here) or for all instances in the data. Usually, the loss
over the whole data is just the average loss over all
instances.
HOW DO WE FIND GOOD WEIGHTS?
With the loss function decided, we can starting looking
for weights (i.e. model parameters) that result in a low
loss over the data.
The better our model, the lower the loss. If we imagine
a model with just two weights then the set of all
w2
models, called the model space, forms a plane. For
<latexit sha1_base64="vmGcwA/Hi9cqjlZgd46Kuj0yvfk=">AAAOBHicfZdNb9s2HMaVrtu6rO7S7biLsKBAOmSBlDh+OQSoJdvoYW2zIC/doiygaFoWTIkERTl2BV33aXYbdt332GWfZZTtyBJFWSeCz8NHP/1J2qRLsR9xw/h358lnTz//4stnX+1+/bzx4pu9l99eRyRmEF1Bggn76IIIYT9EV9znGH2kDIHAxejGndqZfjNDLPJJeMkXFN0FwAv9sQ8BF133e3OHozlPAjJCOL13RGDi8AniID1w5q/1M32hO87uyoRJFKX3iTM/1B1IE54KT2FA5j7YHvfTeuDr34/v9/aNI2P56NWGuW7sa+vn/P7lc77rjAiMAxRyiEEU3ZoG5XcJYNyHGKW7ThwhCuAUeOg25uPOXeKHNOYohKn+SmjjGOuc6FkZ9JHPEOR4IRoAMl8k6HACGIBcFGu3HBWhEAQoOhzNfBqtmtHMWzU4EJW+S+bLmUhLAxOPATrx4bxEloAgCgCfVDqjReCWO1GMEZsF5c6MUjBKzjli0I+yGpyLwnyg2eRGl+R8rU8WdILCKE1ihtPiQCEgxtBYDFw2I8Rjmiw/RqyoaXTGWYwOs+ay76wP2PQCjQ5FTqmjjDPGBPBylxtkxQnRAyRBAMJR4tA0WS0W5/AoFeIr3cXC7BLARmXnRZokTlYz19UvhLUkvi+I72VxUBAH65cQPNLHhOkzMf+ERbow6sLCfIii8uirfPRYv5KjrwvitSzeFMQbWXTjghpX1FlBnVXUh4L6IKvzgjiXxUVBXFRyeUHlsvqpIH6SxY/bXvrrtpf+JsWK2RFbZiF2ORqLn7PlAkumME3eXr77OU26y2e9UmKkm2UjdB+NJ8NWu9tOZRk/6s1hx7T6VT03tK2eaasMuaPXto3+YMNyLHlzaMNoDXotOQrijd7p2sOqvoE1eu2+pTBsaId2c9Belw+hULJ6uc/utpqVsnh5TteyrNNuVc8NVtO2O8cKQ+6w+/1+z16i0JhRjCQvfTS2WqdGNYrmQR2j1ewp9M0EGB3LqsDSAos1tMy+uWThCGDJyfPFYnd63YEcxDf1t3q2XZlAXih/xzb7TYVhw3raPx2cLEkIA6EnV4Xk5Wu1OycdOYrkQcO2mMEKC9m8adi1jHaFhRRYhpZtZYUt70SxyW7Nu2T1g7zZd/q+qaepbBYbTTZne+/RLHmxwozr3Ur7Nr96AK6Fr34phHXxUBEOa2GgigXWw0MlPNwG71X9Xl28pwj3amE8FYtXD+8p4b1t8LTqp3XxVBFOa2GoioXWw1MlPN0Gz6t+XhfPFeG8FoarWHg9PFfC823wpOondfFEEU5qYYiKhdTDEyU8KcOLP4/sTA+wnp2JCdb9cH0wKP1m0ez4MIXiJLlyrz68j8TdgKF34ljxQRxogTjj/Zg4gHmBH6biruA5h1lrmxHMH42iJe4ppnwrqTauj4/M1tHpL839N9b6xvJM+177QTvQTK2tvdHeaufalQa1/3ae7jR2XjT+aPzZ+Kvx98r6ZGc95jut9DT++R8WzShg</latexit>

model✓ (x) = y every point in this plane, our loss function defines a
loss

loss. We can draw this above the plane as a surface:


lossx,t (✓) = (model✓ (x) - t)2
the loss surface (sometimes also called, more
poetically, the loss landscape).
w1
Make sure you understand the difference between the
11
model, a function from the inputs x to the outputs y in
which the weights act as constants, and the loss
function, a function from the weights to a loss value, in
which the data acts as constants.
The symbol θ is a common notation referring to the set
of all weights of a model (sometimes combined into a
vector, sometimes just a set).

LOSS SURFACE
Here’s what a loss surface might look like for a model
with just two parameters.
Our job is to search the loss surface for a low point.
When the loss is low, the model predictions are close
to the target labels, and we've found a model that
does well.

12

This is a common way of summarizing this aim of


machine learning. We have a large space of possible
parameters, with θ representing a single choice, and in
this space we want to find the θ for which the loss on
our chosen dataset is minimized.

arg min lossdata (✓)


<latexit sha1_base64="ME7sDxWgMkefduGwjrrcJfDBpRQ=">AAANyHicfZfbbts2HMbVdofOq7d0u9yNsKBANwSBlDg+oCtQS7ZRDGubBTl1URBQNC0LpkSCohy7gm72KHua3W53e5tRtiNLFGVdEfw+fvrxT9EmXYr9iBvGf48eP/ns8y++fPpV4+tnzW++3Xv+3WVEYgbRBSSYsGsXRAj7IbrgPsfomjIEAhejK3dmZ/rVHLHIJ+E5X1J0GwAv9Cc+BFx03e394gDmBX5450CSOHyKOEidV7rD0YInmERRerduj4FQXhZcPzUad3v7xqGxevRqw9w09rXNc3r3/BlvOGMC4wCFHGIQRTemQfltAhj3IUZpw4kjRAGcAQ/dxHzSvU38kMYchTDVXwhtEmOdEz2bij72GYIcL0UDQOaLBB1OAQOQiwk3ylERCkGAooPx3KfRuhnNvXWDA1Gt22SxqmZaGph4DNCpDxclsgQEUQD4tNIZLQO33IlijNg8KHdmlIJRci4Qg36U1eBUFOYDzRYoOienG326pFMURmkSM5wWBwoBMYYmYuCqGSEe02Q1GfFVzKLXnMXoIGuu+l4PAJudofGByCl1lHEmmABe7nKDrDghuockCEA4ThyaJuuvwzk4TIX4QnexMLsEsHHZeZYmiZPVzHX1M2Etie8L4ntZHBbE4eYlBI/1CWH6XKw/YZEujLqwMB+iqDz6Ih890S/k6MuCeCmLVwXxShbduKDGFXVeUOcV9b6g3svqoiAuZHFZEJeVXF5Quax+KoifZPF610s/7nrpH1KsWB2xZZZil6OJ+ElafWDJDKbJ2/N3v6VJb/VsvpQY6WbZCN0H4/Go3el1UlnGD3pr1DWtQVXPDR2rb9oqQ+7od2xjMNyyHEneHNow2sN+W46CeKt3e/aoqm9hjX5nYCkMW9qR3Rp2NuVDKJSsXu6ze+1WpSxentOzLOukV9Vzg9Wy7e6RwpA77MFg0LdXKDRmFCPJSx+M7faJUY2ieVDXaLf6Cn27AEbXsiqwtMBijSxzYK5YOAJYcvL8Y7G7/d5QDuLb+lt9264sIC+Uv2ubg5bCsGU9GZwMj1ckhIHQk6tC8vK1O93jrhxF8qBRR6xghYVs3zTqWUanwkIKLCPLtrLClnei2GQ35m2y/kHe7jt939TTVDaLjSabs733YJa8WGHG9W6lfZdfPQDXwldnCmFdPFSEw1oYqGKB9fBQCQ93wXtVv1cX7ynCvVoYT8Xi1cN7SnhvFzyt+mldPFWE01oYqmKh9fBUCU93wfOqn9fFc0U4r4XhKhZeD8+V8HwXPKn6SV08UYSTWhiiYiH18EQJT8rw4s8jO9MDrGdnYoJ1P9wcDEq/WTQ7PsygOEmu3euJD5C4GzD0ThwrPogDLRBnvJ+TzUUlFXcFzznIWruMYPFgFC1xTzHlW0m1cXl0aJ4cGr+39t9YmxvLU+0H7UftpWZqHe2N9lY71S40qP2l/a39o/3b/LVJm/fN5dr6+NFmzPda6Wn++T8JlxON</latexit>

It turns out this is actually an oversimplification, and


we don't want to solve this particular problem too
✓ well. We'll discuss this in a future lecture. For now, this
serves as a reasonable summary of what we're trying
to do.
DERIVATIVE
So, how do we find the lowest point on a surface? This
is where calculus comes in.
In one dimension, we can approximate a function at a
particular point, by finding the tangent line at that
point: the line that just touches the function without

ce
surfa
crossing it. The slope of the tangent line tells us how

loss
derivative

much the line rises if we take one step to the right.


This is a good indication of how much the function
itself rises as well at the point at which we took the
1
tangent.
parameter space
The slope of the tangent line is called the derivative.
14

If you are entirely new to derivatives, you should brush


up a little. See these slides for a place to start.

GRADIENT
If our input space has multiple dimensions, like our
model space, we can simply take a derivative with
respect to each input, separately, treating the others
w2 as constants. This is called a partial derivative. The
@loss collection of all possible partial derivatives is called the
@w1
@loss
gradient.
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>
@w2
r✓ lossx,t (✓) The partial derivatives of the loss surface, one for each
@loss
@w1
model weight, tell us how much the loss falls or rises if
@loss we increase each weight. Clearly, this information can
@w2
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>

w1 help us decide how to change the weights.


If we interpret the gradient as a vector, we get an
15
arrow in the model space. This arrow points in the
direction in which the function grows the fastest.
Taking a step in the opposite direction means we are
descending the loss surface.
In our case, this means that if we can work out the
gradient of the loss, then we can take a small step in
the opposite direction and be sure that we are moving
to a lower point on the loss surface. Or to put it
differently be sure that we are improving our model.
The symbol for the gradient is a downward pointing
triangle called a nabla. The subscript indicates the
variable over which we are taking the derivatives.
Note that in this case we are treating θ as a vector.

DIRECTION OF STEEPEST ASCENT


To summarize: the gradient is an arrow that points in
the direction of steepest ascent. That is, the gradient
of our loss (at the dot) is the direction in which the loss
surface increases the quickest.
More precisely, if we fit a tangent hyperplane to the
loss surface at the dot, then the direction of steepest
ascent on that hyperplane is the gradient. Since it's a
hyperplane, the opposite direction (the gradent with a
minus in front of it) is the direction of steepest descent.
This is why we care about the gradient: it helps us find
a downward direction on the loss surface. All we have
16
to do is follow the negative gradient and we will end
up lowering our loss.
STOCHASTIC GRADIENT DESCENT
This is the idea behind the gradient descent algorithm.
We compute the gradient, take a small step in the
pick some initial weights θ (for the whole model)
opposite direction and repeat.
loop: epochs Note that we’re subtracting the (scaled) gradient. Or
for x, t in Data:
rather, we’re adding the negative of the gradient. This
✓ ✓ - ↵ · r✓ lossx,t (✓) results in taking a step in the opposite direction of the
learning rate gradient.
stochastic gradient descent: loss over one example per step (loop over data) The reason we take small steps is that the gradient is
minibatch gradient descent: loss over a few examples per step (loop over data) only the direction of steepest ascent locally. It's a
full-batch gradient descent: loss over the whole data linear approximation to the nonlinear loss function.
The further we move from our current position, the
17
worse an approximation the tangent hyperplane will
be for the function that we are actually trying to
follow. That's why we only take a small step, and then
recompute the gradient in our new position.
We can compute the gradient of the loss with respect
to a single example from our dataset, a small batch of
examples, or over the whole dataset. These options
are usually called stochastic, minibatch and full-batch
gradient descent respectively.
These terms are use interchangeably in the literature,
but we’ll try to stick to these definitions in the course.
In deep learning, we almost always use minibatch
gradient descent, but there are some cases where full-
batch is used.
Training usually requires multiple passes over the data.
One such pass is called an epoch.

RECAP
This is the basic idea of neural networks. We define a
perceptron, a simplified model of a neuron, which we
perceptron: linear combination of inputs chain together into a neural network, with
nonlinearities added. We then define a loss, and train
neural network: network of perceptrons, with scalar nonlinearities by gradient descent to find good weights.
What we haven't discussed is how to work out the
training: (minibatch) gradient descent gradient of a loss function over a neural network. For
simple functions, like linear classifiers, this can be
But, how do we compute the gradient of a complex neural network? done by hand. For more complex functions, like very
Next lecture: backpropagation. deep neural networks, this is no longer feasible, and
we need some help.
18
This help comes in the form of the backpropagation
algorithm. This is a complex and very important
algorithm, so we will dive into it in the next lecture.

Before we move on, it's important to note that the


“NEURAL” NETWORKS
name neural network is not much more than a
historical artifact. The original neural networks were
very loosely inspired by the networks of neurons in our
heads, but even then the artificial neural nets were so
simplified that they had little to do with the real thing.
Today's neural nets are nothing like brain networks,
and serve in no way as a realistic model of what
happens in our head. In short, don't read too much
into the name.

image source: http://www.sciencealert.com/scientists-build-an-artificial-neuron-that-fully-mimics-a-human-brain-cell


Lecture 1: Introduction

Peter Bloem
Deep Learning

dlvu.github.io

|section|Classification and regression|


|video|https://www.youtube.com/embed/-JNj1legjHw?
si=unWk3uoiBqGfVnmt|

PART TWO: CLASSIFICATION AND REGRESSION


So, now we know how to build a neural network, and
we know broadly how to train one by gradient descent
Using neural networks for basic machine learning. (taking it as read for now that there’s a special way to
work out a gradient, that we’ll talk about later). What
can we do with this? We’ll start with the basic machine
learning tasks of classification and regression. These
will lead to some loss functions that we will build on a
lot during the course.
It will also show how we use probability theory in
deep learning, which is important to understand.

BINARY CLASSIFICATION
If we have a classification problem with two classes,
whcih we’ll call positive and negative, we can place a
sigmoid activation on the output layer, so that the
y s|x)
(C=Po output is between 0 and 1.
<- p

logistic regression We can then interpret this as the probability that the
h1 h2 h3
input has the positive class (according to our network).
The probability of the negative class is 1 minus this
feature extractor value.

x1 x2

22

LOG LOSS
So, what’t our loss here? The situation is a little
different from the regression setting. Here, the neural
network predicts a number between 0 and 1, and the
y: predicted probability of heart disease t: has heart disease? (pos or neg)
data only gives us a value that is true or false. Broadly,
what we want from the loss is that it is a low value if
the probability of the true class is close to one, and
high if the probability of the true class is low.
log loss:
<latexit sha1_base64="1c/Bwkm1sjvRrNy4pyzL0TxhzdE=">AAAOPnichZfLbuM2GIXt6W2ajttMuynQDdFgimmRCaTE8WWRYizZbhadmTTIrY2CgKJpWTAlEiTl2CPopfoEfY2+QHdFt12Wsh1ZV1crgufj4dFP0qZsRlwhNe3P+pMPPvzo40+efrrz2bPG51/sPv/yStCAI3yJKKH8xoYCE9fHl9KVBN8wjqFnE3xtT81Yv55hLlzqX8gFw3cedHx37CIoVdf97u+WxHMZEipEBE7AK2AR6gD20kIslJHluSNgzb8HKwo8uHICohLZxo7rh0gFEdECfLfC3bFiV6SyXvVZyA4ZFZESLKCr+f6HpqGPnZjG/mjtf7+7px1oywcUG/q6sVdbP2f3z5/JHWtEUeBhXyIChbjVNSbvQsiliwiOdqxAYAbRFDr4NpDjzl3o+iyQ2EcReKG0cUCApCAuIBi5HCNJFqoBEXeVA0ATyCGSqsw7WSuBfehhsT+auUysmmLmrBoSqjW6C+fLNYwyA0OHQzZx0TyTLISe8KCcFDrFwrOznTggmM+8bGecUmXMkXPMkSviGpypwrxj8bYQF/RsrU8WbIJ9EYUBJ1F6oBIw53isBi6bAsuAhcuXUXtxKk4kD/B+3Fz2nfQhn57j0b7yyXRk44wJhTLbZXtxcXz8gKjnQbULLBaF6+2xfxAp8QWwiYJtCvkoS55HYWjFNbNtcK7QjPg2Jb7Ni4OUOFhPQskIjCkHM7X+lAugQKAQ7iIssqMvk9FjcJm3vkqJV3nxOiVe50U7SKlBQZ2l1FlBfUipD3l1nhLneXGREhcFX5lSZV59nxLf58WbbZP+um3S33K2anXUkVmoU47H6odwucHCKYrC04s3P0dhd/msd0qAgZ4Fkf0IHg1b7W47ysvkUW8OO7rRL+oJ0DZ6ulkGJESvbWr9wSbLYY5NQmtaa9Br5a0Q2eidrjks6puwWq/dN0qATdqh2Ry01+XD2M+hTsKZ3VazUBYn8ekahnHcLeoJYDRNs3NYAiSE2e/3e+YyCgs4IzjHskew1TrWilYsMeporWavRN8sgNYxjEJYlspiDA29ry+zSAxJjpTJZjE7ve4gbyQ39Td6pllYQJkqf8fU+80SYJP1uH88OFomoRz6Tr4qNClfq9056uStaGI0bKsVLGShm5mGXUNrF7LQVJahYRpxYbMnUR2yW/0uXP0gb84d2NNBFOVhddDycHz2HuEcS0pgUk2X4tv48gGkMnzxTRGqskcl5qgyDCrLgqrDo9LwaFt4p8g7VfZOiblTGcYpy+JUh3dKwzvbwrMiz6rsWYk5qwzDyrKw6vCsNDzbFl4WeVllL0vMZWUYWZZFVoeXpeHltvC0yNMqe1piTivD0LIstDo8LQ1Ps+HVn0d8p4cExHdiSoDrry8Gmd8sFl8fpkjdJFf06sX7WH0bcPxGXSveqQstVHe8H0ILcsdz/Uh9KzjWftzaBsL5I6ha6jtFz3+VFBtXhwd66+D4l+bea2P9xfK09k3t29rLml5r117XTmtntcsaqn9d/7H+U/208Ufjr8bfjX9W6JP6esxXtczT+Pc/6eY+GA==</latexit>

One popular function that does this for us is the log


y if t = pos
loss = - log p(t | x) with p(t | x) loss. The negative logarithm of the probability of the
1 - y if t = neg
age weight true class. We’ll just give you the functional form for
now. We’ll show where it comes from later. The log
23
loss is also known as the cross-entropy.
See this lecture to learn why.
We suggest you think a bit about the shape of the log
function, and convince yourself that this is indeed a
function with the correct properties.
The base of the logarithm can be anything. It’s usually
e or 2.

MULTI-CLASS CLASSIFICATION: THE SOFTMAX ACTIVATION


What if we have more than one class? Then we want
the network to somehow output a probability
distribution over all classes. We can’t do this with a
0.1 0.6 0.3 single node anymore. Instead, we’ll give the network
one output node for every possible class.
o1 s o2 s o3 s
oi = w T h + b
exp(oi ) We can then use the softmax activation. This is an
yi = P activation function that ensures that all the output
j exp(oj ) nodes are positive and that they always sum to one. In
short, that together they form a probability vector.
We can then interpret this series of values as the class
probabilities that our network predicts. That is, after
24
the softmax we can interpret the output of node 3 as
the probability that our input has class 3.
To compute the softmax, we simply take the exponent
of each output node oi (to ensure that they are all
positive) and then divide each by the total (to ensure
that they sum to one). We could make the values
positive in many other ways (taking the absolute or the
square), but the exponent is a common choice for this
sort of thing in statistics and physics, and it seems to
work well enough.
Note that the softmax is a little unusual for an
activation function: it’s not element-wise like the
sigmoid or the ReLU. To compute the value of one
output node, it looks at the inputs of all the other
nodes in the same layer.

LOG LOSS
The loss function for the softmax is the same as it was
for the binary classification. We assume that the data
y1 y2 y3: predicted prob of class 3 t: true prob (1, 2, or 3) tells us what the correct class is, and we take the
negative log-probability of the correct class as the loss.
softmax
This way, the higher the probability is that the model
log loss:
assigns to the correct class, the lower the loss.
loss = - log p(t | x) with p(t | x) = yt
<latexit sha1_base64="JYVLo7QDwMs2yQtfWPR1AbDQEKs=">AAAN3HicfZdNb9s2HMbVdi9dVm/petyFWFCgG7JAShy/HALUkm30sLZZkLc1ygKKpmXBlEhQlGNX0G23Ydd9lH2XAbtun2OU7cgSJVkngs/DRz/9Sdqkw4gXCl3/+9HjJ598+tnnT7/Y+fJZ46uvd59/cxnSiCN8gSih/NqBISZegC+EJwi+ZhxD3yH4yplaqX41wzz0aHAuFgzf+tANvLGHoJBdd7untsBzERMahgk4AT8Cm1AXsFc2YrFIbN8bAXv+PVi5wL0nJiApyycALO5WfXe7e/qBvnxAuWGsG3va+jm9e/5M7NgjiiIfBwIRGIY3hs7EbQy58BDByY4dhZhBNIUuvonEuHMbewGLBA5QAl5KbRwRIChIPw+MPI6RIAvZgIh7MgGgCeQQCVmEnWJUiAPo43B/NPNYuGqGM3fVEFBW8DaeLyucFAbGLods4qF5gSyGfuhDMSl1hgvfKXbiiGA+84udKaVkVJxzzJEXpjU4lYV5z9JJC8/p6VqfLNgEB2ESR5wk+YFSwJzjsRy4bIZYRCxefoxcKdPwRPAI76fNZd9JH/LpGR7ty5xCRxFnTCgUxS7HT4sT4HtEfR8Go9hmSbxaLfb+QSLFl8Ah0uxQyEdF51kSx3ZaM8cBZ9JaEN/lxHeqOMiJg/VLKBmBMeVgJuef8hBII5AW7iEcFkdfZKPH4EKNvsyJl6p4lROvVNGJcmpUUmc5dVZS73PqvarOc+JcFRc5cVHKFTlVqOrHnPhRFa+3vfSXbS/9oMTK2ZFbZiF3OR7Ln6nlAounKInfnL/9KYm7y2e9UiIMjKIROQ/Go2Gr3W0nqkwe9OawY5j9sp4Z2mbPsKoMmaPXtvT+YMNyqHgzaF1vDXotNQqRjd7pWsOyvoHVe+2+WWHY0A6t5qC9Lh/GgWJ1M5/VbTVLZXGznK5pmsfdsp4ZzKZldQ4rDJnD6vf7PWuJwiLOCFa87MHYah3r5SiWBXX0VrNXoW8mQO+YZgmW5VjMoWn0jSWLwJAoTpEtFqvT6w7UILGpv9mzrNIEilz5O5bRb1YYNqzH/ePB0ZKEchi4alVoVr5Wu3PUUaNoFjRsyxkssdDNm4ZdU2+XWGiOZWhaZlrY4k6Um+zGuI1XP8ibfQf2DJAkqlluNNWc7r0Hs+IlFWZS7660b/NXDyC18OUvRaguHlWEo1oYVMWC6uFRJTzaBu+W/W5dvFsR7tbCuFUsbj28WwnvboNnZT+ri2cV4awWhlWxsHp4VgnPtsGLsl/UxYuKcFELI6pYRD28qIQX2+Bp2U/r4mlFOK2FoVUstB6eVsLTIrz880jP9JCA9ExMCfCC9cGg8JvF0uPDFMmT5Mq9+vA+lncDjt/KY8V7eaCF8oz3Q2xD7vpekMi7gmvvp61tRjh/MMqWvKcY6q2k3Lg8PDBaB8c/N/dem+sby1PtW+077ZVmaG3ttfZGO9UuNKT9pf2j/av91/i18Vvj98YfK+vjR+sxL7TC0/jzfxHrGX0=</latexit>

Note that we can apply this trick also to a binary


classification problem. That is, we don’t need to do
binary classification with a single output node, we can
also use two output nodes and a softmax.
In practice, both options lead to almost the same
25
gradients. The single-node option saves a little
memory, but probably nothing noticeable. The multi-
node option is a little more flexible when you want to
use the same code on different classification tasks.
SOME COMMON LOSS FUNCTIONS
Here are some common loss functions for situations
where we have examples t of what the model output y
should be for a given input x.
ky - tk with y = model✓ (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

squared errors
regression

ky - tk with Xy = model (x) <latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

The error losses are derived from basic regression. The


y abs(y

ky - tk 1 with
=X = model i -✓ ti(x))
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

absolute errors
ky - tk11 with
X
ky - tk = i abs(yi - ti )
model
(binary) cross entropy comes from logistic regression
abs(yi - ✓ti(x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

= y = )
- log p✓ (t)X iwith t 2 {0, 1}
(as shown last lecture) and the hinge loss comes from
p1✓✓with
ky i y = model (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

ky
--logtk
- tk =
(t)X withabs(y - t✓.1}
2i {0,
{0, i.). , K}
- log p (t) with tt 2 support vector machine classification. You can find
log loss / binary cross-entropy - log p ✓ (t) iwith t 2 {0, 1}
ky
- -
log
max(0,tkp =
(t) withabs(y 2
✓ - ty) with t 2 {-1,
t -
{0, t. i.). , K}
classification

- log p✓✓1(t) 1}
1 i
with t 2 {0, .1}. . , K} their derivations in most machine learning books/
max(0, 1 - ty) i
with t 2 {-1, 1}
log loss / cross-entropy max(0,
- log pp1✓✓(t)
- log (t) with
- ty)with with
tt 2 2 .{-1,
2t{0,
{0, 1} 1}
. . , K} courses.
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K}
hinge loss max(0, 1 - ty) with t 2 {-1, 1} The loss can be computed for a single example or for
multiple examples. In almost all cases, the loss for
26
multiple examples is just the sum or average over all
their individual losses.

FROM FIRST PRINCIPLES: THE MAXIMUM LIKELIHOOD PRINCIPLE


To finish up, let’s have a look at where these loss
functions come from. We didn’t just find them by trial
and error, many of them were derived from first
“Choose the model for which the data you observed is most likely.” principles. And the principle at work is a very powerful
one: the maximum likelihood principle.
This is often used in frequentist statistics to fit models
arg max✓ p✓ (data)
<latexit sha1_base64="uKLEqjLyNgBt0pxHxb6hcyxFVPI=">AAANyHicfZfbbts2HMbV7tRl9ZZul7sRFhTohiCQEscHbAVqyTaKYW2zoDl0URBQNC0LpkSCohy7gm72KHua3W53e5tRtiJLFGVd0fw+fvrpT1EmXYr9iBvGf48ef/LpZ59/8eTLva+etr7+Zv/Zt5cRiRlEF5Bgwq5dECHsh+iC+xyja8oQCFyMrty5nelXC8Qin4Tv+Yqi2wB4oT/1IeCi627/Fwe6iQOYF4BleudAkjh8hjhIdednnZY7XjjQE2205MkEiN/pj/rd/oFxZKwvvd4w88aBll9nd8+e8j1nQmAcoJBDDKLoxjQov00A4z7EKN1z4ghRAOfAQzcxn/ZuEz+kMUchTPXnQpvGWOdEzx5Fn/gMQY5XogEg80WCDmeAAcjFA+9VoyIUggBFh5OFT6NNM1p4mwYHolq3yXJdzbQyMPEYoDMfLitkCQiiAPBZrTNaBW61E8UYsUVQ7cwoBaPkXCIG/SirwZkozDuaTVD0npzl+mxFZyiM0iRmOC0PFAJiDE3FwHUzQjymyfphxFsxj15yFqPDrLnuezkEbH6OJocip9JRxZliAni1yw2y4oToHpIgAOEkcWiavxHO4VEqxOe6i4XZJYBNqs7zNEmcrGauq58La0V8WxLfyuKoJI7ymxA80aeE6Qsx/4RFujDqwsJ8iKLq6Iti9FS/kKMvS+KlLF6VxCtZdOOSGtfURUld1NT7knovq8uSuJTFVUlc1XJ5SeWy+rEkfpTF6103/bDrpn9IsWJ2xJJZiVWOpuKTtH7BkjlMk9fv3/yWJv31lb8pMdLNqhG6D8aTcafb76ayjB/09rhnWsO6Xhi61sC0VYbCMejaxnC0ZTmWvAW0YXRGg44cBfFW7/XtcV3fwhqD7tBSGLa0Y7s96ublQyiUrF7hs/uddq0sXpHTtyzrtF/XC4PVtu3escJQOOzhcDiw1yg0ZhQjyUsfjJ3OqVGPokVQz+i0Bwp9OwFGz7JqsLTEYo0tc2iuWTgCWHLy4mWxe4P+SA7i2/pbA9uuTSAvlb9nm8O2wrBlPR2ejk7WJISB0JOrQorydbq9k54cRYqgcVfMYI2FbO807ltGt8ZCSixjy7aywlZXolhkN+Ztsvkgb9edfmDqaSqbxUKTzdnaezBLXqww42a30r7Lrx6AG+HrTwphUzxUhMNGGKhigc3wUAkPd8F7db/XFO8pwr1GGE/F4jXDe0p4bxc8rftpUzxVhNNGGKpioc3wVAlPd8Hzup83xXNFOG+E4SoW3gzPlfB8Fzyp+0lTPFGEk0YYomIhzfBECU+q8OLPI9vTA6xne2KCdT/MNwaVbxbNtg9zKHaSG/fmwYdInA0YeiO2Fe/EhhaIPd5Pm0OKH6birOA5h1lrl1GcZnJjdq7ZPzDlU0m9cXl8ZHaOTn9vH7yy8hPLE+177QfthWZqXe2V9lo70y40qP2l/a39o/3b+rVFW/et1cb6+FE+5jutcrX+/B+5wxOO</latexit>

to data. Put simply, it says that given some data, and a


class of models, a good way of picking your model is to
see what the probability of the data is under each
model and then picking the model for which the
probability of the data is highest.
27
There are many problems with this approach, and
many settings in which it fails, but it’s usually a simple
and intuitive place to start.
In statistics you usually fill in the definition of p and
solve the maximization problem until you get an
explicit expression of your optimal model as a function
of your data.
For instance, say the data is numerical, and the model
class is the normal distribution. Then after a bit of
scribbling you can work out that the mean of your data
and the standard deviation of your data are the
parameters of the normal distribution that fit your
data best, according to the maximum likelihood
principle.
In our setting, things aren’t so easy. The parameters
we are allowed to change are the parameters of the
neural network (which decides the distribution on the
labels). We cannot find a closed form solution for
neural networks usually, but we can still start with the
maximum likelihood principle and see what we can
rewrite it into.
DERIVING A LOSS FROM MAX LIKELIHOOD
Let’s try this for the binary classification network.
Here we assume that the neural network with weights
Y θ will somehow describe the probability pθ(data) of
argmax✓ p✓ (data) = argmax✓
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>

Y p✓ (instance)
argmax✓ Y
p✓ (data) = argmax✓ instance2data
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>

argmax
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>

(data) = argmax
Y p✓ (instance) seeing the whole dataset.
argmax✓✓ p
p✓ (data) = argmax✓ instance2data p p✓✓ (instance)
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>

✓ ✓ Y (instance)
= argmax✓ instance2data
Y p✓ (t | x)
= argmax✓ instance2data
Y
Y ✓ | x)
p (t The first step is to note that our data consists of
= argmax
x,t2data
= argmax✓✓ x,t2data
Y (t || x)
p✓✓ (t
p x) X independent, identically distributed samples (x, t),
= argmax✓ x,t2data
log Y p✓ (t | x) = argmax✓ X log p✓ (t | x)
x,t2data
Y
= argmax✓ log x,t2data
Y ✓ p (t | x) = argmax X
✓ X log p✓ (t | x) where x is the feature vector and t is the
= argmax
= argmax✓✓ log p✓ (t | x) = argmax x,t2data log p✓ (t | x)
X p✓ (t | x) = argmax✓✓ x,t2data log p✓ (t | x)
log x,t2data
1 x,t2data
= argmin✓ 1 x,t2data
X - log p✓ (t | x) x,t2data
x,t2data
corresponding class. This means that the probability of
= argmin✓ N X
X - log p✓ (t | x)
11 x,t2data all the data is just the product of all the individual
= argmin N - log p
= argmin✓ N x,t2data - log p✓ (t | x)
✓ ✓ (t | x)
N x,t2data
x,t2data
instances.

28
Because they are independently sampled, we may
multiply their probabilities together.
Next, we note that we are only modelling the
probabilities of the classes, not of the features by
themselves (in fancy words, we have a discriminative
classifier). This means that the probability of each
instance is just the probability of the class given the
features.
Next, we stick a logarithm in front. This is a slightly
arbitrary choice, but if you work with probability
distributions a lot, you will know that taking logarithms
of probabilities almost always makes your life easier:
you get simpler functions, better numerical stability,
and better behaved gradients. Crucially, because the
logarithm is a monotonic function, the position of the
maximum doesn’t change: the model that maximizes
the probability of the data is the same as the model
that maximizes the log-probability.
Taking the logarithm inside the product, turns the
product into a sum. Finally, we want something to
minimize, not maximize, so we stick a minus in front
and change the argmax to an argmin. We can then
rescale by any constant without moving the minimum.
If we use 1/N, with N the size of our data, then we end
up with the average log loss over the whole dataset.
That is, if we start with the maximum likelihood
objective, we can show step by step that this is
equivalent to minimizing the log loss.
A deeper reason for the “- log” is that every probability
distribution can be thought of as a compression
algorithm, and the negative log2 probability is the
number of bits you need to encode with this
compression algorithm. See this lecture for details.

We can generalize this idea by viewing the output of


the neural network as the parameters of a probability
py(t | x) distribution.
For instance, in the binary classification network, the
prob dist
output of the network is the probability of one of the
output y two classes. This is the parameter θ, for a Bernoulli
distribution on the space of the two classes {positive,
neural
negative}.
network
In the softmax example, the network output is a
probability vector. This is the parameter for a
input x
Categorical distribution on the space of all classes {1,
29
…, K}.
To finish up, let’s see what happens if we extend this
to another distribution: the Normal distribution.
The normal distribution is a distribution on the
number line, so this fits best to a regression problem.
N(t| μ, 1)
i ze t
h is We’ll return to our earlier example. We give the
axim
<- t
o m network an age and a weight and our aim is to predict
prob dist blood pressure -> the blood pressure. However, instead of treating the
is
th

predicted mean

true value
ov
e output node as the predicted blood pressure directly,
m
output y
we treat it as the mean of a probability distribution
on the space of all blood pressures.
neural
network To keep things simple, we fix the variance to 1 and
predict only the mean. We can also give the neural
input (age, weight) network two outputs, and have it parametrize the
30
whole normal distribution. We’ll see examples of this
later in the course.
Note that in this picture, we are moving the mean
around to maximize the probability density of the true
value t. We move the mean by changing the weights of
the neural network. Of course, for every instance we
see, t will be in a new place, so the weights should give
us a new mean for every input x we see.

So what does the maximum likelihood objective look


1 X
<latexit sha1_base64="KeTnDj/o1KNyQU0KpBAXYpWFpVk=">AAAQrXicxZdNb6NGHMbZdV+26bZ12mMvo0ZbbSpvConjF1WR1mBbe+hm0yhO0oYkGsZjjAzMCAbHXsS5n6bX9rP023SwCYYBfIvKaTTPM8/8+MPAjEFty2ey/O+z57VPPv3s8xdf7Hz58quvv6nvfnvpk8BDeISITbxrA/rYtlw8Yhaz8TX1MHQMG18ZMy3Wr+bY8y3iXrAlxbcONF1rYiHIeNf9bg3oDC9YCD3TgYvoXkck1NkUMxgB/RdAsx2v19Yx5O198ONJZqjl5ofysfoMhfrEgyhUovA0inQ/cO5DfdEAOqIh4/GWCzKJEXgDdNsF4pyJ2bHGQF/sA13feaqZT/OTOQE4AcsGUJ500ty4Q51aXMF34ZtNX1qDN2C5f3cYPQ1NPCIhEmjWvf8XU2ZOYdKnmbMwx319Tz6QVxcoNpSksScl19n97ku2o48JChzsMmRD379RZMpuOSOzkI2jHT3wMYVoBk18E7BJ5za0XBow7KIIvOLaJLABIyBer2BseRgxe8kbEHkWTwBoCvkdMb6qd/JRPnahg/3GeG5Rf9305+a6wSD/JNyGi9UnI8oNDE0P0qmFFjmyEDq+A9m00OkvHSPfiQMbe3Mn3xlTckbBucAesvy4Bme8MB9o/BXyL8hZok+XdIpdPwoDz46yA7mAPQ9P+MBV08csoOHqZvinb+afMC/Ajbi56jvpQ292jscNnpPryONMbAJZvstw4uK4+AERx4HuONRpFK7fEL1xEHHxFTBsbjYI9MZ553kUhnpcM8MA59yaE08z4qkoDjLiIJmE2GMwIR6Y8+dPPB9wI+AWz0LYz48epaMnYCRGX2bES1G8yohXomgEGTUoqPOMOi+oDxn1QVQXGXEhisuMuCzksozKRPVjRvwoitfbJv1926R/CLH86fAls+SrHE/4f3f1goUzFIXvLt7/GoXd1ZW8KQEGSt6IjEfj0bDV7rYjUbYf9eawo6j9op4a2mpP0coMqaPX1uT+YMNyKHhTaFluDXotMQrZG73T1YZFfQMr99p9tcSwoR1qzUE7KR/GrmA1U5/WbTULZTHTnK6qqsfdop4a1KamdQ5LDKlD6/f7PW2FQgOP2ljw0kdjq3UsF6NoGtSRW81eib55AHJHVQuwNMOiDlWlr6xYGIa24GTpy6J1et2BGMQ29Vd7mlZ4gCxT/o6m9Jslhg3rcf94cLQiIR50TbEqJC1fq9056ohRJA0atvkTLLCQzUzDriq3CywkwzJUNTUubH4l8kV2o9yG6w/yZt2BPQVEkWjmC000x2vv0Sx47RKzXe0utW/zlw+wK+GLd4pQVTwqCUeVMKiMBVXDo1J4tA3eLPrNqnizJNyshDHLWMxqeLMU3twGT4t+WhVPS8JpJQwtY6HV8LQUnm6DZ0U/q4pnJeGsEoaVsbBqeFYKz7bBk6KfVMWTknBSCUPKWEg1PCmFJ3l4/vOI9/TQBvGemNiAHybWG4PcN4vG24f4IJK41zfex/xs4OH3fFvxgW9oId/j/RTqyTkmPs/ojbi1zcgP7IkxPrrX9xTxVFJsXB4eKK2D49+ae2/V5MTyQvpe+kF6LSlSW3orvZPOpJGEan/W/qr9Xfun/nN9VNfrd2vr82fJmO+k3FU3/wP9hCI9</latexit>

argmax✓ p✓ (data) = argmin✓ - ln p✓ (t | x) like for this problem?


N
x,t2data
1 X We’re maximizing the probability density rather than
= argmin✓ - ln N(t | µ = y, 1)
N
x,t2data the probability, but the approach remains the same.
1 X 1 - 1 (t-y)2
= argmin✓ - ln e 2 First, we follow the same steps as before. This turns
N 2⇡
x,t2data
1 X 1 1 2
our maximum likelihood into a minimum min-log
= argmin✓ - ln - ln e- 2 (t-y) probability (density). We use the base-e logarithm to
N 2⇡
x,t2data
1 X 1
make our life easier.
= argmin✓ (t - y)2
N 2
x,t2data Then we fill in the definition of the normal probability
1 X density. This is a complicated function, but it quickly
= argmin✓ (t - y)2
N
x,t2data simplifies because of the logarithm in from of it. The
parts in grey are additive or multiplicative constants,
which we can discard without changing the location of
the minimum.
After we’ve stripped away everything we can, we find
that the result is just a plain old squared error loss,
that we were using all along.
You may wonder if this is how the squared error loss
that we use so much was first discovered: by starting
with the famous normal distribution and then working
out the maximum likelihood solution for its
parameters. The truth is a little stranger.
In point of fact it was the other way around. Gauss,
who discovered the Normal distribution, started with
the idea of the mean of a set of measurements as a
good approximation of the true value. This was
something people had been using since antiquity. He
wanted to justify its use.
Gauss invented the maximum likelihood principle to do
so. Then he asked himself what kind of probability
density function would have the mean as its maximum
likelihood solution. He worked out that its logarithm
would have to correspond to the squares of the
residuals, and from that worked out the basic form of
the normal distribution, doing what we just did, but in
the opposite direction.

SOME COMMON LOSS FUNCTIONS


What we’ve shown is that almost all of the loss
functions we use regularly, can be derived from this
basic principle of maximum likelihood. This is
ky - tk with y = model✓ (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

squared errors Normal distribution (fixed important to understand, because it’s a principle we
regression

ky - tk with Xy = model (x) <latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

variance)
y abs(y

ky - tk 1 with
=X = model i -✓ ti(x)) will return to multiple times throughout the course,
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

absolute errors ky - tk1 = X i abs(yi - ti ) Laplace distribution (fixed


ky - tk1 with= y abs(y= model especially in the context of generative models.
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

- t
✓ (x)
)
iwith t 2 {0, 1}
i i variance)
- log p✓ (t)X
ky withi y = model (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>

ky
- -
logtk
- tk 1✓ =(t) abs(y
with tt 2 - t✓1}
i {0, i)
log loss
p
p ✓ (t)X with 2
- log p✓ (t) iwith t 2 {0, 1}. . , K}
- log {0, . Bernoulli distribution
ky - tk
- log
binary cross-entropy max(0, =
p11✓ (t) withabs(y
with -
t 2it{0, i.). , K}
t.{-1,
classification

- ty) 2 1}
- log p✓✓ (t) with t 2 {0, 1} . . . , K}
max(0, 1 - ty) i
with t 2 {-1, 1}
log loss max(0,
- log pp1✓✓(t)
- log (t) with
- ty)with with
tt 2 2 .{-1,
2t{0,
{0, 1}
. . , K}
1} Categorical distribution
cross-entropy
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K}
hinge loss max(0, 1 - ty) with t 2 {-1, 1} <none>

33

Lecture 1: Introduction

Peter Bloem
Deep Learning

dlvu.github.io
|section|Autoencoders|
|video|https://www.youtube.com/embed/
M1hh8HcWEzk?si=8OlYpUyuibS7PJI5|
So, we’ve seen what neural networks are, and how to
use them for regression and for classification. But the
PART THREE: AUTOENCODERS
real power of neural networks is not in doing classical
machine learning tasks. Rather it’s in their flexibility to
A simple neural architecture to give you a hint of what’s possible.
grow beyond that. The idea is that neural networks can
be set up in a wild variety of different configurations,
to solve all sorts of different tasks.
To give you a hint of that, we’ll finish up by looking at a
simple example: the autoencoder.

AUTOENCODERS
Here’s what an autoencoder looks like. It’s is a
particular type of neural network, shaped like an
output
bottleneck architecture for dimensionality hourglass. Its job is just to make the output as close to
reduction. the input as possible, but somewhere in the network
input should be as close as possible to the there is a small layer that functions as a bottleneck.
output
but: must pass through a small We can set it up however we like, with one or many
representation. fully connected layers. The only requirements are that
(a) one of the layers forms a bottleneck, and (b) that
the input is the same size as the output.
The idea is that we simply train the neural network to
input
reconstruct the input. If we manage to train a network
36
that does this successfully, then we know that
whatever value the bottleneck layer takes for a
particular input is a low-dimensional representation of
that input, from which we can pretty well decode the
input, so it must contain the relevant details.
Note how powerful this idea is. We don’t even need
any labeled data. All we need is a large amounts of
examples (images, sentences, etc) and with that we
can train an autoencoder.

Here’s the picture in detail. We call the bottom half of


the network the encoder and the top half the decoder.
loss z2
We feed the autoencoder an instance from our
dataset, and all it has to do is reproduce that instance
in its output. We can use any loss that compares the
decoder

output to the original input, and produces a lower loss,


the more similar they are. Then, we just
z1 z2
brackpropagate the loss and train by gradient descent.
encoder

To feed a neural network an image, we can just flatten


the whole thing into a vector. Every color channel of
every pixel becomes an input node, giving us, in this
z1
latent space case 128 × 128 × 3 inputs. This is a bit costly, but we’ll
37
see some more efficient ways to feed images to neural
networks soon.
Many loss functions would work here, but to keep
things simple, we’ll stick with the squared error loss.
We call the blue layer the latent representation of the
input. If we train an autoencoder with just two nodes
in the bottleneck layer, we can plot in two dimensions
what latent representation each input is assigned. If
the autoencoder works well, we expect to see similar
images clustered together (for instance smiling people
vs frowning people, men vs women, etc). This is often
called the latent space of a network.
No need to read too much into that yet, but it’s phrase
that will come back often.
In a 2D space, we can’t cluster too many attributes
together, but in higher dimensions it’s easier. To quote
Geoff Hinton: “If there was a 30 dimensional

To show what this looks like, we've set up a relatively


AFTER 5 EPOCHS (256 LATENT DIMENSIONS)
simple autoencoder. It uses a few tricks we haven’t
discussed yet, but the basic principle is just neural
network with a bottleneck and a squared error loss.
The size of the bottleneck layer is 256 nodes.
We train it on a low-resolution version of the FFHQ
dataset, containing 70 000 images of faces with
resolution 128 × 128.
Here are the reconstructions after 5 full passes over
the data.

AFTER 25 EPOCHS

AFTER 100 EPOCHS


After 300 epochs, the autoencoder has pretty much
AFTER 300 EPOCHS
converged. Here are the reconstructions next to the
original data. Considering that we've reduced each
image to just 256 numbers, it's not too bad.

data reconstruc ons

INTERPOLATION
So, now that we have an autoencoder what can we do
with it? One thing we can do is interpolation.
If we take two points in the latent space, and draw a
line between them, we can pick evenly spaced points
on that line and decode them. If the decoder is good,
and all the points in the latent space decode to
decoder

realistic faces, then this should give us a smooth


transition from one point to the other, and each point
should result in a convincing example of our output
late domain.
nt
spa
ce
This is not a guaranteed property of an autoencoder
42
source: Sampling Generative Networks, Tom White trained like this. We’ve only asked it to find a way to
represent the data in the latent space. Still, you usually
get decent interpolations from a simple autoencoder.
In a few weeks, we will see variational autoencoders,
which enforce more explicitly that all points in the
latent space should decode to realistic examples.

Another thing we can do is to study the latent space


FIND THE SMILING VECTOR
based on the examples that we have. For instance, we
can see whether smiling and non-smiling people end
up in distinct parts of the latent space.
smiling
We just label a small amount of instances as smiling
and nonsmiling (just 20 each in this case). If we're
r
to
ec
gv

lucky, these form distinct clusters in our latent space. If


ilin
sm

we compute the means of these clusters, we can draw


nonsmiling
a vector between them. We can think of this as a
“smiling” vector. The further we push people along this
latent space (256 dim) line, the more the decoded point will smile.
This is one big benefit of autoencoders: we can train
them on unlabeled data (which is cheap) and then use
only a very small number of labeled examples to
“annotate” the latent space. In other words,
ti
autoencoders are a great way to do semi-supervised
learning.

MAKE SOMEONE SMILE/FROWN


Once we've worked out what the smiling vector is, we
can manipulate photographs to make people smile.
encode to the latent space: We just encode their picture into the latent space, add
z = encode(x) the smiling vector (times some small scalar to control
the effect), and decode the manipulated latent
add/subtract some proportion of the smiling vector: representation. If the autoencoder understands
zsmile = z + vsmile * 0.2 "smiling" well enough, the result will be the same
picture but manipulated so that the person will smile.
decode to a smiling face:
xsmile = decode(zsmile)

44

Here is what that looks like for our (simple) example


model. In the middle we have the decoding of the
original data, and to the right we see what happens if
we add an increasingly large multiple of the smiling
vector.
To the right we subtract the smiling vector, which
makes the person frown.

45

With a bit more powerful model, and some face detection,


we can see what some famously moody celebrities might
look like if they smiled.

source: https://blogs.nvidia.com/blog/2016/12/23/ai-
lips-kanye-wests-frown-upside-down/

source: https://blogs.nvidia.com/blog/2016/12/23/ai- ips-kanye-wests-frown-upside-down/ 46


f
fl
AUTOENCODERS
So, what we get out of an autoencoder, depends on
which part of the model we focus on.
Keep the encoder and decoder: data manipulator.
If we keep the encoder and the decoder, we get a
Keep the encoder, ditch the decoder: dimensionality reduction.
network that can help us manipulate data in this way.
We map data to the latent space, tweak it there, and
Ditch the encoder, keep the decoder: generator network. then map it back out of the latent space with the
decoder.
If we keep just the encoder, we get a powerful
dimensionality reduction method. We can use the
latent space representation as the features for a model
that does not scale well to high numbers of features.
47

One final thing we can do is to throw away the


encoder and keep only the decoder. In that case the
result is a generator network. A model that can
generate fictional examples of the sort of thing we
have in our dataset (in our case, pictures of people
who don’t exist).

TURNING AN AUTOENCODER INTO A GENERATOR


Here’s how that works. First, we train an autoencoder
on our data. The encoder will map the data to a point
train an autoencoder cloud in our latent space.
encode the data to latent variables Z
We don’t know what this point cloud will look like, but
fit an normal distribution to Z
we’ll make a guess that a multivariate normal
sample from the normal distribution distribution will make a reasonable fit. We fit such a
“decode” the sample distribution to our data. If it does fit well, then the
regions of our latent space that get high probability
density, are also the regions that are likely to decode
to realistic looking instances of our data.
With that, we can sample a point from the normal
48
distribution, pass it through the decoder, and get a
new datapoint, that looks like it could have come from
our data.
This is a bit like the interpolation example. There, we
assumed that the points directly in between the two
latent representations should decode to realistic
examples. Here we assume that all the points that are
anywhere near points in our data (as captured by the
normal distribution) decode to realistic examples.
This is the point cloud of the latent representations in
our example. We plot the first two of the 256
dimensions, resulting in the blue point cloud.
To these points we fit a multivariate normal
distribution (in 256 dimensions), and we sample 400
new points from it, the red dots.
In short, we sample points in the latent space that do
not correspond to the data, but that are sufficiently
near the data that we can expect the decoder to give
us something realistic.

z 49

If we feed these sampled points to the decoder, this is


what we get.
The results don’t look as good as the reconstructions,
but clearly we are looking at approximate human
faces.

generated 50

This has given us a generator, but we have little control


over what the cloud of latent representations looks
How to control the shape of the latent space? like. We just have to hope that it looks enough like a
What are we optimizing? Can we optimize maximum likelihood directly? normal distribution that our normal distribution makes
Can we optimize for better interpolation directly? a good fit.
Coming up soon: the Variational Autoencoder. We’ve also seen that this interpolation works well, but
it’s not something we’ve specifically trained the
network to do.
In short, the autoencoder is not a very principled way
of getting a generator network. You may ask if there is
a way to train a generator from first principles,
51
perhaps starting with the maximum likelihood
objective?
The answer to all of these questions is the variational

RECAP

What are neural networks? How are they trained?

What is a loss function, and how are they derived?

First example of a complex architecture: the autoencoder.

Next lecture: How do we derive the gradient? Backpropagation.

52

You might also like