Lecture01 Introduction Annotated
Lecture01 Introduction Annotated
|video|https://www.youtube.com/embed/MrZvXcwQJdg?
Lecture 1: Introduction si=BRr6mIPzcjbE1N_a|
Peter Bloem In this lecture, we will discuss the basics of neural networks.
Deep Learning What they are, and how to use them.
dlvu.github.io
THE PLAN
part 3: autoencoders
NONLINEARITY
The simplest solution is to apply a nonlinear function
to each neuron, called the activation function. This is a
y
scalar function (a function from a number to another
number) which we apply to the output of a perceptron
w1 w2 b y = (w1 x1 + w2 x2 + b) after all the weighted inputs have been combined.
x1 x2
One popular option (especially in the early days) is the
1 logistic sigmoid. The sigmoid takes the range of
sigmoid (x) = numbers from negative to positive infinity and
1 + e-x
squishes them down to the interval between 0 and 1.
x if x > 0 Another, more recent nonlinearity is the linear
r(x) =
ReLU
0 otherwise rectifier, or ReLU nonlinearity. This function just sets
7
every negative input to zero, and keeps everything else
the same.
Not using an activation function is also called using a
linear activation.
If you're familiar with logistic regression, you've seen
the sigmoid function already: it's stuck on the end of a
linear regression function (that is, a perceptron) to turn
the outputs into class probabilities. Now, we will take
these sigmoid outputs, and feed them as inputs to
other perceptrons.
FEEDFORWARD NETWORK
Using these nonlinearities, we can arrange single
neurons into neural networks. Any arrangement of
y
output layer
perceptrons and nonlinearities makes a neural
network, but for ease of training, the arrangement
shown here was the most popular for a long time.
v2
REGRESSION
With that, let’s see how we can use such a
feedforward network to attack some basic machine
learning problems.
y
If we want to train a regression model (a model that
linear regression predicts a numeric value), we put non-linearities on
the hidden nodes, and no activation on the output
h1 h2 h3
node. That way, the output can range from negative to
feature extractor positive infinity, and the nonlinearities on the hidden
layer ensure that we can learn functions that a single
x1 x2 perceptron couldn't learn.
We can think of the first layer as learning some
9
nonlinear transformation of the inputs, the features in
machine learning parlance, and we can think of the
the second layer as performing linear regression on
these derived, nonlinear features.
LOSS
The next step is to figure out a loss function. This tells
you how well your network is doing with its current
weights. The lower the loss the better you are doing.
y: predicted blood pressure t: known bp (given by data)
model✓ (x) = y every point in this plane, our loss function defines a
loss
LOSS SURFACE
Here’s what a loss surface might look like for a model
with just two parameters.
Our job is to search the loss surface for a low point.
When the loss is low, the model predictions are close
to the target labels, and we've found a model that
does well.
12
ce
surfa
crossing it. The slope of the tangent line tells us how
loss
derivative
GRADIENT
If our input space has multiple dimensions, like our
model space, we can simply take a derivative with
respect to each input, separately, treating the others
w2 as constants. This is called a partial derivative. The
@loss collection of all possible partial derivatives is called the
@w1
@loss
gradient.
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>
@w2
r✓ lossx,t (✓) The partial derivatives of the loss surface, one for each
@loss
@w1
model weight, tell us how much the loss falls or rises if
@loss we increase each weight. Clearly, this information can
@w2
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>
RECAP
This is the basic idea of neural networks. We define a
perceptron, a simplified model of a neuron, which we
perceptron: linear combination of inputs chain together into a neural network, with
nonlinearities added. We then define a loss, and train
neural network: network of perceptrons, with scalar nonlinearities by gradient descent to find good weights.
What we haven't discussed is how to work out the
training: (minibatch) gradient descent gradient of a loss function over a neural network. For
simple functions, like linear classifiers, this can be
But, how do we compute the gradient of a complex neural network? done by hand. For more complex functions, like very
Next lecture: backpropagation. deep neural networks, this is no longer feasible, and
we need some help.
18
This help comes in the form of the backpropagation
algorithm. This is a complex and very important
algorithm, so we will dive into it in the next lecture.
Peter Bloem
Deep Learning
dlvu.github.io
BINARY CLASSIFICATION
If we have a classification problem with two classes,
whcih we’ll call positive and negative, we can place a
sigmoid activation on the output layer, so that the
y s|x)
(C=Po output is between 0 and 1.
<- p
logistic regression We can then interpret this as the probability that the
h1 h2 h3
input has the positive class (according to our network).
The probability of the negative class is 1 minus this
feature extractor value.
x1 x2
22
LOG LOSS
So, what’t our loss here? The situation is a little
different from the regression setting. Here, the neural
network predicts a number between 0 and 1, and the
y: predicted probability of heart disease t: has heart disease? (pos or neg)
data only gives us a value that is true or false. Broadly,
what we want from the loss is that it is a low value if
the probability of the true class is close to one, and
high if the probability of the true class is low.
log loss:
<latexit sha1_base64="1c/Bwkm1sjvRrNy4pyzL0TxhzdE=">AAAOPnichZfLbuM2GIXt6W2ajttMuynQDdFgimmRCaTE8WWRYizZbhadmTTIrY2CgKJpWTAlEiTl2CPopfoEfY2+QHdFt12Wsh1ZV1crgufj4dFP0qZsRlwhNe3P+pMPPvzo40+efrrz2bPG51/sPv/yStCAI3yJKKH8xoYCE9fHl9KVBN8wjqFnE3xtT81Yv55hLlzqX8gFw3cedHx37CIoVdf97u+WxHMZEipEBE7AK2AR6gD20kIslJHluSNgzb8HKwo8uHICohLZxo7rh0gFEdECfLfC3bFiV6SyXvVZyA4ZFZESLKCr+f6HpqGPnZjG/mjtf7+7px1oywcUG/q6sVdbP2f3z5/JHWtEUeBhXyIChbjVNSbvQsiliwiOdqxAYAbRFDr4NpDjzl3o+iyQ2EcReKG0cUCApCAuIBi5HCNJFqoBEXeVA0ATyCGSqsw7WSuBfehhsT+auUysmmLmrBoSqjW6C+fLNYwyA0OHQzZx0TyTLISe8KCcFDrFwrOznTggmM+8bGecUmXMkXPMkSviGpypwrxj8bYQF/RsrU8WbIJ9EYUBJ1F6oBIw53isBi6bAsuAhcuXUXtxKk4kD/B+3Fz2nfQhn57j0b7yyXRk44wJhTLbZXtxcXz8gKjnQbULLBaF6+2xfxAp8QWwiYJtCvkoS55HYWjFNbNtcK7QjPg2Jb7Ni4OUOFhPQskIjCkHM7X+lAugQKAQ7iIssqMvk9FjcJm3vkqJV3nxOiVe50U7SKlBQZ2l1FlBfUipD3l1nhLneXGREhcFX5lSZV59nxLf58WbbZP+um3S33K2anXUkVmoU47H6odwucHCKYrC04s3P0dhd/msd0qAgZ4Fkf0IHg1b7W47ysvkUW8OO7rRL+oJ0DZ6ulkGJESvbWr9wSbLYY5NQmtaa9Br5a0Q2eidrjks6puwWq/dN0qATdqh2Ry01+XD2M+hTsKZ3VazUBYn8ekahnHcLeoJYDRNs3NYAiSE2e/3e+YyCgs4IzjHskew1TrWilYsMeporWavRN8sgNYxjEJYlspiDA29ry+zSAxJjpTJZjE7ve4gbyQ39Td6pllYQJkqf8fU+80SYJP1uH88OFomoRz6Tr4qNClfq9056uStaGI0bKsVLGShm5mGXUNrF7LQVJahYRpxYbMnUR2yW/0uXP0gb84d2NNBFOVhddDycHz2HuEcS0pgUk2X4tv48gGkMnzxTRGqskcl5qgyDCrLgqrDo9LwaFt4p8g7VfZOiblTGcYpy+JUh3dKwzvbwrMiz6rsWYk5qwzDyrKw6vCsNDzbFl4WeVllL0vMZWUYWZZFVoeXpeHltvC0yNMqe1piTivD0LIstDo8LQ1Ps+HVn0d8p4cExHdiSoDrry8Gmd8sFl8fpkjdJFf06sX7WH0bcPxGXSveqQstVHe8H0ILcsdz/Uh9KzjWftzaBsL5I6ha6jtFz3+VFBtXhwd66+D4l+bea2P9xfK09k3t29rLml5r117XTmtntcsaqn9d/7H+U/208Ufjr8bfjX9W6JP6esxXtczT+Pc/6eY+GA==</latexit>
LOG LOSS
The loss function for the softmax is the same as it was
for the binary classification. We assume that the data
y1 y2 y3: predicted prob of class 3 t: true prob (1, 2, or 3) tells us what the correct class is, and we take the
negative log-probability of the correct class as the loss.
softmax
This way, the higher the probability is that the model
log loss:
assigns to the correct class, the lower the loss.
loss = - log p(t | x) with p(t | x) = yt
<latexit sha1_base64="JYVLo7QDwMs2yQtfWPR1AbDQEKs=">AAAN3HicfZdNb9s2HMbVdi9dVm/petyFWFCgG7JAShy/HALUkm30sLZZkLc1ygKKpmXBlEhQlGNX0G23Ydd9lH2XAbtun2OU7cgSJVkngs/DRz/9Sdqkw4gXCl3/+9HjJ598+tnnT7/Y+fJZ46uvd59/cxnSiCN8gSih/NqBISZegC+EJwi+ZhxD3yH4yplaqX41wzz0aHAuFgzf+tANvLGHoJBdd7untsBzERMahgk4AT8Cm1AXsFc2YrFIbN8bAXv+PVi5wL0nJiApyycALO5WfXe7e/qBvnxAuWGsG3va+jm9e/5M7NgjiiIfBwIRGIY3hs7EbQy58BDByY4dhZhBNIUuvonEuHMbewGLBA5QAl5KbRwRIChIPw+MPI6RIAvZgIh7MgGgCeQQCVmEnWJUiAPo43B/NPNYuGqGM3fVEFBW8DaeLyucFAbGLods4qF5gSyGfuhDMSl1hgvfKXbiiGA+84udKaVkVJxzzJEXpjU4lYV5z9JJC8/p6VqfLNgEB2ESR5wk+YFSwJzjsRy4bIZYRCxefoxcKdPwRPAI76fNZd9JH/LpGR7ty5xCRxFnTCgUxS7HT4sT4HtEfR8Go9hmSbxaLfb+QSLFl8Ah0uxQyEdF51kSx3ZaM8cBZ9JaEN/lxHeqOMiJg/VLKBmBMeVgJuef8hBII5AW7iEcFkdfZKPH4EKNvsyJl6p4lROvVNGJcmpUUmc5dVZS73PqvarOc+JcFRc5cVHKFTlVqOrHnPhRFa+3vfSXbS/9oMTK2ZFbZiF3OR7Ln6nlAounKInfnL/9KYm7y2e9UiIMjKIROQ/Go2Gr3W0nqkwe9OawY5j9sp4Z2mbPsKoMmaPXtvT+YMNyqHgzaF1vDXotNQqRjd7pWsOyvoHVe+2+WWHY0A6t5qC9Lh/GgWJ1M5/VbTVLZXGznK5pmsfdsp4ZzKZldQ4rDJnD6vf7PWuJwiLOCFa87MHYah3r5SiWBXX0VrNXoW8mQO+YZgmW5VjMoWn0jSWLwJAoTpEtFqvT6w7UILGpv9mzrNIEilz5O5bRb1YYNqzH/ePB0ZKEchi4alVoVr5Wu3PUUaNoFjRsyxkssdDNm4ZdU2+XWGiOZWhaZlrY4k6Um+zGuI1XP8ibfQf2DJAkqlluNNWc7r0Hs+IlFWZS7660b/NXDyC18OUvRaguHlWEo1oYVMWC6uFRJTzaBu+W/W5dvFsR7tbCuFUsbj28WwnvboNnZT+ri2cV4awWhlWxsHp4VgnPtsGLsl/UxYuKcFELI6pYRD28qIQX2+Bp2U/r4mlFOK2FoVUstB6eVsLTIrz880jP9JCA9ExMCfCC9cGg8JvF0uPDFMmT5Mq9+vA+lncDjt/KY8V7eaCF8oz3Q2xD7vpekMi7gmvvp61tRjh/MMqWvKcY6q2k3Lg8PDBaB8c/N/dem+sby1PtW+077ZVmaG3ttfZGO9UuNKT9pf2j/av91/i18Vvj98YfK+vjR+sxL7TC0/jzfxHrGX0=</latexit>
squared errors
regression
absolute errors
ky - tk11 with
X
ky - tk = i abs(yi - ti )
model
(binary) cross entropy comes from logistic regression
abs(yi - ✓ti(x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
= y = )
- log p✓ (t)X iwith t 2 {0, 1}
(as shown last lecture) and the hinge loss comes from
p1✓✓with
ky i y = model (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
ky
--logtk
- tk =
(t)X withabs(y - t✓.1}
2i {0,
{0, i.). , K}
- log p (t) with tt 2 support vector machine classification. You can find
log loss / binary cross-entropy - log p ✓ (t) iwith t 2 {0, 1}
ky
- -
log
max(0,tkp =
(t) withabs(y 2
✓ - ty) with t 2 {-1,
t -
{0, t. i.). , K}
classification
- log p✓✓1(t) 1}
1 i
with t 2 {0, .1}. . , K} their derivations in most machine learning books/
max(0, 1 - ty) i
with t 2 {-1, 1}
log loss / cross-entropy max(0,
- log pp1✓✓(t)
- log (t) with
- ty)with with
tt 2 2 .{-1,
2t{0,
{0, 1} 1}
. . , K} courses.
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K}
hinge loss max(0, 1 - ty) with t 2 {-1, 1} The loss can be computed for a single example or for
multiple examples. In almost all cases, the loss for
26
multiple examples is just the sum or average over all
their individual losses.
Y p✓ (instance)
argmax✓ Y
p✓ (data) = argmax✓ instance2data
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>
argmax
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>
(data) = argmax
Y p✓ (instance) seeing the whole dataset.
argmax✓✓ p
p✓ (data) = argmax✓ instance2data p p✓✓ (instance)
<latexit sha1_base64="3nRnoyYiP4Bm2+pLHXh7wSes+DU=">AAAPznicpZfbbts2HMZV79Rl3Zxsl7shFnRohyyQEscHFAFqyTZ6saZZkFMXBQFF07JgSSQkyrErCLvdo+xpdrvtbUbZiixRhxSYrgh+H7//TxRpkwa1LZ/J8r9PGp98+tnnXzz9cuurZ19/09ze+fbSJ4GH8AUiNvGuDehj23LxBbOYja+ph6Fj2PjKmGmxfjXHnm8R95wtKb51oOlaEwtBxrvudhp9neEFC6FnOnAR3emIhDqbYgYjoL8CNNvxYm0dQ95+CX48BjVDqUfGd+HaYLk+gy7CvNtyQSYkKs1P7S+Brm99VJnFHtARDdnjFRKXY42BvviIAjYx/3+VugpA9wOnPnwFUVtBfA3LFYu80mco1CceRKEShSdR9HjZn9eFH53Bu+1deV9ePaDYUJLGrpQ8p3c7z9iWPiYocLDLkA19/0aRKbvl6MxCNo629MDHFKIZNPFNwCbdW74oaMCwiyLwnGuTwAaMgHhBg7HlYcTsJW9A5Fk8AaAp5C/K+LLfykf52IUO9vfGc4v666Y/N9cNBvmeuQ0Xqz0V5QaGpgfp1EKLHFkIHd+BbFro9JeOke/EgY29uZPvjCk5o+BcYA9ZfjwHp3xi3tF4m/rn5DTRp0s6xa4fhYFnR9mBXMCehyd84KrpYxbQcPUy/Ldh5h8zL8B7cXPVdzyA3uwMj/d4Tq4jjzOxCWT5LsOJJ8fF94g4DnTHoU6jZKPre/sRF58Dw+Zmg0BvnHeeRWGox3NmGOCMW3PiSUY8EcVhRhwmRYg9BhPigTn//sTzATcCbvEshP386It09ARciNGXGfFSFK8y4pUoGkFGDQrqPKPOC+p9Rr0X1UVGXIjiMiMuC7ksozJR/ZARP4jidV3R93VFfxNi+dfhW2bJdzme8D+m1QILZygK35y//SUKe6snWSkBBkreiIwH4+Go3el1IlG2H/TWqKuog6KeGjpqX9HKDKmj39HkwXDDciB4U2hZbg/7bTEK2Ru929NGRX0DK/c7A7XEsKEdaa1hJ5k+jF3BaqY+rdduFabFTHN6qqoe9Yp6alBbmtY9KDGkDm0wGPS1FQoNPGpjwUsfjO32kVyMomlQV263+iX65gPIXVUtwNIMizpSlYGyYmEY2oKTpYtF6/Z7QzGIbeZf7Wta4QOyzPR3NWXQKjFsWI8GR8PDFQnxoGuKs0LS6Wt3uoddMYqkQaMO/4IFFrKpNOqpcqfAQjIsI1VT44nN70S+yW6U23D9g7zZd2BXAVEkmvlGE83x3nswC167xGxXu0vtdf7yAXYlfPFNEaqKRyXhqBIGlbGganhUCo/q4M2i36yKN0vCzUoYs4zFrIY3S+HNOnha9NOqeFoSTithaBkLrYanpfC0Dp4V/awqnpWEs0oYVsbCquFZKTyrgydFP6mKJyXhpBKGlLGQanhSCk/y8PzPIz7TQxvEZ2JiA37HWB8Mcr9ZND4+xPeTxL1+8QHmdwMPv+XHinf8QAv5Ge+nUE+uN/E1R9+LW3VGftlKjPG1a3tXEW8lxcblwb7S3j/6tbX7Wk1uLE+l76UfpBeSInWk19Ib6VS6kFDjz8Zfjb8b/zRPm/Nm1Px9bW08ScZ8J+We5h//AWr819o=</latexit>
✓ ✓ Y (instance)
= argmax✓ instance2data
Y p✓ (t | x)
= argmax✓ instance2data
Y
Y ✓ | x)
p (t The first step is to note that our data consists of
= argmax
x,t2data
= argmax✓✓ x,t2data
Y (t || x)
p✓✓ (t
p x) X independent, identically distributed samples (x, t),
= argmax✓ x,t2data
log Y p✓ (t | x) = argmax✓ X log p✓ (t | x)
x,t2data
Y
= argmax✓ log x,t2data
Y ✓ p (t | x) = argmax X
✓ X log p✓ (t | x) where x is the feature vector and t is the
= argmax
= argmax✓✓ log p✓ (t | x) = argmax x,t2data log p✓ (t | x)
X p✓ (t | x) = argmax✓✓ x,t2data log p✓ (t | x)
log x,t2data
1 x,t2data
= argmin✓ 1 x,t2data
X - log p✓ (t | x) x,t2data
x,t2data
corresponding class. This means that the probability of
= argmin✓ N X
X - log p✓ (t | x)
11 x,t2data all the data is just the product of all the individual
= argmin N - log p
= argmin✓ N x,t2data - log p✓ (t | x)
✓ ✓ (t | x)
N x,t2data
x,t2data
instances.
28
Because they are independently sampled, we may
multiply their probabilities together.
Next, we note that we are only modelling the
probabilities of the classes, not of the features by
themselves (in fancy words, we have a discriminative
classifier). This means that the probability of each
instance is just the probability of the class given the
features.
Next, we stick a logarithm in front. This is a slightly
arbitrary choice, but if you work with probability
distributions a lot, you will know that taking logarithms
of probabilities almost always makes your life easier:
you get simpler functions, better numerical stability,
and better behaved gradients. Crucially, because the
logarithm is a monotonic function, the position of the
maximum doesn’t change: the model that maximizes
the probability of the data is the same as the model
that maximizes the log-probability.
Taking the logarithm inside the product, turns the
product into a sum. Finally, we want something to
minimize, not maximize, so we stick a minus in front
and change the argmax to an argmin. We can then
rescale by any constant without moving the minimum.
If we use 1/N, with N the size of our data, then we end
up with the average log loss over the whole dataset.
That is, if we start with the maximum likelihood
objective, we can show step by step that this is
equivalent to minimizing the log loss.
A deeper reason for the “- log” is that every probability
distribution can be thought of as a compression
algorithm, and the negative log2 probability is the
number of bits you need to encode with this
compression algorithm. See this lecture for details.
predicted mean
true value
ov
e output node as the predicted blood pressure directly,
m
output y
we treat it as the mean of a probability distribution
on the space of all blood pressures.
neural
network To keep things simple, we fix the variance to 1 and
predict only the mean. We can also give the neural
input (age, weight) network two outputs, and have it parametrize the
30
whole normal distribution. We’ll see examples of this
later in the course.
Note that in this picture, we are moving the mean
around to maximize the probability density of the true
value t. We move the mean by changing the weights of
the neural network. Of course, for every instance we
see, t will be in a new place, so the weights should give
us a new mean for every input x we see.
squared errors Normal distribution (fixed important to understand, because it’s a principle we
regression
variance)
y abs(y
✓
ky - tk 1 with
=X = model i -✓ ti(x)) will return to multiple times throughout the course,
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
- t
✓ (x)
)
iwith t 2 {0, 1}
i i variance)
- log p✓ (t)X
ky withi y = model (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
ky
- -
logtk
- tk 1✓ =(t) abs(y
with tt 2 - t✓1}
i {0, i)
log loss
p
p ✓ (t)X with 2
- log p✓ (t) iwith t 2 {0, 1}. . , K}
- log {0, . Bernoulli distribution
ky - tk
- log
binary cross-entropy max(0, =
p11✓ (t) withabs(y
with -
t 2it{0, i.). , K}
t.{-1,
classification
- ty) 2 1}
- log p✓✓ (t) with t 2 {0, 1} . . . , K}
max(0, 1 - ty) i
with t 2 {-1, 1}
log loss max(0,
- log pp1✓✓(t)
- log (t) with
- ty)with with
tt 2 2 .{-1,
2t{0,
{0, 1}
. . , K}
1} Categorical distribution
cross-entropy
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K}
hinge loss max(0, 1 - ty) with t 2 {-1, 1} <none>
33
Lecture 1: Introduction
Peter Bloem
Deep Learning
dlvu.github.io
|section|Autoencoders|
|video|https://www.youtube.com/embed/
M1hh8HcWEzk?si=8OlYpUyuibS7PJI5|
So, we’ve seen what neural networks are, and how to
use them for regression and for classification. But the
PART THREE: AUTOENCODERS
real power of neural networks is not in doing classical
machine learning tasks. Rather it’s in their flexibility to
A simple neural architecture to give you a hint of what’s possible.
grow beyond that. The idea is that neural networks can
be set up in a wild variety of different configurations,
to solve all sorts of different tasks.
To give you a hint of that, we’ll finish up by looking at a
simple example: the autoencoder.
AUTOENCODERS
Here’s what an autoencoder looks like. It’s is a
particular type of neural network, shaped like an
output
bottleneck architecture for dimensionality hourglass. Its job is just to make the output as close to
reduction. the input as possible, but somewhere in the network
input should be as close as possible to the there is a small layer that functions as a bottleneck.
output
but: must pass through a small We can set it up however we like, with one or many
representation. fully connected layers. The only requirements are that
(a) one of the layers forms a bottleneck, and (b) that
the input is the same size as the output.
The idea is that we simply train the neural network to
input
reconstruct the input. If we manage to train a network
36
that does this successfully, then we know that
whatever value the bottleneck layer takes for a
particular input is a low-dimensional representation of
that input, from which we can pretty well decode the
input, so it must contain the relevant details.
Note how powerful this idea is. We don’t even need
any labeled data. All we need is a large amounts of
examples (images, sentences, etc) and with that we
can train an autoencoder.
AFTER 25 EPOCHS
INTERPOLATION
So, now that we have an autoencoder what can we do
with it? One thing we can do is interpolation.
If we take two points in the latent space, and draw a
line between them, we can pick evenly spaced points
on that line and decode them. If the decoder is good,
and all the points in the latent space decode to
decoder
44
45
source: https://blogs.nvidia.com/blog/2016/12/23/ai-
lips-kanye-wests-frown-upside-down/
z 49
generated 50
RECAP
52