0% found this document useful (0 votes)

26 views

lecture19

Uploaded by

neraho1297

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

lecture19

Uploaded by

neraho1297

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Lecture #19: Introduction to Deep Learning

Robert Yang
Department of Computer Science
Stanford University
Stanford, CA 94305
{bobyang9}@cs.stanford.edu

1 History of Deep Learning

Deep Learning is an important technique that can be used for numerous applications including
computer vision. How were deep learning methods developed? Historically, the perceptron, the first
algorithm that led to deep learning models, was published in a paper by Rosenblatt in 1957. As we
will see later, multiple perceptrons could be used together to form a linear classifier. In the 1980s,
further improvements on the algorithm such as backpropagation were made and researchers started
training the linear classifiers, albeit not very well. In the 1990s, researchers focused on graphical
models, and in the 2000s, the support vector machine was very popular. However, in the 2010s, we
have revisited the 1980s and the reliance on backpropagation to automatically learn our models. Due
to vastly improved computing power (e.g. GPUs) and a large amount of data, the formerly clunky
algorithms in the 1980s have become useful and accurate. But to understand today’s advances in
deep learning, we must first understand its foundation based in history: the perceptron model, linear
classifier, and backpropagation.

2 The Perceptron
The structure of the perceptron model is defined by a few key aspects. First, the perceptron takes
in some input which we can call x. x is a vector, because there can be multiple inputs (e.g. x1 , x2 ,
. . . , xn ). Second, the perceptron returns one output which we can call y. In the original version
of the perceptron, y would be a scalar (one output only). Between the input and the output, the
perceptron contains a weighting function that combines the inputs of the perceptron (x1 , x2 , . . . ,
xn ) with weights w. w is a vector, because there is one weight for each input. Lastly, between the
weighting function and the output, there is a sign function, which is usually the following:

1 xw ≥ 0
0 otherwise

The perceptron algorithm was inspired from biology, but it is not how the brain works! The brain has
neurons which gets input through its dendrites, processes that input in the cell body, and then outputs
using the axon which might be sent to other neurons. Similarly, the perceptron takes inputs, processes
that input, and then outputs the processed result. The weights of the perceptron are analogous to
the strength of the connection across the synapses between neurons. However, neurons are probably
much more complex than a simple weighting function, which is why the perceptron algorithm is only
loosely inspired.
We can gain intuition of the perceptron algorithm by looking at an application related to computer
vision. For example, the inputs to the perceptron might represent some sort of image (and we can do
this in many different ways, such as using raw pixels, PCA, and BoW, among others). We can convert
this input into an output that might signify whether the input is some type of class (e.g. airplane, bird,
etc.)

Computer Vision: Foundations and Applications (CS 131, 2017), Stanford University.
3 Linear Classifier
A linear classifier consists of multiple perceptrons put together. For example, you could use a linear
classifier to make predictions for multiple classes.
The structure of a linear classifier is similar to the structure of the perceptron. Continuing our
computer vision example, we see that the input is the same – a vector x that may perhaps contain
information about an image, whether that be a raw pixel representation or some other kind of feature
representation. However, the output is now a set of scores (not just one score). For example, you
could have a score for the airplane class, for the bird class, etc. We could classify by using the class
that has the highest score.
The conversion from input to output in a linear classifier is also defined by the weights w, similar
to those in a perceptron. w is a 2-D matrix with dimensions (number of classes, number of input
elements). We will show why in the example below.

Example 1.
For a certain linear classifier, let there be 10 classes, and let the input be a 32 x 32 x 3 image. Assume
that we are using a raw pixel representation. What is the shape of the weight matrix for the classifier?
Answer: First, we must find how many inputs there are. Because we are using a raw pixel representa-
tion, the number of inputs will be the number of pixels, which is 32 * 32 * 3 = 3072 inputs. We also
know that there are 10 outputs because there are 10 classes (1 output for each class). For each class,
we have a set of 3072 weights, and we have 10 sets in total. This will be organized in a (3072, 10)
shaped matrix.

As in the example, each class has a set of ninput weights, where ninput is the dimension of the input
vector. You can actually visualize the weights as images:

Sometimes, you also add a bias vector to your results. Intuitively, this helps learn the bias for or
against one class in the dataset. For example, if there were more birds in the dataset, you would have
a big bias term for the element corresponding to the bird class.

4 Loss Function
We need to pick the weights so that they minimize the error our model makes on the dataset. Formally,
this is minW Loss(y, ŷ), where y is the true label and ŷ is the predicted label.
Given several training examples (x1 , y1 ), (x2 , y2 ), ..., (xN , yN ) and a perceptron ŷ = wx, where xi
is an image and yi is an integer label (0 for dog, 1 for cat, etc.)
A loss function Li (yi , yˆi ) tells us how good our current classifier is. When the classifier predicts
correctly (yi is close to ŷ), the loss should be low. When the classifier predicts incorrectly (yi is far

2
from ŷ), the loss should be high. We can average the loss over the entire dataset to see how well our
model is doing in general.

4.1 Different Types of Loss Functions

L2 is the squared error loss function. The goal is to minimize the squared difference between the
ground truth labels and the predictions. However, the L2 loss is not very robust to outliers. One
relatively strange example will increase the loss by a lot.
Formula for L2 loss: Li (yi , yî ) = (yi − yî )2
L1 is a loss function similar to L2 which aims to minimize the absolute value of the difference
between the ground truth labels and the predictions. The absolute value is there because you want
loss to be positive, and zero loss means that you are able to perfectly classify your dataset.
Formula for L1 loss: Li (yi , yî ) = |yi − yî |
Zero-One is a loss function that only measures if the predictions are correct or incorrect, and does
not measure how correct or incorrect the predictions are. If the model makes an error, then the loss is
1, and if the model does not make an error, then the loss is 0. This loss function has a few weaknesses
in that models using this loss function takes longer to train (it is less informative on the correctness of
the predictions).
Formula for Zero-One loss: Li (yi , yî ) = 1||yi 6= yî ||
Hinge is another loss function that grows bigger when the difference between the prediction and the
ground truth increases.
Formula for Hinge loss: Li (yi , yî ) = max(0, 1 − yi yî )

5 The KL Divergence Loss Function and the Softmax Linear Classifier

Sometimes we want the model to output probabilities instead of scores because we can visualize
probabilities better. We need to convert the vector of scores into a probability distribution.
There are no limits to the output space for the scores of the linear classifier model, which mean that
values might appear which do not signify probabilities (values greater than 1 or less than 0). Softmax
is a method to convert the output into probability ranges [0, 1]. It does so by taking the exponential
of the outputs and then normalizing them, as you will see below in the formula.
eŷk
P rob[f (xi , W ) == k] = P ŷ
j
j e

The KL divergence allows you to calculate the distance between two probability distributions. First,
we can define P (y) as the ground truth distribution, and Q(y) as the model’s output score distribution
(the predictions). KL divergence is defined as the following:
P P (y)
DKL = y P (y)log Q(y)
Usually, P (y) is probability 1 for one class and 0 for all other classes. In such cases, since P (y) is 0
1
for all the other classes, [LaTeX from slide 46] can be rewritten as log( Q(y) ), which can be rewritten
as −log(Q(y)), where y is the class with probability 1 in P (y).
Notice that softmax and KL divergence work together very well because the exponent and ln will
cancel out. Furthermore, inputting the probabilities given by softmax into KL divergence gives a loss
from 0 to infinity, which is a good range.

Example 2.
Let C be the number of classes in the classifier. In the scenario where the classifier outputs the
same result for all the classes, and that softmax is used to normalize output, what would be the KL
Divergence loss?
Answer: If the classifier outputs the same result for all the classes, then each class will have a
probability of C1 after softmax. We know that the KL divergence loss would be −log(Q(y)), and we

3
know that Q(y) is C1 because no matter which of the classes has a 1 in the ground truth, the prediction
would be C1 . Thus, we would find that the KL divergence loss would be −log( C1 ), which is log(C).

A summary:
Softmax allows us to convert scores into probabilities.
KL Divergence allows us to calculate the distance between the predicted probabilities and the ground
truth.

6 Gradient Descent

Gradient Descent is a technique used to find optimal weights that minimize the loss.

6.1 Intuition of Gradient Descent

In Gradient Descent, you iterate through the following loop:

1. You first compute the derivative of the loss with respect to the weights, and with this, you would
know which direction to move the weights in order to lower the loss.
2. You would move toward that direction by updating the weights.
You would repeat the two steps until you get to a local minima. In this way, the gradient descent
algorithm will slowly decrease the loss.

6.2 Theory

We can calculate what the gradient looks like given our loss function. First, we have a training data
point (x, y), and we are using the linear classifier where ŷ = wx. Let us define k as the class which is
correct. We know that the loss is -log(softmax(ŷ(k))), where ŷ(k) is the score given to the correct
class by the linear classifier.
ŷk
Evaluating that expression, we get that Loss = L(ŷ, y) = −log Pe eŷj = −ŷk + log j eŷj .
P
j

Now that we have our loss function, we need the derivative of the loss function with respect to each
dL
weight, dW . However, since the loss function is in terms of ŷ, we will need to use the chain rule:
dL dL dŷ
dW = dŷ dW .
dL
First, we need to calculate dŷ . To do this, we need to consider 2 cases.
Case 1: We are calculating the derivative respect to ŷ for the class k. In this case, the derivative of the
loss function is
dL d d
P ŷj
dŷk = dŷk (−ŷk ) + dŷk (log je )

eŷk
= −1 + P ŷ
j
.
j e

Case 2: We are calculating the derivative respect to ŷ for a class l where l 6= k. In this case, the
derivative of the loss function is
dL d d
P ŷj
dŷk = dŷk (−ŷk ) + dŷk (log je )

eŷl
= P ŷ
j
.
j e

4
eŷ0
 
P ŷ
j
 j e 
 eŷ1 
P ŷ
j
j e
 
 
 ... 
ŷk
 
dŷ dL −1 + Pe
We also know that since ŷ = W x, = x. Thus, we see that =

ŷj 
dW dW  j e 
 ... 
eŷn−2
 
 P 
ŷj
j e
 
ŷn−1
 
e
eŷj
P
j

where n is the number of classes.

6.3 Pseudocode of Gradient Descent

Now, we can translate the theory into code. First, we calculate the loss by looping through each data
point and averaging the losses of each data point. Then, we calculate the derivative of the loss with
respect to each weight, and update each weight.

α is a hyperparameter that represents step size. You can tune this hyperparameter to find the optimal
step size. If α is too small, then your algorithm will take a long time (many iterations) to finish, while
if α is too big, your algorithm might overshoot the local minima.

7 Backpropagation

Backpropagation is a method to compute the gradients which visualizes the computation as a graph.
For example, you would first have a graph to compute the loss:

You would first apply the multiplication operator on W and x to get ŷ. Afterwards, you would
calculate loss from ŷ and y. This step of going forward to find the loss is called the forward pass.
Afterwards, you go backwards to calculate the gradients:

5
dL
You first calculate dŷ using the methods described in section 6.2. Then, you can calculate values
dL dL dL dŷ
such as dW by using the chain rule ( dW = dŷ dW ). In the above example, because ŷ = W x,
dŷ dL dL dŷ
dW = x = 2. Thus, dW = dŷ dW = (1.2)(2) = 2.4.

8 Neural Networks

8.1 Basics of Neural Networks

We have many ways of producing features for a dataset, including using the raw pixels, injecting
positional information using the raw pixels in addition to the (x, y) coordinates, using LDA or PCA to
find features that produce the most variance in a supervised or unsupervised manner, or using a set of
interesting areas in an image, called a "Bag of Words". If we are using the dataset for classification,
we could then use a linear classifier to separate the data. However, sometimes, the data isn’t linearly
separable, leading the classifier to perform poorly, as in the example below.
Finding good, linearly separable features for use in classification still is a difficult and somewhat
finicky task. Namely, we don’t know what data we are going to get, and each dataset might require a
different feature encoding method to produce the best results possible. However, there is a solution.
The strategy is to design a function to convert features which are not linearly separable into features
that are. For example, in the picture below, by applying a transformation to (r, θ) coordinates, you
could now linearly separate the data.

In other words, the function will learn what features to input into our linear classifier. The outputs of
that function would be the inputs of the linear classifier, which would look something like this:

6
This is a neural network with 2 layers, where the first layer generally works to learn better features
and the second layer classifies the better features. Neural networks with n layers have the first n − 1
layers to learn better features and the last layer, usually a softmax, for the purpose of classification.
For the 2-layer network, we can calculate ŷ using the following formula:
ŷ = W2 max(0, W1 x).
Why is the max function necessary? Notice that one property of linear functions like matrix mul-
tiplication is that no matter how many of them you apply, the transformation on a whole is linear.
Thus, if you got rid of the max function, you would have ŷ = W2 W1 x, but that would just collapse
to ŷ = W x where W = W2 W1 .
We can also calculate the shape of the matrices that hold the weights.
Example 3. Let the shape of x, the input, be (3072, 1), let the shape of y, the output, be (10, 1), and
let the shape of a, the hidden units, be (h, 1). What is the shape of W1 and W2 ?
Answer: Since a = W1 x, W1 must be shape (h, 3072) so that W1 x is of shape (h, 1). Also, since
y = W2 a, W2 must be shape (10, h) so that W2 a is of shape (10, 1).
Note: When designing the structure of the network, we determine the value of h, so h is another
hyperparameter.

8.2 Activation Function

The max function in the formula ŷ = W2 max(0, W1 x) for calculating the prediction of a 2-layer
neural network is one of many activation functions (also called nonlinearities because they prevent
the neural network from collapsing back into one linear combination). Such functions allow models
to learn more complex transformations for features. There are many activation functions, as seen
below:

7
The sigmoid function is very popular and widely used but one of the slower functions, while tanh is
similar to sigmoid in terms of properties but is centered at 0. ReLU (the rectified linear unit) is the
function used in most neural networks today because of its double function in providing nonlinearity
and regularization, and it is one of the faster functions. Leaky ReLU is an advanced version of ReLU.
Finally, Maxout and ELU are less seen but still used in some cases. Choosing the right activation
function is another hyperparameter, so try different ones to see which ones work best.

9 Conclusion
Understanding the history of deep learning and how the algorithms and models developed will be
crucial to understanding how deep learning works. The models of deep learning are made of stacks
of linked linear classifiers. Deep learning algorithms evaluate their performance using loss functions.
Softmax is still popular in the output of deep learning algorithms. Furthermore, backpropagation
and gradient descent (advanced versions of it), which give the model the ability to learn weights, is
crucial for performance at increasingly complex tasks in computer vision and elsewhere.

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Lect 8
No ratings yet
Lect 8
117 pages
Week3_LearningI
No ratings yet
Week3_LearningI
48 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
SML_Lecture5
No ratings yet
SML_Lecture5
45 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Chap2slides - Copy
No ratings yet
Chap2slides - Copy
74 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
AI Lec2.1 MLsupervised
No ratings yet
AI Lec2.1 MLsupervised
21 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
DeepLearning Workshop Humayun
No ratings yet
DeepLearning Workshop Humayun
63 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Lec 05
No ratings yet
Lec 05
54 pages
DL145611_03_Shallow
No ratings yet
DL145611_03_Shallow
92 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
92 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
Machine Learning: Linear Models For Classification 1
No ratings yet
Machine Learning: Linear Models For Classification 1
30 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Lec 21
No ratings yet
Lec 21
34 pages
1 Intro
No ratings yet
1 Intro
5 pages
Lecture W1c UG
No ratings yet
Lecture W1c UG
33 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
ML_Lec 6- Linear Classifiers
No ratings yet
ML_Lec 6- Linear Classifiers
55 pages
ML Merge
No ratings yet
ML Merge
145 pages
Ds 2
No ratings yet
Ds 2
27 pages
ML-UNIT-I
No ratings yet
ML-UNIT-I
14 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
19ImageClassification
No ratings yet
19ImageClassification
78 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
ml
No ratings yet
ml
10 pages
Practical-5_2CEIT606_Artificial Intelligence
No ratings yet
Practical-5_2CEIT606_Artificial Intelligence
14 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
ML 01
No ratings yet
ML 01
24 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
125 pages
L3_CSE256_FA24_FFN
No ratings yet
L3_CSE256_FA24_FFN
64 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
Lab NN KNN SVM
No ratings yet
Lab NN KNN SVM
13 pages
CS115 01
No ratings yet
CS115 01
38 pages
b9c50b5b58d240169f8bec65f9d6589b
No ratings yet
b9c50b5b58d240169f8bec65f9d6589b
61 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Genetics
No ratings yet
Genetics
1 page
The Civil Rights Movement
No ratings yet
The Civil Rights Movement
1 page
Understanding the Basics of Climate Change
No ratings yet
Understanding the Basics of Climate Change
1 page
The Fundamentals of Quantum Mechanics
No ratings yet
The Fundamentals of Quantum Mechanics
1 page
AI notes
No ratings yet
AI notes
1 page
Data Science
No ratings yet
Data Science
1 page
VLMs basics
No ratings yet
VLMs basics
29 pages
Front Pages
No ratings yet
Front Pages
30 pages
Sat - 27.Pdf - Face Mask Detection Using Convolutional Neural Network
No ratings yet
Sat - 27.Pdf - Face Mask Detection Using Convolutional Neural Network
10 pages
QML Review
No ratings yet
QML Review
20 pages
Deep Learning Models For Bridge Deck Evaluation Using Impact Echo
No ratings yet
Deep Learning Models For Bridge Deck Evaluation Using Impact Echo
41 pages
Unit IV - Learning
No ratings yet
Unit IV - Learning
18 pages
(IJCST-V12I2P13) :prof. Neethi Narayanan, Sona Maria Simon, Sonica R, Sredha Anil, Parvathy Krishna S
No ratings yet
(IJCST-V12I2P13) :prof. Neethi Narayanan, Sona Maria Simon, Sonica R, Sredha Anil, Parvathy Krishna S
5 pages
Transfer Learning With Convolutional Neural Networks For Iris Recognition
No ratings yet
Transfer Learning With Convolutional Neural Networks For Iris Recognition
18 pages
Rishik Rangaraju Annotated Bibliography 9
No ratings yet
Rishik Rangaraju Annotated Bibliography 9
2 pages
4 SVM
No ratings yet
4 SVM
6 pages
Use of Artificial Intelligence in Drug Discovery and Its Development
No ratings yet
Use of Artificial Intelligence in Drug Discovery and Its Development
13 pages
77-1557305671121-Unit 6 Managing A Successful Computing Research Project 2019
No ratings yet
77-1557305671121-Unit 6 Managing A Successful Computing Research Project 2019
14 pages
Improved Swarm Intelligence Optimization Using Crossover and Mutation For Medical Classification
No ratings yet
Improved Swarm Intelligence Optimization Using Crossover and Mutation For Medical Classification
7 pages
Recommender Systems Notes
No ratings yet
Recommender Systems Notes
21 pages
3rd Year Syllabus 2020-21
No ratings yet
3rd Year Syllabus 2020-21
36 pages
Gerardo Rodríguez Barba Control Lectura 120224
No ratings yet
Gerardo Rodríguez Barba Control Lectura 120224
9 pages
Gas Turbines Modeling Simulation and Control Using Artificial Neural Networks
100% (1)
Gas Turbines Modeling Simulation and Control Using Artificial Neural Networks
218 pages
Dynamic Weight Estimation Using An Artificial Neural Network
No ratings yet
Dynamic Weight Estimation Using An Artificial Neural Network
5 pages
01 Speed Read Tensorflow Playground
No ratings yet
01 Speed Read Tensorflow Playground
6 pages
Ordinary Differential Equation Based Recurrent Neural Network Models for Learning Continuous Time Series
No ratings yet
Ordinary Differential Equation Based Recurrent Neural Network Models for Learning Continuous Time Series
316 pages
Week 02 Ch2.1 Introduction To Neural Networks
No ratings yet
Week 02 Ch2.1 Introduction To Neural Networks
44 pages
nvidia-learning-learning-path-developers-it-administrators
No ratings yet
nvidia-learning-learning-path-developers-it-administrators
19 pages
Final Year Project Lung Cancer Detection Using Efficient Net B3
No ratings yet
Final Year Project Lung Cancer Detection Using Efficient Net B3
14 pages
BITS_F312_1334_20240731165555
No ratings yet
BITS_F312_1334_20240731165555
3 pages
Plant Disease Report 4
No ratings yet
Plant Disease Report 4
29 pages
CS 4407 Written Assignment 7 PDF
No ratings yet
CS 4407 Written Assignment 7 PDF
6 pages
Crop Disease Detection
No ratings yet
Crop Disease Detection
19 pages
Image Recognition On Arm Cortex-M With CMSIS-NN
No ratings yet
Image Recognition On Arm Cortex-M With CMSIS-NN
17 pages
A Novel Hybrid MPPT Controller Using (P&O) - Neural Networks For Variable Speed Wind Turbine Based On DFIG
No ratings yet
A Novel Hybrid MPPT Controller Using (P&O) - Neural Networks For Variable Speed Wind Turbine Based On DFIG
8 pages
Computational Intelligence
No ratings yet
Computational Intelligence
396 pages

lecture19

Uploaded by

lecture19

Uploaded by

Lecture #19: Introduction to Deep Learning

1 History of Deep Learning

4.1 Different Types of Loss Functions

5 The KL Divergence Loss Function and the Softmax Linear Classifier

6.1 Intuition of Gradient Descent

In Gradient Descent, you iterate through the following loop:

where n is the number of classes.

6.3 Pseudocode of Gradient Descent

8.1 Basics of Neural Networks

8.2 Activation Function

You might also like