0% found this document useful (0 votes)

69 views9 pages

CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation

1) Neural networks can learn nonlinear decision boundaries using multiple layers of nodes and nonlinear activation functions, overcoming limitations of single-layer perceptrons. 2) A multi-layer perceptron connects the output of one perceptron to the input of another, allowing complex functions to be learned. 3) Neural networks can approximate any function given enough nodes, but the challenge is finding the optimal weights through training to generalize to new examples.

Uploaded by

Eman Jaffri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views9 pages

CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation

Uploaded by

Eman Jaffri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS 188 Introduction to Artificial Intelligence

Fall 2017 Note 10

Neural Networks: Motivation

Non-linear Separators
We know how to construct a model that learns a linear boundary for binary classification tasks. This is a
powerful technique, and one that works well when the underlying optimal decision boundary is itself linear.
However, many practical problems involve the need for decision boundaries that are nonlinear in nature, and
our linear perceptron model isn’t expressive enough to capture this relationship.
Consider the following set of data:

We would like to separate the two colors, and clearly there is no way this can be done in a single dimension
(a single dimensional decision boundary would be a point, separating the axis into two regions).
To fix this problem, we can add additional (potentially nonlinear) features to construct a decision boundary
from. Consider the same dataset with the addition of x2 as a feature:

With this additional piece of information, we are now able to construct a linear separator in the two di-
mensional space containing the points. In this case, we were able to fix the problem by mapping our data
to a higher dimensional space by manually adding useful features to datapoints. However, in many high-
dimensional problems, such as image classification, manually selecting features that are useful is a tedious
problem. This requires domain-specific effort and expertise, and works against the goal of generalization
across tasks. A natural desire is to learn these featurization or transformation functions as well, perhaps
using a nonlinear function class that is capable of representing a wider variety of functions.

Multi-layer Perceptron
Let’s examine how we can derive a more complex function from our original perceptron architecture. Con-
sider the following setup, a two-layer perceptron, which is a perceptron that takes as input the outputs of
another perceptron.

CS 188, Fall 2017, Note 10 1

In fact, we can generalize this to an N-layer perceptron:

With this additional structure and weights, we can express a much wider set of functions.
By increasing the complexity of our model, we in turn greatly increase its expressive power. Multi-layer
perceptrons give us a generic way to represent a much wider set of functions. In fact, a multi-layer percep-
tron is a universal function approximator and can represent any real function, leaving us only with the
problem of selecting the best set of weights to parameterize our network. This is formally stated below:

Theorem. (Universal Function Approximators) A two-layer neural network with a sufficient number
of neurons can approximate any continuous real-valued function to any desired accuracy.

Measuring Accuracy
The accuracy of the binary perceptron after making m predictions can be expressed as:
1 m
l acc (w
w) = ∑ 1(sgn(ww · f(x(i) )) == y(i) )
m i=1

where x(i) is datapoint i, w is our weight vector, f is our function that derives a feature vector from a raw
datapoint, and y(i) is the actual class label of x(i) . In this context, sgn(x) represents the sign function, which

CS 188, Fall 2017, Note 10 2

evaluates to −1 when x is negative, and 1 when x is positive. Similarly, 1(x) is an indicator function, which
evaluates to 1 if the quantities within are equivalent and 0 otherwise. Taking this notation into account, we
can note that our accuracy function above is equivalent to dividing the total number of correct predictions
by the raw total number of predictions.
Sometimes, we want an output that is more expressive than a binary label. It then becomes useful to produce
a probability for each of the N classes we want to classify into, which reflects our a degree of certainty that
the datapoint belongs to each of the possible classes. To do so, we transition from storing a single weight
vector to storing a weight vector for each class j, and estimate probabilities with the softmax function σ (x).
The softmax function defines the probability of classifying x(i) to class j as:
(i) T
e f (x ) w j
σ (x(i) ) j = N (i) )T w
= P(y(i) = j|x(i) )
∑k=1 e f (x k

Given a vector that is output by our function f , softmax performs normalization to output a probability
distribution. To come up with a general loss function for our models, we can use this probability distribution
to generate an expression for the likelihood of a set of weights:
m
w) = ∏ P(y(i) |x(i) ; w)
`(w
i=1

This expression denotes the likelihood of a particular set of weights explaining the observed labels and
datapoints. We would like to find the set of weights that maximizes this quantity. This is identical to finding
the maximum of the log-likelihood expression (since log is an increasing function, the maximizer of one
will be the maximizer of the other):
m m
w) = log ∏ P(y(i) |x(i) ; w ) = ∑ log P(y(i) |x(i) ; w )
``(w
i=1 i=1

(Depending on the application, the formulation as a sum of log probabilities may be more useful – for
example in mini-batched or stochastic gradient descent; see the Neural Networks: Optimization section
below.) In the case where the log likelihood is differentiable with respect to the weights, we will discuss a
simple algorithm to optimize it.

Multi-layer Feedforward Neural Networks

We now introduce the idea of an (artificial) neural network. This is much like the multi-layer perceptron,
however, we choose a different non-linearity to apply after the individual perceptron nodes. Note that it
is these added non-linearities that makes the network as a whole non-linear and more expressive (without
them, a multi-layer perceptron would simply be a composition of linear functions and hence also linear). In
the case of a multi-layer perceptron, we chose a step function:
(
1 if x ≥ 0
f (x) =
−1 otherwise

Let’s take a look at its graph:

CS 188, Fall 2017, Note 10 3

0.5

−0.5

−1
−2 −1 0 1 2

This is difficult to optimize for a number of reasons which will hopefully become clearer when we address
gradient descent. Firstly, it is not continuous, and secondly, it has a derivative of zero at all points. Intu-
itively, this means that we cannot know in which direction to look for a local minima of the function, which
makes it difficult to minimize loss in a smooth way.
Instead of using a step function like above, a better solution is to select a continuous function. We have
many options for such a function, including the sigmoid function (named for the Greek σ or ’s’ as it looks
like an ’s’) as well as the rectified linear unit (ReLU).Let’s look at their definitions and graphs below:

1
Sigmoid: σ (x) = 1+e−x

0.8

0.6

0.4

0.2

0
−10 −5 0 5 10
(
0 if x < 0
ReLU: f (x) =
x if x ≥ 0

CS 188, Fall 2017, Note 10 4

1.5

0.5

0
−2 −1 0 1 2

Calculating the output of a multi-layer perceptron is done as before, with the difference that at the output
of each layer we now apply one of our new non-linearities (chosen as part of the architecture for the neural
network) instead of the initial indicator function. In practice, the choice of nonlinearity is a design choice
that typically requires some experimentation to select a good one for each individual use case.

Loss Functions and Multivariate Optimization

Now we have a sense of how a feed-forward neural network is constructed and makes its predictions, we
would like to develop a way to train it, iteratively improving its accuracy, similarly to how we did in the
case of the perceptron. In order to do so, we will need to be able to measure their performance. Returning
to our log-likelihood function that we wanted to maximize, we can derive an intuitive algorithm to optimize
our weights given that our function is differentiable.

Gradient Ascent / Descent

To maximize our log-likelihood function, we differentiate it to obtain a gradient vector consisting of its
partial derivatives for each parameter:

∂ ``(w) ∂ ``(w)
∇w ``(w) = , ...,
∂ w1 ∂ wn

This gradient vector gives the local direction of steepest ascent (or descent if we reverse the vector). Gradi-
ent ascent is a greedy algorithm that calculates this gradient for the current values of the weight parameters,
then updates the parameters along the direction of the gradient, scaled by a step size, α. Specifically the
algorithm looks as follows:

Initialize weights w
For i = 0, 1, 2, ...
w ← w + α∇w ``(w w)
If rather than minimizing we instead wanted to minimize a function f , the update should subtract the scaled
gradient (w ← w − α∇w f (w)) – this gives the gradient descent algorithm.

CS 188, Fall 2017, Note 10 5

Neural Networks: Optimization
Now that we have a method for computing gradients for all parameters of the network, we can use gradient
descent methods to optimize the parameters to get high accuracy on our training data. For example, suppose
we have designed some classification network to output probabilities of classes y for data points x, and have
m different training datapoints (see the Measuring Accuracy section for more on this). Let w be all the
parameters of our network. We want to find values for the parameters w that maximize the likelihood of the
true class probabilities for our data, so we have the following function to run gradient ascent on:
m m
``(w) = log ∏ P(y(i) | x(i) ; w) = ∑ log P(y(i) | x(i) ; w)
i=1 i=1

where x(1) , . . . , x(m) are the m datapoints in our training set.

One way to try to minimize this function is, at each iteration of gradient descent, to use all the data points
x(1) , . . . , x(m) to compute gradients for the parameters w, update the parameters, and repeat until the param-
eters converge (at which point we’ve reached a local minimum of the function).
This technique, known as batch gradient descent, is rarely done in practice, since datasets are typically
large enough that computing gradients for this full likelihood function will be very slow. Instead, we’ll
typically use mini-batching. Mini-batching rotates through randomly sampled batches of k data points at a
time, taking one batch for each step of gradient descent and computing gradients of the loss function using
only that batch (so that the sum above is over the k datapoints in the batch, rather than all m datapoints
in the training set). This allows us to compute each gradient update much more quickly, and often still
makes fast progress toward the minimum of the function. The limit where the batch size k = 1 is known
as stochastic gradient descent (SGD). In SGD, we randomly sample a single example from the training
dataset at each step of gradient descent, compute parameter gradients using the network’s loss on that single
example, update the parameters, and repeat (sampling another example from the training set).

Neural Networks: Backpropagation (Optional)

To efficiently calculate the gradients for each parameter in a neural network, we will use an algorithm known
as backpropagation. Backpropagation represents the neural network as a dependency graph of operators
and operands, called a computational graph, such as the one shown below:
The graph structure allows us to efficiently compute both the network’s error (loss) on input data, as well as
the gradients of each parameter with respect to the loss. These gradients can be used in gradient descent to
adjust the network’s parameters and minimize the loss on the training data.

The Chain Rule

The chain rule is the fundamental rule from calculus which both motivates the usage of computation graphs
and allows for a computationally feasible backpropagation algorithm. Mathematically, it states that for a
variable z which is a function of n variables x1 , . . . , xn and each xi is a function of m variables t1 , . . . ,tm , then
we can compute the derivative of z with respect to any ti as follows:

∂f ∂ f ∂ x1 ∂ f ∂ x2 ∂ f ∂ xn
= · + · +...+ ·
∂ti x1 ∂ti x2 ∂ti xn ∂ti

CS 188, Fall 2017, Note 10 6

x
2

g f
+ ∗

y
3

z
4

In the context of computation graphs, this means that to compute the gradient of a given node ti with respect
to the output z, we take a sum of children(ti ) terms.

The Backpropagation Algorithm

x
2 2→
←4
g f
5→ ∗ 20→
+
←4 ←1
y 3→
←4
3
← →
4
5

z
4

Figure 1: A computation graph for computing (x + y) ∗ z with the values x = 2, y = 3, z = 4.

Figure 1 shows an example computation graph for computing (x + y) ∗ z with the values x = 2, y = 3, z = 4.
We will write g = x + y and f = g ∗ z. Values in green are the outputs of each node, which we compute in
the forward pass, where we apply each node’s operation to its input values coming from its parent nodes.
Values in red after each node give gradients of the function computed by the graph, which are computed in
the backward pass: the value after each node is the partial derivative of the last node f value with respect
to the variable at that node. For example, the red value 4 after g is ∂∂ gf , and the red value 4 after x is ∂∂ xf . In
our simple example, f is just a multiplication node which outputs the product of its two input operands, but
in a real neural network the final node will usually compute the loss value that we are trying to minimize.
The backward pass computes gradients by starting at the final node (which has a gradient of 1 since ∂∂ ff = 1)
and passing and updating gradients backward through the graph. Intuitively, each node’s gradient measures

CS 188, Fall 2017, Note 10 7

how much a change in that node’s value contributes to a change in the final node’s value. This will be the
product of how much the node contributes to a change in its child node, with how much the child node con-
tributes to a change in the final node. Each node receives and combines gradients from its children, updates
this combined gradient based on the node’s inputs and the node’s operation, and then passes the updated
gradient backward to its parents. Computation graphs are a great way to visualize repeated application of
the chain rule from calculus, as this process is required for backpropagation in neural networks.
Our goal during backpropagation is to determine the gradient of output with respect to each of the inputs.
As you can see in Figure 1, in this case we want to compute the gradients ∂∂ xf , ∂∂ yf , and ∂∂ zf :

∂f
1. Since f is our final node, it has gradient ∂ f = 1. Then we compute the gradients for its children, g
∂f
and z. We have ∂g = ∂
∂ g (g · z) = z = 4, and ∂∂ zf = ∂∂z (g · z) = g = 5.

2. Now we can move on upstream to compute the gradients of x and y. For these, we’ll use the chain
rule and reuse the gradient we just computed for g, ∂∂ gf .
∂f ∂ f ∂g
3. For x, we have ∂x = ∂g ∂x by the chain rule – the product of the gradient coming from g with the
partial derivative for x at this node. We have ∂∂ gx = ∂∂x (x + y) = ∂∂x x + ∂∂x y = 1 + 0, so ∂∂ xf = 4 · 1 = 4.
Intuitively, the amount that a change in x contributes to a change in f is the product of the amount that
a change in g contributes to a change in f , with the amount that a change in x contributes to one in g.

4. The process for computing the gradient of the output with respect to y is almost identical. For y
we have ∂∂ yf = ∂∂ gf ∂∂ gy by the chain rule – the product of the gradient coming from g with the partial
∂g ∂ ∂ ∂ ∂f
derivative for y at this node. We have ∂y = ∂ y (x + y) = ∂yx + ∂yy = 0 + 1, so ∂y = 4 · 1 = 4.

Since the backward pass step for a node in general depends on the node’s inputs (which are computed in the
forward pass), and gradients computed “downstream” of the current node by the node’s children (computed
earlier in the backward pass), we cache all of these values in the graph for efficiency. Taken together, the
forward and backward pass over the graph make up the backpropagation algorithm.
x h
2→ +
2 ←16 ←4
5
← → g f
←4

4
+ 11→ ∗ 44→
←4 ←1
←1

y i 6→ 4
←
2

3→
← →

3 ∗
11

z
4

←12 ←8
4

Figure 2: A computation graph for computing ((x + y) + (x · y)) · z, with x = 2, y = 3, z = 4.

For an example of applying the chain rule for a node with multiple children, consider the graph in Figure 2,
representing ((x + y) + (x · y)) · z, with x = 2, y = 3, z = 4. x and y are each used in 2 operations, and so each
has two children. By the chain rule, their gradient values are the sum of the gradients computed for them

CS 188, Fall 2017, Note 10 8

by their children (i.e. gradient values add at path junctions). For example, to compute the gradient for x, we
have
∂f ∂ f ∂h ∂ f ∂i
= + = 4 · 1 + 4 · 3 = 4 + 12 = 16
∂x ∂h ∂x ∂i ∂x

Conclusion
In this note, we generalized the perceptron to neural networks, models which are powerful (and universal!)
function approximators but can be difficult to design and train. We talked about how the expressiveness of
neural networks comes from the activation functions they employ to introduce nonlinearities, as well as how
to optimize their parameters with backpropagation and gradient descent. Currently, a large portion of new
research in machine learning focuses on various aspects of neural network design such as:

1. Network Architectures - designing a network (choosing activation functions, number of layers, etc.)
that’s a good fit for a particular problem

2. Learning Algorithms - how to find parameters that achieve a low value of the loss function, a difficult
problem since gradient descent is a greedy algorithm and neural nets can have many local optima

3. Generalization and Transfer Learning - since neural nets have many parameters, it’s often easy to
overfit training data - how do you guarantee that they also have low loss on testing data you haven’t
seen before?

CS 188, Fall 2017, Note 10 9

Unit 2 First Law-Closed System Problems
0% (1)
Unit 2 First Law-Closed System Problems
11 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Understanding and Coding Neural Networks From Scratch in Python and R
No ratings yet
Understanding and Coding Neural Networks From Scratch in Python and R
12 pages
Neural Network
No ratings yet
Neural Network
97 pages
Neural Network
No ratings yet
Neural Network
82 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Lecture - 3 Block Diagram Representation of Control Systems
100% (1)
Lecture - 3 Block Diagram Representation of Control Systems
54 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Slides NN
No ratings yet
Slides NN
59 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
lec22-ML III
No ratings yet
lec22-ML III
51 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Python Unit 5
No ratings yet
Python Unit 5
36 pages
Class 13 - Mathematical Modeling of Thermal System
67% (3)
Class 13 - Mathematical Modeling of Thermal System
15 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
17 pages
08 NN
No ratings yet
08 NN
43 pages
Week4 LearningII
No ratings yet
Week4 LearningII
39 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
UNIT1 Perceptron MLP
No ratings yet
UNIT1 Perceptron MLP
26 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Perceptron Notes
No ratings yet
Perceptron Notes
27 pages
Slide 2
No ratings yet
Slide 2
35 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
NN Theory
No ratings yet
NN Theory
138 pages
Main
No ratings yet
Main
25 pages
Neural Networks Unit-3
No ratings yet
Neural Networks Unit-3
14 pages
Branch Prediction With Neural Networks - Hidden Layers and Recurrent Connections
No ratings yet
Branch Prediction With Neural Networks - Hidden Layers and Recurrent Connections
15 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
16 pages
Advanced Supervised Learning
No ratings yet
Advanced Supervised Learning
17 pages
Artifical Neural Networks - Lect - 2
No ratings yet
Artifical Neural Networks - Lect - 2
16 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
cs188 sp24 Note22
No ratings yet
cs188 sp24 Note22
8 pages
Week3 Perceptron Mlprwerwerwer
No ratings yet
Week3 Perceptron Mlprwerwerwer
8 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
No ratings yet
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
6 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Neural Networks: Artificial Intelligence: Representation and Problem Solving
No ratings yet
Neural Networks: Artificial Intelligence: Representation and Problem Solving
19 pages
Implementation of PID Controllers On Motorola DSP PDF
No ratings yet
Implementation of PID Controllers On Motorola DSP PDF
84 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
02.08.2022textbookgtast Compressed
No ratings yet
02.08.2022textbookgtast Compressed
460 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
QM ZG528-L1
No ratings yet
QM ZG528-L1
24 pages
PID Control System Analysis, Design and Technology
No ratings yet
PID Control System Analysis, Design and Technology
38 pages
PHY 103: Basic Principle of Physics II: Heat and Thermodynamics
No ratings yet
PHY 103: Basic Principle of Physics II: Heat and Thermodynamics
32 pages
Systems Thinking 1
No ratings yet
Systems Thinking 1
35 pages
Lesson 2
No ratings yet
Lesson 2
30 pages
Software Engineering: Self Learning Material
No ratings yet
Software Engineering: Self Learning Material
203 pages
AEI504
No ratings yet
AEI504
28 pages
(Δάσκαλος - Teacher) a. C. Antoulas Auth., Prof. Dr. Athanasios C. Antoulas Eds. Mathematical System Theory the Influence of R. E. Kalman 1991
No ratings yet
(Δάσκαλος - Teacher) a. C. Antoulas Auth., Prof. Dr. Athanasios C. Antoulas Eds. Mathematical System Theory the Influence of R. E. Kalman 1991
589 pages
SDLC
No ratings yet
SDLC
8 pages
1) What Are The Four Functions Included Within The Scope of Manufacturing Support Systems?
No ratings yet
1) What Are The Four Functions Included Within The Scope of Manufacturing Support Systems?
4 pages
Difference Between DCS and PLC
No ratings yet
Difference Between DCS and PLC
2 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
High Speed
100% (1)
High Speed
20 pages
Fundamental Properties of Linear Ship Steering Dynamic Models
No ratings yet
Fundamental Properties of Linear Ship Steering Dynamic Models
11 pages
Seminar On Carnot Cycle and Its Efficiency
No ratings yet
Seminar On Carnot Cycle and Its Efficiency
10 pages
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
54 pages
Solving Thermodynamics Problems PDF
No ratings yet
Solving Thermodynamics Problems PDF
3 pages
Tuning of A Pid Controller and Performance Analysis: Exp No.:4
No ratings yet
Tuning of A Pid Controller and Performance Analysis: Exp No.:4
5 pages
Ch. 22 - Scheduling - Part 2 - Skeleton
No ratings yet
Ch. 22 - Scheduling - Part 2 - Skeleton
17 pages
216-472/795 Managing Connected Enterprise: Weeks 3 and 4
No ratings yet
216-472/795 Managing Connected Enterprise: Weeks 3 and 4
31 pages
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
No ratings yet
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
3 pages
Group 2activity 4 Entropy and The Second Law of Thermodynamics
No ratings yet
Group 2activity 4 Entropy and The Second Law of Thermodynamics
8 pages
Writing The Specifications
No ratings yet
Writing The Specifications
4 pages
Software Process Models
No ratings yet
Software Process Models
4 pages

CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation

Uploaded by

CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation

Uploaded by

CS 188 Introduction to Artificial Intelligence

Fall 2017 Note 10

Neural Networks: Motivation

CS 188, Fall 2017, Note 10 1

CS 188, Fall 2017, Note 10 2

Multi-layer Feedforward Neural Networks

Let’s take a look at its graph:

CS 188, Fall 2017, Note 10 3

CS 188, Fall 2017, Note 10 4

Loss Functions and Multivariate Optimization

Gradient Ascent / Descent

CS 188, Fall 2017, Note 10 5

where x(1) , . . . , x(m) are the m datapoints in our training set.

Neural Networks: Backpropagation (Optional)

The Chain Rule

CS 188, Fall 2017, Note 10 6

The Backpropagation Algorithm

Figure 1: A computation graph for computing (x + y) ∗ z with the values x = 2, y = 3, z = 4.

CS 188, Fall 2017, Note 10 7

Figure 2: A computation graph for computing ((x + y) + (x · y)) · z, with x = 2, y = 3, z = 4.

CS 188, Fall 2017, Note 10 8

CS 188, Fall 2017, Note 10 9

You might also like