CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
We would like to separate the two colors, and clearly there is no way this can be done in a single dimension
(a single dimensional decision boundary would be a point, separating the axis into two regions).
To fix this problem, we can add additional (potentially nonlinear) features to construct a decision boundary
from. Consider the same dataset with the addition of x2 as a feature:
With this additional piece of information, we are now able to construct a linear separator in the two di-
mensional space containing the points. In this case, we were able to fix the problem by mapping our data
to a higher dimensional space by manually adding useful features to datapoints. However, in many high-
dimensional problems, such as image classification, manually selecting features that are useful is a tedious
problem. This requires domain-specific effort and expertise, and works against the goal of generalization
across tasks. A natural desire is to learn these featurization or transformation functions as well, perhaps
using a nonlinear function class that is capable of representing a wider variety of functions.
Multi-layer Perceptron
Let’s examine how we can derive a more complex function from our original perceptron architecture. Con-
sider the following setup, a two-layer perceptron, which is a perceptron that takes as input the outputs of
another perceptron.
With this additional structure and weights, we can express a much wider set of functions.
By increasing the complexity of our model, we in turn greatly increase its expressive power. Multi-layer
perceptrons give us a generic way to represent a much wider set of functions. In fact, a multi-layer percep-
tron is a universal function approximator and can represent any real function, leaving us only with the
problem of selecting the best set of weights to parameterize our network. This is formally stated below:
Theorem. (Universal Function Approximators) A two-layer neural network with a sufficient number
of neurons can approximate any continuous real-valued function to any desired accuracy.
Measuring Accuracy
The accuracy of the binary perceptron after making m predictions can be expressed as:
1 m
l acc (w
w) = ∑ 1(sgn(ww · f(x(i) )) == y(i) )
m i=1
where x(i) is datapoint i, w is our weight vector, f is our function that derives a feature vector from a raw
datapoint, and y(i) is the actual class label of x(i) . In this context, sgn(x) represents the sign function, which
Given a vector that is output by our function f , softmax performs normalization to output a probability
distribution. To come up with a general loss function for our models, we can use this probability distribution
to generate an expression for the likelihood of a set of weights:
m
w) = ∏ P(y(i) |x(i) ; w)
`(w
i=1
This expression denotes the likelihood of a particular set of weights explaining the observed labels and
datapoints. We would like to find the set of weights that maximizes this quantity. This is identical to finding
the maximum of the log-likelihood expression (since log is an increasing function, the maximizer of one
will be the maximizer of the other):
m m
w) = log ∏ P(y(i) |x(i) ; w ) = ∑ log P(y(i) |x(i) ; w )
``(w
i=1 i=1
(Depending on the application, the formulation as a sum of log probabilities may be more useful – for
example in mini-batched or stochastic gradient descent; see the Neural Networks: Optimization section
below.) In the case where the log likelihood is differentiable with respect to the weights, we will discuss a
simple algorithm to optimize it.
0.5
−0.5
−1
−2 −1 0 1 2
This is difficult to optimize for a number of reasons which will hopefully become clearer when we address
gradient descent. Firstly, it is not continuous, and secondly, it has a derivative of zero at all points. Intu-
itively, this means that we cannot know in which direction to look for a local minima of the function, which
makes it difficult to minimize loss in a smooth way.
Instead of using a step function like above, a better solution is to select a continuous function. We have
many options for such a function, including the sigmoid function (named for the Greek σ or ’s’ as it looks
like an ’s’) as well as the rectified linear unit (ReLU).Let’s look at their definitions and graphs below:
1
Sigmoid: σ (x) = 1+e−x
0.8
0.6
0.4
0.2
0
−10 −5 0 5 10
(
0 if x < 0
ReLU: f (x) =
x if x ≥ 0
1.5
0.5
0
−2 −1 0 1 2
Calculating the output of a multi-layer perceptron is done as before, with the difference that at the output
of each layer we now apply one of our new non-linearities (chosen as part of the architecture for the neural
network) instead of the initial indicator function. In practice, the choice of nonlinearity is a design choice
that typically requires some experimentation to select a good one for each individual use case.
To maximize our log-likelihood function, we differentiate it to obtain a gradient vector consisting of its
partial derivatives for each parameter:
∂ ``(w) ∂ ``(w)
∇w ``(w) = , ...,
∂ w1 ∂ wn
This gradient vector gives the local direction of steepest ascent (or descent if we reverse the vector). Gradi-
ent ascent is a greedy algorithm that calculates this gradient for the current values of the weight parameters,
then updates the parameters along the direction of the gradient, scaled by a step size, α. Specifically the
algorithm looks as follows:
Initialize weights w
For i = 0, 1, 2, ...
w ← w + α∇w ``(w w)
If rather than minimizing we instead wanted to minimize a function f , the update should subtract the scaled
gradient (w ← w − α∇w f (w)) – this gives the gradient descent algorithm.
∂f ∂ f ∂ x1 ∂ f ∂ x2 ∂ f ∂ xn
= · + · +...+ ·
∂ti x1 ∂ti x2 ∂ti xn ∂ti
g f
+ ∗
y
3
z
4
In the context of computation graphs, this means that to compute the gradient of a given node ti with respect
to the output z, we take a sum of children(ti ) terms.
x
2 2→
←4
g f
5→ ∗ 20→
+
←4 ←1
y 3→
←4
3
← →
4
5
z
4
Figure 1 shows an example computation graph for computing (x + y) ∗ z with the values x = 2, y = 3, z = 4.
We will write g = x + y and f = g ∗ z. Values in green are the outputs of each node, which we compute in
the forward pass, where we apply each node’s operation to its input values coming from its parent nodes.
Values in red after each node give gradients of the function computed by the graph, which are computed in
the backward pass: the value after each node is the partial derivative of the last node f value with respect
to the variable at that node. For example, the red value 4 after g is ∂∂ gf , and the red value 4 after x is ∂∂ xf . In
our simple example, f is just a multiplication node which outputs the product of its two input operands, but
in a real neural network the final node will usually compute the loss value that we are trying to minimize.
The backward pass computes gradients by starting at the final node (which has a gradient of 1 since ∂∂ ff = 1)
and passing and updating gradients backward through the graph. Intuitively, each node’s gradient measures
∂f
1. Since f is our final node, it has gradient ∂ f = 1. Then we compute the gradients for its children, g
∂f
and z. We have ∂g = ∂
∂ g (g · z) = z = 4, and ∂∂ zf = ∂∂z (g · z) = g = 5.
2. Now we can move on upstream to compute the gradients of x and y. For these, we’ll use the chain
rule and reuse the gradient we just computed for g, ∂∂ gf .
∂f ∂ f ∂g
3. For x, we have ∂x = ∂g ∂x by the chain rule – the product of the gradient coming from g with the
partial derivative for x at this node. We have ∂∂ gx = ∂∂x (x + y) = ∂∂x x + ∂∂x y = 1 + 0, so ∂∂ xf = 4 · 1 = 4.
Intuitively, the amount that a change in x contributes to a change in f is the product of the amount that
a change in g contributes to a change in f , with the amount that a change in x contributes to one in g.
4. The process for computing the gradient of the output with respect to y is almost identical. For y
we have ∂∂ yf = ∂∂ gf ∂∂ gy by the chain rule – the product of the gradient coming from g with the partial
∂g ∂ ∂ ∂ ∂f
derivative for y at this node. We have ∂y = ∂ y (x + y) = ∂yx + ∂yy = 0 + 1, so ∂y = 4 · 1 = 4.
Since the backward pass step for a node in general depends on the node’s inputs (which are computed in the
forward pass), and gradients computed “downstream” of the current node by the node’s children (computed
earlier in the backward pass), we cache all of these values in the graph for efficiency. Taken together, the
forward and backward pass over the graph make up the backpropagation algorithm.
x h
2→ +
2 ←16 ←4
5
← → g f
←4
4
+ 11→ ∗ 44→
←4 ←1
←1
y i 6→ 4
←
2
3→
← →
3 ∗
11
z
4
←12 ←8
4
For an example of applying the chain rule for a node with multiple children, consider the graph in Figure 2,
representing ((x + y) + (x · y)) · z, with x = 2, y = 3, z = 4. x and y are each used in 2 operations, and so each
has two children. By the chain rule, their gradient values are the sum of the gradients computed for them
Conclusion
In this note, we generalized the perceptron to neural networks, models which are powerful (and universal!)
function approximators but can be difficult to design and train. We talked about how the expressiveness of
neural networks comes from the activation functions they employ to introduce nonlinearities, as well as how
to optimize their parameters with backpropagation and gradient descent. Currently, a large portion of new
research in machine learning focuses on various aspects of neural network design such as:
1. Network Architectures - designing a network (choosing activation functions, number of layers, etc.)
that’s a good fit for a particular problem
2. Learning Algorithms - how to find parameters that achieve a low value of the loss function, a difficult
problem since gradient descent is a greedy algorithm and neural nets can have many local optima
3. Generalization and Transfer Learning - since neural nets have many parameters, it’s often easy to
overfit training data - how do you guarantee that they also have low loss on testing data you haven’t
seen before?