0% found this document useful (0 votes)
17 views6 pages

Unit 3

Aiml

Uploaded by

vdon9320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Unit 3

Aiml

Uploaded by

vdon9320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Historical Trends in Deep Learning

• Deep learning has had a long and rich history, but has gone by many names reflecting different
philosophical viewpoints, and has waxed and waned in popularity.

• Deep learning has become more useful as the amount of available training data has increased.

• Deep learning models have grown in size over time as computer hardware and software
infrastructure for deep learning has improved.

• Deep learning has solved increasingly complicated applications with increasing accuracy over
time

Deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006.

The first wave started with cybernetics in the 1940s–1960s, with the development of theories of
biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and implementations of the first
models such as the perceptron (Rosenblatt, 1958) allowing the training of a single neuron.

The second wave started with the connectionist approach of the 1980–1995 period, with back-
propagation (Rumelhart et al., 1986a) to train a neural network with one or two hidden layers.

The current and third wave, deep learning, started around 2006 (Hinton et al., 2006; Bengio et
al., 2007; Ranzato et al., 2007a)

Back-Propagation and Other Differentiation Algorithms

The back-propagation algorithm (Rumelhart et al., 1986a), often simply called backprop, allows
the information from the cost to then flow backwards through the network, in order to compute
the gradient.

Actually, back-propagation refers only to the method for computing the gradient, while another
algorithm, such as stochastic gradient descent, is used to perform learning using this gradient

we will describe how to compute the gradient ∇xf(x, y) for an arbitrary function f, where x is a
set of variables whose derivatives are desired, and y is an additional set of variables that are
inputs to the function but whose derivatives are not required

∇θ
In learning algorithms, the gradient we most often require is the gradient of the cost function
with respect to the parameters, J(θ).
Computational Graphs

To describe the back-propagation algorithm more precisely, it is helpful to have a more precise
computational graph language.

we use each node in the graph to indicate a variable. The variable may be a scalar, vector, matrix,
tensor, or even a variable of another type.

Examples of computational graphs are shown in Fig

Examples of computational graphs. (a) The graph using the× operation to compute z = xy.

(b) The graph for the logistic regression prediction yˆ = σ  x  w + b  . Some of the
intermediate expressions do not have names in the algebraic expression but need names in the
graph. We simply name the i-th such variable u(i) .

(c) The computational graph for the expression H = max{0, XW + b}, which computes a design
matrix of rectified linear unit activations H given a design matrix containing a minibatch of
inputs X.
(d) Examples a–c applied at most one operation to each variable, but it is possible to apply more
than one operation

Here we show a computation graph that applies more than one operation to the weights w of a
linear regression model. The weights are used to make the both the prediction yˆ and the weight
decay penalty λ i sigma wi 2

Chain Rule of Calculus

Back-propagation is an algorithm that computes the chain rule, with a specific order of
operations that is highly efficient.

Let x be a real number, and let f and g both be functions mapping from a real number to a real
number. Suppose that y = g(x) and z = f (g(x)) = f (y).

Then the chain rule states that


The back-propagation algorithm is very simple. To compute the gradient of some scalar z with
respect to one of its ancestors x in the graph, we begin by observing that the gradient with
respect to z is given by dx/dz=1

We can then compute the gradient with respect to each parent of z in the graph by multiplying
the current gradient by the Jacobian of the operation that produced z.

We continue multiplying by Jacobians traveling backwards through the graph in this way until
we reach x.

More formally, each node in the graph G corresponds to a variable. To achieve maximum
generality, we describe this variable as being a tensor V.

We assume that each variable V is associated with the following subroutines:

• get_operation(V): This returns the operation that computes V, represented by the edges coming
into V in the computational graph. For example, there may be a Python or C++ class
representing the matrix multiplication operation, and the get_operation function. Suppose we
have a variable that is created by matrix multiplication, C = AB. Then get_operation(V) returns a
pointer to an instance of the corresponding C++ class.

• get_consumers(V, G): This returns the list of variables that are children of V in the
computational graph G.

• get_inputs(V, G): This returns the list of variables that are parents of V in the computational
graph G

For example, we might use a matrix multiplication operation to create a variable C = AB.

Suppose that the gradient of a scalar z with respect to C is given by G.

The matrix multiplication operation is responsible for defining two back-propagation rules, one
for each of its input arguments.

If we call the bprop method to request the gradient with respect to A given that the gradient on
the output is G, then the bprop method of the matrix multiplication operation must state that the
gradient with respect to A is given by GB transpose.

Likewise, if we call the bprop method to request the gradient with respect to B, then the matrix
operation is responsible for implementing the bprop method and specifying that the desired
gradient is given by A transpose G.

The back-propagation algorithm itself does not need to know any differentiation rules. It only
needs to call each operation’s bprop rules with the right arguments. Formally, op.bprop(inputs,
X, G) must return
Computing a gradient in a graph with n nodes will never execute more than O(n2) operations or
store the output of more than O(n2) operations.

The back-propagation algorithm adds one Jacobian-vector product, which should be expressed
with O(1) nodes, per edge in the original graph. Because the computational graph is a directed
acyclic graph it has at most O(n2) edges.

Most neural network cost functions are roughly chain-structured, causing back-propagation to
have O(n) cost.

You might also like