Demystifying Deep Learning
Demystifying Deep Learning
AGENDA
Quick refresher on Gradient Descent and Probabilistic Perspectives
Differentiation Methods and Autodiff
Computational Graphs
Multi-layer Perceptrons – “Traditional Way”
Deep Networks – “Computational Graph” Architecture
Publicly available Jupyter Notebook – Build own Tensorflow!
SIMPLE REGRESSION
Hypothesis Function:
Cost Function:
GRADIENT DESCENT
Model training:
PROBABILISTIC INTERPRETATION
Let us assume that the target variables and the inputs are related via the equation:
where error term captures either unmodeled effects or random noise. Let us further
assume that the error terms are distributed IID according to a Gaussian distribution
with mean zero and some variance sigma^2.
PROBABILISTIC INTERPRETATION
The probability of the data is given by
This quantity is typically viewed a function of y (and perhaps X), for a fixed value of θ. When
we wish to explicitly view this as a function of θ, we will instead call it the likelihood function:
The principal of maximum likelihood says that we should choose θ so as to make the data as
high probability as possible. I.e., we should choose θ to maximize L(θ).
PROBABILISTIC INTERPRETATION
The derivations is simpler if we instead maximize the log likelihood ℓ(θ):
Hence, maximizing ℓ(θ) gives the same answer as minimizing Least-squares regression
corresponds to finding the
maximum likelihood
estimate of θ.
DIFFERENTIATION METHODS - AUTODIFF
In mathematics and computer algebra, automatic differentiation (AD), also called
algorithmic differentiation or computational differentiation, is a set of techniques
to numerically evaluate the derivative of a function specified by a computer
program.
Bakpropagation refers to the whole process of training an artificial neural network
using multiple backpropagation steps, each of which computes gradients and uses
them to perform a Gradient Descent step. In contrast, auto diff is simply a
technique used to compute gradients efficiently and it happens to be used by
backpropagation.
Tensorflow uses automatic differentiation and more specifically reverse-mode auto
differentiation.
NUMERICAL DIFFERENTIATION
The simplest solution is to compute an approximation of the derivatives, numerically.
Recall the following derivate equations:
NUMERICAL DIFFERENTIATION
COMPUTATIONAL GRAPHS
A computational graph is a directed graph where the nodes correspond to
operations or variables. Variables can feed their value into operations, and
operations can feed their output into other operations. This way, every node in the
graph defines a function of the variables.
COMPUTATIONAL GRAPHS AND DERIVATIVES
Consider the following computational graphs:
COMPUTATIONAL GRAPHS AND DERIVATIVES
COMPUTATIONAL GRAPHS AND DERIVATIVES
COMPUTATIONAL GRAPHS AND DERIVATIVES
We can evaluate the expression by setting the input variables to certain values and
computing nodes up through the graph. For example, let’s set a=2 and b=1:
COMPUTATIONAL GRAPHS AND DERIVATIVES
If one wants to understand derivatives in a computational graph, the key is to understand
derivatives on the edges. If a directly affects c, then we want to know how it affects c. If a
changes a little bit, we want to know the degree/factor by how much c changes.
To evaluate the partial derivatives in this graph, we need the sum rule and the product rule:
COMPUTATIONAL GRAPHS AND DERIVATIVES
Below, the graph has the derivative on each
edge labelled.
What if we want to understand how nodes
that aren’t directly connected affect each
other. Let’s consider how e is affected by a.
If we change a at a speed of 1, c also
changes at a speed of 1. In turn, c changing
at a speed of 1causes e to change at a
speed of 2. So e changes at a rate of 1∗2
with respect to a.
COMPUTATIONAL GRAPHS AND DERIVATIVES
The general rule is to sum over all possible
paths from one node to the other, multiplying
the derivatives on each edge of the path
together. For example, to get the derivative
of e with respect to b we get:
FACTORING PATHS
The problem with just “summing over the paths” is that it’s very easy to get a
combinatorial explosion in the number of possible paths.
Factoring:
FORWARD AND REVERSE MODE DIFFERENTIATION
FORWARD MODE DIFFERENTIATION
REVERSE MODE DIFFERENTIATION
http://www.deepideas.net/deep-learning-from-scratch-i-computational-graphs/