0% found this document useful (0 votes)
22 views39 pages

Unit 1

deep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views39 pages

Unit 1

deep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

AD3501DEEP LEARNINGAI&DS

AD3501 DEEP LEARNING LTPC3003


COURSE OBJECTIVES:
· To understand and need and principles of deep neural networks
· To understand CNN and RNN architectures of deep neural networks
· To comprehend advanced deep learning models
· To learn the evaluation metrics for deep learning models

UNIT I DEEP NETWORKS BASICS 9


Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions –
Gradient based Optimization – Machine Learning Basics: Capacity -- Overfitting and
underfitting -- Hyperparameters and validation sets -- Estimators -- Bias and variance --
Stochastic gradient descent -- Challenges motivating deep learning; Deep Networks: Deep
feedforward networks; Regularization -- Optimization.
UNIT II CONVOLUTIONAL NEURAL NETWORKS 9
Convolution Operation -- Sparse Interactions -- Parameter Sharing -- Equivariance -- Pooling
-- Convolution Variants: Strided -- Tiled -- Transposed and dilated convolutions; CNN
Learning: Nonlinearity Functions -- Loss Functions -- Regularization -- Optimizers --Gradient
Computation.
UNIT III RECURRENT NEURAL NETWORKS 10
Unfolding Graphs -- RNN Design Patterns: Acceptor -- Encoder --Transducer; Gradient
Computation -- Sequence Modelling Conditioned on Contexts -- Bidirectional RNN --
Sequence to Sequence RNN – Deep Recurrent Networks -- Recursive Neural Networks --
Long Term Dependencies; Leaky Units: Skip connections and dropouts; Gated Architecture:
LSTM.
UNIT IV MODEL EVALUATION 8
Performance metrics -- Baseline Models -- Hyperparameters: Manual Hyperparameter --
Automatic Hyperparameter -- Grid search -- Random search -- Debugging strategies.
UNIT V AUTOENCODERS AND GENERATIVE MODELS 9
Autoencoders: Undercomplete autoencoders -- Regularized autoencoders -- Stochastic
encoders and decoders -- Learning with autoencoders; Deep Generative Models: Variational
autoencoders – Generative adversarial networks.
TOTAL: 45 PERIODS
COURSE OUTCOMES
CO1: Explain the basics in deep neural networks
CO2: Apply Convolution Neural Network for image processing
CO3: Apply Recurrent Neural Network and its variants for text analysis
CO4: Apply model evaluation for various applications
CO5: Apply auto encoders and generative models for suitable applications

KITE 1 | Page
AD3501DEEP LEARNINGAI&DS

TEXT BOOK
1. Ian Goodfellow, Yoshua Bengio, Aaron Courville, ``Deep Learning'', MIT Press, 2016.
2. Andrew Glassner, “Deep Learning: A Visual Approach”, No Starch Press, 2021.

REFERENCES

1. Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, Mohammed Bennamoun, ``A Guide
to Convolutional Neural Networks for Computer Vision'', Synthesis Lectures on Computer
Vision, Morgan & Claypool publishers, 2018.
2. Yoav Goldberg, ``Neural Network Methods for Natural Language Processing'', Synthesis
Lectures on Human Language Technologies, Morgan & Claypool publishers, 2017.
3. Francois Chollet, ``Deep Learning with Python'', Manning Publications Co, 2018.
4. Charu C. Aggarwal, ``Neural Networks and Deep Learning: A Textbook'', Springer
International Publishing, 2018.
5. Josh Patterson, Adam Gibson, ``Deep Learning: A Practitioner's Approach'', O'Reilly
Media, 2017.

KITE 2 | Page
AD3501DEEP LEARNINGAI&DS

UNIT I DEEP NETWORKS BASICS 9


Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions –
Gradient based Optimization – Machine Learning Basics: Capacity -- Overfitting and
underfitting -- Hyperparameters and validation sets -- Estimators -- Bias and variance --
Stochastic gradient descent -- Challenges motivating deep learning; Deep Networks: Deep
feedforward networks; Regularization -- Optimization.

1. Linear Algebra
1. It is a branch of Mathematics.
2. It is widely used throughout science and engineering.
3. However, it is continuous rather than discrete, many computer scientists have
little experience with it
4. A good understanding of Linear Algebra is essential al understanding and
workind with many Machine Learning Algorithms, especially Deep Learning.
1.1 Scalar, Vector, Metrics and Tensors
Scalar: A scalar is just a single number, in contrast to most of the other objects
studied in linear algebra, which are usually arrays of multiple numbers. We write
scalars in italics. We usually give scalars lower-case variable names. When we
introduce them, we specify what kind of number they are.
For example, we might say “Let s ∈ R be the slope of the line,” while defining
a real-valued scalar, or “Let n ∈ N be the number of units,” while defining a natural
number scalar.

Vectors: A vector is an array of numbers. The numbers are arranged in order. We can
identify each individual number by its index in that ordering. Typically, we give
vectors lower case names written in bold typeface, such as x. The elements of the
vector are identified by writing its name in italic typeface, with a subscript.

Matrices: A matrix is a 2-D array of numbers, so each element is identified by two


indices instead of just one. We usually give matrices upper-case variable names with
bold typeface, such as A. If a real-valued matrix A has a height of m and a width of n,
then we say that A ∈ Rm×n. We usually identify the elements of a matrix using its
name in italic but not bold font, and the indices are listed with separating commas.

Tensors: In some cases we will need an array with more than two axes. In the
general case, an array of numbers arranged on a regular grid with a variable number

KITE 3 | Page
AD3501DEEP LEARNINGAI&DS

of axes is known as a tensor. We denote a tensor named “A” with this typeface: A.
We identify the element of A at coordinates (i, j, k) by writing Ai,j,k.

(AT)i,j = Aj,i

1.2 Probability Distributions

A probability distribution is a description of how likely a random variable or set of


random variables is to take on each of its possible states. The way we describe probability
distributions depends on whether the variables are discrete or continuous.

1.2.1 Discrete Variables and Probability Mass Functions:

A probability distribution over discrete variables may be described using a


probability mass function (PMF). We typically denote probability mass functions with a
capital P. Often we associate each random variable with a different probability mass
function and the reader must infer which probability mass function to use based on the
identity of the random variable, rather than the name of the function; P(x) is usually not
the same as P(y).
The probability mass function maps from a state of a random variable to the
probability of that random variable taking on that state. The probability that x = x is
denoted as P(x), with a probability of 1 indicating that x = x is certain and a probability of
0 indicating that x = x is impossible. Sometimes to disambiguate which PMF to use, we
write the name of the random variable explicitly: P (x = x). Sometimes we define a
variable first, then use ∼ notation to specify which distribution it follows later: x ∼ P(x).
Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution. P
(x = x, y = y ) denotes the probability that x = x and y = y simultaneously. We may also
write P(x, y) for brevity.
To be a probability mass function on a random variable x, a function P must satisfy
the following properties
● The domain of P must be the set of all possible states of x.
● ∀x ∈ x,0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can be
less probable than that. Likewise, an event that is guaranteed to happen has
probability 1, and no state can have a greater chance of occurring.
● ∑x∈x P(x) = 1. We refer to this property as being normalized. Without this
property, we could obtain probabilities greater than one by computing the
probability of one of many events occurring.
For example, consider a single discrete random variable x with k different states. We
can place a uniform distribution on x—that is, make each of its states equally likely—by
setting its probability mass function to

P(x = xi) = (1 /k)

for all i. We can see that this fits the requirements for a probability mass function. The
value 1 k is positive because k is a positive integer. We also see that

KITE 4 | Page
AD3501DEEP LEARNINGAI&DS

so the distribution is properly normalized.

1.2.2 Continuous Variables and Probability Density Functions:

When working with continuous random variables, we describe probability


distributions using a probability density function (PDF) rather than a probability mass
function. To be a probability density function, a function p must satisfy the following
properties:
● The domain of p must be the set of all possible states of x.
● ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.
● ∫ p(x)dx = 1.

A probability density function p(x) does not give the probability of a specific
state directly, instead the probability of landing inside an infinitesimal region with
volume δx is given by p(x)δx.

We can integrate the density function to find the actual probability mass of a set
of points. Specifically, the probability that x lies in some set S is given by the integral of
p(x) over that set. In the univariate example, the probability that x lies in the interval [a,
b] is given by ∫ [a,b] p(x)dx.

For an example of a probability density function corresponding to a specific


probability density over a continuous random variable, consider a uniform distribution
on an interval of the real numbers. We can do this with a function u(x; a, b), where a
and b are the endpoints of the interval, with b > a. The “;” notation means
“parametrized by”; we consider x to be the argument of the function, while a and b are
parameters that define the function. To ensure that there is no probability mass outside
the interval, we say u(x;a, b) = 0 for all x ∈ [a, b]. Within [a, b], u(x; a, b) = 1 /(b−a). We
can see that this is nonnegative everywhere. Additionally, it integrates to 1. We often
denote that x follows the uniform distribution on [a, b] by writing x ∼ U(a, b).

1.3 Gradient-based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization


refers to the task of either minimizing or maximizing some function f (x) by altering x. We
usually phrase most optimization problems in terms of minimizing f (x). Maximization may
be accomplished via a minimization algorithm by minimizing −f(x).
The function we want to minimize or maximize is called the objective function or
criterion. When we are minimizing it, we may also call it the cost function, loss function, or
error function. In this book, we use these terms interchangeably, though some machine
learning publications assign special meaning to some of these terms.
We often denote the value that minimizes or maximizes a function with a superscript
∗. For example, we might say x∗ = arg min f(x).
We assume the reader is already familiar with calculus, but provide a brief review of
how calculus concepts relate to optimization here.

KITE 5 | Page
AD3501DEEP LEARNINGAI&DS

Suppose we have a function y = f (x), where both x and y are real numbers. The
derivative of this function is denoted as f’(x) or as (dy/dx). The derivative f ‘ (x) gives the
slope of f (x) at the point x. In other words, it specifies how to scale a small change in the
input in order to obtain the corresponding change in the output: f(x +∈ ) ≈ f(x) + ∈f ‘ (x).
The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y. For example, we know that f (x − ∈
sign(f ‘ (x))) is less than f (x) for small enough . We can thus reduce f (x) by moving x in
small steps with opposite sign of the derivative. This technique is called Gradient Descent
(Cauchy, 1847). See figure 1.1 for an example of this technique.

Figure 1.1: An illustration of how the Gradient Descent algorithm uses the derivatives of a
function can be used to follow the function downhill to a minimum.

When f ‘ (x) = 0, the derivative provides no information about which direction to


move. Points where f’ (x) = 0 are known as critical points or stationary points. A local
minimum is a point where f (x) is lower than at all neighboring points, so it is no longer
possible to decrease f(x) by making infinitesimal steps. A local maximum is a point where f
(x) is higher than at all neighboring points.
so it is not possible to increase f (x) by making infinitesimal steps. Some critical
points are neither maxima nor minima. These are known as saddle points. See figure 4.2 for
examples of each type of critical point.

Figure 1.2: Examples of each of the three types of critical points in 1-D. A critical point is a
point with zero slope. Such a point can either be a local minimum, which is lower than the

KITE 6 | Page
AD3501DEEP LEARNINGAI&DS

neighboring points, a local maximum, which is higher than the neighboring points, or a
saddle point, which has neighbors that are both higher and lower than the point itself.

A point that obtains the absolute lowest value of f (x) is a global minimum. It is
possible for there to be only one global minimum or multiple global minima of the function.
It is also possible for there to be local minima that are not globally optimal. In the context of
deep learning, we optimize functions that may have many local minima that are not
optimal, and many saddle points surrounded by very flat regions. All of this makes
optimization very difficult, especially when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that is very low, but not necessarily
minimal in any formal sense. See figure 1.3 for an example

Figure 4.3: Optimization algorithms may fail to find a global minimum when there
are multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long as they
correspond to significantly low values of the cost function.
The gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent.

1.3.1 Beyond the Gradient: Jacobian and Hessian Matrices:


Sometimes we need to find all of the partial derivatives of a function whose input
and output are both vectors. The matrix containing all such partial derivatives is known as a
Jacobian matrix. Specifically, if we have a function f : Rm → Rn, then the Jacobian matrix J ∈
Rn×m of f is defined such that Ji,j = (∂/∂xj) f(x)i.
We are also sometimes interested in a derivative of a derivative. This is known as a
second derivative. For example, for a function f : Rn → R, the derivative with respect to xi of
the derivative of f with respect to xj is denoted as ∂2 /(∂xi∂xj) X f.

KITE 7 | Page
AD3501DEEP LEARNINGAI&DS

Figure 1.4: The second derivative determines the curvature of a function. Here we show quadratic
functions with various curvature. The dashed line indicates the value of the cost function we would expect
based on the gradient information alone as we make a gradient step downhill. In the case of negative
curvature, the cost function actually decreases faster than the gradient predicts. In the case of no curvature,
the gradient predicts the decrease correctly. In the case of positive curvature, the function decreases slower
than expected and eventually begins to increase, so steps that are too large can actually increase the
function inadvertently.

figure 4.4 to see how different forms of curvature affect the relationship between
the value of the cost function predicted by the gradient and the true value.
When our function has multiple input dimensions, there are many second
derivatives. These derivatives can be collected together into a matrix called the Hessian
matrix. The Hessian matrix H(f)(x) is defined such that:

2. Machine Learning Basics


The central challenge in machine learning is that we must perform well on
new, previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.
The central challenge in machine learning is that we must perform well on
new, previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.
We typically estimate the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately
from the training set.

2.1 Capacity -- Overfitting and underfitting:

These two factors correspond to the two central challenges in machine


learning: underfitting and overfitting. Underfitting occurs when the model is not able
to obtain a sufficiently low error value on the training set. Overfitting occurs when the
KITE 8 | Page
AD3501DEEP LEARNINGAI&DS

gap between the training error and test error is too large. We can control whether a
model is more likely to overfit or underfit by altering its capacity. Informally, a model’s
capacity is its ability to fit a wide variety of functions. Models with low capacity may
struggle to fit the training set. Models with high capacity can overfit by memorizing
properties of the training set that do not serve them well on the test set. One way to
control the capacity of a learning algorithm is by choosing its hypothesis space, the
set of functions that the learning algorithm is allowed to select as being the solution.
For example, the linear regression algorithm has the set of all linear functions of its
input as its hypothesis space. We can generalize linear regression to include
polynomials, rather than just linear functions, in its hypothesis space. Doing so
increases the model’s capacity.

A polynomial of degree one gives us the linear regression model with which
we are already familiar, with prediction
yˆ = b + wx.

By introducing x2 as another feature provided to the linear regression model,


we can learn a model that is quadratic as a function of x:
yˆ = b + w1x + w2x2

Though this model implements a quadratic function of its input, the output is
still a linear function of the parameters, so we can still use the normal equations to
train the model in closed form. We can continue to add more powers of x as
additional features, for example to obtain a polynomial of degree 9

Machine learning algorithms will generally perform best when their capacity is
appropriate for the true complexity of the task they need to perform and the amount
of training data they are provided with. Models with insufficient capacity are unable to
solve complex tasks. Models with high capacity can solve complex tasks, but when
their capacity is higher than needed to solve the present task they may overfit.
Figure 1.5 shows this principle in action. We compare a linear, quadratic and
degree-9 predictor attempting to fit a problem where the true underlying function is
quadratic. The linear function is unable to capture the curvature in the true
underlying problem, so it underfits. The degree-9 predictor is capable of representing
the correct function, but it is also capable of representing infinitely many other
functions that pass exactly through the training points, because we have more
parameters than training examples. We have little chance of choosing a solution that
generalizes well when so many wildly different solutions exist. In this example, the
quadratic model is perfectly matched to the true structure of the task so it
generalizes well to new data.
The model specifies which family of functions the learning algorithm can
choose from when varying the parameters in order to reduce a training objective.
This is called the representational capacity of the model.

KITE 9 | Page
AD3501DEEP LEARNINGAI&DS

Figure 1.5: We fit three models to this example training set. The training
data was generated synthetically, by randomly sampling x values and
choosing y deterministically by evaluating a quadratic function. (Left)A linear
function fit to the data suffers from underfitting—it cannot capture the
curvature that is present in the data. (Center)A quadratic function fit to the
data generalizes well to unseen points. It does not suffer from a significant
amount of overfitting or underfitting. (Right)A polynomial of degree 9 fit to the
data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse
to solve the underdetermined normal equations. The solution passes through
all of the training points exactly, but we have not been lucky enough for it to
extract the correct structure. It now has a deep valley in between two training
points that does not appear in the true underlying function. It also increases
sharply on the left side of the data, while the true function decreases in this
area.

2.1.1 The No Free Lunch Theorem


Unfortunately, even this does not resolve the entire problem. The no free lunch theorem for machine learning
(Wolpert, 1996) states that, averaged over all possible data generating distributions, every classification
algorithm has the same error rate when classifying previously unobserved points. In other words, in some
sense, no machine learning algorithm is universally any better than any other. The most sophisticated
algorithm we can conceive of has the same average.

2.2 Hyperparameters and Validation Sets


Most machine learning algorithms have several settings that we can use to control the behavior of the learning
algorithm. These settings are called hyperparameters. The values of hyperparameters are not adapted by the
learning algorithm itself (though we can design a nested learning procedure where one learning algorithm
learns the best hyperparameters for another learning algorithm).

In the polynomial regression example we saw in figure 1.6, there is a single hyperparameter: the degree of the
polynomial, which acts as a capacity hyperparameter. The λ value used to control the strength of weight decay
is another example of a hyperparameter.

KITE 10 | Page
AD3501DEEP LEARNINGAI&DS

Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because it is
difficult to optimize. More frequently, the setting must be a hyperparameter because it is not appropriate to
learn that hyperparameter on the training set. This applies to all hyperparameters that control model capacity.
If learned on the training set, such hyperparameters would always choose the maximum possible model
capacity, resulting in overfitting (refer to figure 1.6). For example, we can always fit the training set better with
a higher degree polynomial and a weight decay setting of λ = 0 than we could with a lower degree polynomial
and a positive weight decay setting.
To solve this problem, we need a validation set of examples that the training algorithm does not observe.

Figure 1.6: Typical relationship between capacity and error. Training and test error behave
differently. At the left end of the graph, training error and generalization error are both
high. This is the underfitting regime. As we increase capacity, training error decreases, but
the gap between training and generalization error increases. Eventually, the size of this
gap outweighs the decrease in training error, and we enter theoverfitting regime, where
capacity is too large, above the optimal capacity

Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the
training set, can be used to estimate the generalization error of a learner, after the learning process has
completed. It is important that the test examples are not used in any way to make choices about the model,
including its hyperparameters. For this reason, no example from the test set can be used in the validation set.
Therefore, we always construct the validation set from the training data. Specifically, we split the training data
into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our
validation set, used to estimate the generalization error during or after training, allowing for the
hyperparameters to be updated accordingly. The subset of data used to learn the parameters is still typically
called the training set, even though this may be confused with the larger pool of data used for the entire
training process. The subset of data used to guide the selection of hyperparameters is called the validation set.
Typically, one uses about 80% of the training data for training and 20% for validation. Since the validation set is
used to “train” the hyperparameters, the validation set error will underestimate the generalization error,
though typically by a smaller amount than the training error. After all hyperparameter optimization is
complete, the generalization error may be estimated using the test set.

In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms
over many years, and especially if we consider all the attempts from the scientific community at beating the
reported state-ofthe-art performance on that test set, we end up having optimistic evaluations with the test
set as well. Benchmarks can thus become stale and then do not reflect the true field performance of a trained
KITE 11 | Page
AD3501DEEP LEARNINGAI&DS

system. Thankfully, the community tends to move on to new (and usually more ambitious and larger)
benchmark datasets.

2.2.1 Cross-Validation:
Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set
being small. A small test set implies statistical uncertainty around the estimated average test error, making it
difficult to claim that algorithm A works better than algorithm B on the given task.

2.3 Estimators

Estimation is a statistical term for finding some estimate of


unknown parameter, given some data.
Point Estimation is the attempt to provide the single best
prediction of some quantity of interest.

Quantity of interest can be:

KITE 12 | Page
AD3501DEEP LEARNINGAI&DS

● A single parameter

● A vector of parameters — e.g., weights in linear regression

● A whole function

Point estimator
To distinguish estimates of parameters from their true value, a
point estimate of a parameter θ is represented by θˆ. Let {x , x (1) (2)

,..x } be m independent and identically distributed data points. Then


(m)

a point estimator is any function of the data:

θˆm = g(x(1) , . . . , x(m) ).

This definition of a point estimator is very general and allows the


designer of an estimator great flexibility. While almost any function
thus qualifies as an estimator, a good estimator is a function whose
output is close to the true underlying θ that generated the training
data.

Point estimation can also refer to estimation of relationship


between input and target variables referred to as function
estimation.

Function Estimation
Here we are trying to predict a variable y given an input vector x.
We assume that there is a function f(x) that describes the
approximate relationship between y and x. For example,

we may assume that y = f(x) + ε, where ε stands for the part of y that is
not predictable from x. In function estimation, we are interested in
approximating f with a model or estimate fˆ. Function estimation is
really just the same as estimating a parameter θ; the function
KITE 13 | Page
AD3501DEEP LEARNINGAI&DS

estimator fˆ is simply a point estimator in function space. Ex: in


polynomial regression we are either estimating a parameter w or
estimating a function mapping from x to y.

Bias and Variance


Bias and variance measure two different sources of error in an
estimator.
Bias measures the expected deviation from the true value of the
function or parameter.
Variance on the other hand, provides a measure of the deviation
from the expected estimator value that any particular sampling of
the data is likely to cause.

Bias
The bias of an estimator is defined as:

where the expectation is over the data (seen as samples from a


random variable) and θ is the true underlying value of θ used to
define the data generating distribution.

An estimator θˆm is said to be unbiased if bias(θˆm) = 0, which implies


that E(θˆm) = θ.

Variance and Standard Error


The variance of an estimator Var(θˆ) where the random variable is the
training set. Alternately, the square root of the variance is called the
standard error, denoted standard error SE(ˆθ). The variance or the
standard error of an estimator provides a measure of how we would
expect the estimate we compute from data to vary as we
independently re-sample the dataset from the underlying data
generating process.

KITE 14 | Page
AD3501DEEP LEARNINGAI&DS

Just as we might like an estimator to exhibit low


bias we would also like it to have relatively low
variance.

Having discussed the definition of an estimator, let us now discuss


some commonly used estimators.

Maximum Likelihood Estimator (MLE)


Maximum Likelihood Estimation can be defined as a method for
estimating parameters (such as the mean or variance ) from sample
data such that the probability (likelihood) of obtaining the observed
data is maximized.

Consider a set of m examples X = {x(1), . . . , x(m)} drawn independently


from the true but unknown data generating distribution Pdata(x).
Let Pmodel(x; θ) be a parametric family of probability distributions
over the same space indexed by θ. In other words, Pmodel(x; θ) maps
any configuration xto a real number estimating the true
probability Pdata(x).
The maximum likelihood estimator for θis then defined as:

Since we assumed the examples to be i.i.d, the above equation can


be written in the product form as:

KITE 15 | Page
AD3501DEEP LEARNINGAI&DS

This product over many probabilities can be inconvenient for a


variety of reasons. For example, it is prone to numerical underflow.
Also, to find the maxima/minima of this function, we can take the
derivative of this function w.r.t θand equate it to 0. Since we have
terms in product here, we need to apply the chain rule which is
quite cumbersome with products. To obtain a more convenient but
equivalent optimization problem, we observe that taking the
logarithm of the likelihood does not change its arg max but does
conveniently transform a product into a sum and since log is a
strictly increasing function ( natural log function is a monotone
transformation), it would not impact the resulting value of θ.

Two important properties: Consistency & Efficiency


Consistency: As the number of training examples approaches
infinity, the maximum likelihood estimate of a parameter converges
to the true value of the parameter.

Efficiency: A way to measure how close we are to the true


parameter is by the expected mean squared error, computing the
squared difference between the estimated and true parameter
values, where the expectation is over m training samples from the
data generating distribution.

For the reasons of consistency and efficiency,


maximum likelihood is often considered the
preferred estimator to use for machine learning.

KITE 16 | Page
AD3501DEEP LEARNINGAI&DS

2.4 Bias and variance


The bias of an estimator is defined as:

KITE 17 | Page
AD3501DEEP LEARNINGAI&DS

KITE 18 | Page
AD3501DEEP LEARNINGAI&DS

KITE 19 | Page
AD3501DEEP LEARNINGAI&DS

KITE 20 | Page
AD3501DEEP LEARNINGAI&DS

Variance:
Another property of the estimator that we might want to consider is how much we expect it to vary as a
function of the data sample. Just as we computed the expectation of the estimator to determine its bias, we
can compute its variance. The variance of an estimator is simply the variance
Var(ˆθ)
where the random variable is the training set. Alternately, the square root of the variance is called the
standard error, denoted SE(ˆθ).

The variance or the standard error of an estimator provides a measure of how we would expect the estimate
we compute from data to vary as we independently resample the dataset from the underlying data generating
process. Just as we might like an estimator to exhibit low bias we would also like it to have relatively low
variance.

When we compute any statistic using a finite number of samples, our estimate of the true underlying
parameter is uncertain, in the sense that we could have obtained other samples from the same distribution
and their statistics would have been different. The expected degree of variation in any estimator is a source of
error that we want to quantify.
The standard error of the mean is given by

where σ2 is the true variance of the samples xi. The standard error is often estimated by using an estimate of
σ. Unfortunately, neither the square root of the sample variance nor the square root of the unbiased estimator
of the variance provide an unbiased estimate of the standard deviation. Both approaches tend to
underestimate the true standard deviation, but are still used in practice. The square root of the unbiased
estimator of the variance is less of an underestimate. For large m, the approximation is quite reasonable.

KITE 21 | Page
AD3501DEEP LEARNINGAI&DS

2.5 Stochastic Gradient Descent


Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD.
Stochastic gradient descent is an extension of the gradient descent algorithm.

A recurring problem in machine learning is that large training sets are necessary for good generalization, but
large training sets are also more computationally expensive.

The cost function used by a machine learning algorithm often decomposes as a sum over training examples of
some per-example loss function. For example, the negative conditional log-likelihood of the training data can
be written as

KITE 22 | Page
AD3501DEEP LEARNINGAI&DS

2.6 Challenges Motivating Deep Learning


The simple machine learning algorithms described in this chapter work very well on a wide variety of
important problems. However, they have not succeeded in solving the central problems in AI, such as
recognizing speech or recognizing objects. The development of deep learning was motivated in part by the
failure of traditional algorithms to generalize well on such AI tasks. This section is about how the challenge of
generalizing to new examples becomes exponentially more difficult when working with high-dimensional data,
and how the mechanisms used to achieve generalization in traditional machine learning are insufficient to
learn complicated functions in high-dimensional spaces. Such spaces also often impose high computational
costs. Deep learning was designed to overcome these and other obstacles.

2.6.1 The Curse of Dimensionality


Many machine learning problems become exceedingly difficult when the number of dimensions in the data is
high. This phenomenon is known as the curse of dimensionality. Of particular concern is that the number of
possible distinct configurations of a set of variables increases exponentially as the number of variables
increases.

Figure 1.7: As the number of relevant dimensions of the data increases (from left to right), the number of
configurations of interest may grow exponentially. (Left)In this one-dimensional example, we have one
variable for which we only care to distinguish 10 regions of interest. With enough examples falling within
each of these regions (each region corresponds to a cell in the illustration), learning algorithms can easily
generalize correctly. A straightforward way to generalize is to estimate the value of the target function
within each region (and possibly interpolate between neighboring regions). (Center)With 2 dimensions it is
more difficult to distinguish 10 different values of each variable. We need to keep track of up to 10×10=100
regions, and we need at least that many examples to cover all those regions. (Right)With 3 dimensions this
grows to 103 = 1000 regions and at least that many examples. For d dimensions and v values to be
distinguished along each axis, we seem to need O(v d ) regions and examples. This is an instance of the curse
of dimensionality. Figure graciously provided by Nicolas Chapados.

The curse of dimensionality arises in many places in computer science, and especially so in machine learning
One challenge posed by the curse of dimensionality is a statistical challenge. As illustrated in figure 1.7, a
statistical challenge arises because the number of possible configurations of x is much larger than the number
of training examples. To understand the issue, let us consider that the input space is organized into a grid, like
in the figure. We can describe low-dimensional space with a low number of grid cells that are mostly occupied
by the data. When generalizing to a new data point, we can usually tell what to do simply by inspecting the
training examples that lie in the same cell as the new input. For example, if estimating the probability density
at some point x, we can just return the number of training examples in the same unit volume cell as x, divided
by the total number of training examples. If we wish to classify an example, we can return the most common
class of training examples in the same cell. If we are doing regression, we can average the target values
observed over the examples in that cell. But what about the cells for which we have seen no example?
Because in high-dimensional spaces the number of configurations is huge, much larger than our number of
examples, a typical grid cell has no training example associated with it. How could we possibly say something

KITE 23 | Page
AD3501DEEP LEARNINGAI&DS

meaningful about these new configurations? Many traditional machine learning algorithms simply assume that
the output at a new point should be approximately the same as the output at the nearest training point.

2.6.2 Local Constancy and Smoothness Regularization


In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of
function they should learn. Previously, we have seen these priors incorporated as explicit beliefs in the form of
probability distributions over parameters of the model. More informally, we may also discuss prior beliefs as
directly influencing the function itself and only indirectly acting on the parameters via their effect on the
function. Additionally, we informally discuss prior beliefs as being expressed implicitly, by choosing algorithms
that are biased toward choosing some class of functions over another, even though these biases may not be
expressed (or even possible to express) in terms of a probability distribution representing our degree of belief
in various functions.
Among the most widely used of these implicit “priors” is the smoothness prior or local constancy prior. This
prior states that the function we learn should not change very much within a small region.

Figure 1.8 Illustration of how the nearest neighbor algorithm breaks up the input space into regions. An
example (represented here by a circle) within each region defines the region boundary (represented here by
the lines). They value associated with each example defines what the output should be for all points within
the corresponding region. The regions defined by nearest neighbor matching form a geometric pattern
called a Voronoi diagram. The number of these contiguous regions cannot grow faster than the number of
training examples. While this figure illustrates the behavior of the nearest neighbor algorithm specifically,
other machine learning algorithms that rely exclusively on the local smoothness prior for generalization
exhibit similar behaviors: each training example only informs the learner about how to generalize in some
neighborhood immediately surrounding that example

2.6.2 Manifold Learning

An important concept underlying many ideas in machine learning is that of a


manifold.
A manifold is a connected region. Mathematically, it is a set of points,
associated with a neighborhood around each point. From any given point, the
manifold locally appears to be a Euclidean space. In everyday life, we
experience the surface of the world as a 2-D plane, but it is in fact a spherical
manifold in 3-D space.
The definition of a neighborhood surrounding each point implies the existence
of transformations that can be applied to move on the manifold from one

KITE 24 | Page
AD3501DEEP LEARNINGAI&DS

position to a neighboring one. In the example of the world’s surface as a


manifold, one can walk north, south, east, or west.
Although there is a formal mathematical meaning to the term “manifold,” in
machine learning it tends to be used more loosely to designate a connected set
of points that can be approximated well by considering only a small number of
degrees of freedom, or dimensions, embedded in a higher-dimensional space.
Each dimension corresponds to a local direction of variation. See figure 5.11
for an example of training data lying near a one-dimensional manifold
embedded in twodimensional space. In the context of machine learning, we
allow the dimensionality of the manifold to vary from one point to another.
This often happens when a manifold intersects itself. For example, a figure
eight is a manifold that has a single dimension in most places but two
dimensions at the intersection at the center.

Figure 1.9 Data sampled from a distribution in a two-dimensional space that is actually concentrated near a
one-dimensional manifold, like a twisted string. The solid line indicates the underlying manifold that the
learner should infer.
Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn
functions with interesting variations across all of Rn. Manifold learning algorithms surmount this obstacle by
assuming that most of Rn consists of invalid inputs, and that interesting inputs occur only along a collection of
manifolds containing a small subset of points, with interesting variations in the output of the learned function
occurring only along directions that lie on the manifold, or with interesting variations happening only when we
move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data
and the unsupervised learning setting, although this probability concentration idea can be generalized to both
discrete data and the supervised learning setting: the key assumption remains that probability mass is highly
concentrated.

Deep Networks:
Deep Networks, also known as Deep Neural Networks (DNNs), are a class of
artificial neural networks that are characterized by having multiple layers of
interconnected nodes or neurons. These networks are designed to process complex
data by learning hierarchical representations of the input data at different levels of
abstraction.

The key features of deep networks are:


KITE 25 | Page
AD3501DEEP LEARNINGAI&DS

1. Multiple layers: Deep networks consist of multiple layers, typically including an input
layer, one or more hidden layers, and an output layer. Each layer contains a set of
neurons that perform specific computations on the input data.
2. Non-linearity: Each neuron in a deep network uses a non-linear activation function to
introduce non-linearity into the model. This non-linearity enables the network to learn
complex patterns and relationships in the data.
3. Learning through training: Deep networks learn from data through a process called
training. During training, the network adjusts its internal parameters (weights and
biases) using optimization algorithms like gradient descent to minimize a predefined
loss function, which measures the difference between the predicted outputs and the
true labels.
4. Hierarchical representations: Deep networks learn to extract hierarchical
representations of the input data. The initial layers capture low-level features, and as
the information flows through deeper layers, more abstract and high-level features
are learned.
5. Feature learning: One of the main advantages of deep networks is their ability to
automatically learn relevant features from the raw input data. This feature learning
reduces the need for handcrafted features, which were commonly used in traditional
machine learning approaches.

Deep networks have shown remarkable success in various domains, including


computer vision, natural language processing, speech recognition, and many others.
Some popular deep network architectures include Convolutional Neural Networks
(CNNs) for image processing, Recurrent Neural Networks (RNNs) for sequential
data, and Transformer-based architectures for natural language understanding.

It is important to note that training deep networks can be computationally intensive


and may require large amounts of data. However, the significant breakthroughs in
hardware and the development of advanced training techniques (e.g., transfer
learning) have made it possible to train deeper and more powerful models
effectively.

Deep feedforward networks:

Deep feedforward networks, also known as feedforward neural networks or


multilayer perceptrons (MLPs), are a class of artificial neural networks where
information flows in one direction, from the input layer through the hidden layers to
the output layer. These networks are called "feedforward" because there are no
feedback connections or loops in the architecture.

Key characteristics of deep feedforward networks:

1. Architecture: Deep feedforward networks consist of multiple layers of interconnected


nodes, each layer having its own set of neurons. The first layer is the input layer,
which receives the raw input data. The intermediate layers are known as hidden
layers, and the final layer is the output layer, which produces the network's
predictions.

KITE 26 | Page
AD3501DEEP LEARNINGAI&DS

2. Neurons and Activation Functions: Neurons in deep feedforward networks are


computational units that take weighted inputs, apply an activation function to produce
an output, and pass it on to the next layer. The activation functions introduce non-
linearity to the model, allowing it to learn complex relationships in the data. Common
activation functions include sigmoid, tanh, and ReLU (Rectified Linear Unit).
3. Forward Propagation: Information flows in a single direction during inference or
prediction. The input data is passed through the layers sequentially, and
computations are performed to generate the output. This process is called forward
propagation.
4. Training: Deep feedforward networks are trained using a supervised learning
approach. They learn from labeled data by adjusting their internal parameters
(weights and biases) to minimize a predefined loss function. The optimization is
typically performed through backpropagation, where the gradients of the loss with
respect to the network's parameters are calculated and used to update the weights.
5. Depth and Expressiveness: The term "deep" in deep feedforward networks refers to
the presence of multiple hidden layers. Deeper networks have the potential to
capture more intricate and abstract representations of the data, allowing them to
learn complex patterns and improve their expressiveness.

Deep feedforward networks are versatile and have been successfully applied in
various tasks, including classification, regression, pattern recognition, and function
approximation. However, as the number of layers increases, training deep
feedforward networks can become more challenging due to issues like vanishing or
exploding gradients. Techniques like weight initialization, batch normalization, and
skip connections (e.g., ResNet) have been introduced to help alleviate these
challenges and enable the successful training of deep networks.

Regularization in Deep Neural Networks

Regularization is a set of techniques used to prevent overfitting in deep neural networks and
improve their generalization performance. Overfitting occurs when a model becomes too
specialized in the training data, capturing noise or random fluctuations, and fails to generalize
well to unseen data. Regularization methods aim to reduce overfitting by introducing additional
constraints during the training process.

Here are some common regularization techniques used in deep neural networks:

1. L1 and L2 Regularization:
 L1 regularization adds a penalty term to the loss function proportional to the absolute
values of the network's weights. It encourages sparsity in the model by pushing some
weights to exactly zero, effectively removing less important features.
 L2 regularization, also known as weight decay, adds a penalty term to the loss function
proportional to the squared values of the network's weights. It penalizes large weights
and encourages smaller, more distributed weights. This helps prevent the network from
relying too heavily on a few dominant features.
2. Dropout: Dropout is a technique where, during training, random neurons are temporarily dropped
or ignored with a certain probability. This forces the network to learn more robust and redundant
representations, as it cannot rely on specific neurons always being present. Dropout helps
prevent overfitting and can lead to better generalization.
3. Batch Normalization: Batch normalization is a technique that normalizes the activations of each
layer during training. It helps stabilize and speed up training by reducing internal covariate shift.

KITE 27 | Page
AD3501DEEP LEARNINGAI&DS

Additionally, batch normalization acts as a regularization technique, as it introduces a small


amount of noise to the model during training.
4. Data Augmentation: Data augmentation involves applying random transformations (such as
rotation, translation, flipping, etc.) to the training data to create additional training samples. This
technique increases the effective size of the training dataset and helps the model generalize
better to variations in the data.
5. Early Stopping: Early stopping is a simple regularization technique where the training process is
halted before the model starts to overfit. It involves monitoring the validation loss during training
and stopping the training process when the validation loss starts to increase.
6. Ensemble Methods: Ensemble methods involve training multiple models and combining their
predictions. Ensemble methods, such as bagging and boosting, can improve generalization
performance and reduce overfitting.

Regularization techniques should be chosen and tuned carefully, as their effectiveness can vary
depending on the specific problem and dataset. A combination of different regularization
techniques can often lead to the best results in practice.

Optimization :
For a single training example, Backpropagation algorithm calculates the gradient of
the error function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to efficiently train
artificial neural networks following a gradient descent approach which exploits the
chain rule.

The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until it
is not able to perform the task for which it is being trained. Derivatives of the
activation function to be known at network design time is required to
Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation


works? Let start with an example and do it mathematically to understand how exactly
updates the weight using Backpropagation.

KITE 28 | Page
AD3501DEEP LEARNINGAI&DS

Input values
X1=0.05
X2=0.10

Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values
b1=0.35 b2=0.60

Target Values
T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass
To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

KITE 29 | Page
AD3501DEEP LEARNINGAI&DS

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

KITE 30 | Page
AD3501DEEP LEARNINGAI&DS

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.

Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.

Backward pass at the output layer


To update the weight, we calculate the error correspond to each weight with the help
of a total error. The error on weight w is calculated by differentiating total error with
respect to w.

We perform backward process so first consider the last weight w5 as

KITE 31 | Page
AD3501DEEP LEARNINGAI&DS

From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

Putting the value of e-y in equation (5)

KITE 32 | Page
AD3501DEEP LEARNINGAI&DS

So, we put the values of in equation no (3) to find the final


result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the
following values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer

KITE 33 | Page
AD3501DEEP LEARNINGAI&DS

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3,
and w4 as we have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can
easily differentiate it with respect to w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as

We again split this because there is no any H1final term in Etoatal as

will again split because in E1 and E2 there is no H1 term.


Splitting is done as

We again Split both because there is no any y1 and y2 term in E1 and


E2. We split it as

KITE 34 | Page
AD3501DEEP LEARNINGAI&DS

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

From equation (8)

From equation (19)

KITE 35 | Page
AD3501DEEP LEARNINGAI&DS

Putting the value of e-y2 in equation (23)

From equation (21)

Now from equation (16) and (17)

KITE 36 | Page
AD3501DEEP LEARNINGAI&DS

Put the value of in equation (15) as

We have we need to figure out as

KITE 37 | Page
AD3501DEEP LEARNINGAI&DS

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:

So, we put the values of in equation (13) to find the final


result.

KITE 38 | Page
AD3501DEEP LEARNINGAI&DS

Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network
when we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation,
the total error is down to 0.291027924. After repeating this process 10,000, the total
error is down to 0.0000351085. At this point, the outputs neurons generate
0.159121960 and 0.984065734 i.e., nearby our target value when we feed forward
the 0.05 and 0.1.

KITE 39 | Page

You might also like