Unit 1
Unit 1
KITE 1 | Page
AD3501DEEP LEARNINGAI&DS
TEXT BOOK
1. Ian Goodfellow, Yoshua Bengio, Aaron Courville, ``Deep Learning'', MIT Press, 2016.
2. Andrew Glassner, “Deep Learning: A Visual Approach”, No Starch Press, 2021.
REFERENCES
1. Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, Mohammed Bennamoun, ``A Guide
to Convolutional Neural Networks for Computer Vision'', Synthesis Lectures on Computer
Vision, Morgan & Claypool publishers, 2018.
2. Yoav Goldberg, ``Neural Network Methods for Natural Language Processing'', Synthesis
Lectures on Human Language Technologies, Morgan & Claypool publishers, 2017.
3. Francois Chollet, ``Deep Learning with Python'', Manning Publications Co, 2018.
4. Charu C. Aggarwal, ``Neural Networks and Deep Learning: A Textbook'', Springer
International Publishing, 2018.
5. Josh Patterson, Adam Gibson, ``Deep Learning: A Practitioner's Approach'', O'Reilly
Media, 2017.
KITE 2 | Page
AD3501DEEP LEARNINGAI&DS
1. Linear Algebra
1. It is a branch of Mathematics.
2. It is widely used throughout science and engineering.
3. However, it is continuous rather than discrete, many computer scientists have
little experience with it
4. A good understanding of Linear Algebra is essential al understanding and
workind with many Machine Learning Algorithms, especially Deep Learning.
1.1 Scalar, Vector, Metrics and Tensors
Scalar: A scalar is just a single number, in contrast to most of the other objects
studied in linear algebra, which are usually arrays of multiple numbers. We write
scalars in italics. We usually give scalars lower-case variable names. When we
introduce them, we specify what kind of number they are.
For example, we might say “Let s ∈ R be the slope of the line,” while defining
a real-valued scalar, or “Let n ∈ N be the number of units,” while defining a natural
number scalar.
Vectors: A vector is an array of numbers. The numbers are arranged in order. We can
identify each individual number by its index in that ordering. Typically, we give
vectors lower case names written in bold typeface, such as x. The elements of the
vector are identified by writing its name in italic typeface, with a subscript.
Tensors: In some cases we will need an array with more than two axes. In the
general case, an array of numbers arranged on a regular grid with a variable number
KITE 3 | Page
AD3501DEEP LEARNINGAI&DS
of axes is known as a tensor. We denote a tensor named “A” with this typeface: A.
We identify the element of A at coordinates (i, j, k) by writing Ai,j,k.
(AT)i,j = Aj,i
for all i. We can see that this fits the requirements for a probability mass function. The
value 1 k is positive because k is a positive integer. We also see that
KITE 4 | Page
AD3501DEEP LEARNINGAI&DS
A probability density function p(x) does not give the probability of a specific
state directly, instead the probability of landing inside an infinitesimal region with
volume δx is given by p(x)δx.
We can integrate the density function to find the actual probability mass of a set
of points. Specifically, the probability that x lies in some set S is given by the integral of
p(x) over that set. In the univariate example, the probability that x lies in the interval [a,
b] is given by ∫ [a,b] p(x)dx.
KITE 5 | Page
AD3501DEEP LEARNINGAI&DS
Suppose we have a function y = f (x), where both x and y are real numbers. The
derivative of this function is denoted as f’(x) or as (dy/dx). The derivative f ‘ (x) gives the
slope of f (x) at the point x. In other words, it specifies how to scale a small change in the
input in order to obtain the corresponding change in the output: f(x +∈ ) ≈ f(x) + ∈f ‘ (x).
The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y. For example, we know that f (x − ∈
sign(f ‘ (x))) is less than f (x) for small enough . We can thus reduce f (x) by moving x in
small steps with opposite sign of the derivative. This technique is called Gradient Descent
(Cauchy, 1847). See figure 1.1 for an example of this technique.
Figure 1.1: An illustration of how the Gradient Descent algorithm uses the derivatives of a
function can be used to follow the function downhill to a minimum.
Figure 1.2: Examples of each of the three types of critical points in 1-D. A critical point is a
point with zero slope. Such a point can either be a local minimum, which is lower than the
KITE 6 | Page
AD3501DEEP LEARNINGAI&DS
neighboring points, a local maximum, which is higher than the neighboring points, or a
saddle point, which has neighbors that are both higher and lower than the point itself.
A point that obtains the absolute lowest value of f (x) is a global minimum. It is
possible for there to be only one global minimum or multiple global minima of the function.
It is also possible for there to be local minima that are not globally optimal. In the context of
deep learning, we optimize functions that may have many local minima that are not
optimal, and many saddle points surrounded by very flat regions. All of this makes
optimization very difficult, especially when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that is very low, but not necessarily
minimal in any formal sense. See figure 1.3 for an example
Figure 4.3: Optimization algorithms may fail to find a global minimum when there
are multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long as they
correspond to significantly low values of the cost function.
The gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent.
KITE 7 | Page
AD3501DEEP LEARNINGAI&DS
Figure 1.4: The second derivative determines the curvature of a function. Here we show quadratic
functions with various curvature. The dashed line indicates the value of the cost function we would expect
based on the gradient information alone as we make a gradient step downhill. In the case of negative
curvature, the cost function actually decreases faster than the gradient predicts. In the case of no curvature,
the gradient predicts the decrease correctly. In the case of positive curvature, the function decreases slower
than expected and eventually begins to increase, so steps that are too large can actually increase the
function inadvertently.
figure 4.4 to see how different forms of curvature affect the relationship between
the value of the cost function predicted by the gradient and the true value.
When our function has multiple input dimensions, there are many second
derivatives. These derivatives can be collected together into a matrix called the Hessian
matrix. The Hessian matrix H(f)(x) is defined such that:
gap between the training error and test error is too large. We can control whether a
model is more likely to overfit or underfit by altering its capacity. Informally, a model’s
capacity is its ability to fit a wide variety of functions. Models with low capacity may
struggle to fit the training set. Models with high capacity can overfit by memorizing
properties of the training set that do not serve them well on the test set. One way to
control the capacity of a learning algorithm is by choosing its hypothesis space, the
set of functions that the learning algorithm is allowed to select as being the solution.
For example, the linear regression algorithm has the set of all linear functions of its
input as its hypothesis space. We can generalize linear regression to include
polynomials, rather than just linear functions, in its hypothesis space. Doing so
increases the model’s capacity.
A polynomial of degree one gives us the linear regression model with which
we are already familiar, with prediction
yˆ = b + wx.
Though this model implements a quadratic function of its input, the output is
still a linear function of the parameters, so we can still use the normal equations to
train the model in closed form. We can continue to add more powers of x as
additional features, for example to obtain a polynomial of degree 9
Machine learning algorithms will generally perform best when their capacity is
appropriate for the true complexity of the task they need to perform and the amount
of training data they are provided with. Models with insufficient capacity are unable to
solve complex tasks. Models with high capacity can solve complex tasks, but when
their capacity is higher than needed to solve the present task they may overfit.
Figure 1.5 shows this principle in action. We compare a linear, quadratic and
degree-9 predictor attempting to fit a problem where the true underlying function is
quadratic. The linear function is unable to capture the curvature in the true
underlying problem, so it underfits. The degree-9 predictor is capable of representing
the correct function, but it is also capable of representing infinitely many other
functions that pass exactly through the training points, because we have more
parameters than training examples. We have little chance of choosing a solution that
generalizes well when so many wildly different solutions exist. In this example, the
quadratic model is perfectly matched to the true structure of the task so it
generalizes well to new data.
The model specifies which family of functions the learning algorithm can
choose from when varying the parameters in order to reduce a training objective.
This is called the representational capacity of the model.
KITE 9 | Page
AD3501DEEP LEARNINGAI&DS
Figure 1.5: We fit three models to this example training set. The training
data was generated synthetically, by randomly sampling x values and
choosing y deterministically by evaluating a quadratic function. (Left)A linear
function fit to the data suffers from underfitting—it cannot capture the
curvature that is present in the data. (Center)A quadratic function fit to the
data generalizes well to unseen points. It does not suffer from a significant
amount of overfitting or underfitting. (Right)A polynomial of degree 9 fit to the
data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse
to solve the underdetermined normal equations. The solution passes through
all of the training points exactly, but we have not been lucky enough for it to
extract the correct structure. It now has a deep valley in between two training
points that does not appear in the true underlying function. It also increases
sharply on the left side of the data, while the true function decreases in this
area.
In the polynomial regression example we saw in figure 1.6, there is a single hyperparameter: the degree of the
polynomial, which acts as a capacity hyperparameter. The λ value used to control the strength of weight decay
is another example of a hyperparameter.
KITE 10 | Page
AD3501DEEP LEARNINGAI&DS
Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because it is
difficult to optimize. More frequently, the setting must be a hyperparameter because it is not appropriate to
learn that hyperparameter on the training set. This applies to all hyperparameters that control model capacity.
If learned on the training set, such hyperparameters would always choose the maximum possible model
capacity, resulting in overfitting (refer to figure 1.6). For example, we can always fit the training set better with
a higher degree polynomial and a weight decay setting of λ = 0 than we could with a lower degree polynomial
and a positive weight decay setting.
To solve this problem, we need a validation set of examples that the training algorithm does not observe.
Figure 1.6: Typical relationship between capacity and error. Training and test error behave
differently. At the left end of the graph, training error and generalization error are both
high. This is the underfitting regime. As we increase capacity, training error decreases, but
the gap between training and generalization error increases. Eventually, the size of this
gap outweighs the decrease in training error, and we enter theoverfitting regime, where
capacity is too large, above the optimal capacity
Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the
training set, can be used to estimate the generalization error of a learner, after the learning process has
completed. It is important that the test examples are not used in any way to make choices about the model,
including its hyperparameters. For this reason, no example from the test set can be used in the validation set.
Therefore, we always construct the validation set from the training data. Specifically, we split the training data
into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our
validation set, used to estimate the generalization error during or after training, allowing for the
hyperparameters to be updated accordingly. The subset of data used to learn the parameters is still typically
called the training set, even though this may be confused with the larger pool of data used for the entire
training process. The subset of data used to guide the selection of hyperparameters is called the validation set.
Typically, one uses about 80% of the training data for training and 20% for validation. Since the validation set is
used to “train” the hyperparameters, the validation set error will underestimate the generalization error,
though typically by a smaller amount than the training error. After all hyperparameter optimization is
complete, the generalization error may be estimated using the test set.
In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms
over many years, and especially if we consider all the attempts from the scientific community at beating the
reported state-ofthe-art performance on that test set, we end up having optimistic evaluations with the test
set as well. Benchmarks can thus become stale and then do not reflect the true field performance of a trained
KITE 11 | Page
AD3501DEEP LEARNINGAI&DS
system. Thankfully, the community tends to move on to new (and usually more ambitious and larger)
benchmark datasets.
2.2.1 Cross-Validation:
Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set
being small. A small test set implies statistical uncertainty around the estimated average test error, making it
difficult to claim that algorithm A works better than algorithm B on the given task.
2.3 Estimators
KITE 12 | Page
AD3501DEEP LEARNINGAI&DS
● A single parameter
● A whole function
Point estimator
To distinguish estimates of parameters from their true value, a
point estimate of a parameter θ is represented by θˆ. Let {x , x (1) (2)
Function Estimation
Here we are trying to predict a variable y given an input vector x.
We assume that there is a function f(x) that describes the
approximate relationship between y and x. For example,
we may assume that y = f(x) + ε, where ε stands for the part of y that is
not predictable from x. In function estimation, we are interested in
approximating f with a model or estimate fˆ. Function estimation is
really just the same as estimating a parameter θ; the function
KITE 13 | Page
AD3501DEEP LEARNINGAI&DS
Bias
The bias of an estimator is defined as:
KITE 14 | Page
AD3501DEEP LEARNINGAI&DS
KITE 15 | Page
AD3501DEEP LEARNINGAI&DS
KITE 16 | Page
AD3501DEEP LEARNINGAI&DS
KITE 17 | Page
AD3501DEEP LEARNINGAI&DS
KITE 18 | Page
AD3501DEEP LEARNINGAI&DS
KITE 19 | Page
AD3501DEEP LEARNINGAI&DS
KITE 20 | Page
AD3501DEEP LEARNINGAI&DS
Variance:
Another property of the estimator that we might want to consider is how much we expect it to vary as a
function of the data sample. Just as we computed the expectation of the estimator to determine its bias, we
can compute its variance. The variance of an estimator is simply the variance
Var(ˆθ)
where the random variable is the training set. Alternately, the square root of the variance is called the
standard error, denoted SE(ˆθ).
The variance or the standard error of an estimator provides a measure of how we would expect the estimate
we compute from data to vary as we independently resample the dataset from the underlying data generating
process. Just as we might like an estimator to exhibit low bias we would also like it to have relatively low
variance.
When we compute any statistic using a finite number of samples, our estimate of the true underlying
parameter is uncertain, in the sense that we could have obtained other samples from the same distribution
and their statistics would have been different. The expected degree of variation in any estimator is a source of
error that we want to quantify.
The standard error of the mean is given by
where σ2 is the true variance of the samples xi. The standard error is often estimated by using an estimate of
σ. Unfortunately, neither the square root of the sample variance nor the square root of the unbiased estimator
of the variance provide an unbiased estimate of the standard deviation. Both approaches tend to
underestimate the true standard deviation, but are still used in practice. The square root of the unbiased
estimator of the variance is less of an underestimate. For large m, the approximation is quite reasonable.
KITE 21 | Page
AD3501DEEP LEARNINGAI&DS
A recurring problem in machine learning is that large training sets are necessary for good generalization, but
large training sets are also more computationally expensive.
The cost function used by a machine learning algorithm often decomposes as a sum over training examples of
some per-example loss function. For example, the negative conditional log-likelihood of the training data can
be written as
KITE 22 | Page
AD3501DEEP LEARNINGAI&DS
Figure 1.7: As the number of relevant dimensions of the data increases (from left to right), the number of
configurations of interest may grow exponentially. (Left)In this one-dimensional example, we have one
variable for which we only care to distinguish 10 regions of interest. With enough examples falling within
each of these regions (each region corresponds to a cell in the illustration), learning algorithms can easily
generalize correctly. A straightforward way to generalize is to estimate the value of the target function
within each region (and possibly interpolate between neighboring regions). (Center)With 2 dimensions it is
more difficult to distinguish 10 different values of each variable. We need to keep track of up to 10×10=100
regions, and we need at least that many examples to cover all those regions. (Right)With 3 dimensions this
grows to 103 = 1000 regions and at least that many examples. For d dimensions and v values to be
distinguished along each axis, we seem to need O(v d ) regions and examples. This is an instance of the curse
of dimensionality. Figure graciously provided by Nicolas Chapados.
The curse of dimensionality arises in many places in computer science, and especially so in machine learning
One challenge posed by the curse of dimensionality is a statistical challenge. As illustrated in figure 1.7, a
statistical challenge arises because the number of possible configurations of x is much larger than the number
of training examples. To understand the issue, let us consider that the input space is organized into a grid, like
in the figure. We can describe low-dimensional space with a low number of grid cells that are mostly occupied
by the data. When generalizing to a new data point, we can usually tell what to do simply by inspecting the
training examples that lie in the same cell as the new input. For example, if estimating the probability density
at some point x, we can just return the number of training examples in the same unit volume cell as x, divided
by the total number of training examples. If we wish to classify an example, we can return the most common
class of training examples in the same cell. If we are doing regression, we can average the target values
observed over the examples in that cell. But what about the cells for which we have seen no example?
Because in high-dimensional spaces the number of configurations is huge, much larger than our number of
examples, a typical grid cell has no training example associated with it. How could we possibly say something
KITE 23 | Page
AD3501DEEP LEARNINGAI&DS
meaningful about these new configurations? Many traditional machine learning algorithms simply assume that
the output at a new point should be approximately the same as the output at the nearest training point.
Figure 1.8 Illustration of how the nearest neighbor algorithm breaks up the input space into regions. An
example (represented here by a circle) within each region defines the region boundary (represented here by
the lines). They value associated with each example defines what the output should be for all points within
the corresponding region. The regions defined by nearest neighbor matching form a geometric pattern
called a Voronoi diagram. The number of these contiguous regions cannot grow faster than the number of
training examples. While this figure illustrates the behavior of the nearest neighbor algorithm specifically,
other machine learning algorithms that rely exclusively on the local smoothness prior for generalization
exhibit similar behaviors: each training example only informs the learner about how to generalize in some
neighborhood immediately surrounding that example
KITE 24 | Page
AD3501DEEP LEARNINGAI&DS
Figure 1.9 Data sampled from a distribution in a two-dimensional space that is actually concentrated near a
one-dimensional manifold, like a twisted string. The solid line indicates the underlying manifold that the
learner should infer.
Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn
functions with interesting variations across all of Rn. Manifold learning algorithms surmount this obstacle by
assuming that most of Rn consists of invalid inputs, and that interesting inputs occur only along a collection of
manifolds containing a small subset of points, with interesting variations in the output of the learned function
occurring only along directions that lie on the manifold, or with interesting variations happening only when we
move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data
and the unsupervised learning setting, although this probability concentration idea can be generalized to both
discrete data and the supervised learning setting: the key assumption remains that probability mass is highly
concentrated.
Deep Networks:
Deep Networks, also known as Deep Neural Networks (DNNs), are a class of
artificial neural networks that are characterized by having multiple layers of
interconnected nodes or neurons. These networks are designed to process complex
data by learning hierarchical representations of the input data at different levels of
abstraction.
1. Multiple layers: Deep networks consist of multiple layers, typically including an input
layer, one or more hidden layers, and an output layer. Each layer contains a set of
neurons that perform specific computations on the input data.
2. Non-linearity: Each neuron in a deep network uses a non-linear activation function to
introduce non-linearity into the model. This non-linearity enables the network to learn
complex patterns and relationships in the data.
3. Learning through training: Deep networks learn from data through a process called
training. During training, the network adjusts its internal parameters (weights and
biases) using optimization algorithms like gradient descent to minimize a predefined
loss function, which measures the difference between the predicted outputs and the
true labels.
4. Hierarchical representations: Deep networks learn to extract hierarchical
representations of the input data. The initial layers capture low-level features, and as
the information flows through deeper layers, more abstract and high-level features
are learned.
5. Feature learning: One of the main advantages of deep networks is their ability to
automatically learn relevant features from the raw input data. This feature learning
reduces the need for handcrafted features, which were commonly used in traditional
machine learning approaches.
KITE 26 | Page
AD3501DEEP LEARNINGAI&DS
Deep feedforward networks are versatile and have been successfully applied in
various tasks, including classification, regression, pattern recognition, and function
approximation. However, as the number of layers increases, training deep
feedforward networks can become more challenging due to issues like vanishing or
exploding gradients. Techniques like weight initialization, batch normalization, and
skip connections (e.g., ResNet) have been introduced to help alleviate these
challenges and enable the successful training of deep networks.
Regularization is a set of techniques used to prevent overfitting in deep neural networks and
improve their generalization performance. Overfitting occurs when a model becomes too
specialized in the training data, capturing noise or random fluctuations, and fails to generalize
well to unseen data. Regularization methods aim to reduce overfitting by introducing additional
constraints during the training process.
Here are some common regularization techniques used in deep neural networks:
1. L1 and L2 Regularization:
L1 regularization adds a penalty term to the loss function proportional to the absolute
values of the network's weights. It encourages sparsity in the model by pushing some
weights to exactly zero, effectively removing less important features.
L2 regularization, also known as weight decay, adds a penalty term to the loss function
proportional to the squared values of the network's weights. It penalizes large weights
and encourages smaller, more distributed weights. This helps prevent the network from
relying too heavily on a few dominant features.
2. Dropout: Dropout is a technique where, during training, random neurons are temporarily dropped
or ignored with a certain probability. This forces the network to learn more robust and redundant
representations, as it cannot rely on specific neurons always being present. Dropout helps
prevent overfitting and can lead to better generalization.
3. Batch Normalization: Batch normalization is a technique that normalizes the activations of each
layer during training. It helps stabilize and speed up training by reducing internal covariate shift.
KITE 27 | Page
AD3501DEEP LEARNINGAI&DS
Regularization techniques should be chosen and tuned carefully, as their effectiveness can vary
depending on the specific problem and dataset. A combination of different regularization
techniques can often lead to the best results in practice.
Optimization :
For a single training example, Backpropagation algorithm calculates the gradient of
the error function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to efficiently train
artificial neural networks following a gradient descent approach which exploits the
chain rule.
The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until it
is not able to perform the task for which it is being trained. Derivatives of the
activation function to be known at network design time is required to
Backpropagation.
KITE 28 | Page
AD3501DEEP LEARNINGAI&DS
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
KITE 29 | Page
AD3501DEEP LEARNINGAI&DS
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
KITE 30 | Page
AD3501DEEP LEARNINGAI&DS
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
KITE 31 | Page
AD3501DEEP LEARNINGAI&DS
From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
KITE 32 | Page
AD3501DEEP LEARNINGAI&DS
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the
following values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
KITE 33 | Page
AD3501DEEP LEARNINGAI&DS
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3,
and w4 as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can
easily differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
KITE 34 | Page
AD3501DEEP LEARNINGAI&DS
Now, we find the value of by putting values in equation (18) and (19) as
KITE 35 | Page
AD3501DEEP LEARNINGAI&DS
KITE 36 | Page
AD3501DEEP LEARNINGAI&DS
KITE 37 | Page
AD3501DEEP LEARNINGAI&DS
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
KITE 38 | Page
AD3501DEEP LEARNINGAI&DS
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network
when we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation,
the total error is down to 0.291027924. After repeating this process 10,000, the total
error is down to 0.0000351085. At this point, the outputs neurons generate
0.159121960 and 0.984065734 i.e., nearby our target value when we feed forward
the 0.05 and 0.1.
KITE 39 | Page