0% found this document useful (0 votes)
55 views44 pages

Module 2

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views44 pages

Module 2

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DEEP LEARNING

MODULE 2
.

Artificial
Neuron​
• It takes certain
inputs and weights.​
• Applies dot product
on respective inputs
& weights and apply
summation.​
• Apply some
transformation
using activation
function on the
above summation.​
• Fires output
Feedforward Neural
Network
• Basically Deep Feed-
Forward Networks are such
neural networks which only
uses input to feed forward
through a function, let’s
say f*, but only through
forward. There is no feedback
mechanism in DFN. There are
indeed such cases when we
have feedback mechanism
from the output, that are
called Recurrent Neural
Networks.
APPLICATIONS
How Feedforward Networks
Work:
• These networks are called
feedforward because
information flows forward
through the network:
• The input x goes through several
layers of computation in the
network (called hidden layers).
• The network processes this
information layer by layer and
finally produces an output y.
• The term "feedforward"
emphasizes that there are no
cycles or loops in this flow of
information.
Input Layer
• This layer consists of the input data which is being given to the neural network.
• This layer is depicted like neurons only but they are not the actual artificial neuron with computational
capabilities that we discussed above.
• Each neuron represents a feature of the data. This means if we have a data set with three attributes Age,
Salary, City then we will have 3 neurons in the input layer to represent each of them. If we are working
with an image of the dimension of 1024×768 pixels then we will have 1024*768 = 786432 neurons in the
input layer to represent each of the pixels !!
Hidden Layer
• This is the layer that consists of the actual artificial neurons.
• If the number of hidden layer is one then it is known as a shallow neural network.
• If the number of hidden layer is more than one then it is known as a deep neural network.
• In a deep neural network, the output of neurons in one hidden layer is the input to the next hidden layer.
• There is no rule of thumb on how many hidden layers and how many neurons each hidden layer should
have in the neural network. In fact, the practitioners will tell you that arriving at a good number of hidden
layers & neurons is an art and mostly depends on the data in hand.
• In most of the cases, all the neurons are connected with each other and it is also known as fully
connected neural network.
• In the case of a convolution neural network, however, not all neurons are connected with each other.
Output Layer
• This layer is used to represent the output of the neural network.
• The number of output neurons depends on number of output that we are
expecting in the problem at hand.
Weights and Bias
• The neurons in the neural network are connected to each other by weights.
• Apart from weights, each neuron also has its own bias.
• One more key point to highlight here is that the information flows in only
one forward direction only. Hence it is known a feed forward neural
network.
• If the information is not passed in one direction and output of neuron is
feedback into previous neuron in a cycle then it is known as recurrent
neural network and is a counterpart of feed forward neural network.
❖For example, we might have three functions f(1), f(2), and f(3) connected in a chain, to
form f(x) = f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used
structures of neural networks. In this case, f (1) is called the first layer of the network, f(2)
is called the second layer, and so on.
❖The goal of a feedforward network is to approximate some function f . For example, for a
classifier, y = f∗(x) maps an input x to a category y. A feedforward network defines a
mapping y = f (x; θ) and learns the value of the parameters θ that result in the best function
approximation. length of the chain gives the depth of the model. It is from this terminology
that the name “deep learning” arises. The final layer of a feedforward network is called the
output layer. During neural network training, we drive f(x) to match f∗(x).
❖The training data provides us with noisy, approximate examples of f ∗(x) evaluated at
different training points. Each example x is accompanied by a label y ≈ f ∗ (x).
❖The training examples specify directly what the output layer must do at each point x; it
must produce a value that is close to y. The behavior of the other layers is not directly
specified by the training data. The learning algorithm must decide how to use those layers
to produce the desired output, but the training data does not say what each individual
layer should do.
Example: Learning XOR
To make the idea of a feedforward network more concrete, we begin with an
example of a fully functioning feedforward network on a very simple task: learning
the XOR function.
• The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When
exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it
returns 0. The XOR function provides the target function y = f∗(x) that we want to learn. Our
model provides a function y = f(x;θ) and our learning algorithm will adapt the parameters θ
to make f as similar as possible In this simple example, we will not be concerned with
statistical generalization. We want our network to perform correctly on the four points X =
{[0, 0], [0,1],[1,0], and [1,1]}. We will train the network on all four of these points. The only
challenge is to fit the training set. We can treat this problem as a regression problem and
use a mean squared error loss function. We choose this loss function to simplify the math
for this example as much as possible. In practical applications, MSE is usually not an
appropriate cost function for modeling binary data. More appropriate approaches a
• Evaluated on our whole training set, the MSE loss function is a linear model, with θ
consisting of w and b. Our model is defined to be
• f (x; w, b) = x*w + b.
• We can minimize J(θ) in closed form with respect to w and b using the normal equations.
The bold numbers printed on the plot indicate the value that the learned function must output at each point. (Left)A linear
model applied directly to the original input cannot implement the XOR function. When x1 = 0, the model’s output must
increase as x2 increases. When x1 = 1, the model’s output must decrease as x2 increases. A linear model must apply a fixed
coefficient w2 to x2. The linear model therefore cannot use the value of x1 to change the coefficient on x2 and cannot solve
this problem. (Right)In the transformed space represented by the features extracted by a neural network, a linear model can
now solve the problem. In our example solution, the two points that must have output 1 have been collapsed into a single
point in feature space. In other words, the nonlinear features have mapped both x = [1,0] and x = [0,1]to a single point in
feature space, h = [1,0]. The linear model can now describe the function as increasing in h1 and decreasing in h2. In this
example, the motivation for learning the feature space is only to make the model capacity greater so that it can fit the training
set. In more realistic applications, learned representations can also help the model to generalize.
• Understanding the Problem with XOR:
• The XOR function returns 1 if the two binary inputs are different, and 0 if they
are the same.
• Example:
• XOR(0, 0) = 0
• XOR(0, 1) = 1
• XOR(1, 0) = 1
• XOR(1, 1) = 0
• Nonlinear Activation Function:
• In the hidden layer, we use a nonlinear activation function called the
Rectified Linear Unit (ReLU):
• In this example, we manually specified the solution. However, in real-world
applications, we don’t know the solution in advance.
• Instead, we use gradient-based optimization algorithms (like gradient descent)
to find the parameters W, c, w, and b that minimize the error.
• The solution we found is at a global minimum of the loss function, meaning it
perfectly solves the XOR problem. Gradient descent could converge to this
solution or similar ones, depending on the starting point of the parameters.
Gradient-Based Learning
Similarities with Other Machine Learning Models:
• Training a neural network with gradient descent is similar to training
other machine learning models like linear regression or logistic
regression.
• In both cases, we need to specify three things:
• Optimization procedure (like gradient descent).
• Cost function (to measure how good the model is).
• Model family (type of model, e.g., neural networks, linear models).
Key Difference with Neural Networks:
• The key difference between neural networks and linear models is nonlinearity.
• Neural networks contain nonlinear components (like activation functions), which makes the loss
function non-convex (i.e., it has many local minima and maxima).
• In contrast, linear models (e.g., linear regression) often have convex loss functions, which can be
solved using simple optimization methods that guarantee a global minimum.
Training Neural Networks: Using Gradient-Based Optimizers:
• Because of the non-convex nature of neural networks, we use iterative, gradient-based optimizers
to minimize the cost function. These optimizers may not find the global minimum, but they drive
the cost function to a low value.
• For simpler models like linear regression or logistic regression, we can often use methods that
guarantee finding the optimal solution.
• In contrast, neural networks require more complex methods, like stochastic gradient descent
(SGD), due to their non-convex loss functions.
Importance of Weight Initialization:
• For feedforward neural networks, it is crucial to initialize the weights with small random values.
• The biases can be initialized to zero or small positive values. Proper initialization helps avoid
problems during training (such as getting stuck in poor local minima).
Gradient-Based Optimization:
• Neural networks use iterative gradient-based algorithms to train the model by minimizing the cost
function.
• These algorithms are improvements on the basic idea of gradient descent. The most common one
is stochastic gradient descent (SGD), which is particularly useful for large datasets.
Choosing the Cost Function and Model Representation:
• Just like in other machine learning models, we must choose a cost function that measures how
far the model's predictions are from the true values.
• We also need to decide how to represent the output of the model, depending on whether it's a
classification or regression problem.
Cost Functions
Importance of Cost Function:
• When designing a deep neural network, choosing the right cost function is crucial. The cost
function helps measure how well the neural network’s predictions match the actual results.
• The cost functions used for neural networks are mostly the same as those used for other models,
like linear models.
between the true distribution (actual data) and the predicted distribution (from the model).
Maximum Likelihood and Cross-Entropy:
• In most neural networks, the model predicts a probability distribution for the target variable y,
given the input x.
• The principle of maximum likelihood is often used to choose the best model. This means the
model tries to make its predictions match the true data as closely as possible.
• In practice, this is done by using the cross-entropy loss function, which measures the difference
between the true distribution (actual data) and the predicted distribution (from the model).
Learning Conditional Distributions with Maximum Likelihood
Maximum Likelihood Training:
• In most modern neural networks, we train them using maximum likelihood estimation (MLE).
This means the goal of training is to make the model's predicted probabilities match the actual
data as closely as possible.
• The cost function in this case is the negative log-likelihood or cross-entropy, which is a
measure of how well the model’s predicted probabilities align with the true data.
Model-Specific Cost Functions:
•The exact form of the cost function can vary depending on the type of model you’re using.
•For example, in some cases, this cost function can simplify to mean squared error (MSE),
which measures the average squared difference between predicted and actual values.

Benefit of Maximum Likelihood:


•One of the main benefits of using maximum likelihood is that you don't have to manually
design a cost function for each model. Once you specify a probability distribution p(y|x), the
cost function is automatically determined by the model.
Mean Absolute Error:
• Another cost function is mean absolute error (MAE), which minimizes the absolute
difference between the predicted and true values of y.
• If you minimize MAE, the function you learn will predict the median of y for each x. The
median is the middle value, different from the mean in how it deals with outliers.
Why Cross-Entropy is Preferred:
• Even though MSE and MAE are simple and intuitive, they sometimes perform poorly in
neural networks, especially when combined with certain activation functions.
• Some activation functions (like sigmoid) saturate, meaning that their gradients become
very small, making it hard for the model to learn. This is why cross-entropy is often
preferred—because it avoids these small gradients and helps the model learn better.
Output Units
❖ The choice of cost function is tightly coupled with the choice of output unit.
Most of the time, we simply use the cross-entropy(Cross-entropy measures
how well the predicted probabilities match the actual distribution of the
labels) between the data distribution and the model distribution.
❖ The choice of how to represent the output then determines the form of the
cross-entropy function.
❖ Any kind of neural network unit that may be used as an output can also be
used as a hidden unit. Here, we focus on the use of these units as outputs
of the model, but in principle they can be used internally as well.
feedforward network provides a set of hidden features defined by h = f(x;θ).
The role of the output layer is then to provide some additional
transformation from the features to complete the task that the network
must perform.
Linear Units for Gaussian Output Distributions
Modeling Covariance:
•One advantage of using a maximum likelihood framework is that it simplifies the process of
learning the covariance of the Gaussian distribution. This allows the model to adapt the
output's uncertainty based on the input features.
Sigmoid Units for Bernoulli Output Distributions
Many tasks require predicting the value of a binary variable y. Classification problems with
two classes can be cast in this form. x The maximum-likelihood approach is to define a
Bernoulli distribution over y conditioned on .
Softmax Units for Multinoulli Output Distributions
• Sigmoid and Binary Classification:
• In binary classification (where we only have two possible outcomes, like 0
and 1), we use the sigmoid function to represent the probability of one of the
outcomes.
• The sigmoid function ensures that the predicted probability (let’s call it ŷ) is
between 0 and 1. If we know the probability of class 1 (let's say, P(y = 1 |
x)), then the probability of class 0 is simply 1 - P(y = 1 | x).
• Sigmoid formula:

This gives us a probability for class 1 (when y = 1), where z is the output of a
linear function (like z = Wx + b).
Extending to Multiple Classes:
• For more than two possible classes (let’s say n classes), we use the softmax
function to handle this.
Softmax Formula:
• The softmax function works as follows:

Where:
•z_i is the unnormalized log probability (or score) for class i.
•The sum in the denominator ensures that all output probabilities sum up to 1.
Each z_i represents a raw score, and the softmax exponentiates (e^{z_i}) and
normalizes these scores to convert them into probabilities.
Hidden Units
Hidden Units in Neural Networks:
•Hidden units are neurons that exist in the hidden layers of a neural network (the layers
between the input and output layers).
•These units apply a transformation to their inputs, helping the network learn complex
patterns.
•Each hidden unit transforms the input using an activation function, which determines how
the unit behaves.
Choosing the Right Activation Function:
•Choosing the right activation function (which controls the output of a neuron) is important
but not straightforward.
•Different types of activation functions exist, and it's hard to predict in advance which one
will work best for your specific task. The best approach is often trial and error, where you
test different activation functions and evaluate their performance.
Rectified Linear Units (ReLU):
• ReLU (Rectified Linear Unit) is one of the most popular choices for hidden units. It is
simple and usually works well.
• The formula for ReLU is:
• This means that if the input z is positive, it returns z, and if z is negative, it returns 0.
• Example: If z = 2, then g(z) = 2. If z = -3, then g(z) = 0.

Derivatives of ReLU:
•Even though ReLU is not differentiable at z = 0, it's differentiable at almost all other
points.
•The left derivative of ReLU (when z is just to the left of 0) is 0, and the right derivative
(when z is just to the right of 0) is 1.

For example:
• For z = -0.1, the derivative of ReLU is 0 (left derivative).
• For z = 0.1, the derivative of ReLU is 1 (right derivative).
• In software implementations, the program typically chooses one of these
derivatives, and this works fine for training.
General Form of Hidden Units:
• Most hidden units in a neural network take a vector of inputs x and apply an
affine transformation to it (a linear transformation followed by a shift):

Where:
•W is a matrix of weights,
•x is the input vector,
•b is a bias term, and
•z is the result of this transformation.

Rectified Linear Units and Their Generalizations


• Why is ReLU useful?
• It is easy to optimize (i.e., easy to adjust during learning) because
it behaves similarly to a linear function. The only difference is that it
outputs zero for negative inputs.
• The gradients (used in optimization) are large and consistent when
ReLU is active (i.e., when z > 0). This is important because it helps
the model learn better and faster.
• ReLU is usually applied after an affine transformation:

Where:
•W is the weight matrix,
•x is the input vector,
•b is the bias, and
•g(z) is the ReLU function applied element-wise to z

Initialization tip: It's often good to initialize the bias b to a small positive value
(like 0.1), ensuring that ReLU units are active for most inputs early in training.
This allows gradients to flow during the learning process.
Logistic Sigmoid and Hyperbolic Tangent
Logistic Sigmoid Activation Function:
The logistic sigmoid activation function is given by:

This function maps any input z to a value between 0 and 1. It’s often used in binary classification tasks because it can
represent a probability.
Hyperbolic Tangent (tanh) Activation Function:
The tanh function is similar to the sigmoid function but maps inputs to a range between -1 and 1
Saturation Problem:
Both the sigmoid and tanh functions have a saturation problem:
• When the input z becomes very positive (large), the output saturates (or
flattens) to 1 for sigmoid and to 1 for tanh.
• When z becomes very negative (small), the output saturates to 0 for sigmoid
and to -1 for tanh.
This saturation makes learning difficult because:
• In the saturated regions (where z is either very large or very small), the
gradients become very small (almost zero). This means the network stops
learning in these areas because small gradients slow down gradient-based
optimization.
• Why Sigmoid and tanh are Discouraged in Hidden Layers:
Because of this saturation problem, using sigmoid or tanh in hidden layers of
neural networks can make learning very slow and difficult, especially when
training deep networks. The gradients become very small, which makes it
hard for the network to adjust its parameters and learn effectively. That’s why
these functions are now mostly avoided in hidden layers.
Other Hidden Units
In neural networks, the hidden units are the neurons that apply
some sort of transformation to the input data before passing it to the
next layer. While some activation functions like ReLU (Rectified
Linear Unit) are very common, many other types of hidden units
exist. Let's explain some of these less common hidden units and the
concepts mentioned.
No Activation (Linear Units)
• In some cases, neural networks use linear units, meaning they
don’t apply any activation function at all.
• The formula for a hidden unit in this case is:
Why So Many Activation Functions?
Researchers test different types of hidden units to see if they can improve performance on
specific tasks. Many new activation functions are proposed, but they only become widely
adopted if they show significant improvement over existing methods. For example, ReLU
became popular because it avoids the vanishing gradient problem seen with sigmoid and
tanh, making it easier to train deep networks.

You might also like