0% found this document useful (0 votes)
7 views63 pages

Neural Networks

The document provides an overview of neural networks, focusing on the Perceptron model, its structure, and operation, including the learning algorithm and activation functions. It also discusses various optimization techniques like gradient descent, loss functions for classification and regression, and methods to improve neural networks such as fine-tuning hyperparameters and transfer learning. Additionally, it highlights the advantages of Convolutional Neural Networks (CNNs) for image processing tasks.

Uploaded by

sayaksardar04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views63 pages

Neural Networks

The document provides an overview of neural networks, focusing on the Perceptron model, its structure, and operation, including the learning algorithm and activation functions. It also discusses various optimization techniques like gradient descent, loss functions for classification and regression, and methods to improve neural networks such as fine-tuning hyperparameters and transfer learning. Additionally, it highlights the advantages of Convolutional Neural Networks (CNNs) for image processing tasks.

Uploaded by

sayaksardar04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Neural Networks

PERCEPTRON
What is Perceptron?
● A simple model of a biological neuron used in artificial neural networks
● Introduced by Frank Rosenblatt in 1957
● The simplest form of a neural network for binary classification
● Functions as a linear classifier that separates input space with a hyperplane

How Perceptron Work?


Structure and Operation

● Inputs: Feature values(x₁, x₂, ..., xₙ)


● Weights: Values (w₁, w₂, ..., wₙ) representing connection strength
● Bias: Shifts the decision boundary (b)
● Formula: f(∑(wᵢ×xᵢ) + b) where f is the activation function
● Activation: Uses Heaviside step function - outputs 1 if sum ≥ threshold, 0
otherwise
The Perceptron Learning Algorithm
Training Process
● Initialization: Set weights and bias to zero or small
random values
● Forward Pass: Calculate weighted sum and apply
activation function
● Error Calculation: Compare output to target value
● Weight Update Rule: wn = wo + η(yi - Ŷi)xi

Where η is learning rate, yi is expected output and


Ŷi is calculated output, x is input
i

● Iteration: Repeat until convergence or maximum


epochs reached
Activation Function
Several different types of activation functions are used in Deep Learning. Some of
them are explained below:
Step Function: Step Function is one of the simplest kind of activation functions. In
this, we consider a threshold value and if the value of net input say y is greater than
the threshold then the neuron is activated. Mathematically,

Sigmoid Function: Sigmoid function is a widely used activation function. It is defined


as:
This is a smooth function and is continuously differentiable. The biggest advantage
that it has over step and linear function is that it is non-linear. This is an incredibly
cool feature of the sigmoid function. This essentially means that when I have
multiple neurons having sigmoid function as their activation function – the output is
non linear as well. The function ranges from 0-1 having an S shape.
Softmax function is a mathematical function that converts a vector of raw prediction scores (often called
logits) from the neural network into probabilities. These probabilities are distributed across different
classes such that their sum equals 1. Essentially, Softmax helps in transforming output values into a format
that can be interpreted as probabilities, which makes it suitable for classification tasks.
In a multi-class classification neural network, the final layer outputs a set of values, each corresponding
to a different class. These values, before Softmax is applied, can be any real numbers, and may not provide
meaningful information directly. The Softmax function processes these values into probabilities, which
indicate the likelihood of each class being the correct one.

Formula of Softmax function

Softmax( zi ) =

Where:
● zi is the logit (the output of the previous layer in the network) for the ith class
● K is the number of classes.
● represents the exponential of the logit.
● is the sum of exponentials across all classes.
Error Function
In Linear space, Logistic regression
finds a linear decision boundary, a Red: 1
line, or a hyperplane, to differentiate Black: 0
between the classes.
Primarily, the logistic regression
estimates the probability for binary
outcomes only. For this, it uses the A B
sigmoid transformation function
(magic function) as default.
Though it is not restricted to binary problems only, it can be used for multi-classification.
But for doing that, it either uses a different transformation function such as the SoftMax function
There are many transformation functions that can be used to transform these values between 0 and
1. However, by default, the sigmoid function (logit function) is used to transform these values into a
probability space. The output of the sigmoid function is interpretable as the probability of an input
belonging to the positive class in binary classification.
The odd is a measure of the relative

likelihood of two events. When we take the

log of it becomes log odds, and it is typically

represented as a number between -infinity and infinity. The regression values are also continuous
and range between -infinity and infinity. So, in that sense too, the logistic regression is related to
the probability.

or,p = sigmoid (log odds)


Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the

parameters of a probability distribution that best explains the observed data. To do so, MLE

maximizes the likelihood function. MLE is not a classification algorithm; rather, it is a parameter

estimation technique.
Log Likelihood
Considering the values of probability is very dense that lies between 0 and 1 doing a

product would not be mathematical easier and feasible to do. However, taking the

logarithm allows transforming the product of probabilities in the likelihood function

into a sum of logarithms, which is generally easier to work with and we called this Log

Likelihood

.
Negative Log Likelihood

However, again taking the logarithmic of probabilities value which lies between 0 and 1 will always

return negative values. So, in order to have positive values we multiply it with -1 and by doing so we

get the negative log-likelihood.

Under this, we try to minimize this negative log-likelihood and in return expects the best parameter

values.
Cost Function

In logistic regression, we consider the negative log-likelihood as the cost function, and it is also

called a cross-entropy function.


Convexity of Negative Likelihood Estimation

Convexity is a desirable property for optimization algorithms because it guarantees that there is a unique

global minimum, and any local minimum is also a global minimum. Convexity of a function requires that

the function has a unique global minimum and that any local minimum is also a global minimum. The

negative log-likelihood function in logistic regression is not guaranteed to be convex. In fact, it is often

non-convex due to the non-linear nature of the logistic function. Since it does not always satisfy the

convexity property so it can have multiple local minima. The presence of multiple local minima can make it

challenging to find the optimal set of model parameters. So, in order to minimize it, we have to take the

partial derivate of the log loss function or negative log likelihood.


Gradient Descent
Similarly for bias,

Now using gradient descent, we can update the values of coefficients and at each
iteration, we will calculate the cost until we find that the stopping-criteria is met.
Multi Layer Perceptron
Artificial Neural Network
Neural Network Architecture

Input Layer Hidden Layer Output Layer


Deep Neural Network
Loss Function
Cross-Entropy Loss:
Cross-Entropy is the go-to loss function for classification tasks. It measures the
dissimilarity between predicted probabilities and actual labels.For binary classification:

When to Use Cross-Entropy

● Binary or multi-class classification problems.

● Tasks where the model outputs probabilities.


Mean Squared Error (MSE)
MSE is the most common loss function for regression problems. It calculates the average squared

difference between predicted and actual values:

The derivative of MSE is smooth, enabling efficient optimization with gradient descent.

When to Use MSE

● Regression problems where large errors are unacceptable.

● Scenarios where you want to emphasize minimizing large deviations.


Gradient Descent
Gradient descent is a fundamental optimization algorithm in machine learning used to
minimize functions by iteratively moving towards the minimum. It's important for
training models by fine-tuning parameters to reduce prediction errors.
Batch Gradient Descent
Batch Gradient Descent is a variant of the gradient descent algorithm where the
entire dataset is used to compute the gradient of the loss function with respect to the
parameters. In each iteration the algorithm calculates the average gradient of the
loss function for all the training examples and updates the model parameters
accordingly. The update rule for batch gradient descent is:
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of the gradient
descent algorithm where the model parameters are updated using
the gradient of the loss function with respect to a single training
example at each iteration. Unlike batch gradient descent which uses
the entire dataset SGD updates the parameters more frequently,
leading to faster convergence. The update rule for SGD is:
Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a compromise between Batch
Gradient Descent and Stochastic Gradient Descent. Instead of using
the entire dataset or a single training example Mini-Batch Gradient
Descent updates the model parameters using a small, random
subset of the training data called a mini-batch. The update rule for
Mini-Batch Gradient Descent is:
Comparison between the variants of Gradient Descent
Forward propagation and evaluation
The final step in a forward pass is to evaluate the
predicted output s against an expected output y.
The output y is part of the training dataset (x, y) where
x is the input (as we saw in the previous section).
Evaluation between s and y happens through a cost
function. This can be as simple as MSE (mean squared
error) or more complex like cross-entropy.
We name this cost function C and denote it as follows:

were cost can be equal to MSE, cross-entropy or any other cost function.
Based on C’s value, the model “knows” how much to adjust its parameters in order to get closer to the
expected output y. This happens using the backpropagation algorithm.
Backpropagation and computing gradients
Backpropagation aims to minimize the cost function by adjusting network’s
weights and biases. The level of adjustment is determined by the gradients
of the cost function with respect to those parameters.
•Gradient of a function C(x_1, x_2, …, x_m) in point x is a vector of the
partial derivatives of C in x.
•The derivative of a function C measures the sensitivity to change of the
function value (output value) with respect to a change in its argument x
(input value). In other words, the derivative tells us the direction C is going.
•The gradient shows how much the parameter x needs to change (in
positive or negative direction) to minimize C.
For a single weight gradient is:
For a single bias gradient is:
How to Improve Neural Network
Fine Tuning Hyperparameters
Shallow vs. Deep Networks
•Single hidden layer often suffices for many tasks (e.g.,
MNIST: >97% accuracy with ~100s of neurons).
•Deep networks (e.g., 2+ hidden layers) improve parameter
efficiency, enabling exponential reductions in neuron count for
equivalent performance.
•Hierarchical learning: Lower layers detect simple patterns
(lines), intermediate layers combine them (shapes), and upper
layers model complex structures (faces).
Transfer Learning & Practical Benefits
•Reuse lower layers of pretrained networks (e.g., face
detection → hairstyle recognition) to accelerate training and
reduce data needs.
•Generalization: Deep architectures inherently capture
real-world hierarchical data structures, improving convergence
and adaptability.
•Complex tasks (e.g., image/speech recognition) often require
dozens of layers but benefit from pretrained models rather
than training from scratch.
Leaky ReLU
Formula :
f(x)=max(αx,x), α≈0.01

Allows small gradient for x<0; reduces


dying ReLU
Convolution Neural Networks
Why CNNs Over ANNs for Image Tasks?
Convolution
Convolution
Convolution
Convolution
Convolution

You might also like