Neural Networks
Neural Networks
PERCEPTRON
What is Perceptron?
● A simple model of a biological neuron used in artificial neural networks
● Introduced by Frank Rosenblatt in 1957
● The simplest form of a neural network for binary classification
● Functions as a linear classifier that separates input space with a hyperplane
Softmax( zi ) =
Where:
● zi is the logit (the output of the previous layer in the network) for the ith class
● K is the number of classes.
● represents the exponential of the logit.
● is the sum of exponentials across all classes.
Error Function
In Linear space, Logistic regression
finds a linear decision boundary, a Red: 1
line, or a hyperplane, to differentiate Black: 0
between the classes.
Primarily, the logistic regression
estimates the probability for binary
outcomes only. For this, it uses the A B
sigmoid transformation function
(magic function) as default.
Though it is not restricted to binary problems only, it can be used for multi-classification.
But for doing that, it either uses a different transformation function such as the SoftMax function
There are many transformation functions that can be used to transform these values between 0 and
1. However, by default, the sigmoid function (logit function) is used to transform these values into a
probability space. The output of the sigmoid function is interpretable as the probability of an input
belonging to the positive class in binary classification.
The odd is a measure of the relative
represented as a number between -infinity and infinity. The regression values are also continuous
and range between -infinity and infinity. So, in that sense too, the logistic regression is related to
the probability.
parameters of a probability distribution that best explains the observed data. To do so, MLE
maximizes the likelihood function. MLE is not a classification algorithm; rather, it is a parameter
estimation technique.
Log Likelihood
Considering the values of probability is very dense that lies between 0 and 1 doing a
product would not be mathematical easier and feasible to do. However, taking the
into a sum of logarithms, which is generally easier to work with and we called this Log
Likelihood
.
Negative Log Likelihood
However, again taking the logarithmic of probabilities value which lies between 0 and 1 will always
return negative values. So, in order to have positive values we multiply it with -1 and by doing so we
Under this, we try to minimize this negative log-likelihood and in return expects the best parameter
values.
Cost Function
In logistic regression, we consider the negative log-likelihood as the cost function, and it is also
Convexity is a desirable property for optimization algorithms because it guarantees that there is a unique
global minimum, and any local minimum is also a global minimum. Convexity of a function requires that
the function has a unique global minimum and that any local minimum is also a global minimum. The
negative log-likelihood function in logistic regression is not guaranteed to be convex. In fact, it is often
non-convex due to the non-linear nature of the logistic function. Since it does not always satisfy the
convexity property so it can have multiple local minima. The presence of multiple local minima can make it
challenging to find the optimal set of model parameters. So, in order to minimize it, we have to take the
Now using gradient descent, we can update the values of coefficients and at each
iteration, we will calculate the cost until we find that the stopping-criteria is met.
Multi Layer Perceptron
Artificial Neural Network
Neural Network Architecture
The derivative of MSE is smooth, enabling efficient optimization with gradient descent.
were cost can be equal to MSE, cross-entropy or any other cost function.
Based on C’s value, the model “knows” how much to adjust its parameters in order to get closer to the
expected output y. This happens using the backpropagation algorithm.
Backpropagation and computing gradients
Backpropagation aims to minimize the cost function by adjusting network’s
weights and biases. The level of adjustment is determined by the gradients
of the cost function with respect to those parameters.
•Gradient of a function C(x_1, x_2, …, x_m) in point x is a vector of the
partial derivatives of C in x.
•The derivative of a function C measures the sensitivity to change of the
function value (output value) with respect to a change in its argument x
(input value). In other words, the derivative tells us the direction C is going.
•The gradient shows how much the parameter x needs to change (in
positive or negative direction) to minimize C.
For a single weight gradient is:
For a single bias gradient is:
How to Improve Neural Network
Fine Tuning Hyperparameters
Shallow vs. Deep Networks
•Single hidden layer often suffices for many tasks (e.g.,
MNIST: >97% accuracy with ~100s of neurons).
•Deep networks (e.g., 2+ hidden layers) improve parameter
efficiency, enabling exponential reductions in neuron count for
equivalent performance.
•Hierarchical learning: Lower layers detect simple patterns
(lines), intermediate layers combine them (shapes), and upper
layers model complex structures (faces).
Transfer Learning & Practical Benefits
•Reuse lower layers of pretrained networks (e.g., face
detection → hairstyle recognition) to accelerate training and
reduce data needs.
•Generalization: Deep architectures inherently capture
real-world hierarchical data structures, improving convergence
and adaptability.
•Complex tasks (e.g., image/speech recognition) often require
dozens of layers but benefit from pretrained models rather
than training from scratch.
Leaky ReLU
Formula :
f(x)=max(αx,x), α≈0.01