Deep Learning UNIT 1
Deep Learning UNIT 1
Neural Networks are a core technology in Artificial Intelligence (AI) that mimic the human brain's ability to
recognize patterns, learn from data, and make decisions. They are used in various AI applications, from image
and speech recognition to autonomous systems and natural language processing.
What is a Neural Network?
A Neural Network is a computational model inspired by the way biological neural networks in the human
brain process information. It consists of layers of interconnected nodes (also known as neurons) that work
together to process input data and produce an output.
UNIT 1 Page 1
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to find
hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results in output
that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a
bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it to the
output layer. There are distinctive activation functions available that can be applied upon the sort of task
we are performing.
UNIT 1 Page 2
weights to minimize the loss.
UNIT 1 Page 3
Deep Learning is a subset of machine learning that employs neural networks with multiple layers (deep
networks) to model complex patterns in data. Unlike traditional machine learning, which often requires
manual feature extraction, deep learning automates this process through hierarchical representation learning.
b. Activation Functions
Activation functions introduce non-linearity, enabling the model to learn complex relationships. Common
activation functions include:
UNIT 1 Page 5
characteristics of neural networks terminology
11 October 2024 11:28
Neural networks, inspired by the human brain, are designed to recognize patterns and relationships in data.
Here's a detailed breakdown of key neural network terminology:
1. Neuron (Node or Perceptron)
• Definition: The basic unit of a neural network. It receives inputs, applies a weight to each, sums them,
and passes the result through an activation function.
• Function: Similar to biological neurons, it processes inputs and produces an output, which can be fed
into other neurons in the next layer.
• Mathematics: Output = Activation Function (∑ (Input × Weight) + Bias).
2. Input Layer
• Definition: The first layer of the network that receives the input data. Each neuron in this layer
corresponds to one feature in the dataset.
• Function: Transmits the input data to the subsequent layers without applying any computation.
3. Hidden Layer
• Definition: Layers between the input and output layers where the actual computation occurs. There can
be one or more hidden layers depending on the complexity of the problem.
• Function: These layers extract and learn features from the input data through weight adjustments.
• Deep Learning: Neural networks with many hidden layers are called deep neural networks (DNNs),
enabling them to model more complex patterns.
4. Output Layer
• Definition: The final layer that produces the network's prediction or classification. Its neurons represent
the possible classes (for classification tasks) or continuous outputs (for regression tasks).
• Function: The outputs of this layer are interpreted as the network’s final decision or prediction based on
the processed data.
5. Weights
• Definition: Parameters that are applied to inputs to adjust their influence on the output. Each
connection between neurons has a weight associated with it.
• Function: Weights are learned through training and determine the importance of input features.
• Gradient Descent: Weights are updated iteratively during training using optimization algorithms like
gradient descent to minimize the error.
6. Bias
• Definition: An additional parameter added to the sum of inputs and weights, allowing the model to fit
the data better.
• Function: Bias shifts the output of the activation function, enabling the network to handle patterns that
don't pass through the origin.
7. Activation Function
• Definition: A function applied to the weighted sum of inputs in a neuron, determining whether the
neuron should be activated (i.e., produce an output).
• Types:
○ Sigmoid: Maps input values to a range between 0 and 1, making it useful for probability-based
outputs.
○ Tanh: Similar to sigmoid but maps values between -1 and 1, often used to center data.
○ ReLU (Rectified Linear Unit): Outputs the input directly if positive, otherwise outputs zero. It is
widely used for its simplicity and effectiveness in deep networks.
○ Softmax: Converts logits (raw prediction scores) into probabilities for multi-class classification.
8. Loss Function (Cost Function)
• Definition: A function that measures how far the predicted output is from the actual output. It
quantifies the error of the network’s predictions.
• Types:
UNIT 1 Page 6
• Types:
○ Mean Squared Error (MSE): Used in regression tasks to measure the average squared difference
between predicted and actual values.
○ Cross-Entropy Loss: Used in classification tasks to measure the difference between predicted and
actual probabilities.
9. Backpropagation
• Definition: An algorithm used for training neural networks by updating weights. It involves propagating
the error backward through the network from the output to the input layer.
• Function: Backpropagation calculates the gradient of the loss function with respect to each weight and
updates them using gradient descent to minimize the loss.
UNIT 1 Page 7
neurons, perceptron, backpropagation,
11 October 2024 17:44
1. Neuron
• Definition: A neuron in a neural network is the fundamental unit that mimics a biological neuron. It
processes input data and passes on the information to other neurons.
• Function: Each neuron receives multiple inputs, applies weights to them, sums them up, adds a bias, and
passes the result through an activation function to produce an output.
• Structure:
○ Inputs: Data or signals coming from other neurons or directly from the input layer.
○ Weights: Each input is multiplied by a weight, which represents the strength of that connection.
○ Bias: An extra term added to the weighted sum to help the neuron adjust its output.
○ Activation Function: The result of the weighted sum and bias is passed through a nonlinear function
(like ReLU or Sigmoid) to determine if the neuron should be "activated."
2. Perceptron
• Definition: The perceptron is the simplest form of a neural network, consisting of a single neuron with
adjustable weights and biases. It is the foundational unit of more complex neural networks.
• History: Introduced by Frank Rosenblatt in 1958, the perceptron is an early model of how a neural network
might function, mimicking how a biological neuron processes data.
• Structure:
○ Input Layer: The perceptron takes several binary or real-valued inputs.
○ Weights: Each input is associated with a weight that determines the input’s importance.
○ Weighted Sum: The inputs are multiplied by their respective weights and summed up, along with a
bias term.
○ Activation Function: The perceptron uses a step activation function that produces an output of either
0 or 1, depending on whether the weighted sum exceeds a certain threshold.
3. Backpropagation
• Definition: Backpropagation (short for "backward propagation of errors") is a key algorithm used to train
neural networks by updating the weights of the neurons through the calculation of gradients.
• Role: It helps minimize the error by propagating it backward from the output layer to the input layer,
adjusting weights to reduce the overall error.
• Steps of Backpropagation:
1. Forward Pass: Input data is passed through the network layer by layer, with each neuron producing an
output. The final output is compared with the actual output (label) using a loss function (like cross-
entropy for classification or mean squared error for regression).
2. Loss Calculation: The loss function computes the error between the predicted and actual outputs.
3. Backward Pass (Error Propagation): The error is propagated backward through the network. Using
calculus (chain rule), the algorithm calculates the gradient of the loss function with respect to each
UNIT 1 Page 8
calculus (chain rule), the algorithm calculates the gradient of the loss function with respect to each
weight. This is crucial because it tells the network how much each weight contributes to the total
error.
4. Weight Update: Weights are updated by subtracting a fraction of the gradient from each weight. This
fraction is controlled by the learning rate. This step is done using an optimization technique like
gradient descent.
UNIT 1 Page 9
Basic learning laws
11 October 2024 17:46
The Basic Learning Laws describe the principles or rules by which a neural network learns and adjusts its
weights during the training process. These rules guide how the neural network modifies its parameters
(weights and biases) to reduce the error between its predicted output and the actual output, allowing it to
improve its performance over time. \
1. Hebbian Learning
• Overview: Hebbian learning is based on the principle that "neurons that fire together, wire together." It
was introduced by Donald Hebb in 1949 and is one of the earliest learning laws.
• Concept: If two neurons are frequently activated together, the connection between them becomes
stronger. In other words, if a presynaptic neuron (input neuron) contributes to firing a postsynaptic
neuron (output neuron), the weight of the connection between them is increased.
• Rule: The weight between two connected neurons increases if both are activated at the same time.
• Applications: Hebbian learning is used in unsupervised learning systems, where the network learns from
patterns in the input without explicit labeled data. It has been used in models of associative memory and
in self-organizing networks like Kohonen's Self-Organizing Maps (SOMs).
• Limitations: Pure Hebbian learning can lead to instability, as weights may grow indefinitely if not
controlled. Variants like Oja’s rule introduce constraints to prevent this issue.
UNIT 1 Page 10
2. Perceptron Learning Rule
• Overview: This rule is used in perceptrons, which are simple neural networks consisting of a single
neuron. It’s designed to adjust the weights of the network in response to errors in the prediction.
• Concept: If the perceptron makes a mistake in classifying a data point, the weights are updated in such a
way that the prediction improves. The perceptron learning rule aims to reduce the error between the
predicted and actual output by changing the weights.
• Applications: This learning rule is used in supervised learning, particularly in linear classifiers. However, it
is limited to linearly separable data (where classes can be separated by a straight line).
Applications: The Delta Rule is commonly used in training linear regression models and simple neural
networks. It is also foundational for gradient descent and is widely applied in training multi-layer networks
using backpropagation.
UNIT 1 Page 11
• Gradient Descent Approach: It continuously adjusts the weights to reduce the error between the desired
and actual output.
• Modification of Weight: The change in the weight is proportional to the product of the input and the error
(which is the difference between the expected and predicted output).
•
4. Competitive Learning
• Overview: In competitive learning, neurons compete to be the most active (or the "winner") for a given
input. Only the winning neuron updates its weights, and the others remain unchanged.
• Concept: Neurons are trained to specialize in recognizing different features or patterns in the input. This
form of learning is unsupervised and is often used for clustering.
• Rule: Only the neuron with the highest activation (the "winner") updates its weights to strengthen its
response to the current input. This is often achieved using winner-takes-all strategies.
Applications: Competitive learning is used in clustering algorithms, vector quantization, and self-organizing
maps. It’s a common technique in unsupervised learning where labels are not provided.
UNIT 1 Page 12
○ If two neurons operate in opposite phases, their weight decreases.
○ Unlike Hebbian learning, this rule uses a targeted response to calculate weight changes.
UNIT 1 Page 13
Activation and Loss function
11 October 2024 17:58
• Behavior:
○ This function outputs the same value as the input.
○ It is a linear function, meaning there is no transformation or change applied to the input.
• Use Case:
○ Often used in the input layer of a neural network where no transformation is required.
○ Rarely used in hidden layers since it does not introduce non-linearity, which is essential for complex learning.
• Advantages:
○ Simple and easy to compute.
○ Useful in some regression models.
• Disadvantages:
○ Does not capture any complex patterns in data due to its linearity.
○ No gradient or slope, meaning weights are updated the same way, leading to limited learning capabilities.
B. Threshold/Step Function
○ Behavior:
○ This function outputs 1 if the input is 0 or positive, and 0 if the input is negative.
○ It transforms input signals into binary outputs (either 0 or 1).
○ It can be viewed as a binary classifier, distinguishing between classes.
○ Use Case:
○ Primarily used for binary classification tasks where the output is either true or false.
○ Commonly applied in single-layer perceptrons.
○ Variant: Threshold function with a threshold value θ:
○ Instead of 0 as the decision boundary, a threshold θ is introduced.
UNIT 1 Page 14
• Advantages:
○ Simple and fast.
○ Effective for binary decisions.
• Disadvantages:
○ Not differentiable, meaning it's not suitable for gradient-based optimization methods.
○ Can't handle multiple classes or complex relationships.
• Behavior:
○ ReLU is the most popular activation function in deep learning and convolutional neural networks (CNNs).
○ For all positive input values, the output is the same as the input. For negative inputs, the output is 0.
○ It introduces non-linearity but still allows for simple, efficient computations.
• Advantages:
○ Efficient computation: Due to its simplicity, it’s computationally fast, making it suitable for large-scale models.
○ Sparse activation: It activates only a portion of the neurons, which reduces computational overhead.
○ Gradient propagation: Helps mitigate the vanishing gradient problem by maintaining gradients when input is positive.
• Disadvantages:
○ Non-differentiability at 0: ReLU is not differentiable at x=0.
○ Dying ReLU problem: Sometimes, neurons can stop responding to any input (when they output 0 all the time) during
training. This can cause parts of the network to "die" or become inactive, leading to performance issues.
D. Sigmoid Function
• Sigmoid functions are S-shaped (or logistic curves) and are highly useful when output values need to be squashed between a
specific range.
UNIT 1 Page 15
• Use Case:
○ Common in binary classification problems.
○ It’s used in the output layer when the desired output is between 0 and 1.
• Advantages:
○ Provides smooth, non-linear output.
○ Useful in probability-based outputs (since the output lies between 0 and 1).
• Disadvantages:
○ Vanishing gradient problem: When the input values are too large or too small, the gradients tend to zero, which slows down
the learning process.
○ Output is not zero-centered, which can affect optimization algorithms.
• Behavior:
○ Similar to the binary sigmoid but squashes the input to a range between -1 and +1.
○ It can model outputs that span a negative-to-positive range.
• Use Case:
○ Common in networks where the output needs to represent a bipolar decision, or values ranging between -1 and +1 are
required.
• Advantages:
○ Retains the properties of the binary sigmoid while providing an expanded output range (-1, +1).
• Disadvantages:
○ Similar issues with vanishing gradients and non-zero-centered output as the binary sigmoid function.
UNIT 1 Page 16
• Behavior:
• The Tanh function is an S-shaped curve similar to the sigmoid but squashes input values to the range (-1, +1).
• When input values are large, the Tanh function saturates to -1 (for negative inputs) or +1 (for positive inputs).
• It is zero-centered, unlike the sigmoid function, meaning the output is symmetrically distributed around zero.
• Use Case:
• Frequently used in backpropagation networks, particularly in hidden layers.
• Suitable for models where negative values are important for learning.
• Advantages:
• Avoids the zero-centered issue of sigmoid functions.
• Works well for data that has strong negative and positive relationships.
• Disadvantages:
• Similar to the sigmoid function, Tanh suffers from the vanishing gradient problem, especially for large inputs.
LOSS FUNCTION
A loss function (also known as a cost function or objective function) is a critical part of a neural network, determining how well
the model is performing during training. It quantifies the difference between the predicted output and the actual output, guiding
the optimization process by updating weights and biases to minimize this error.
The primary goal of training a neural network is to minimize the loss function so that the model predictions are as close as
possible to the actual values. Different types of loss functions are used depending on the type of problem—classification,
regression, or others.
• Explanation:
○ MSE measures the average squared difference between the actual and predicted values.
○ The errors are squared to ensure that positive and negative errors do not cancel each other out, and also to penalize larger
errors more severely.
• Use Case:
○ Primarily used in regression problems, where the task is to predict continuous values.
• Advantages:
○ Easy to compute and differentiable, which is important for gradient descent.
○ Penalizes larger errors more, making it sensitive to outliers.
• Disadvantages:
○ The squared error can overly penalize larger errors, making it less robust to outliers.
• Explanation:
○ MAE measures the average absolute difference between the actual and predicted values.
○ Unlike MSE, it takes the absolute value of the error, so it does not penalize larger errors as severely.
• Use Case:
○ Also commonly used in regression problems, especially when outliers are present, as it is more robust than MSE.
• Advantages:
UNIT 1 Page 17
• Advantages:
○ MAE gives a more natural, linear measure of error, treating all errors equally without squaring them.
○ Less sensitive to outliers compared to MSE.
• Disadvantages:
○ Since it’s not differentiable at zero, it may be difficult to optimize in some cases, though this issue can often be handled using
sub-gradients.
• Explanation:
○ Cross-Entropy Loss (or log loss) is widely used in classification problems, particularly in binary classification.
○ It measures the difference between the actual class label and the predicted probability. If the predicted probability is close to
the true label, the loss is small, and if it's far from the true label, the loss increases.
• Use Case:
○ Used for binary classification (binary cross-entropy) and multi-class classification (categorical cross-entropy).
• Advantages:
○ Well-suited for probability-based outputs and works well with models that predict probabilities, such as those using sigmoid
or softmax activation functions.
• Disadvantages:
○ Sensitive to poorly estimated probabilities. If the model is very confident but wrong, the loss becomes very large.
4. Hinge Loss
• Formula:
• Explanation:
○ Hinge loss is primarily used in support vector machines (SVMs).
○ It ensures that the margin between the predicted value and the decision boundary is maximized. The loss is zero if the
prediction is correct and greater than the margin. Otherwise, it increases as the prediction moves further away from the
margin.
• Use Case:
○ Commonly used in binary classification tasks for SVMs, where the goal is to create a margin between classes.
• Advantages:
○ It emphasizes margin maximization, which often leads to better generalization in classification tasks.
• Disadvantages:
○ Only applicable to classification problems with linear boundaries.
5. Huber Loss
• Formula:
• Explanation:
○ Huber Loss combines the best of both MSE and MAE. For smaller errors, it behaves like MSE (quadratic), and for larger
errors, it behaves like MAE (linear).
○ This allows Huber Loss to be more robust to outliers compared to MSE while still providing smooth and differentiable error
feedback.
UNIT 1 Page 18
feedback.
• Use Case:
○ Commonly used in regression problems where there are outliers in the data but still requires smooth behavior for small
errors.
• Advantages:
○ More robust to outliers than MSE.
○ Provides smoother gradient feedback than MAE.
• Disadvantages:
○ The threshold δ\deltaδ must be tuned, and improper tuning can affect performance.
• Explanation:
○ KL Divergence measures the difference between two probability distributions: the true distribution P(x)P(x)P(x) and the
predicted distribution Q(x)Q(x)Q(x).
○ It is useful when you are dealing with probability distributions rather than specific class labels or continuous values.
• Use Case:
○ Used in variational autoencoders (VAEs) and other models that learn probability distributions.
• Advantages:
○ Suitable for comparing two distributions and is commonly used in unsupervised learning and probabilistic models.
• Disadvantages:
○ It is asymmetric, meaning it does not treat differences between distributions PPP and QQQ in the same way, which might not
be desirable in certain applications.
UNIT 1 Page 19
Function approximation
14 October 2024 10:04
Function approximation refers to the process of estimating a target function, which maps input data to output
predictions, using a model that best captures the underlying relationship between the inputs and outputs. This
concept is fundamental to machine learning, where we seek to build models that generalize well to unseen data
based on training data.
In simpler terms, the goal of function approximation is to find a mathematical function f(x) that can approximate
the true underlying function f*(x), which governs the relationship between inputs and outputs.
UNIT 1 Page 20
Applications
14 October 2024 10:07
4. Financial Services
• Fraud Detection: ANNs are used to identify fraudulent transactions by recognizing patterns that deviate
from typical user behavior, helping banks and financial institutions mitigate fraud risks.
• Stock Market Prediction: Neural networks, particularly recurrent architectures like LSTMs, are employed
in forecasting stock prices by analyzing historical data, news, and other influencing factors.
• Credit Scoring: ANNs assist in predicting the creditworthiness of individuals by assessing various factors
like credit history, loan applications, and spending patterns.
UNIT 1 Page 21
• Virtual Reality (VR) and Augmented Reality (AR): ANNs are used to enhance the realism of VR/AR
environments by improving object recognition, scene reconstruction, and interactive user interfaces.
7. Recommendation Systems
• Content Recommendations: Platforms like Netflix, YouTube, and Spotify use ANNs to recommend
content based on user preferences and behavior by analyzing viewing, listening, or searching patterns.
• Product Recommendations: E-commerce sites like Amazon use neural networks to recommend products
by learning from users' purchase history, browsing behavior, and other user data.
UNIT 1 Page 22