0% found this document useful (0 votes)
43 views18 pages

Unit 2 Machine Learning Aktu

Unit 2 of the Machine Learning course covers various learning approaches including Supervised, Unsupervised, Reinforcement, and Semi-Supervised Learning, detailing their characteristics, common tasks, and algorithms. It also introduces Decision Tree Learning, explaining its structure, formation process, attribute selection measures, and the importance of pruning for optimal decision-making. The document emphasizes the practical applications of these methodologies in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views18 pages

Unit 2 Machine Learning Aktu

Unit 2 of the Machine Learning course covers various learning approaches including Supervised, Unsupervised, Reinforcement, and Semi-Supervised Learning, detailing their characteristics, common tasks, and algorithms. It also introduces Decision Tree Learning, explaining its structure, formation process, attribute selection measures, and the importance of pruning for optimal decision-making. The document emphasizes the practical applications of these methodologies in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning Unit 2

Unit 2
DECISION TREE LEARNING - Decision tree learning algorithm- Inductive bias- Issues in Decision
tree learning; ARTIFICIAL NEURAL NETWORKS – Perceptrons, Gradient descent and the Delta
rule, Adaline, Multilayer networks, Derivation of backpropagation rule Backpropagation Algorithm
Convergence, Generalization.

Machine Learning at present:


The field of machine learning has made significant strides in recent years, and its applications are
numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender system. It
incorporates clustering, classification, decision tree, SVM algorithms, and reinforcement learning, as
well as unsupervised and supervised learning.

Introduction to Machine Learning Approaches


Machine learning approaches are the methodologies and techniques used to create models that learn
from data. These approaches vary based on the nature of the data, the task at hand, and the desired
outcome. Broadly, machine learning approaches can be categorized into three main types: supervised
learning, unsupervised learning, and reinforcement learning. There are also other approaches,
such as semi-supervised learning and self-supervised learning, which blend elements of these
primary categories.

1. Supervised Learning
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.

• Key Characteristics:
o Labeled Data: Requires a dataset where the correct output (label) is known.
o Goal: Learn a function that maps inputs to outputs, minimizing the difference between
the predicted and actual outputs.
• Common Tasks:
o Classification: Assigning inputs to one of several predefined categories (e.g., email
spam detection, digit recognition).
o Regression: Predicting a continuous output value (e.g., house price prediction, stock
market forecasting).
• Algorithms:
o Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision
Trees, Random Forests, k-Nearest Neighbors (k-NN), Neural Networks.

In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Machine Learning Unit 2

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.

2. Unsupervised Learning
As the name suggests, unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision”.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images
of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means
it does not have any idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised learning algorithm will perform
this task by clustering the image dataset into the groups according to similarities between images.

Unsupervised learning involves training a model on data without labeled outputs. The goal is to
discover the underlying structure or distribution in the data.
• Key Characteristics:
o Unlabeled Data: The model works with data where the correct output is unknown.
o Goal: Identify patterns, groupings, or representations of the data.
• Common Tasks:
o Clustering: Grouping similar data points together (e.g., customer segmentation, image
compression).
o Dimensionality Reduction: Reducing the number of variables under consideration
(e.g., Principal Component Analysis (PCA).
o Anomaly Detection: Identifying unusual or rare items in data (e.g., fraud detection).
• Algorithms:
o k-Means, Hierarchical Clustering, DBSCAN, PCA, Autoencoders, Gaussian Mixture
Models (GMMs).
Machine Learning Unit 2

Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, PCA, and Autoencoders etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.

Supervised Learning Unsupervised Learning


Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.
Machine Learning Unit 2

Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as
each data, and then only it can predict the correct a child learns daily routine things by his
output. experiences.
It includes various algorithms such as Linear It includes various algorithms such as
Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.

3. Reinforcement Learning
Reinforcement learning (RL) involves training an agent
to make a sequence of decisions by interacting with an
environment. The agent receives feedback in the form
of rewards or penalties based on the actions it takes, and
the goal is to learn a policy that maximizes cumulative
rewards.
o Reinforcement Learning is a feedback-based
Machine learning technique in which an agent
learns to behave in an environment by
performing the actions and seeing the results of
actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
o The agent learns that what actions lead to positive feedback or rewards and what actions lead
to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.

• Key Characteristics:
o Environment Interaction: The model learns by interacting with an environment and
receiving feedback.
o Goal: Learn a policy that maximizes long-term rewards.
• Common Tasks:
o Game Playing: Training agents to play games like chess, Go, or video games.
o Robotics: Enabling robots to perform tasks like navigation or manipulation.
o Self-driving Cars: Teaching autonomous vehicles to make driving decisions.
• Algorithms:
o Q-Learning, Deep Q-Networks (DQN), Policy Gradients

4. Semi-Supervised Learning
Semi-supervised learning is an important category that lies between the Supervised and
Unsupervised machine learning. Although Semi-supervised learning is the middle ground between
supervised and unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data.
Machine Learning Unit 2

Key Characteristics:
o Combination of Labeled and Unlabeled Data: Leverages both types of data to
improve learning efficiency.
o Goal: Achieve better performance than purely supervised learning with limited labeled
data.
• Common Tasks:
o Often used in scenarios where obtaining labels is difficult, such as text classification,
image classification, and speech recognition.
• Algorithms:
o Semi-Supervised Support Vector Machines (S3VM), Generative models, Label
Propagation, Self-training.

Working of Semi-Supervised Learning


Semi-supervised learning uses pseudo labeling to train the model with less labeled training data than
supervised learning. The process can combine various neural network models and training ways. The
whole working of semi-supervised learning is explained in the below points:
o Firstly, it trains the model with less amount of training data similar to the supervised learning
models. The training continues until the model gives accurate results.
o The algorithms use the unlabeled dataset with pseudo labels in the next step, and now the result
may not be accurate.
o Now, the labels from labeled training data and pseudo labels data are linked together.
o The input data in labeled training data and unlabeled training data are also linked.
o In the end, again train the model with the new combined input as did in the first step. It will
reduce errors and improve the accuracy of the model.
Note: Pseudo labeling is the process of adding confident predicted test data in training data to increase
the amount of training data.

DECISION TREE LEARNING:


A decision tree is a flowchart-like structure used to make decisions or predictions. It consists of nodes
representing decisions or tests on attributes, branches representing the outcome of these decisions,
and leaf nodes representing final outcomes or predictions. Each internal node corresponds to a test on
an attribute, each branch corresponds to the result of the test, and each leaf node corresponds to a class
label or a continuous value.

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.

Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
• Pruning: The process of removing branches or nodes from a decision tree to improve its
generalisation and prevent overfitting.
Machine Learning Unit 2

How Decision Tree is formed?


The process of forming a decision tree involves recursively partitioning the data based on the values
of different attributes. The algorithm selects the best
attribute to split the data at each internal node, based
on certain criteria such as information gain or Gini
impurity. This splitting process continues until a
stopping criterion is met, such as reaching a
maximum depth or having a minimum number of
instances in a leaf node.

Example: Suppose there is a candidate who has a job


offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into
the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The
next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg.) *Entropy (each feature)
Entropy: Entropy is the measurement of disorder or impurities in the information processes in
machine learning. It determines how a decision tree chooses to split data. Entropy can be calculated
as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the CART
(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
Machine Learning Unit 2

o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
• Overfitting: Decision trees can easily overfit the training data, especially if they are deep with
many nodes.
Applications of Decision Trees
• Business Decision Making: Used in strategic planning and resource allocation.
• Healthcare: Assists in diagnosing diseases and suggesting treatment plans.
• Finance: Helps in credit scoring and risk assessment.
• Marketing: Used to segment customers and predict customer behavior.

Issues in Decision tree learning


• How deep to grow?
• How to handle continuous attributes?
• How to choose an appropriate attributes selection measure?
• How to handle data with missing attributes values?
• How to handle attributes with different costs?
• How to improve computational efficiency?
• ID3 has been extended to handle most of these.

Artificial Neural Networks


Artificial Neural Networks contain artificial neurons which are called units. These units are arranged
in a series of layers that together constitute the whole Artificial Neural Network in a system. A layer
can have only a dozen units or millions of units as this depends on how the complex neural networks
will be required to learn the hidden patterns in the dataset.
Artificial Neural Network has an input layer, an output layer as well as hidden layers. The input layer
receives data from the outside world which the neural network needs to analyze or learn about. Then
this data passes through one or multiple hidden layers that transform the input into data that is valuable
for the output layer. Finally, the output layer provides an output in the form of a response of the
Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to another. Each of these
connections has weights that determine the influence of one unit on another unit. As the data transfers
from one unit to another, the neural network learns more and more about the data which eventually
results in an output from the output layer.
Machine Learning Unit 2

The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers
of the networks. These neurons are known as nodes.

Neural Networks Architecture

Artificial neurons vs Biological neurons


The concept of artificial neural networks comes from biological neurons found in animal brains

Biological Artificial
Neuron Neuron
Dendrite Inputs
Cell nucleus or Nodes
Soma

Synapses Weights
Axon Output

Structure: The structure of artificial neural networks is inspired by biological neurons. A biological
neuron has a cell body or soma to process the impulses, dendrites to receive them, and an axon that
transfers them to other neurons.
Synapses: Synapses are the links between biological neurons that enable the transmission of impulses
from dendrites to the cell body. Synapses are the weights that join the one-layer nodes to the next-
layer nodes in artificial neurons.
Learning: In biological neurons, learning happens in the cell body nucleus or soma, which has a
nucleus that helps to process the impulses. An action potential is produced and travels through the
axons if the impulses are powerful enough to reach the threshold.
Machine Learning Unit 2

Activation: In biological neurons, activation is the firing rate of the neuron which happens when the
impulses are strong enough to reach the threshold. In artificial neural networks, A mathematical
function known as an activation function maps the input to the output.

Applications of Artificial Neural Networks


Social Media: Artificial Neural Networks are used heavily in Social Media. For example, let’s take
the ‘People you may know’ feature on Facebook that suggests people that you might know in real
life so that you can send them friend requests.

Marketing and Sales: When you log onto E-commerce sites like Amazon and Flipkart, they will
recommend your products to buy based on your previous browsing history. Similarly, suppose you
love Pasta, then Zomato, Swiggy, etc. will show you restaurant recommendations based on your tastes
and previous order history. This is true across all new-age marketing segments like Book sites, Movie
services, Hospitality sites, etc. and it is done by implementing personalized marketing.

Personal Assistants: I am sure you all have heard of Siri, Alexa, Cortana, etc., and also heard them
based on the phones you have!!! These are personal assistants and an example of speech recognition
that uses Natural Language Processing to interact with the users and formulate a response
accordingly.

Perceptron
Perceptron is one of the simplest Artificial neural network architectures. It was introduced by Frank
Rosenblatt in 1957s. It is the simplest type of feedforward neural network, consisting of a single layer
of input nodes that are fully connected to a layer of output nodes.

Types of Perceptron
• Single-Layer Perceptron: This type of perceptron is limited to learning linearly separable
patterns. Effective for tasks where the data can be divided into distinct categories through a
straight line.
• Multilayer Perceptron: Multilayer perceptrons possess enhanced processing capabilities as
they consist of two or more layers, adept at handling more complex patterns and relationships
within the data.

Perceptron
A perceptron, the basic unit of a neural network, comprises essential components that collaborate in
information processing.
• Input Features: The perceptron takes multiple input features; each input feature represents a
characteristic or attribute of the input data.
• Weights: Each input feature is associated with a weight, determining the significance of each
input feature in influencing the perceptron’s output. During training, these weights are adjusted
to learn the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs using the
summation function. The summation function combines the inputs with their respective
weights to produce a weighted sum.
• Activation Function: The weighted sum is then passed through an activation function.
Perceptron uses Heaviside step function functions. which take the summed values as input and
compare with the threshold and provide the output as 0 or 1.
Note: The Heaviside step function H(x), also called the unit step function, is a discontinuous
function, whose value is zero for negative arguments x < 0 and one for positive arguments x > 0.
Machine Learning Unit 2

• Output: The final output of the perceptron, is determined by the activation function’s result.
For example, in binary classification problems, the output might represent a predicted class (0
or 1).
• Bias: A bias term is often included in the perceptron model. The bias allows the model to make
adjustments that are independent of the input. It is an additional parameter that is learned
during training.
• Learning Algorithm (Weight Update Rule): During training, the perceptron learns by
adjusting its weights and bias based on a learning algorithm. A common approach is the
perceptron learning algorithm, which updates weights based on the difference between the
predicted output and the true output.

These components work together to enable a perceptron to learn and make predictions. While a single
perceptron can perform binary classification, more complex tasks require the use of multiple
perceptrons organized into layers, forming a neural network.

Adaline (Adaptive Linear Neural):


A network with a single linear unit is called Adaline (Adaptive Linear Neural). A unit with a linear
activation function is called a linear unit. In Adaline, there is only one output unit and output values
are bipolar (+1,-1). Weights between the input unit and output unit are adjustable.

The learning rule is found to minimize the mean square error between activation and target values.
Adaline consists of trainable weights, it compares actual output with calculated output, and based on
error training algorithm is applied.
Workflow:

Adaline
First, calculate the net input to your Adaline network then apply the activation function to its output
then compare it with the original output if both the equal, then give the output else send an error back
to the network and update the weight according to the error which is calculated by the delta learning
rule.
Machine Learning Unit 2

Architecture:

Adaline
In Adaline, all the input neuron is directly connected to the output neuron with the weighted connected
path. There is a bias b of activation function 1 is present.
Note: Bias in a neural networks plays a key role in helping the network learn, improve accuracy.

Multi-Layer Neural Network


To be accurate a fully connected Multi-Layered Neural Network is known as Multi-Layer Perceptron.
A Multi-Layered Neural Network consists of multiple layers of artificial neurons or nodes. Unlike
Single-Layer Neural networks, in recent times most networks have Multi-Layered Neural Network.
The following diagram is a visualization of a multi-layer neural network.

Explanation: Here the nodes marked as “1” are known as bias units. The leftmost layer or Layer 1
is the input layer, the middle layer or Layer 2 is the hidden layer and the rightmost layer or Layer 3
is the output layer. It can say that the above diagram has 3 input units (leaving the bias unit), 1
output unit, and 4 hidden units (1 bias unit is not included).
Machine Learning Unit 2

A Multi-layered Neural Network is a typical example of the Feed Forward Neural Network. The
number of neurons and the number of layers consists of the hyperparameters of Neural Networks
which need tuning. In order to find ideal values for the hyperparameters, one must use some cross-
validation techniques. Using the Back-Propagation technique, weight adjustment training is carried
out.

Gradient Descent
Gradient descent is an optimization algorithm used to minimize or reduce some function by iteratively
moving in the direction of steepest descent or minimum point as defined by the negative of
the gradient.

Let us understand about Gradient Descent in a more practical way to know more detail.
Consider the figure below in the context of a cost
function. Our goal is to move from the mountain in
the top corner (high cost) to the dark blue sea in the
bottom (low cost). The arrows denote the direction
of steepest descent (negative gradient) from any
given point, the direction that decreases the cost
function as early as possible.

Starting from the top of the mountain, we have to


take our step downwards in the direction specified
by the negative gradient which means to the
minimum point. Following, we recalculate the
negative gradient and take another step in the
direction it specifies. We continue this process
iteratively until we get to the bottom of our graph
which means a local minimum.

Learning rate
The size of these steps towards minimum point is
called the learning rate. With a high learning rate
(steep slope), we can cover more ground each step,
but we risk passing the lowest point since the slope
of the hill is constantly changing. With a very low
learning rate (less steep), we can confidently move
in the direction of the negative gradient since we
are recalculating it so frequently. A low learning
rate is more accurate, but calculating the gradient
is time-consuming, so it will take us a very long
time to reach the bottom.

Cost function
A Loss Functions or cost function tells us how accurately predictions can be made for a given set of
parameters. The cost function has its own curve and own gradients. The slope of this curve tells us
upgradation of parameters to make the model more accurate.
Machine Learning Unit 2

Delta Rule

If a set of data points can be separated into two groups using a straight line, the data is said to be
linearly separable. Non-linearly separable data is defined as data points that cannot be split into two
groups using a straight line.

Figure (a) -> Training Set is Linearly Separable


Figure (b) -> Training Set is non-linearly Separable

When the training instances are linearly separable, the perceptron algorithm finds a successful weight
vector; however, if the examples are not linearly separable, they may fail to converge.
The delta rule, a second training rule, is meant to address this challenge. In this blog, we’ll have a
brief look at Gradient Descent and Delta Rule.
The delta rule converges toward a best-fit approximation to the target concept if the training instances
are not linearly separable.

Delta Rule’s Main Idea:


The Delta rule’s main idea is to explore the hypothesis space of potential weight vectors using
gradient descent to discover the weights that best suit the training instances.

This criterion is significant because the BACKPROPAGATION algorithm, which can train networks
with many linked units, is based on gradient descent.

Backpropagation:
• In machine learning, backpropagation is an effective algorithm used to train artificial neural
networks, especially in feed-forward neural networks.
• Backpropagation is an iterative algorithm, that helps to minimize the cost function by
determining which weights and biases should be adjusted. During every epoch, the model
learns by adapting the weights and biases to minimize the loss by moving down toward the
gradient of the error. Thus, it involves the two most popular optimization algorithms, such
as gradient descent or stochastic gradient descent.
• Computing the gradient in the backpropagation algorithm helps to minimize the cost
function and it can be implemented by using the mathematical rule called chain rule from
calculus to navigate through complex layers of the neural network.
Machine Learning Unit 2

A simple illustration of how the backpropagation works by adjustments of weights

Advantages of Using the Backpropagation Algorithm in Neural Networks


Backpropagation, a fundamental algorithm in training neural networks, offers several advantages that
make it a preferred choice for many machine learning tasks. Here, we discuss some key advantages
of using the backpropagation algorithm:
1. Ease of Implementation: Backpropagation does not require prior knowledge of neural
networks, making it accessible to beginners. Its straightforward nature simplifies the
programming process, as it primarily involves adjusting weights based on error derivatives.
2. Simplicity and Flexibility: The algorithm’s simplicity allows it to be applied to a wide range
of problems and network architectures. Its flexibility makes it suitable for various scenarios,
from simple feedforward networks to complex recurrent or convolutional neural networks.
3. Efficiency: Backpropagation accelerates the learning process by directly updating weights
based on the calculated error derivatives. This efficiency is particularly advantageous in
training deep neural networks, where learning features of a function can be time-consuming.
4. Generalization: Backpropagation enables neural networks to generalize well to unseen data
by iteratively adjusting weights during training. This generalization ability is crucial for
developing models that can make accurate predictions on new, unseen examples.
5. Scalability: Backpropagation scales well with the size of the dataset and the complexity of
the network. This scalability makes it suitable for large-scale machine learning tasks, where
training data and network size are significant factors.
In conclusion, the backpropagation algorithm offers several advantages that contribute to its
widespread use in training neural networks. Its ease of implementation, simplicity, efficiency,
generalization ability, and scalability make it a valuable tool for developing and training neural
network models for various machine learning applications.
Working of Backpropagation Algorithm
The Backpropagation algorithm works by two different passes, they are:
• Forward pass
• Backward pass
Machine Learning Unit 2

How does Forward pass work?


• In forward pass, initially the input is fed into the input layer. Since the inputs are raw data,
they can be used for training our neural network.
• The inputs and their corresponding weights are passed to the hidden layer. The hidden layer
performs the computation on the data it receives. If there are two hidden layers in the neural
network, for instance, consider the illustration figure, h1 and h2 are the two hidden layers, and
the output of h1 can be used as an input of h2. Before applying it to the activation function,
the bias is added.
• To the weighted sum of inputs, the activation function is applied in the hidden layer to each of
its neurons. One such activation function that is commonly used is ReLU can also be used,
which is responsible for returning the input if it is positive otherwise it returns zero. By doing
this so, it introduces the non-linearity to our model, which enables the network to learn the
complex relationships in the data. And finally, the weighted outputs from the last hidden layer
are fed into the output to compute the final prediction, this layer can also use the activation
function called the softmax function which is responsible for converting the weighted outputs
into probabilities for each class.

The forward pass using weights

How does backward pass work?


• In the backward pass process shows, the error is transmitted back to the network which helps
the network, to improve its performance by learning and adjusting the internal weights.
• To find the error generated through the process of forward pass, we can use one of the most
commonly used methods called mean squared error which calculates the difference between
the predicted output and desired output. The formula for mean squared error is:

Mean squared error = (predicted output – actual output)^2

• Once we have done the calculation at the output layer, we then propagate the error backward
through the network, layer by layer.
• The key calculation during the backward pass is determining the gradients for each weight and
bias in the network. This gradient is responsible for telling us how much each weight/bias
should be adjusted to minimize the error in the next forward pass. The chain rule is used
iteratively to calculate this gradient efficiently.
Machine Learning Unit 2

• In addition to gradient calculation, the activation function also plays a crucial role in
backpropagation, it works by calculating the gradients with the help of the derivative of the
activation function.

Convergence:
• The BACKPROPAGATION technique uses a gradient descent search to reduce the error E
between the training example target values and the network outputs by iteratively lowering the
set of feasible network weights.
• Gradient descent can become caught in any of the many possible local minima that exist on
the error surface for multilayer networks. As a result, BACKPROPAGATION over multilayer
networks can only converge to a local minimum in E, not to the global minimum error.
• Consider how big weighted networks relate to error surfaces in very high-dimensional spaces
(one dimension per weight).
• When gradient descent reaches a local minimum with one of these weights, it does not always
reach a local minimum with the other weights.
• In reality, the more dimensions in the network, the more “escape routes” for gradient descent
to fall away from the local minimum with regard to this single weight.
• Consider how network weights vary as the number of training iterations rises for a second
viewpoint on local minima.
• If the network weights are set to approach zero, the network will reflect a highly smooth
function with nearly linear inputs during the early gradient descent phases.
• This is due to the fact that when the weights are close to zero, the sigmoid threshold function
is almost linear.
• The weights will only be able to represent highly nonlinear network functions when they have
had time to mature.

Backpropagation – Generalization
The goal of backpropagation is to obtain partial derivatives of the cost function C for each weight w
and bias b in the network. Multilayer Perceptrons use this supervised learning approach (Artificial
Neural Networks).

Why Generalization?
What is an appropriate condition for the weight update loop to be terminated?

– One option is to keep training until the error E on the training examples drops below a certain level.
– This is a bad method since BACKPROPAGATION is prone to over-fitting the training instances at
the expense of generalization accuracy over unknown cases.

The below figure depicts this difference for two common BACKPROPAGATION applications. Take
a look at the top plot in this diagram.

As the number of gradient descent iterations increases, the lower of the two lines shows the error E
decreasing monotonically over the training set. The error E measured over a different validation set
of examples, distinct from the training examples, is shown on the top line.

This line represents the network’s generalization accuracy, or how well it matches examples outside
of the training data.
Machine Learning Unit 2

The graph:
For two independent robot perception tasks, plots show error E as a function of the number of weight
changes. As gradient descent minimizes this measure of error, error E over the training examples
reduces monotonically in both learning cases.

Due to overfitting the training examples, errors over the separate “validation” set of examples are
normally lower at first, then may increase afterward.

The network with the lowest error over the validation set is the most likely to generalize appropriately
to unseen input. When the validation set error begins to increase, one must be careful not to terminate
training too soon, as shown in the second plot.

How does it work?


• Even when the error over the training instances continues to reduce, the generalization
accuracy assessed over the validation examples initially declines, then climbs.
• This happens when the weights are adjusted to fit training instances that aren’t typical of the
whole distribution.
• Overfitting is more common in later iterations than in earlier ones.
• The weights of the network are initially set to modest random numbers. Only very smooth
decision surfaces can be described with approximately comparable weights.
• As training progresses, some weights rise in order to reduce error over the training data, and
the learned decision surface becomes more complicated as the number of weight-tuning
iterations increases.
Machine Learning Unit 2

Convergence and local minima


Backpropagation is a multi-layer algorithm. In multi-layer neural networks, it can go back and change
the weights.
All neurons are interconnected to each other and they converge at a point so that the information is
passed onto every neuron in the network.
Using the backpropagation algorithm we are minimizing the errors by modifying the weights. This
minimization of errors can be done only locally but not globally.

You might also like