Unit 2 Machine Learning Aktu
Unit 2 Machine Learning Aktu
Unit 2
DECISION TREE LEARNING - Decision tree learning algorithm- Inductive bias- Issues in Decision
tree learning; ARTIFICIAL NEURAL NETWORKS – Perceptrons, Gradient descent and the Delta
rule, Adaline, Multilayer networks, Derivation of backpropagation rule Backpropagation Algorithm
Convergence, Generalization.
1. Supervised Learning
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
• Key Characteristics:
o Labeled Data: Requires a dataset where the correct output (label) is known.
o Goal: Learn a function that maps inputs to outputs, minimizing the difference between
the predicted and actual outputs.
• Common Tasks:
o Classification: Assigning inputs to one of several predefined categories (e.g., email
spam detection, digit recognition).
o Regression: Predicting a continuous output value (e.g., house price prediction, stock
market forecasting).
• Algorithms:
o Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision
Trees, Random Forests, k-Nearest Neighbors (k-NN), Neural Networks.
In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Machine Learning Unit 2
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
2. Unsupervised Learning
As the name suggests, unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision”.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images
of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means
it does not have any idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised learning algorithm will perform
this task by clustering the image dataset into the groups according to similarities between images.
Unsupervised learning involves training a model on data without labeled outputs. The goal is to
discover the underlying structure or distribution in the data.
• Key Characteristics:
o Unlabeled Data: The model works with data where the correct output is unknown.
o Goal: Identify patterns, groupings, or representations of the data.
• Common Tasks:
o Clustering: Grouping similar data points together (e.g., customer segmentation, image
compression).
o Dimensionality Reduction: Reducing the number of variables under consideration
(e.g., Principal Component Analysis (PCA).
o Anomaly Detection: Identifying unusual or rare items in data (e.g., fraud detection).
• Algorithms:
o k-Means, Hierarchical Clustering, DBSCAN, PCA, Autoencoders, Gaussian Mixture
Models (GMMs).
Machine Learning Unit 2
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, PCA, and Autoencoders etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as
each data, and then only it can predict the correct a child learns daily routine things by his
output. experiences.
It includes various algorithms such as Linear It includes various algorithms such as
Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.
3. Reinforcement Learning
Reinforcement learning (RL) involves training an agent
to make a sequence of decisions by interacting with an
environment. The agent receives feedback in the form
of rewards or penalties based on the actions it takes, and
the goal is to learn a policy that maximizes cumulative
rewards.
o Reinforcement Learning is a feedback-based
Machine learning technique in which an agent
learns to behave in an environment by
performing the actions and seeing the results of
actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
o The agent learns that what actions lead to positive feedback or rewards and what actions lead
to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
• Key Characteristics:
o Environment Interaction: The model learns by interacting with an environment and
receiving feedback.
o Goal: Learn a policy that maximizes long-term rewards.
• Common Tasks:
o Game Playing: Training agents to play games like chess, Go, or video games.
o Robotics: Enabling robots to perform tasks like navigation or manipulation.
o Self-driving Cars: Teaching autonomous vehicles to make driving decisions.
• Algorithms:
o Q-Learning, Deep Q-Networks (DQN), Policy Gradients
4. Semi-Supervised Learning
Semi-supervised learning is an important category that lies between the Supervised and
Unsupervised machine learning. Although Semi-supervised learning is the middle ground between
supervised and unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data.
Machine Learning Unit 2
Key Characteristics:
o Combination of Labeled and Unlabeled Data: Leverages both types of data to
improve learning efficiency.
o Goal: Achieve better performance than purely supervised learning with limited labeled
data.
• Common Tasks:
o Often used in scenarios where obtaining labels is difficult, such as text classification,
image classification, and speech recognition.
• Algorithms:
o Semi-Supervised Support Vector Machines (S3VM), Generative models, Label
Propagation, Self-training.
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers
of the networks. These neurons are known as nodes.
Biological Artificial
Neuron Neuron
Dendrite Inputs
Cell nucleus or Nodes
Soma
Synapses Weights
Axon Output
Structure: The structure of artificial neural networks is inspired by biological neurons. A biological
neuron has a cell body or soma to process the impulses, dendrites to receive them, and an axon that
transfers them to other neurons.
Synapses: Synapses are the links between biological neurons that enable the transmission of impulses
from dendrites to the cell body. Synapses are the weights that join the one-layer nodes to the next-
layer nodes in artificial neurons.
Learning: In biological neurons, learning happens in the cell body nucleus or soma, which has a
nucleus that helps to process the impulses. An action potential is produced and travels through the
axons if the impulses are powerful enough to reach the threshold.
Machine Learning Unit 2
Activation: In biological neurons, activation is the firing rate of the neuron which happens when the
impulses are strong enough to reach the threshold. In artificial neural networks, A mathematical
function known as an activation function maps the input to the output.
Marketing and Sales: When you log onto E-commerce sites like Amazon and Flipkart, they will
recommend your products to buy based on your previous browsing history. Similarly, suppose you
love Pasta, then Zomato, Swiggy, etc. will show you restaurant recommendations based on your tastes
and previous order history. This is true across all new-age marketing segments like Book sites, Movie
services, Hospitality sites, etc. and it is done by implementing personalized marketing.
Personal Assistants: I am sure you all have heard of Siri, Alexa, Cortana, etc., and also heard them
based on the phones you have!!! These are personal assistants and an example of speech recognition
that uses Natural Language Processing to interact with the users and formulate a response
accordingly.
Perceptron
Perceptron is one of the simplest Artificial neural network architectures. It was introduced by Frank
Rosenblatt in 1957s. It is the simplest type of feedforward neural network, consisting of a single layer
of input nodes that are fully connected to a layer of output nodes.
Types of Perceptron
• Single-Layer Perceptron: This type of perceptron is limited to learning linearly separable
patterns. Effective for tasks where the data can be divided into distinct categories through a
straight line.
• Multilayer Perceptron: Multilayer perceptrons possess enhanced processing capabilities as
they consist of two or more layers, adept at handling more complex patterns and relationships
within the data.
Perceptron
A perceptron, the basic unit of a neural network, comprises essential components that collaborate in
information processing.
• Input Features: The perceptron takes multiple input features; each input feature represents a
characteristic or attribute of the input data.
• Weights: Each input feature is associated with a weight, determining the significance of each
input feature in influencing the perceptron’s output. During training, these weights are adjusted
to learn the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs using the
summation function. The summation function combines the inputs with their respective
weights to produce a weighted sum.
• Activation Function: The weighted sum is then passed through an activation function.
Perceptron uses Heaviside step function functions. which take the summed values as input and
compare with the threshold and provide the output as 0 or 1.
Note: The Heaviside step function H(x), also called the unit step function, is a discontinuous
function, whose value is zero for negative arguments x < 0 and one for positive arguments x > 0.
Machine Learning Unit 2
• Output: The final output of the perceptron, is determined by the activation function’s result.
For example, in binary classification problems, the output might represent a predicted class (0
or 1).
• Bias: A bias term is often included in the perceptron model. The bias allows the model to make
adjustments that are independent of the input. It is an additional parameter that is learned
during training.
• Learning Algorithm (Weight Update Rule): During training, the perceptron learns by
adjusting its weights and bias based on a learning algorithm. A common approach is the
perceptron learning algorithm, which updates weights based on the difference between the
predicted output and the true output.
These components work together to enable a perceptron to learn and make predictions. While a single
perceptron can perform binary classification, more complex tasks require the use of multiple
perceptrons organized into layers, forming a neural network.
The learning rule is found to minimize the mean square error between activation and target values.
Adaline consists of trainable weights, it compares actual output with calculated output, and based on
error training algorithm is applied.
Workflow:
Adaline
First, calculate the net input to your Adaline network then apply the activation function to its output
then compare it with the original output if both the equal, then give the output else send an error back
to the network and update the weight according to the error which is calculated by the delta learning
rule.
Machine Learning Unit 2
Architecture:
Adaline
In Adaline, all the input neuron is directly connected to the output neuron with the weighted connected
path. There is a bias b of activation function 1 is present.
Note: Bias in a neural networks plays a key role in helping the network learn, improve accuracy.
Explanation: Here the nodes marked as “1” are known as bias units. The leftmost layer or Layer 1
is the input layer, the middle layer or Layer 2 is the hidden layer and the rightmost layer or Layer 3
is the output layer. It can say that the above diagram has 3 input units (leaving the bias unit), 1
output unit, and 4 hidden units (1 bias unit is not included).
Machine Learning Unit 2
A Multi-layered Neural Network is a typical example of the Feed Forward Neural Network. The
number of neurons and the number of layers consists of the hyperparameters of Neural Networks
which need tuning. In order to find ideal values for the hyperparameters, one must use some cross-
validation techniques. Using the Back-Propagation technique, weight adjustment training is carried
out.
Gradient Descent
Gradient descent is an optimization algorithm used to minimize or reduce some function by iteratively
moving in the direction of steepest descent or minimum point as defined by the negative of
the gradient.
Let us understand about Gradient Descent in a more practical way to know more detail.
Consider the figure below in the context of a cost
function. Our goal is to move from the mountain in
the top corner (high cost) to the dark blue sea in the
bottom (low cost). The arrows denote the direction
of steepest descent (negative gradient) from any
given point, the direction that decreases the cost
function as early as possible.
Learning rate
The size of these steps towards minimum point is
called the learning rate. With a high learning rate
(steep slope), we can cover more ground each step,
but we risk passing the lowest point since the slope
of the hill is constantly changing. With a very low
learning rate (less steep), we can confidently move
in the direction of the negative gradient since we
are recalculating it so frequently. A low learning
rate is more accurate, but calculating the gradient
is time-consuming, so it will take us a very long
time to reach the bottom.
Cost function
A Loss Functions or cost function tells us how accurately predictions can be made for a given set of
parameters. The cost function has its own curve and own gradients. The slope of this curve tells us
upgradation of parameters to make the model more accurate.
Machine Learning Unit 2
Delta Rule
If a set of data points can be separated into two groups using a straight line, the data is said to be
linearly separable. Non-linearly separable data is defined as data points that cannot be split into two
groups using a straight line.
When the training instances are linearly separable, the perceptron algorithm finds a successful weight
vector; however, if the examples are not linearly separable, they may fail to converge.
The delta rule, a second training rule, is meant to address this challenge. In this blog, we’ll have a
brief look at Gradient Descent and Delta Rule.
The delta rule converges toward a best-fit approximation to the target concept if the training instances
are not linearly separable.
This criterion is significant because the BACKPROPAGATION algorithm, which can train networks
with many linked units, is based on gradient descent.
Backpropagation:
• In machine learning, backpropagation is an effective algorithm used to train artificial neural
networks, especially in feed-forward neural networks.
• Backpropagation is an iterative algorithm, that helps to minimize the cost function by
determining which weights and biases should be adjusted. During every epoch, the model
learns by adapting the weights and biases to minimize the loss by moving down toward the
gradient of the error. Thus, it involves the two most popular optimization algorithms, such
as gradient descent or stochastic gradient descent.
• Computing the gradient in the backpropagation algorithm helps to minimize the cost
function and it can be implemented by using the mathematical rule called chain rule from
calculus to navigate through complex layers of the neural network.
Machine Learning Unit 2
• Once we have done the calculation at the output layer, we then propagate the error backward
through the network, layer by layer.
• The key calculation during the backward pass is determining the gradients for each weight and
bias in the network. This gradient is responsible for telling us how much each weight/bias
should be adjusted to minimize the error in the next forward pass. The chain rule is used
iteratively to calculate this gradient efficiently.
Machine Learning Unit 2
• In addition to gradient calculation, the activation function also plays a crucial role in
backpropagation, it works by calculating the gradients with the help of the derivative of the
activation function.
Convergence:
• The BACKPROPAGATION technique uses a gradient descent search to reduce the error E
between the training example target values and the network outputs by iteratively lowering the
set of feasible network weights.
• Gradient descent can become caught in any of the many possible local minima that exist on
the error surface for multilayer networks. As a result, BACKPROPAGATION over multilayer
networks can only converge to a local minimum in E, not to the global minimum error.
• Consider how big weighted networks relate to error surfaces in very high-dimensional spaces
(one dimension per weight).
• When gradient descent reaches a local minimum with one of these weights, it does not always
reach a local minimum with the other weights.
• In reality, the more dimensions in the network, the more “escape routes” for gradient descent
to fall away from the local minimum with regard to this single weight.
• Consider how network weights vary as the number of training iterations rises for a second
viewpoint on local minima.
• If the network weights are set to approach zero, the network will reflect a highly smooth
function with nearly linear inputs during the early gradient descent phases.
• This is due to the fact that when the weights are close to zero, the sigmoid threshold function
is almost linear.
• The weights will only be able to represent highly nonlinear network functions when they have
had time to mature.
Backpropagation – Generalization
The goal of backpropagation is to obtain partial derivatives of the cost function C for each weight w
and bias b in the network. Multilayer Perceptrons use this supervised learning approach (Artificial
Neural Networks).
Why Generalization?
What is an appropriate condition for the weight update loop to be terminated?
– One option is to keep training until the error E on the training examples drops below a certain level.
– This is a bad method since BACKPROPAGATION is prone to over-fitting the training instances at
the expense of generalization accuracy over unknown cases.
The below figure depicts this difference for two common BACKPROPAGATION applications. Take
a look at the top plot in this diagram.
As the number of gradient descent iterations increases, the lower of the two lines shows the error E
decreasing monotonically over the training set. The error E measured over a different validation set
of examples, distinct from the training examples, is shown on the top line.
This line represents the network’s generalization accuracy, or how well it matches examples outside
of the training data.
Machine Learning Unit 2
The graph:
For two independent robot perception tasks, plots show error E as a function of the number of weight
changes. As gradient descent minimizes this measure of error, error E over the training examples
reduces monotonically in both learning cases.
Due to overfitting the training examples, errors over the separate “validation” set of examples are
normally lower at first, then may increase afterward.
The network with the lowest error over the validation set is the most likely to generalize appropriately
to unseen input. When the validation set error begins to increase, one must be careful not to terminate
training too soon, as shown in the second plot.