0% found this document useful (0 votes)
11 views37 pages

ML Unit-Ii

The document provides an overview of supervised learning, focusing on classification techniques such as decision trees, regression, and various algorithms including logistic regression and support vector machines. It explains the concepts of binary, multiclass, and multi-label classification, as well as the roles of lazy and eager learners in classification problems. Additionally, it details decision tree structures, pruning methods, and different types of decision tree algorithms like CART, ID3, and C4.5.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views37 pages

ML Unit-Ii

The document provides an overview of supervised learning, focusing on classification techniques such as decision trees, regression, and various algorithms including logistic regression and support vector machines. It explains the concepts of binary, multiclass, and multi-label classification, as well as the roles of lazy and eager learners in classification problems. Additionally, it details decision tree structures, pruning methods, and different types of decision tree algorithms like CART, ID3, and C4.5.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-II

SUPERVISED LEARNING: Classification, Decision Trees – Univariate Tree –Multivariate


Tree – Pruning, Bayesian Decision Theory, Parametric Methods-Maximum Likelihood
Estimation - Evaluating an Estimator Bias and Variance -The Bayes Estimator, Linear
Discrimination- Gradient Descent- Logistic Discrimination- Logistic Regression, Multilayer
Perceptron- Back Propagation Algorithm

Supervised Learning
• In supervised learning, the machine is trained on a set of labelled data, which means
that the input data is paired with the desired output. The machine then learns to predict
the output for new input data.
• Supervised learning is often used for tasks such as classification and regression.
• In supervised learning, each data point in the training data contains input variables (also
known as independent variables or features), and an output variable, or label.

1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
For example, a labelled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel “or “Cow.”

Classification:
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or
1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.

• Classification: The process of sorting data into categories based on specific features or
characteristics.

• There are different types of classification problems depending on how many categories
(or classes) we are working with and how they are organized. There are two main
classification types in machine learning:
1. Binary Classification
• This is the simplest kind of classification. In binary classification, the goal is to sort the
data into two distinct categories. Think of it like a simple choice between two options.
• Imagine a system that sorts emails into either spam or not spam. It works by looking
at different features of the email like certain keywords or sender details, and decides
whether it’s spam or not. It only chooses between these two options.
2. Multiclass Classification
• Here, instead of just two categories, the data needs to be sorted into more than two
categories. The model picks the one that best matches the input.
• Think of an image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and
chooses which animal the picture is most likely to be based on the training it received.

3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once.
Unlike multiclass classification where each data point belongs to only one class, multi-label
classification allows datapoints to belong to multiple classes. A movie recommendation
system could tag a movie as both action and comedy. The system checks various features (like
movie plot, actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.

Learners in Classification Problems:


In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naive
Bayes, ANN.

Property Lazy Learning Eager Learning

Training Slow, Tries to learn from data while


Fast, stores the data while training
Speed training

Prediction Too Slow tries to apply functions and Faster, predicts very fast as there are
Speed learnings in the prediction stage pre-defined functions

Learning Medium, it can learn from data while Medium, it can learn from data
Scope training while testing

Example KNN Linear Regression

Classification algorithms are widely used in many real-world applications across various
domains, including:
• Email spam filtering
• Credit risk assessment
• Medical diagnosis
• Sentiment analysis
• Fraud detection
• Recommendation systems

Classification Algorithms
Now, for implementation of any classification model it is essential to understand Logistic
Regression, which is one of the most fundamental and widely used algorithms in machine
learning for classification tasks. There are various types of classifiers algorithms. Some of
them are:
Linear Classifiers: Linear classifier models create a linear decision boundary between classes.
They are simple and computationally efficient. Some of the linear classification models are as
follows:
• Logistic Regression
• Support Vector Machines having kernel = ‘linear’
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between
classes. They can capture more complex relationships between input features and target
variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• Multi-layer Artificial Neural Networks
Decision Trees:
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• The decisions or the test are performed on the basis of features of the given dataset.

Below diagram explains the general structure of a decision tree:

Decision Tree Terminologies:

• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?


• In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree.
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

• Information Gain
• Gini Index
Information Gain (IG) is a metric used in decision tree algorithms (like ID3, C4.5, and
CART) to evaluate the effectiveness of a particular attribute in classifying data. It measures
the reduction in entropy (uncertainty or disorder) after a dataset is split based on a particular
attribute.
Entropy: A measure of uncertainty or disorder. If a dataset is perfectly pure (all
instances belong to the same class), the entropy is 0. If the classes are equally
distributed, the entropy is maximized (which is 1 for binary classification).
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

Information Gain (IG): It measures the change in entropy after a dataset is split based on a
particular attribute. It is the difference between the original entropy of the dataset and the
weighted average entropy of the subsets resulting from the split.
The formula for Information Gain is:
Information Gain (Dataset, Attribute) = Entropy(Dataset) - Weighted Average
Entropy(Subsets created by splitting on Attribute)

Gini Index
Decision trees are a popular machine learning algorithm used for both classification and
regression tasks. In classification, the goal is to predict which category a data point belongs to.
The Gini Index plays a crucial role in how decision trees decide how to split the data at each
step.
Here's the basic idea:
1. Measuring Impurity: The Gini Index measures the "impurity" of a set of data points.
A pure set means all data points belong to the same category. An impure set means the
data points are mixed across different categories.
2. Finding the Best Split: When building a decision tree, the algorithm needs to decide
which feature to use for splitting the data at each node. It calculates the Gini Index for
each possible split and chooses the one that results in the greatest reduction in impurity.
This means the split that creates the most "pure" subsets of data.
How it's calculated:
The Gini Index for a set of data points is calculated as:
Gini Index = 1 - (probability of category 1) ^2 - (probability of category 2) ^2 - ... -
(probability of category c) ^2
where c is the number of categories.

Types of Decision Tree Algorithms


1.CART (Classification and Regression Trees)
• Used for both classification & regression.
• Uses Gini Index for classification.
• Uses Mean Squared Error (MSE) for regression.
• Forms binary splits (only two child nodes at each step).
2. ID3 (Iterative Dichotomiser 3)
• Used only for classification.
• Uses Entropy & Information Gain to decide splits.
• Splits can have multiple child nodes (not just binary).
• Works best with categorical data.

3. C4.5 (Successor of ID3)

• Improves ID3 by handling:


• Numerical data (by discretizing it).
• Missing values (by assigning probabilities).
• Pruning (to reduce overfitting).
• Uses Entropy & Information Gain Ratio to split nodes.
4. C5.0 (Improved C4.5)
• Faster and more memory-efficient than C4.5.
• Uses Boosting (combining multiple trees).
• Generates smaller trees for better generalization.
Used in: Commercial decision tree software.

Univariate Decision Trees


A univariate decision tree splits nodes based on a single feature at each step. This means that
each decision rule involves only one attribute (one dimension) of the data.
Characteristics:
• Splits are parallel to feature axes.

• Uses conditions like Xi ≤ V (for numerical data) or categorical rules (e.g., "Color =
Red").
• Simpler to interpret and computationally efficient.
• Commonly used in CART, ID3, and C4.5 algorithms.
Example:
If we have features (Age, Income, Credit Score), a univariate decision tree may use only one
feature at a time to make a split:
• If Age ≤ 30 → Go left
• If Age > 30 → Go right

Multivariate Decision Trees


A multivariate decision tree considers multiple features in a single split. Instead of using a
simple condition like Xi ≤ V it uses linear combinations of features to make decisions.

Characteristics:
• More flexible than univariate trees.
• Decision boundaries are not restricted to be parallel to axes (can be diagonal or
curved).
• Uses linear models (like Logistic Regression, SVM, PCA-based splits) to determine
the best split.
• More computationally expensive than univariate trees.
Example (Multivariate Split Rule):

This allows complex decision boundaries like:


• If (0.3 × Age + 0.7 × Income) ≤ 50 → Go left
• If (0.3 × Age + 0.7 × Income) > 50 → Go right
Pruning

• Pruning is a technique used in decision trees to reduce overfitting and improve


generalization. It removes unnecessary branches from the tree, making it simpler and
more robust.
• Decision trees tend to overfit the training data by creating overly complex trees.
• Overfitting means the tree captures noise rather than true patterns.
• Pruning helps by reducing tree size and improving accuracy on unseen data.

Types of Pruning
1. Pre-Pruning (Early Stopping):
o Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a
maximum tree depth, a minimum number of data points in a leaf node, or a
threshold for the information gain at each decision node. Pre-pruning is simple
and computationally efficient, but it may not capture complex relationships in
the data.
• Stops the tree from growing before it becomes too complex.
• Sets a limit on conditions like:
o Max depth (max_depth)
o Min samples per split (min_samples_split)
o Min samples per leaf (min_samples_leaf)
• Advantages:
Faster training.
Prevents excessive growth.
• Disadvantages:
Might stop too early, missing some important patterns.

2. Post-Pruning (Pruning After Tree Growth):


o Post-pruning involves growing a decision tree to its maximum depth and then
removing the unnecessary branches that do not improve the model's
performance on the validation data. This can be done by calculating a measure
of impurity reduction or error rate reduction for each subtree and removing the
subtree that does not meet a certain criterion. Post- pruning is more
computationally intensive than pre-pruning, but it can capture complex
relationships and improve the accuracy of the model.
• First, grow the full tree (allowing overfitting).
• Then, remove unnecessary branches based on performance.
• Uses cost complexity pruning (CCP) in Scikit-learn.
• Advantages:
More accurate than pre-pruning.
Uses real performance to decide cuts.
• Disadvantages:
Slower, since it grows a full tree first.
Needs tuning.

Bayesian Decision Theory

• Bayesian Decision Theory is a probabilistic approach to decision-making under


uncertainty. It helps in minimizing risk and making optimal decisions by using
probability distributions and cost functions.
• Bayes' theorem is a fundamental concept in probability theory that plays a crucial role
in various machine learning algorithms, especially in the fields of Bayesian statistics
and probabilistic modelling.
• It provides a way to update probabilities based on new evidence or information. In the
context of machine learning, Bayes' theorem is often used in Bayesian inference and
probabilistic models.
• Bayes' theorem may be derived from the definition of conditional probability:
• Bayes' theorem is stated mathematically as the following equation:
Example: Spam Email Classification
Consider a binary classification problem where we classify an email as Spam (S) or Not
Spam (N).
Step 1: Define Prior Probabilities
Assume from past data:
• P(S)=0.3 (30% of emails are spam)
• P(N)=0.7 (70% of emails are not spam)

Step 2: Compute Likelihood

• Suppose we analyze a word, say "discount", in an email:


• Probability of "discount" appearing in a spam email: P(W∣S) =0.8
• Probability of "discount" appearing in a non-spam email: P(W∣N) =0.2

Step 3: Compute Posterior Using Bayes' Theorem

Thus, if an email contains the word "discount," there's a 63% probability it's spam.
Step 4: Decision Making Using Loss Function
• A loss function in machine learning tells us how far our predictions are from the actual
values.
• In Bayesian learning, we make predictions based on probability distributions rather
than fixed numbers, so our loss function helps decide which predictions are best.
• Suppose misclassifying spam as non-spam has a higher cost (e.g., 5 points) than
misclassifying non-spam as spam (2 points).

• The expected loss for labelling an email as spam (ds) or non-spam (dN) is computed.

• Choose the decision that minimizes expected loss (Bayes Risk).


Conclusion
Bayesian Decision Theory allows us to:
• Integrate prior knowledge with observed data.
• Make probabilistic decisions to minimize risk.
• Apply in real-world scenarios like email filtering, medical diagnosis, and fraud
detection.
Parametric methods
• Parametric methods simplify the learning process by assuming a specific functional
form for the relationship between inputs and outputs.
• This means we're essentially trying to fit the data to a pre-defined equation with a fixed
number of parameters.
Key Characteristics
1. Assumptions: Parametric methods rely on pre-defined assumptions about the data
distribution or the relationship between variables. These assumptions might include:
o Data follows a normal distribution
o Relationship between variables is linear
2. Fixed Parameters: The model has a fixed number of parameters that are learned from
the training data. This number does not change with the size of the dataset.
3. Two-Step Process:
o Select a functional form for the mapping function based on assumptions.
o Learn the coefficients (parameters) of the function from the training data.
Examples with Detailed Explanations
1. Linear Regression
o Assumption: The relationship between the input features (X) and the output (Y)
is linear.
o Functional Form: Y = wx+ b
▪ w: Weight vector (parameters to be learned)
▪ b: Bias (parameter to be learned)
o Example: Predicting house prices based on size (in square feet). We assume a
linear relationship: larger houses cost more.
o Learning: We use the training data to find the best values for w and b that
minimize the difference between predicted and actual prices.
2. Logistic Regression: Used for classification tasks, assumes a logistic function to model
the probability of a class.
3. Linear Discriminant Analysis (LDA): Assumes that the data for each class follows a
normal distribution with the same covariance matrix.
4. Naive Bayes: Assumes that features are independent of each other given the class label.
5. Perceptron: A simple neural network with a single layer, makes assumptions about the
separability of data.
Advantages of Parametric Methods
• Simplicity: Easier to understand and interpret due to the strong assumptions.
• Speed: Often faster to train compared to non-parametric methods.
• Less Data: Can work reasonably well with smaller datasets.
Limitations of Parametric Methods
• Constrained: The pre-defined functional form limits the model's flexibility to capture
complex relationships in the data.
• Limited Complexity: May not be suitable for problems with highly intricate patterns.
• Poor Fit: If the assumptions are incorrect, the model might not accurately represent
the underlying data, leading to poor performance.
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a method used in statistics and machine
learning to find the best parameters for a model by maximizing the likelihood of the
observed data.
How it Works
1. Assume a Probability Distribution: First, you assume that your data follows a specific
probability distribution. This could be a normal distribution, a binomial distribution, or
any other distribution that you think is appropriate for your data.
2. The Likelihood Function: The likelihood function measures how likely it is to observe
your data given a particular set of parameters for the chosen distribution. In simpler
terms, it tells you how well the distribution with those parameters fits your data.

• Likelihood is a measure of how well a given set of parameters explains the observed
data.

If we have a dataset D = {x1, x2, ..., xn} and a model with parameters θ, the
likelihood function is:

L(θ)=P(x1∣θ) P(x2∣θ) P(x3∣θ) … P(xn∣θ)

or

L(θ) =P (D ∣ θ)
This function tells us how probable the observed data is, given the model
parameters.

3. Maximize the Likelihood: The goal of MLE is to find the parameters that maximize
the likelihood function. This means finding the parameters that make your observed
data the most probable.
• MLE finds the
• that maximizes the likelihood function:

• θ̂ (theta-hat): This represents the estimated value of the parameter “The


estimated value of the parameter θ (θ̂) is the value that maximizes the likelihood
function L(θ). "

• arg max: This is short for "argument of the maximum." It means "find the value of
θ that maximizes the following function."’

• L(θ): This is the likelihood function. It represents how likely it is to observe the data
we have, given a particular value of the parameter θ.

Since probabilities are usually very small (because they involve multiplying
probabilities of individual data points.), we often take the log-likelihood instead (to
avoid numerical issues and simplify calculations):

Evaluating an Estimator: Bias and Variance


• In machine learning and statistics, we evaluate an estimator based on bias and
variance, which describe how well it generalizes to new data.
Bias Variance Tradeoff
• If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
• In the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
• This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.
The Bayes Estimator
• The Bayes Estimator is a fundamental concept in Bayesian statistics used for making
decisions or estimations when dealing with uncertainty.
• In machine learning, it's a method for estimating parameters of a model or making
predictions by combining prior knowledge (beliefs) with observed data.
How it Works
1. Start with a prior: Begin with a prior probability distribution representing your initial
beliefs about the parameters.
2. Observe data: Gather data relevant to the problem.
3. Calculate the likelihood: Determine how likely the observed data is for different
values of the parameters.
4. Apply Bayes' theorem: Use Bayes' theorem to update the prior distribution with the
likelihood, resulting in the posterior distribution.
5. Choose an estimator: Select an estimator (a specific value or decision) that minimizes
the expected loss under the posterior distribution. This estimator is the Bayes Estimator.
Mathematical Expression (Bayes' Theorem)

Types of Bayes estimators:

• Minimum Mean Square Error (MMSE) estimator


• Maximum A Posteriori (MAP) estimator.
• Posterior Median estimator.
Bayes estimators are used in various machine learning tasks, including:
• Classification: Naive Bayes classifier, Bayes optimal classifier
• Regression: Bayesian linear regression, Bayesian logistic regression
• Model selection: Bayesian model comparison
• Hyperparameter optimization: Bayesian optimization

Linear regression:
• Linear regression is a quiet and the simplest statistical regression technique used for
predictive analysis in machine learning.
• It shows the linear relationship between the independent(predictor) variable i.e. X-axis
and the dependent (output) variable i.e. Y-axis, called linear regression. If there is a
single input variable X (independent variable), such linear regression is simple linear
regression.
• In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables.
• The slope represents the change in the dependent variable for each unit change in the
independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero.

• The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Simple Regression Calculation
• To calculate best-fit line linear regression uses a traditional slope-intercept form which
is given below,
Yi= β0+β1Xi
where Y i = Dependent variable, β 0 = constant/Intercept, β 1 = Slope/Intercept, X
i = Independent variable.
• This algorithm explains the linear relationship between the dependent(output) variable
y and the independent(predictor) variable X using a straight-line Y= B 0 + B 1 X.
But how does the regression find out which is the best-fit line?
• The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to
find the best-fit line. The best-fit line is a line that has the least error which means the
error between predicted values and actual values should be minimum.

Gradient Descent
• Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models
by means of minimizing errors between actual and expected results. It helps in finding
the local minimum of a function.
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function. This entire
procedure is known as Gradient Ascent, which is also known as steepest descent.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.

The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.

To achieve this goal, it performs two steps iteratively:

• Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
• Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
To minimize the cost function, two data points are required: Direction & Learning Rate
Learning Rate: It is defined as the step size taken to reach the minimum or lowest point.
How Does Gradient Descent Work in Linear Regression?
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression.

• Initialize Parameters: Start with random initial values for the slope (m) and intercept
(b).
• Calculate the Cost Function: Compute the error using a cost function such as Mean
Squared Error (MSE):
• Compute the Gradient: Find the gradient of the cost function with respect to m and b.
These gradients indicate how the cost changes when the parameters are adjusted.

• Update Parameters: Adjust m and b in the direction that reduces the cost:

• Repeat: Iterate until the cost function converges i.e. further updates make little or no
difference.

Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:


Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it
requires only one training example at a time, hence it is easier to store in allocated memory.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a
few advantages over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
3. Mini Batch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates
on those batches separately. Splitting training datasets into smaller batches make a balance to
maintain the computational efficiency of batch gradient descent and speed of stochastic
gradient descent.
Advantages of Mini Batch gradient descent:
o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.

Linear Discrimination
• Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).
• It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.

To create a new axis, Linear Discriminant Analysis uses the following criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
LDA can be performed in 5 steps:

Step 1: Compute the mean vectors for the different classes from the dataset.

Step 2: Compute the scatter matrices (in-between-class and within-class scatter matrices).

Step 3: Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.

Step 4: Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues.

Step 5: Use this eigenvector matrix to transform the samples onto the new subspace.

1. Compute the Mean Vectors:


Calculate the mean vector for each class and the overall mean of the entire dataset.

2. Calculate Scatter Matrices:

3. Objective of LDA:
LDA tries to maximize the ratio of between-class scatter to within-class scatter to
achieve maximum class separation. This is represented mathematically as:
4. Solve for the Optimal Projection Vector:

5. Project Data onto the New Axis:

This transformation reduces the dimensionality of the data while maintaining the class-
related information.
Logistic regression
• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

o It maps any real value into another value within a range of 0 and 1.
Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
P(y=1)

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Multilayer Perceptron:

• A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of


multiple layers of neurons, or nodes, arranged in a hierarchical structure. It is one
of the simplest and most widely used types of neural networks, particularly for
supervised learning tasks such as classification and regression.
• MLP consists of fully connected dense layers that transform input data from one
dimension to another. It is called “multi-layer” because it contains an input layer, one
or more hidden layers, and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs, making it a powerful tool for various machine
learning tasks.
Key Components of Multi-Layer Perceptron (MLP)
• Input Layer: Each neuron (or node) in this layer corresponds to an input feature.
For instance, if you have three input features, the input layer will have three neurons.
• Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from
the input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP.
This means that every node in one layer connects to every node in the next layer. As the
data moves through the network, each layer transforms it until the final output is
generated in the output layer.

Working of Multi-Layer Perceptron


Let’s delve in to the working of the multi-layer perceptron. The key mechanisms such as
forward propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. Each neuron in the hidden layers processes the input as follows:
• Weighted Sum: The neuron computes the weighted sum of the inputs:

• Activation Function: The weighted sum z is passed through an activation function to


introduce non-linearity. Common activation functions include:
• Sigmoid:
• ReLU (Rectified Linear Unit)
• Tanh (Hyperbolic Tangent).
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using
a loss function. In supervised learning, this compares the predicted output to the
actual label.
For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:

Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:

• Adam Optimizer: An extension of SGD that incorporates momentum and adaptive


learning rates for more efficient training.
Example:
break down how to solve this problem:

Understanding the Network Architecture


• Input Layer: The input is a 3D feature vector, meaning it has 3 input nodes.
• Hidden Layer: The hidden layer has 30 nodes.
• Output Layer: Since it's a binary classification problem, the output layer has 1 node
(representing the probability of belonging to one of the two classes).
• Fully Connected: Each node in one layer is connected to every node in the next layer.
• No Bias Nodes: This simplifies the calculation as we don't need to account for bias
terms.
Calculating the Number of Parameters (Weights)
1. Weights between Input and Hidden Layer:
o Each of the 3 input nodes is connected to each of the 30 hidden nodes.
o Number of weights = (Number of input nodes) * (Number of hidden nodes) =
3 * 30 = 90
2. Weights between Hidden and Output Layer:
o Each of the 30 hidden nodes is connected to the 1 output node.
o Number of weights = (Number of hidden nodes) * (Number of output nodes) =
30 * 1 = 30
3. Total Number of Weights:
o Add the weights from the input-to-hidden layer and the hidden-to-output layer.
o Total weights = 90 + 30 = 120
Answer
The correct answer is b. 120
Back Propagation Algorithm:
• Backpropagation is also known as "Backward Propagation of Errors" and it is a method
used to train neural network . Its goal is to reduce the difference between the model’s
predicted output and the actual output by adjusting the weights and biases in the
network.
• Backpropagation is a technique used in deep learning to train artificial neural
networks particularly feed-forward networks. It works iteratively to adjust weights
and bias to minimize the cost function.
• In each epoch the model adapts these parameters reducing loss by following the error
gradient. Backpropagation often uses optimization algorithms like gradient
descent or stochastic gradient descent. The algorithm computes the gradient using the
chain rule from calculus allowing it to effectively navigate complex layers in the neural
network to minimize the cost function.

Step 1: Forward Propagation


In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. Each neuron in the hidden layers processes the input as follows:
• Weighted Sum: The neuron computes the weighted sum of the inputs:

• Activation Function: The weighted sum z is passed through an activation function to


introduce non-linearity. Common activation functions include:
• Sigmoid:

• ReLU (Rectified Linear Unit)


• Tanh (Hyperbolic Tangent).
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using
a loss function. In supervised learning, this compares the predicted output to the
actual label.
For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:

• Adam Optimizer: An extension of SGD that incorporates momentum and adaptive


learning rates for more efficient training.

You might also like