ML Unit-Ii
ML Unit-Ii
Supervised Learning
• In supervised learning, the machine is trained on a set of labelled data, which means
that the input data is paired with the desired output. The machine then learns to predict
the output for new input data.
• Supervised learning is often used for tasks such as classification and regression.
• In supervised learning, each data point in the training data contains input variables (also
known as independent variables or features), and an output variable, or label.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
For example, a labelled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel “or “Cow.”
Classification:
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or
1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
• Classification: The process of sorting data into categories based on specific features or
characteristics.
• There are different types of classification problems depending on how many categories
(or classes) we are working with and how they are organized. There are two main
classification types in machine learning:
1. Binary Classification
• This is the simplest kind of classification. In binary classification, the goal is to sort the
data into two distinct categories. Think of it like a simple choice between two options.
• Imagine a system that sorts emails into either spam or not spam. It works by looking
at different features of the email like certain keywords or sender details, and decides
whether it’s spam or not. It only chooses between these two options.
2. Multiclass Classification
• Here, instead of just two categories, the data needs to be sorted into more than two
categories. The model picks the one that best matches the input.
• Think of an image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and
chooses which animal the picture is most likely to be based on the training it received.
3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once.
Unlike multiclass classification where each data point belongs to only one class, multi-label
classification allows datapoints to belong to multiple classes. A movie recommendation
system could tag a movie as both action and comedy. The system checks various features (like
movie plot, actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.
Prediction Too Slow tries to apply functions and Faster, predicts very fast as there are
Speed learnings in the prediction stage pre-defined functions
Learning Medium, it can learn from data while Medium, it can learn from data
Scope training while testing
Classification algorithms are widely used in many real-world applications across various
domains, including:
• Email spam filtering
• Credit risk assessment
• Medical diagnosis
• Sentiment analysis
• Fraud detection
• Recommendation systems
Classification Algorithms
Now, for implementation of any classification model it is essential to understand Logistic
Regression, which is one of the most fundamental and widely used algorithms in machine
learning for classification tasks. There are various types of classifiers algorithms. Some of
them are:
Linear Classifiers: Linear classifier models create a linear decision boundary between classes.
They are simple and computationally efficient. Some of the linear classification models are as
follows:
• Logistic Regression
• Support Vector Machines having kernel = ‘linear’
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between
classes. They can capture more complex relationships between input features and target
variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• Multi-layer Artificial Neural Networks
Decision Trees:
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• The decisions or the test are performed on the basis of features of the given dataset.
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Information Gain (IG) is a metric used in decision tree algorithms (like ID3, C4.5, and
CART) to evaluate the effectiveness of a particular attribute in classifying data. It measures
the reduction in entropy (uncertainty or disorder) after a dataset is split based on a particular
attribute.
Entropy: A measure of uncertainty or disorder. If a dataset is perfectly pure (all
instances belong to the same class), the entropy is 0. If the classes are equally
distributed, the entropy is maximized (which is 1 for binary classification).
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
Information Gain (IG): It measures the change in entropy after a dataset is split based on a
particular attribute. It is the difference between the original entropy of the dataset and the
weighted average entropy of the subsets resulting from the split.
The formula for Information Gain is:
Information Gain (Dataset, Attribute) = Entropy(Dataset) - Weighted Average
Entropy(Subsets created by splitting on Attribute)
Gini Index
Decision trees are a popular machine learning algorithm used for both classification and
regression tasks. In classification, the goal is to predict which category a data point belongs to.
The Gini Index plays a crucial role in how decision trees decide how to split the data at each
step.
Here's the basic idea:
1. Measuring Impurity: The Gini Index measures the "impurity" of a set of data points.
A pure set means all data points belong to the same category. An impure set means the
data points are mixed across different categories.
2. Finding the Best Split: When building a decision tree, the algorithm needs to decide
which feature to use for splitting the data at each node. It calculates the Gini Index for
each possible split and chooses the one that results in the greatest reduction in impurity.
This means the split that creates the most "pure" subsets of data.
How it's calculated:
The Gini Index for a set of data points is calculated as:
Gini Index = 1 - (probability of category 1) ^2 - (probability of category 2) ^2 - ... -
(probability of category c) ^2
where c is the number of categories.
• Uses conditions like Xi ≤ V (for numerical data) or categorical rules (e.g., "Color =
Red").
• Simpler to interpret and computationally efficient.
• Commonly used in CART, ID3, and C4.5 algorithms.
Example:
If we have features (Age, Income, Credit Score), a univariate decision tree may use only one
feature at a time to make a split:
• If Age ≤ 30 → Go left
• If Age > 30 → Go right
Characteristics:
• More flexible than univariate trees.
• Decision boundaries are not restricted to be parallel to axes (can be diagonal or
curved).
• Uses linear models (like Logistic Regression, SVM, PCA-based splits) to determine
the best split.
• More computationally expensive than univariate trees.
Example (Multivariate Split Rule):
Types of Pruning
1. Pre-Pruning (Early Stopping):
o Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a
maximum tree depth, a minimum number of data points in a leaf node, or a
threshold for the information gain at each decision node. Pre-pruning is simple
and computationally efficient, but it may not capture complex relationships in
the data.
• Stops the tree from growing before it becomes too complex.
• Sets a limit on conditions like:
o Max depth (max_depth)
o Min samples per split (min_samples_split)
o Min samples per leaf (min_samples_leaf)
• Advantages:
Faster training.
Prevents excessive growth.
• Disadvantages:
Might stop too early, missing some important patterns.
Thus, if an email contains the word "discount," there's a 63% probability it's spam.
Step 4: Decision Making Using Loss Function
• A loss function in machine learning tells us how far our predictions are from the actual
values.
• In Bayesian learning, we make predictions based on probability distributions rather
than fixed numbers, so our loss function helps decide which predictions are best.
• Suppose misclassifying spam as non-spam has a higher cost (e.g., 5 points) than
misclassifying non-spam as spam (2 points).
• The expected loss for labelling an email as spam (ds) or non-spam (dN) is computed.
• Likelihood is a measure of how well a given set of parameters explains the observed
data.
If we have a dataset D = {x1, x2, ..., xn} and a model with parameters θ, the
likelihood function is:
or
L(θ) =P (D ∣ θ)
This function tells us how probable the observed data is, given the model
parameters.
3. Maximize the Likelihood: The goal of MLE is to find the parameters that maximize
the likelihood function. This means finding the parameters that make your observed
data the most probable.
• MLE finds the
• that maximizes the likelihood function:
• arg max: This is short for "argument of the maximum." It means "find the value of
θ that maximizes the following function."’
• L(θ): This is the likelihood function. It represents how likely it is to observe the data
we have, given a particular value of the parameter θ.
Since probabilities are usually very small (because they involve multiplying
probabilities of individual data points.), we often take the log-likelihood instead (to
avoid numerical issues and simplify calculations):
Linear regression:
• Linear regression is a quiet and the simplest statistical regression technique used for
predictive analysis in machine learning.
• It shows the linear relationship between the independent(predictor) variable i.e. X-axis
and the dependent (output) variable i.e. Y-axis, called linear regression. If there is a
single input variable X (independent variable), such linear regression is simple linear
regression.
• In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables.
• The slope represents the change in the dependent variable for each unit change in the
independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero.
• The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Simple Regression Calculation
• To calculate best-fit line linear regression uses a traditional slope-intercept form which
is given below,
Yi= β0+β1Xi
where Y i = Dependent variable, β 0 = constant/Intercept, β 1 = Slope/Intercept, X
i = Independent variable.
• This algorithm explains the linear relationship between the dependent(output) variable
y and the independent(predictor) variable X using a straight-line Y= B 0 + B 1 X.
But how does the regression find out which is the best-fit line?
• The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to
find the best-fit line. The best-fit line is a line that has the least error which means the
error between predicted values and actual values should be minimum.
Gradient Descent
• Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models
by means of minimizing errors between actual and expected results. It helps in finding
the local minimum of a function.
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function. This entire
procedure is known as Gradient Ascent, which is also known as steepest descent.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.
• Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
• Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
To minimize the cost function, two data points are required: Direction & Learning Rate
Learning Rate: It is defined as the step size taken to reach the minimum or lowest point.
How Does Gradient Descent Work in Linear Regression?
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression.
• Initialize Parameters: Start with random initial values for the slope (m) and intercept
(b).
• Calculate the Cost Function: Compute the error using a cost function such as Mean
Squared Error (MSE):
• Compute the Gradient: Find the gradient of the cost function with respect to m and b.
These gradients indicate how the cost changes when the parameters are adjusted.
• Update Parameters: Adjust m and b in the direction that reduces the cost:
• Repeat: Iterate until the cost function converges i.e. further updates make little or no
difference.
Linear Discrimination
• Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).
• It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
LDA can be performed in 5 steps:
Step 1: Compute the mean vectors for the different classes from the dataset.
Step 2: Compute the scatter matrices (in-between-class and within-class scatter matrices).
Step 3: Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
Step 4: Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues.
Step 5: Use this eigenvector matrix to transform the samples onto the new subspace.
3. Objective of LDA:
LDA tries to maximize the ratio of between-class scatter to within-class scatter to
achieve maximum class separation. This is represented mathematically as:
4. Solve for the Optimal Projection Vector:
This transformation reduces the dimensionality of the data while maintaining the class-
related information.
Logistic regression
• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
P(y=1)
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
Multilayer Perceptron:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data: