0% found this document useful (0 votes)
2 views

ML4_ML_Algorithms

The document discusses various machine learning algorithms including KNN, K-Means Clustering, Decision Trees, Naive Bayes, Random Forest, and Linear Regression, detailing their functionalities, advantages, and disadvantages. It provides step-by-step processes for implementing K-Means and Decision Trees, along with examples for clarity. Additionally, it highlights the importance of methods like the Elbow Method for determining the number of clusters and the role of entropy and information gain in decision trees.

Uploaded by

ramzanrawal777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML4_ML_Algorithms

The document discusses various machine learning algorithms including KNN, K-Means Clustering, Decision Trees, Naive Bayes, Random Forest, and Linear Regression, detailing their functionalities, advantages, and disadvantages. It provides step-by-step processes for implementing K-Means and Decision Trees, along with examples for clarity. Additionally, it highlights the importance of methods like the Elbow Method for determining the number of clusters and the role of entropy and information gain in decision trees.

Uploaded by

ramzanrawal777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

LO3 Develop a machine learning

application using an appropriate


programming language or
machine learning tool for solving
a real-world problem
RAJAD SHAKYA
KNN Algorithm
● popular Machine Learning algorithm used mostly for
solving classification problems.

● compares a new data entry to the values in a given


data set

● Based on its closeness or similarities in a given


range (K) of neighbors, the algorithm assigns the
new data to a class or category in the data set
KNN Algorithm
● 1 - Assign a value to K
● 2 - Calculate the distance between the new data
entry and all other existing data entries
● 3 - Find the K nearest neighbors to the new entry
based on the calculated distances.
● 4 - Assign the new data entry to the majority class
in the nearest neighbors.
Question
Question
K-Means Clustering
● popular unsupervised machine learning algorithm
used to partition a dataset into a set of distinct,
non-overlapping groups (or clusters) based on their
features.
● Centroids: Each cluster is represented by a centroid,
which is the average of all points in the cluster.
● Clusters: Groups of data points that are assigned to
the nearest centroid.
Steps of K-Means Algorithm
● Initialization: Select k initial centroids randomly from
the dataset.
● Assignment: Assign each data point to the nearest
centroid, forming k clusters.
● Update: Recompute the centroid of each cluster as the
mean of all points assigned to that cluster.
● Repeat: Repeat the assignment and update steps until
convergence (i.e., centroids no longer change or change
very little).
Example 1
{2,4,10,12,3,20,30,11,25}

Let's assume we want to divide these data points


into k=2 clusters.

We will randomly select two data points as initial


cluster centroids. Suppose we choose 4 and 20.

Assign each data point to the nearest centroid:


Example 1
Data Point D to C 4 Dto C20 Assigned Cluster
2 2 18 Cluster 1
4 0 16 Cluster 1
10 6 10 Cluster 1
12 8 8 Cluster 1
3 1 17 Cluster 1
20 16 0 Cluster 2
30 26 10 Cluster 2
11 7 9 Cluster 1
25 21 5 Cluster 2
Example 1
Recompute the centroids:

New Centroid 1: (2+4+10+12+3+11)/6=42/6=7

New Centroid 2: (20+30+25)/3=75/3=25

Reassign data points based on new centroids:


Example 1
Data Point D to C 7 D to C 25 Assigned Cluster
2 5 23 Cluster 1
4 3 21 Cluster 1
10 3 15 Cluster 1
12 5 13 Cluster 1
3 4 22 Cluster 1
20 13 5 Cluster 2
30 23 5 Cluster 2
11 4 14 Cluster 1
25 18 0 Cluster 2
Example 1
Recompute the centroids (repeat until convergence):

New Centroid 1: (2+4+10+12+3+11)/6=42/6=7

New Centroid 2: (20+30+25)/3=75/3=25

Reassign data points based on new centroids:


Example 1
Final Clusters

Cluster 1: {2, 4, 10, 12, 3, 11} (Centroid: 7)


Cluster 2: {20, 30, 25} (Centroid: 25)
Example 2
{2,3,8,10,15,18}

Let's assume k=2 clusters.

Randomly select two data points as initial cluster


centroids. Suppose we choose 3 and 15.
Example 2
Data Point D to C3 D to C15 Assigned Cluster
2 1 13 Cluster 1
3 0 12 Cluster 1
8 5 7 Cluster 1
10 7 5 Cluster 2
15 12 0 Cluster 2
18 15 3 Cluster 2
Example 2
Recompute the centroids:

New Centroid 1: (2+3+8)/3=13/3≈4.33

New Centroid 2: (10+15+18)/3=43/3≈14.33


Example 2
Data Point D to 4.33 D to 14.3 Assigned Cluster
2 2.33 12.33 Cluster 1
3 1.33 11.33 Cluster 1
8 3.67 6.33 Cluster 1
10 5.67 4.33 Cluster 2
15 10.67 0.67 Cluster 2
18 13.67 3.67 Cluster 2
Example 2
Recompute the centroids:

New Centroid 1: (2+3+8)/3=13/3≈4.33

New Centroid 2: (10+15+18)/3=43/3≈14.33

The centroids have not changed, so the algorithm


converged
Example 2
Cluster 1: {2, 3, 8} (Centroid: 4.33)

Cluster 2: {10, 15, 18} (Centroid: 14.33)


Example 2
Point X Y
A 1 1
B 2 1
C 4 5
D 5 4

Using k=2 cluster


Suppose we choose A (1, 1) and C (4, 3).
Advantages
● Simple to implement and computationally efficient.

● Easy to interpret results.


Disadvantages
● Number of clusters (k) must be specified
beforehand.

● Outliers can skew the centroids and affect the


clustering results.
Question
● Cluster the following eight points (with (x, y)
representing locations) into three clusters:

● A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5),
A6(6, 4), A7(1, 2), A8(4, 9)

● Initial cluster centers are: A1(2, 10), A4(2, 8) and


A7(1, 2).
Question
● Apply K(=2)-Means algorithm over the data (185,
72), (170, 56), (168, 60), (179,68), (182,72),
(188,77) up to two iterations and show the clusters.
Initially choose first two objects as initial centroids.
Elbow Method
● used in determining the number of clusters in a
dataset.
● This method involves running the K-Means
algorithm on the dataset for a range of values of k
(the number of clusters) and then for each value of
k , computing the within-cluster sum of squares
(WCSS).

Decision Tree
Decision Tree
● supervised machine learning algorithm that can be
used for both classification and regression tasks.

● It splits the data into subsets based on the most


significant attribute, making decisions at each node
to reach the final prediction.

● Root Node: The top node representing the entire


dataset, which is split into two or more
homogeneous sets.
Decision Tree
● Decision Node: Nodes that represent a decision or
test on an attribute.

● Leaf/Terminal Node: Nodes that represent the final


class label or value (in the case of regression).

● Branch/Sub-tree: A subsection of the entire tree.


Steps to Build a Decision Tree
● Select the Best Attribute
○ Choose the attribute that best splits the dataset.
■ Gini Index, Information Gain
● Splitting
○ Divide the dataset into subsets based on the best
attribute. This creates decision nodes and leaf
nodes.
● Repeat Step 1 and Step 2
○ Recursively split the nodes until one of the
stopping conditions is met
Steps to Build a Decision Tree
● Stopping Criteria

○ Maximum tree depth is reached.


○ Minimum number of samples in a node.
○ No further information gain can be achieved.
Entropy
● to measure the impurity in a given attribute.

● specifies randomness in data.

● Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

○ S= Total number of samples


○ P(yes)= probability of yes
○ P(no)= probability of no
Information Gain
● measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
● calculates how much information a feature provides us
about a class.
● to maximize the value of information gain, and a
node/attribute having the highest information gain is
split first.

Entropy(D)=−∑i=1kpilog2pi
Advantages of the Decision Tree
● simple to understand as it follows the same process
which a human follow while making any decision in
real-life.

● helps to think about all the possible outcomes for a


problem.
Disadvantages of the DT
● contains lots of layers, which makes it complex.

● For more class labels, the computational complexity


of the decision tree may increase.
Numerical of the DT
https://medium.datadriveninvestor.com/decision-tree-al
gorithm-with-hands-on-example-e6c2afb40d38

https://www.cs.cmu.edu/~aarti/Class/10315_Fall20/rec
s/DecisionTreesBoostingExampleProblem.pdf
Naive Bayes Algorithm
● probabilistic machine learning algorithm based on
Bayes' Theorem.

● called "naive" because it assumes that the features


(or predictors) are independent of each other given
the class label
Naive Bayes Algorithm
Naive Bayes Algorithm
● Calculate Prior Probability
○ Out of 14 records, 9 are yes.
So P(yes)=9/14 and P(no)=5/14

● Calculate Likelihood Ratio


○ Out of 14 records, 5-Sunny,4-Overcast,5-Rainy.

○ From the dataset, the number of sunny days we


can play is 2. The total no of days we can play is
9.
○ So P(Sunny | yes) =2/9
● Let’s predict given the conditions sunny, mild,
normal, False → Whether he/she can play golf?

= P((yes)|(Sunny,Mild,Normal,False)
= P((Sunny,Mild,Normal,False)|yes) *P(yes)
= P(Sunny | yes)*P(Mild | yes)*P(Normal | yes)
*P(False | yes)*P(yes)

= 2/9 *4/9 *6/9 *6/9 *9/14


= P(yes|(Sunny,Mild,Normal,False))= 0.0282
● Let’s now calculate P(no|(Sunny,Mild,Normal,False))

P(no|(Sunny,Mild,Normal,False))
= P((Sunny,Mild,Normal,False)|no) *P(no)
=P(Sunny | no) * P(Mild | no) * P(Normal | no) *
P(False | no) * P(no)

=3/5 *2/5 *1/5 *2/5 *5/14

P(no|(Sunny,Mild,Normal,False))=0.0068
Since 0.0282 > 0.0068
[P(yes|conditions)>P(no|conditions) ,
for the given conditions Sunny,Mild,Normal,False ,
play is predicted as yes.
Advantages of Naïve Bayes
● fast and easy ML algorithms to predict a class

● used for Binary as well as Multi-class


Classifications.

● most popular choice for text classification problems.


Disadvantages of Naïve Bayes
● assumes that all features are independent or
unrelated, so it cannot learn the relationship
between features.
Types of Naïve Bayes Model
● Gaussian

● Multinomial
Disadvantages of Naïve Bayes
● assumes that all features are independent or
unrelated, so it cannot learn the relationship
between features.
Random Forest
● collaborative team of decision trees that work
together to provide a single output.

● It works by creating a number of Decision Trees


during the training phase.

● Each tree is constructed using a random subset of


the data set to measure a random subset of features
in each partition.
Random Forest
● This randomness introduces variability among
individual trees, reducing the risk of overfitting and
improving overall prediction performance.

● aggregates the results of all trees, either by voting


(for classification tasks)

● supported by multiple trees with their insights


Random Forest
● Step 1: Select random K data points from the
training set.

● Step 2:Build the decision trees associated with the


selected data points(Subsets).

● Step 3:Choose the number N for decision trees that


you want to build.
Random Forest
● Step 4:Repeat Step 1 and 2.

● Step 5: For new data points, find the predictions of


each decision tree, and assign the new data points
to the category that wins the majority votes.
Key Features of Random Forest
● High Predictive Accuracy

● Resistance to Overfitting

● Large Datasets Handling

● Parallelization for Speed


Linear Regression
● type of supervised machine learning algorithm that
computes the linear relationship between the
dependent variable and one or more independent
features by fitting a linear equation to observed
data.
● When there is only one independent feature, it is
known as Simple Linear Regression
● when there are more than one feature, it is known as
Multiple Linear Regression
Linear Regression
● interpretability of linear regression is a notable
strength.

● transparent, easy to implement, and serves as a


foundational concept for more complex algorithms.
Simple Linear Regression
● it involves only one independent variable and one
dependent variable.

● Y is the dependent variable


● X is the independent variable
● β0 is the intercept
● β1 is the slope
Multiple Linear Regression
● involves more than one independent variable and
one dependent variable.

● where:

● Y is the dependent variable


● X1, X2, …, Xp are the independent variables
● β0 is the intercept
● β1, β2, …, βn are the slopes
Simple Linear Regression
● The goal of the algorithm is to find the best Fit Line
equation that can predict the values based on the
independent variables.
● What is the best Fit Line?
○ straight line that
represents the
relationship between
the dependent and
independent variables.
Simple Linear Regression
● The slope of the line indicates how much the
dependent variable changes for a unit change in the
independent variable(s).

● Here Y is called a dependent or target variable and

● X is called an independent variable also known as


the predictor of Y.
Simple Linear Regression
● performs the task to predict a dependent variable
value (y) based on a given independent variable (x)

● We utilize the cost function to compute the best


values in order to get the best fit line since different
values for weights or the coefficient of lines result in
different regression lines.
How to update θ1 and θ2 values to get the best-fit line?

● To achieve the best-fit regression line, the model


aims to predict the target value 𝑌_pred such that the
error difference between the predicted value 𝑌_pred
and the true value Y is minimum.

● So, it is very important to update the θ1 and θ2


values, to reach the best value that minimizes the
error between the predicted y value (pred) and the
true y value (y).


Cost function for LR
● The cost function or the loss function is nothing but
the error or difference between the predicted value
Y_pred and the true value Y.

● In Linear Regression, the Mean Squared Error (MSE)


cost function is employed, which calculates the
average of the squared errors between the
predicted values y_pred_i and the actual values y_i
Cost function for LR
● Utilizing the MSE function, the iterative process of
gradient descent is applied to update the values of
theta1 and theta2
● This ensures that the MSE value converges to the global
minima, signifying the most accurate fit of the linear
regression line to the dataset.
● The final result is a linear regression line that minimizes
the overall squared differences between the predicted
and actual values, providing an optimal representation of
the underlying relationship in the data.
Numerical
● (1,2),(2,3),(3,5),(4,4),(5,6)

● y=0.9x+1.3
Gradient Descent for LR
● model can be trained using the optimization
algorithm gradient descent by iteratively modifying
the model’s parameters to reduce the mean squared
error (MSE) of the model on a training dataset.

● To update θ1 and θ2 values in order to reduce the


Cost function (minimizing RMSE value) and achieve
the best-fit line the model uses Gradient Descent.
Gradient Descent for LR
● The idea is to start with random θ1 and θ2 values
and then iteratively update the values, reaching
minimum cost.

● A gradient is nothing but a derivative that defines


the effects on outputs of the function with a little bit
of variation in inputs.
Gradient Descent for LR
● Finding the coefficients of a linear equation that best
fits the training data is the objective of linear
regression.

● By moving in the direction of the Mean Squared


Error negative gradient with respect to the
coefficients, the coefficients can be changed.

● And the respective intercept and coefficient of X will


be if α is the learning rate.
Gradient Descent for LR
Gradient Descent for LR
● Finding the coefficients of a linear equation that best
fits the training data is the objective of linear
regression.

● By moving in the direction of the Mean Squared


Error negative gradient with respect to the
coefficients, the coefficients can be changed.

● And the respective intercept and coefficient of X will


be if α is the learning rate.
Alpha – The Learning Rate
● It must be chosen carefully to end up with local
minima.

● If the learning rate is too high, we might


OVERSHOOT the minima and keep bouncing
without reaching the minima
● If the learning rate is too small, the training might
turn out to be too long
Types of gradient descent
● Batch gradient descent
○ works by updating the values of parameters
based on the average gradient of the loss
function over the entire training dataset.
○ This is a standard form of gradient descent and
takes huge time with large datasets because it
computes the gradient for all the training data at
each iteration of finding optimal parameter
values.
Types of gradient descent
● Stochastic gradient descent (SGD)

○ works in a single iteration and gives a higher


modelling speed.

○ Here, values of models parameter are updated


once by the gradient of the loss function on a
single training data at a time.
Types of gradient descent
● Mini-batch gradient descent

○ We can think of it as the compromise between


the above two types of gradient descent because
here, the values of the model parameter get
updated based on the average gradient of a
small random subset of the training data.

○ This subset is called a mini-batch, and this works


with a medium speed.
Evaluation Metrics for LR
● Mean Square Error (MSE)

○ evaluation metric that calculates the average of


the squared differences between the actual and
predicted values for all the data points.
Evaluation Metrics for LR
● Mean Square Error (MSE)

○ Here,

■ n is the number of data points.


■ yi is the actual or observed value for the ith
data point.

■ y_i_pred is the predicted value for the ith


data point.
Evaluation Metrics for LR
● Mean Square Error (MSE)
○ Here,
■ n is the number of data points.
■ yi is the actual or observed value for the ith data point.
■ y_i_pred is the predicted value for the ith data point.
○ MSE is a way to quantify the accuracy of a model’s
predictions.
○ MSE is sensitive to outliers as large errors contribute
significantly to the overall score.
Evaluation Metrics for LR
● Coefficient of Determination (R-squared)

○ statistic that indicates how much variation the


developed model can explain or capture.

○ It is always in the range of 0 to 1.

○ In general, the better the model matches the


data, the greater the R-squared number.
Evaluation Metrics for LR
● Coefficient of Determination (R-squared)
Evaluation Metrics for LR
● Coefficient of Determination (R-squared)

○ Residual sum of Squares (RSS):

■ The sum of squares of the residual for each


data point in the plot or data

■ It is a measurement of the difference between


the output that was observed and what was
anticipated.
Evaluation Metrics for LR
● Coefficient of Determination (R-squared)

○ Total Sum of Squares (TSS):

■ The sum of the data points’ errors from the


answer variable’s mean is known as the total
sum of squares, or TSS.
Evaluation Metrics for LR
● Coefficient of Determination (R-squared)

○ R squared metric is a measure of the proportion


of variance in the dependent variable that is
explained the independent variables in the
model.
Evaluation Metrics for LR
● Adjusted R-Squared Error
○ proportion of variance in the dependent variable
that is explained by independent variables in a
regression model.
○ accounts the number of predictors in the model
and penalizes the model for including irrelevant
predictors that don’t contribute significantly to
explain the variance in the dependent variables.
Evaluation Metrics for LR
● Adjusted R-Squared Error

○ adjusted R2 is expressed as:

○ Here,

○ n is the number of observations


○ k is the number of predictors in the model
○ R2 is coefficient of determination
Evaluation Metrics for LR
● Adjusted R-Squared Error

○ helps to prevent overfitting. It penalizes the


model with additional predictors that do not
contribute significantly to explain the variance in
the dependent variable.
Train Test Validation
● fundamental in machine learning and data analysis,
particularly during model development.

● It involves dividing a dataset into three subsets:


training, testing, and validation.

● Train test split is a model validation process that


allows you to check how your model would perform
with a new data set.
Train Test Validation
● train validation test split helps assess how well a
machine learning model will generalize to new,
unseen data.
● It also prevents overfitting, where a model performs
well on the training data but fails to generalize to
new instances.
● By using a validation set, practitioners can iteratively
adjust the model’s parameters to achieve better
performance on unseen data.
Train Test Validation
● Performance Evaluation

○ After training and validation, the model faces the


testing set, which checks real-world scenarios.

○ A well-performing model on the testing set


indicates that it has successfully adapted to new,
unseen data.
Train Test Validation
● Bias and Variance Assessment
○ helps in understanding the bias trade-off.
○ The training set provides information about the model’s
bias, capturing inherent patterns, while
○ the validation and testing sets help assess variance,
indicating the model’s sensitivity to fluctuations in the
dataset.
○ Striking the right balance between bias and variance is
vital for achieving a model that generalizes well across
different datasets.
Confusion Matrix
Confusion Matrix
● matrix that summarizes the performance of a
machine learning model on a set of test data.

● It is a means of displaying the number of accurate


and inaccurate instances based on the model’s
predictions.

● It is often used to measure the performance of


classification models, which aim to predict a
categorical label for each input instance.
Confusion Matrix
Confusion Matrix
● True Positive:

○ Interpretation: You predicted positive and it’s


true.

○ You predicted that a woman is pregnant and she


actually is.
Confusion Matrix
● True Negative:

○ Interpretation: You predicted negative and it’s


true.

○ You predicted that a man is not pregnant and he


actually is not.
Confusion Matrix
● False Positive: (Type 1 Error)

○ Interpretation: You predicted positive and it’s


false.

○ You predicted that a man is pregnant but he


actually is not.
Confusion Matrix
● False Negative: (Type 2 Error)

○ Interpretation: You predicted negative and it’s


false.

○ You predicted that a woman is not pregnant but


she actually is.
Confusion Matrix
● Just Remember, We describe predicted values as
Positive and Negative and actual values as True and
False.
Confusion Matrix
● Recall

○ The above equation can be explained by saying,


from all the positive classes, how many we
predicted correctly.

○ Recall should be high as possible.


Confusion Matrix
● Precision

○ from all the classes we have predicted as


positive, how many are actually positive.

○ Precision should be high as possible.


Confusion Matrix
● Accuracy

○ From all the classes (positive and negative), how


many of them we have predicted correctly.

○ Accuracy should be high as possible.


Confusion Matrix
● F-measure
○ It is difficult to compare two models with low
precision and high recall or vice versa.
○ So to make them comparable, we use F-Score.
F-score helps to measure Recall and Precision at
the same time.
○ It uses Harmonic Mean in place of Arithmetic
Mean by punishing the extreme values more.
Confusion Matrix
● Positive Negative
Actual Positive 50 10
Actual Negative 5 35

● Calculate the following metrics:


a) Accuracy
b) Precision
c) Recall
d) F1 Score
Confusion Matrix
● P Spam P Not Spam
Actual Spam 90 10
Actual Not Spam 20 80

● Calculate the following metrics:


a) Accuracy
b) Precision
c) Recall
d) F1 Score
Support Vector Machine (SVM)
● It is a supervised machine learning problem where
we try to find a hyperplane that best separates the
two classes.

● SVM is a powerful supervised algorithm that works


best on smaller datasets but on complex ones.
Support Vector Machine (SVM)
Support Vector Machine (SVM)
● Support Vectors:

○ These are the points that are closest to the


hyperplane. A separating line will be defined
with the help of these data points.
Support Vector Machine (SVM)
● Margin:

○ It is the distance between the hyperplane and


the observations closest to the hyperplane
(support vectors). In SVM large margin is
considered a good margin. There are two
types of margins hard margin and soft margin.
How Does SVM Work?
How Does SVM Work?
● The best hyperplane is that plane that has the
maximum distance from both the classes, and this is
the main aim of SVM.

● This is done by finding different hyperplanes which


classify the labels in the best way
Mathematical Intuition
● The best hyperplane is that plane that has the
maximum distance from both the classes, and this is
the main aim of SVM.

● This is done by finding different hyperplanes which


classify the labels in the best way
Thank You

RAJAD SHAKYA

You might also like