0% found this document useful (0 votes)
3 views

Machine Learning Notes

The document provides comprehensive notes on supervised learning, focusing on supervised learning techniques such as classification and regression, including detailed explanations of linear and logistic regression. It covers the steps involved in model training, testing, and common algorithms, as well as the mathematical foundations necessary for understanding these concepts. Additionally, it highlights the differences between linear and logistic regression, emphasizing their respective applications in predicting numeric values and classifying categorical outcomes.

Uploaded by

kjim9716
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning Notes

The document provides comprehensive notes on supervised learning, focusing on supervised learning techniques such as classification and regression, including detailed explanations of linear and logistic regression. It covers the steps involved in model training, testing, and common algorithms, as well as the mathematical foundations necessary for understanding these concepts. Additionally, it highlights the differences between linear and logistic regression, emphasizing their respective applications in predicting numeric values and classifying categorical outcomes.

Uploaded by

kjim9716
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

MACHINE LEARNING NOTES:

NOTES BY: ARBAB LAIQ AHMED

SUPERVISED LEARNING:

 Definition: Supervised learning involves teaching a computer to perform tasks by providing


examples, similar to a teacher guiding a student.

 Example: An analogy is made to teaching a child, showing images of an apple, ball, cat, and
dog to create mental associations based on shape, color, and texture .

 Labeled Data: Supervised learning uses labeled data, where the system is told what each
piece of data represents to understand and classify information

Steps in Supervised Learning:

 Data Collection: Gathering labeled data, such as emails marked as spam or not spam.

 Model Training: Using methods like classification (e.g., K-Nearest Neighbors, decision trees)
to train the model to understand the data.

 Testing: Evaluating the model's accuracy by testing it with a portion of the data to see if it
can correctly classify information.

Techniques Used:

 Classification: Categorizing data into predefined classes, such as identifying emails as spam
or not spam.

 Regression: Working with continuous numerical data to predict values, like predicting house
prices based on size.

Regression Model Creation:

 Data Collection: Gathering data on house sizes and their corresponding prices.

 Training: Training the model to understand the relationship between size and price, using
methods like linear regression.

 Testing: Providing the model with a house size to predict its price.

Common Algorithms:

 Classification: K-Nearest Neighbors (KNN), Logistic Regression, Decision Trees, Support


Vector Machines (SVM), Neural Networks.

 Regression: Linear Regression, Polynomial Regression.

The video emphasizes the importance of understanding these concepts theoretically before moving
on to practical applications, highlighting the interdisciplinary nature and growing popularity of the
field.

REGRESSION INTRODUCTION:
o Regression is a statistical method used to understand and predict the relationship
between variables.

o It's used to understand the relationships between multiple variables and make
predictions.

o The concept of regression, particularly linear regression, has been around since the
19th century.

 Variables: Dependent and Independent

o A variable is a quantitative metric or value, like height or weight.

o Dependent Variable: The value we want to predict.

o Independent Variable: The value used to make predictions.

o Examples:

 Predicting salary based on years of experience. (Years of experience is


independent, salary is dependent).

 Predicting exam scores based on study hours. (Study hours are independent,
exam score is dependent).

 Predicting resale price of a vehicle based on its age. (Vehicle age is


independent, resale price is dependent).

 Representing Relationships Graphically

o Relationships between variables can be shown graphically.

o The independent variable is usually placed on the x-axis.

o The dependent variable is placed on the y-axis.

o The relationship can be positive (both variables increase together).

o Or negative (one variable increases as the other decreases).


POSITIVE REGRESSION
Price
3.5

2.5

1.5

0.5

0
Day 1 Day 2 Day 3 Day 4

NEGATIVE REGRESSION
Price
3.5

2.5

1.5

0.5

0
Day 1 Day 2 Day 3 Day 4

 Types of Regression

o Linear Regression: Finds the relationship between two variables.

o Multiple Linear Regression: Involves multiple independent variables and one


dependent variable.

o Polynomial Regression: Used when the relationship between variables is non-linear.

o The line may not be straight, and can take various curved forms.
POLYNOMIAL REGRESSION
Price
3.5

2.5

1.5

0.5

0
Hour 1 Hour 2 Hour 3 Hour 4

 Importance of Understanding Basic Mathematics

o It's crucial to understand the underlying math before using tools and libraries.

o The next video will cover linear regression with examples and numerical problems.

LINEAR REGRESSION:

o Regression shows the relationship between variables.

o Linear regression shows the relationship between two variables, one dependent and
the other independent, to make predictions.

 Linear Regression Equation

o The equation for linear regression is given as:

 y = mx + b

 Where:

 y is the dependent variable.

 x is the independent variable.

 m is the slope of the line.

 b is the intercept.

 Project Example: Predicting Pizza Prices

o The project involves predicting pizza prices based on their diameter.

o Data is collected, cleaned, and then used for calculations.

 Mean = Sum Of Values / Number Of Values i.e 13+23+24 = 60, Mean = 60/3 = 20.
 Deviation = Value – Mean i.e 13 – 20 = -7, Deviation = -7.
Diamet Pric Mean( Mean( Deviations( Deviations( Product Sum Of Square
er (X) In e X) Y) X) Y) Of Product Of
Inches (Y) Deviatio Of Deviatio
ns Deviatio ns For X
ns
8 10 10 13 -2 -3 -2 x -3=6 6+0+6 -2 x-2 = 4
10 13 0 0 0x0=0 = 12 0x0=0
12 16 2 3 2x3=6 2x2=4

 Calculations

o Calculate the mean of x (diameters) and y (prices).

o Determine the deviation of each x and y value from their respective means.

o Calculate the product of these deviations.

o Find the sum of the product of deviations.

o Square the deviations of x.

 Calculating Slope (m) and Intercept (b)

o Slope (m) is calculated as:

 m = (Sum of Product of Deviations) / (Sum of Squared Deviations of x).

o Intercept (b) is calculated as:

 b = (Mean of y) - m * (Mean of x).

m = 12 / 8 = 1.5

It Means If We Change The ‘X’ By 1 Unit, The Change In ‘Y’ Will Be 1.5.

Now,

b = 13 – 1.5 * 10 = -2

 Visualization and Prediction

o The calculated values are used to visualize the data on a graph.

o Using the equation, predictions can be made. For example, for a 20-inch pizza:

 y = 1.5 * 20 - 2 = 28.

 Real-Life Data and Errors

o In real life, data points may not perfectly align on a straight line, leading to errors or
outliers.

o These errors represent the difference between actual values and predictions.

1. Scatter Plot of Data Points:

 This is the primary graph used to visualize the relationship between the independent and
dependent variables.
 In the video's pizza example, the x-axis represents the diameter of the pizzas (independent
variable), and the y-axis represents the price (dependent variable).

 Each data point on the graph is plotted based on the diameter and price of a pizza from the
dataset. So, if a pizza is 12 inches in diameter and costs $14, a point would be plotted at (12,
14).

 The overall pattern of these points visually suggests whether there's a linear relationship (or
not) between the variables.

2. Linear Regression Line:

 This is the "best-fit" straight line drawn through the scatter plot.

 The line represents the linear equation (y = mx + b) that the regression model has calculated.

 It aims to minimize the overall distance between the line and all the data points.

 This line is crucial because it's what we use to make predictions. For example, to predict the
price of a pizza with a diameter not in the original dataset, you'd find that diameter on the x-
axis, trace upwards to the regression line, and then across to the y-axis to read the predicted
price.

3. Table of Data:

 The video uses a table to organize the data used for the calculations. This table typically
includes:

o Diameter (x-values)

o Price (y-values)

o Columns for intermediate calculations, such as:

 Mean of x

 Mean of y

 Deviation of x from its mean (x - mean of x)

 Deviation of y from its mean (y - mean of y)

 Product of the deviations of x and y

 Squared deviations of x

How to Recreate These:

1. Gather the Data: Watch the video and note down the specific diameters and prices of the
pizzas used in the example. This is your (x, y) data pairs.

2. Scatter Plot:

o Draw x and y axes.

o Label the x-axis "Diameter" and the y-axis "Price."

o Plot each (x, y) data pair as a point on the graph.


3. Calculate the Regression Line:

o Follow the formulas explained in the video to calculate the slope (m) and intercept
(b) of the linear regression line.

4. Draw the Regression Line:

o Use the equation y = mx + b.

o Choose two diameter values (x-values) within the range of your graph.

o Plug those x-values into the equation to calculate the corresponding y-values.

o Plot these two (x, y) points on your graph.

o Draw a straight line through these two points. This is your regression line.

5. Create the Table:

o Set up the table with the columns mentioned above.

o Fill in the diameter and price data.

o Calculate the values for the other columns step-by-step.

LOGISITIC REGRESSION:

o Logistic regression is used in supervised learning for classification models.

o It predicts classes, such as whether an email is spam or not spam.

o It helps in deciding if something belongs to a certain category.

 Core Concepts

o Similar to linear regression, it uses independent and dependent variables.

o The dependent variable is what you predict.

o Independent variables help in making predictions and can be multiple.

o The dependent variable is categorical (e.g., binary data like 0 or 1, yes or no).

o The goal is to predict whether something will happen or not.

 Example: Predicting Exam Results

o Using study hours to predict exam results (pass or fail).

o If a student studies for 5 hours, predict if they will pass or fail.

o Zero represents fail, and one represents pass.

Study Hours Exam Result


2 0
3 0
4 0
5 1
6 1
7 1
8 1

o The model finds the probability of passing (e.g., 0.8 means likely to pass).

 Sigmoid Function

o The sigmoid function is used:

o Formula: y = 1 / (1 + e-z)

 y is the predicted variable.

 The sigmoid function ensures values are between 0 and 1.

 e is a mathematical constant.

 z is calculated as a0 + a1 * x.

 a0 is the intercept.

 a1 is the coefficient.

 x is the independent variable (e.g., study hours).

 Understanding Intercept and Coefficient

o Intercept (a0): The value of y when x is zero.


o Coefficient (a1): The effect on y for a one-unit increase in x.

 Applying the Model

o Exam result is either 0 or 1 (pass or fail).

1. 5 acts as a threshold to divide into two classes.

o Values above 0.5 indicate a higher chance of passing.

o Values below 0.5 indicate a higher chance of failing.

 Example Calculation

o Training data is used to predict outcomes.

o If a student studies 5 hours (x = 5).

o Values for a0 and a1 are needed.

o These values are found using Maximum Likelihood Estimation and cost function
optimization.

o Example values: a0 = -1.5, a1 = 0.6.

o Calculation: z = -1.5 + 0.6 * 5 = 1.5.

o y = 1 / (1 + e^(-1.5)) ≈ 0.818.

o Result: Approximately 81.8% chance of passing.

 Another Example

o If a student studies 1. 5 hours (x = 1.5).

o Calculation: z = -1.5 + 0.6 * 1.5 = -0.6.

o y = 1 / (1 + e^(0.6)) ≈ 0.35.

o Result: Approximately 35% chance of passing, likely to fail.

 Key Difference from Linear Regression

o Logistic regression predicts categorical outcomes.

o Linear regression predicts continuous values (e.g., price of a house).

o Example: House size vs. house price in linear regression.

o House price is continuous, not a yes/no category.

 Conclusion

o Logistic regression classifies data into categories.

o It uses the sigmoid function to predict probabilities.

LINEAR REGRESSION VS LOGISTIC REGRESSION:


 Both Linear and Logistic Regression fall under supervised learning, but they serve different
purposes.

 Linear Regression is used for predicting numeric values, while Logistic Regression is used for
classification.

 Linear Regression

 It is used to predict numeric values.

 An example is predicting stock prices or house prices based on size.

 Both independent and dependent variables are continuous.

 The equation for Linear Regression is: y = a_0 + a_1x_1, where:

o y is the dependent variable.

o x_1 is the independent variable.

o a_0 is the intercept.

o a_1 is the coefficient indicating how much y increases for a unit increase in x.

 Logistic Regression

 It is used for classification, predicting the class an item belongs to.

 Examples include email spam detection or predicting if a student will pass or fail.

 Independent variables are continuous, but the dependent variable is discrete (categorical).

 Values are category-based, often in the form of yes/no or 0/1.

 It uses a sigmoid function to bring values between 0 and 1.

 The sigmoid function transforms continuous data into a 0 or 1 category.

 Key Differences Summarized

 Linear Regression predicts continuous numeric values; Logistic Regression classifies data into
categories.

 Dependent variables in Linear Regression are continuous; in Logistic Regression, they are
discrete.

 Linear Regression uses a linear equation; Logistic Regression uses a sigmoid function.

k-NN INTRODUCTION

 The k-NN algorithm is used for predicting classes and is like a "Hello World" program for
classification

 It classifies data points based on the classes of their nearest neighbors.

 Core Concept
 k-NN involves taking a vote from the nearest neighbors to make a decision.

 The distance between data points is calculated to determine proximity.

 Euclidean distance formula: Used to calculate the distance between data points.

o The formula is not explicitly shown in the video, but it's a standard formula: √((x₂ -
x₁)² + (y₂ - y₁)²).

 Example: Predicting Movie Genre

 The video uses an example of predicting the genre of a movie (Barbie) based on its IMDB
rating and duration, using a dataset of other movies.

IMDB Rating Duration Genre


8.0 (Mission Impossible) 160 Action
6.2 (Gadar 2) 170 Action
7.2 (Rocky & Rani) 168 Comedy
8.2 (OMG 2) 155 Comedy

Predicting The Genre Of The ‘Barbie’ Movie With IMDB Rating 7.4 And Duration 114 Minutes.

 Steps Explained

 Data Collection: Gathering data with IMDB ratings, durations, and genres for existing movies.

 Distance Calculation: Using the Euclidean distance formula to find the distance between the
Barbie movie and other movies in the dataset.

 Selecting Nearest Neighbors: Choosing the 'k' nearest neighbors based on the calculated
distances.

 Majority Voting: Determining the genre of the Barbie movie based on the majority genre
among its nearest neighbors.

STEP I: CALCULATE DISTANCES:

Calculate the Euclidean Distance Between the new movie and each movie in the dataset.


Distance To (8.0, 160) = ( ( 7.4−8.0 )2+ (114 −160 )2 )=√ ( 0.36+2116 ) ≈ 46.00

Distance To (6.2, 160) = √ ( ( 7.4−6 . 2 ) + ( 114−1 7 0 ) )=√ ( 1 . 44+3 1 3 6 ) ≈5 6.0 1


2 2

Distance To (7.2, 168) = √ ( ( 7.4−7 . 2 ) + ( 114−1 68 ) )=√ ( 0. 04+2 9 16 ) ≈ 5 4.00


2 2

Distance To (8.2, 155) = √ ( ( 7.4−8. 2 ) + ( 114−1 55 ) )=√ ( 0.6 4 +16 81 ) ≈ 4 1 .00


2 2

STEP II: SELECT NEAREST NEIGHBORS:

Now If I Set K to 1 as:

K = 1, It Means We Are Finding Only One Neighbor, Which Is The Most Nearest One. So, The Most
Nearest One Is The One Having Distance 41.00, If We Check It Is OMG 2 Movie Whose Genre Is
Comedy, So We can Say That Barbie Movie Has Genre “Comedy” Which Is Correct.

STEP III: MAJORITY VOTING (CLASSIFICATION):


Now If We Set K to 3 as:

K=3

In That Case we have 41.00, 46.00, 54.00, Now Here Two Of These Has Comedy Genre And One Has
Action Genre, Which Means The Barbie Movie Has More Likely The Comedy Genre.

NAIVE BAYES ALGORITHM:

 Definition: Naive Bayes is a supervised learning method used for classification, where the
goal is to predict which class a particular item belongs to. A common example is classifying
emails as spam or not spam.

 Based on Bayes' Theorem: The method is based on Bayes' Theorem, which relies heavily on
probability.

 Key Assumption: Naive Bayes assumes that the variables used for classification are
independent of each other.

Bayes' Theorem Explained with an Example:

 Basic Probability: If a bag contains 2 red and 3 black balls, the probability of picking a red
ball is 2/5.

 Total Probability: If there are two bags, Bag 1 (2 red, 3 black) and Bag 2 (4 red, 3 black), the
probability of picking a red ball involves considering the probability of choosing each bag
(1/2 for each) and then the probability of a red ball from that bag (2/5 from Bag 1, 4/7 from
Bag 2).
E1 = ½
2 RED, 3 BLACK Probability Of Red Ball = 2/5 Total Probability:

4 RED, 3 BLACK E2 = ½ Probability Of Red Ball = 4/7 ½ x 2/5 + ½ x 4/7

E1, And E2 Are The Probabilities Of Bags.

 Reverse Probability: Bayes' Theorem comes into play when you know the outcome (e.g., a
red ball was picked) and want to find the probability of it coming from a specific source (e.g.,
Bag 1).

 Formula: If you want to calculate the probability of event A given event X, you use P(A|X) =
P(X|A) * P(A) / P(X).

MATHEMATICS:

P ( X ∨Y )∗P(Y )
P(Y/X) =
P (X )
Consider ‘Y’ As Source (Bag 1) In This Equation, And ‘X’ As Red Ball.

Now This Was For Single Variable, In Real Life Scenario You Might Have Multiple Variables, For That
The Equation Will Become as:

P ( X 1∨Y )∗P (X 2∨Y )∗P (X 3∨Y )... P (X n∨Y )∗P (Y )


P(Y|X1, X2, X3, ..., Xn) =
P ( X 1 )∗P ( X 2 )∗P ( X 3 ) … P(X n )
¿
P(N/X) = P ( X| N ¿∗P (N)
P (X )
Here Consider ‘N’ As Source (Bag 2) In This Equation, And ‘X’ As Red Ball.

Now For Multiple Variables:

P ( X 1∨N )∗P (X 2∨N )∗P (X 3∨N )... P (X n∨N )∗P (N )


P(N|X1,X2,X3...Xn) =
P ( X 1 )∗P ( X 2 )∗P ( X 3 ) … P(X n )

Naive Bayes in Practice:

 Multiple Variables: In real-life scenarios, multiple variables may influence the outcome.
Naive Bayes considers all these variables, assuming they are independent.

 Classification: To classify whether something belongs to a class (e.g., whether an email is


spam), Naive Bayes calculates the probability of it belonging to that class and not belonging
to it, considering all variables.

 Formula with Multiple Variables: If you have variables x1, x2, etc., the formula becomes
more complex but follows the same principle as Bayes' Theorem.

 Simplification: Since the denominator in the Bayes' Theorem formula remains constant
when comparing probabilities, it can be ignored. The focus is on the numerator to determine
which class has a higher probability.

Example: Predicting Fever

 Scenario: Given data about whether people have COVID or the flu (x1, x2) and whether they
have a fever, predict if a new person with the flu and COVID will have a fever.

 Steps:

o Calculate the probability of having a fever (Yes) and not having a fever (No) based on
the existing data.

o Calculate conditional probabilities, like the probability of having COVID and a fever.

o Use these probabilities in the Naive Bayes formula to calculate the probability of
fever (Yes) and fever (No) for a person with the flu and COVID.

o Compare the probabilities. The higher probability indicates the predicted class.

 Prediction: Based on the calculations, if the probability of "fever: Yes" is higher, the model
predicts that the person will have a fever.
Important Points:

 Naive Bayes classifies based on the highest probability, but real-world data might have
outliers and errors.

DECISION TREE ALGORITHM:

 Definition: Decision Tree is an algorithm mainly used for classification tasks, but it can also
be used for regression.

 Real-life example: Deciding whether to pursue higher studies as a fourth-year student.

 Process:

o The model is given a sample (training) dataset.

o Based on the data, the model becomes capable of making decisions.

 Example scenario:

o Check placement status: Placed or Not Placed.

o If placed: No (don't go for higher studies).

o If not placed: Check if the student gave the GATE exam.

o If GATE exam given: Check if qualified.

o If qualified: Yes (go for higher studies).

o If not qualified: No (don't go for higher studies).


 Further elaboration: You can add more attributes like package value (high or low) to make
more informed decisions.

 Tree Structure:

o Consists of nodes (vertices) and edges.

o The root node is chosen based on mathematical principles.

o Splitting and selection are based on Entropy and Information Gain.

 Entropy: Measures the impurity of the dataset.

 Information Gain: Measures how much information a feature provides


about the class.

 Higher Information Gain means less impurity, so that attribute is chosen as


the root node.

Weather

Sunny Cloudy Rain

Temperature Yes Yes

Hot Mild Cold

No Yes Yes

 Example:

o Check the weather (Sunny, Cloudy, Rain).

o If Cloudy: Yes (play football).

o If Rain: Yes (play football).

o If Sunny: Check temperature (Hot, Mild, Cold).

o If Hot: No (don't play).

o If Mild or Cold: Yes (play).

 Outcomes: Outcomes (Yes/No) are represented on the leaf nodes.

 Pruning: Removing unnecessary branches to reduce complexity.


 Edges: Help in decision-making.

 Further Learning: The next video will explain the mathematics behind splitting, Entropy, and
Information Gain.

DECISION TREE ID3 ALGORITHM

 Overview of Decision Trees and ID3 Algorithm

o Decision trees are primarily used for classification to determine the class to which
something belongs.

o The ID3 algorithm is used to create a decision tree, explaining the process step by
step with mathematical calculations.

 Key Concepts: Entropy and Information Gain

o The two main calculations needed are entropy and information gain, which are used
to draw the decision tree.

o Entropy: Measures the impurity of the dataset.

o Information Gain: Measures how much information a feature provides about the
class. Higher Information Gain means less impurity.

 Data Set Example

o The example data set includes weather (sunny, cloudy, rain), temperature (hot, mild,
cold), humidity (high, normal), and wind (strong, weak).

o Based on these attributes, the goal is to predict whether a game (football or tennis)
should be played (yes/no).

DAY Weather Temperature Humidity Wind Play Football?


Day 1 Sunny Hot High Weak No
Day 2 Sunny Hot High Strong No
Day 3 Cloudy Hot High Weak Yes
Day 4 Rain Mild High Weak Yes
Day 5 Rain Cool Normal Weak Yes
Day 6 Rain Cool Normal Strong No
Day 7 Cloudy Cool Normal Strong Yes
Day 8 Sunny Mild High Weak No
Day 9 Sunny Cool Normal Weak Yes
Day 10 Rain Mild Normal Weak Yes
Day 11 Sunny Mild Normal Strong Yes
Day 12 Cloudy Mild High Strong Yes
Day 13 Cloudy Hot Normal Weak Yes
Day 14 Rain Mild High Strong No

 Steps to Create the Decision Tree


1. Root Node Selection: Calculate the entropy and information gain for each attribute
(weather, temperature, humidity, wind). The attribute with the highest information
gain (lowest impurity) becomes the root node.

2. Entropy Calculation:

 First, calculate the entropy of the entire dataset. For example, if there are
nine 'yes' and five 'no' outcomes in the dataset, the entropy is calculated
using a specific formula.

 Formula: The formula involves ratios of positive and negative outcomes to


the total number of rows, using logarithmic calculations.

 Entropy(S) = - Σ pᵢ log₂ pᵢ, where S is the dataset, and pᵢ is the


proportion of each class in the dataset.

3. Information Gain Calculation: Calculate the information gain for each attribute:

 For the weather attribute, calculate entropy for each value (sunny, cloudy,
rain).

 Count the occurrences of 'yes' and 'no' for each weather condition.

 Use these counts to calculate the entropy for each weather condition using
the entropy formula.

 Calculate the information gain for the weather attribute using the formula:
Entropy of the entire dataset minus the weighted sum of entropies for each
weather condition.

 Gain(S, A) = Entropy(S) - Σ [ (|Sᵥ| / |S|) * Entropy(Sᵥ) ], where A is


the attribute, Sᵥ are the subsets of S for each value v of A, and | |
denotes the number of elements.

 Repeat this process for all other attributes (temperature, humidity, and
wind).

 Select the attribute with the highest information gain as the root node.

MATHEMATICALLY:

STEP I: ENTROPY OF THE ENTIRE DATASET:

−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:

−2 2 3 3
Entropy Of Sunny {+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
−4 4 0 0
Entropy Of Cloudy {+4, -0} = log 2 − log 2 =0
4 4 4 4
−3 3 2 2
Entropy Of Rain {+3, -2} = log 2 − log 2 =0.97
5 5 5 5
Information Gain = Entropy(Entire Dataset) – 5/14.Ent(S) – 4/14.Ent(C) – 5/14.Ent(R) = 0.246

0.94 - 0.346 -0 - 0.346 = 0.246

CALCULATING THE INFORMATION GAIN OF TEMPERATURE:

STEP I: ENTROPY OF THE ENTIRE DATASET:

−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:

−2 2 2 2
Entropy Of Hot {+2, -2} = log 2 − log 2 =1.0
4 4 4 4
−4 4 2 2
Entropy Of Mild {+4, -2} = log 2 − log 2 =0.91
6 6 6 6
−3 3 1 1
Entropy Of Cold {+3, -1} = log 2 − log 2 =0. 81
4 4 3 3
Information Gain = Entropy(Entire Dataset) – 4/14.Ent(H) – 6/14.Ent(M) – 4/14.Ent(C) = 0.029

0.94 - 0.285 - 0.39 - 0.231 = 0.029

CALCULATING THE INFORMATION GAIN OF HUMIDITY:

STEP I: ENTROPY OF THE ENTIRE DATASET:

−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:

−3 3 4 4
Entropy Of High {+3, -4} = log 2 − log 2 =0 .98
7 7 7 7
−6 6 1 1
Entropy Of Normal {+6, -1} = log 2 − log 2 =0 . 59
7 7 7 7
Information Gain = Entropy(Entire Dataset) – 7/14.Ent(H) – 7/14.Ent(N) = 0.15

0.94 - 0.49 - 0.295 = 0.15

CALCULATING THE INFORMATION GAIN OF WIND:

STEP I: ENTROPY OF THE ENTIRE DATASET:

−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:

−3 3 3 3
Entropy Of Strong {+3, -3} = log 2 − log 2 =1 . 0
6 6 6 6
−6 6 2 2
Entropy Of Weak {+6, -2} = log 2 − log 2 =0 . 81
8 8 8 8
Information Gain = Entropy(Entire Dataset) – 6/14.Ent(S) – 8/14.Ent(W) = 0.0478

0.94 - 0.428 - 0.462 = 0.0478

COMPARING THE INFORMATION GAIN:

Gain (S, Weather) = 0.246

Gain (S, Temp) = 0.029

Gain (S, Humidity) = 0.15

Gain (S, Wind) = 0.0478

Now We Found The Root Node. Now Which one Is Gonna Be The Next Node. Well, Again
Information Gain As Expected.

DAY WEATHER TEMPERATURE HUMIDITY WIND PLAY


FOOTBALL?
DAY 1 SUNNY HOT HIGH WEAK NO
DAY 2 SUNNY HOT HIGH STRONG NO
DAY 3 SUNNY MILD HIGH WEAK NO
DAY 4 SUNNY COOL NORMAL WEAK YES
DAY 5 SUNNY MILD NORMAL STRONG YES

Now Lets Calculate the Information Gain:

CALCULATING THE INFORMATION GAIN OF TEMPERATURE:

STEP I: ENTROPY OF THE SUNNY DATASET:

−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:

−0 0 2 2
Entropy Of Hot {+0, -3} = log 2 − log 2 =0
2 2 2 2
−1 1 1 1
Entropy Of Mild {+1, -1} = log 2 − log 2 =1. 0
2 2 2 2
−1 1 0 0
Entropy Of Cool {+1, -0} = log 2 − log 2 =0
1 1 1 1
Information Gain = Entropy(Sunny) – 2/5.Ent(H) – 2/5.Ent(M) – 1/5.Ent(C) = 0.5

CALCULATING THE INFORMATION GAIN OF HUMIDITY:

STEP I: ENTROPY OF THE SUNNY DATASET:

−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:

−0 0 3 3
Entropy Of High {+0, -3} = log 2 − log 2 =0
3 3 3 3
−2 2 0 0
Entropy Of Normal {+2, -0} = log 2 − log 2 =0
2 2 2 2
Information Gain = Entropy(Sunny) – 3/5.Ent(H) – 2/5.Ent(N) = 0.97

CALCULATING THE INFORMATION GAIN OF WIND:

STEP I: ENTROPY OF THE SUNNY DATASET:

−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:

−1 1 1 1
Entropy Of Strong {+1, -1} = log 2 − log 2 =1
2 2 2 2
−1 1 2 2
Entropy Of Weak {+1, -2} = log 2 − log 2 =0 .918
3 3 3 3
Information Gain = Entropy(Sunny) – 2/5.Ent(S) – 3/5.Ent(W) = 0.019

COMPARING THE INFORMATION GAIN:

Gain (Ssunny, Temp) = 0.57

Gain (Ssunny, Humidity) = 0.97

Gain (Ssunny, Wind) = 0.019

So, Our Next Node Is Going To Be Humidity.

Now For Rain Which Is Going To Be The Next Node, Lets Calculate:

DAY WEATHER TEMPERATURE HUMIDITY WIND PLAY


FOOTBALL?
DAY 4 RAIN MILD HIGH WEAK YES
DAY 5 RAIN COOL NORMAL WEAK YES
DAY 6 RAIN COOL NORMAL STRONG NO
DAY 10 RAIN MILD NORMAL WEAK YES
DAY 14 RAIN MILD HIGH STRONG NO

CALCULATING THE INFORMATION GAIN OF TEMPERATURE:


CALCULATING THE INFORMATION GAIN OF HUMIDITY:

CALCUATING THE INFORMATION GAIN OF WIND:


COMPARING THE INFORMATION GAIN:

Gain (Srain, Temp) = 0.019

Gain (Srain, Humidity) = 0.019

Gain (Srain, Wind) = 0.97

So, the Next Node Will Be Wind.

So, The Tree Becomes Something Like This As Shown Below:

Weather

Sunny Cloudy Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

 Building the Tree


o In the example, weather has the highest information gain and becomes the root
node.

o The branches from the weather node are sunny, cloudy, and rain.

o If all outcomes for a branch are the same (e.g., all 'yes' for cloudy), that branch leads
to a leaf node with that outcome.

o For branches with mixed outcomes (e.g., sunny), further calculations are needed.

 Further Steps for Mixed Outcomes

o For the 'sunny' branch, consider the subset of data where the weather is sunny.

o Repeat the entropy and information gain calculations for the remaining attributes
(temperature, humidity, wind) within this subset.

o Select the attribute with the highest information gain to create the next node under
the 'sunny' branch.

o Continue this process until all branches lead to leaf nodes with 'yes' or 'no'
outcomes.

o The same process is repeated for the 'rain' branch.

 Final Decision Tree

o The final decision tree will have the weather as the root node, with branches for
sunny, cloudy, and rain.

o Further nodes and branches are created based on calculations, leading to leaf nodes
indicating whether to play the game (yes/no).

 Prediction

o To use the tree, input the weather conditions, and follow the branches to reach a
leaf node, which provides the prediction.

CONDITIONAL PROBABILITY:

 Basic Probability: The likelihood of a particular event occurring. For example, when tossing a coin,
the probability of getting heads is 1/2.

 Conditional Probability: The chances of an event happening given that another event has already
occurred.

 Example: Tossing a coin three times. We Get The Sample Space as:

S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}

Event E: at least two tails appear. Event F: the first coin shows heads.

o Probability of event E: 1/2.

o Probability of event F: 1/2.


 Conditional probability of E given F (E/F) means finding the probability of getting at least two
tails, knowing the first toss was heads. i.e. HTT

 Formula for Conditional Probability:

 P(E/F) = P(E ∩ F) / P(F)

o P(E ∩ F): Probability of both E and F occurring. In the example, it's 1/8.

o P(F): Probability of F occurring, which is 1/2 .

o Therefore, P(E/F) = (1/8) / (1/2) = 1/4.

ENSEMBLE LEARNING INTRODUCTION:

 Definition of Ensemble Learning: Ensemble learning refers to a group of predictors or


prediction algorithms.

 Core Concept: Instead of relying on a single expert for predictions, ensemble learning
involves multiple experts. Data is provided to these experts, their outputs are combined, and
a voting system determines the most likely outcome.

 Advantages:

o Improved Accuracy: Ensemble learning enhances the accuracy of predictions.

 Disadvantages:

o Increased Computational Cost: Using multiple predictors increases the


computational cost.

 Applications: Ensemble learning is widely used in various fields, including:

o Remote sensing

o Medicine

o Cybersecurity

o Fraud detection

o Image analysis

 Techniques:

o Bootstrapping: Creating random subsets from a large dataset.

o Boosting: A method used in ensemble learning.

o Stacking: A method used in ensemble learning.

o Voting: A method used in ensemble learning.

 Implementation:

o Random Forest: Combines multiple decision trees for more accurate predictions,
especially with large datasets.
 Process:

o A sample dataset is given to multiple classifiers (e.g., KNN, Logistic Regression). Each
provides an output.

o Instead of relying on a single output, ensemble learning combines these outputs and
uses voting to determine the most frequent outcome.

 Data Handling: Ensemble learning can involve giving the entire dataset to multiple predictors
or using random subsets for each predictor.

DIAGRAMMATICALLY:

K-MEAN CLUSTERING:

 Concept of Clustering: The core idea is to group similar data points into clusters. Data
points within the same cluster share similarities, while those in different clusters have dissimilar
characteristics.

 Steps for K-Means Clustering:

 Define Clusters: Decide on the number of clusters you want to create (e.g., two clusters, k1
and k2).

 Initialize Centroids: Assign initial centroid values. You can randomly assign any value from
the dataset to each cluster. For example, customer c1 (20, 500) is assigned to k1, and c2 (40,
1000) to k2.

 Calculate Distance: Use the Euclidean distance formula to determine how close each data
point is to the centroids

o Euclidean Distance Formula: √((x₂ - x₁)² + (y₂ - y₁)²)

 Where (x₂, y₂) is the observed value, and (x₁, y₁) is the centroid value.
 Assign to Cluster: Assign each data point to the nearest cluster based on the calculated
distances. For example, if c3 is closer to k2, it's assigned there.

 Update Centroids: After assigning a data point, update the centroid of the affected cluster by
calculating the mean of the values in that cluster.

o Centroid Update Formula: New Centroid = (Sum of values in the cluster) / (Number
of values in the cluster)

 For example, if c3 (30, 800) joins k2 (40, 1000), the new centroid for k2 is
((40+30)/2, (1000+800)/2) = (35, 900).

 Iterate: Repeat the distance calculation and centroid update steps until the clusters stabilize
and no more changes occur.

 Real-life Example: The video uses customer segmentation based on online shopping behavior to
illustrate the K-means clustering method.

SR. NO AGE AMOUNT


C1 20 500
C2 40 1000
C3 30 800
C4 18 300
C5 28 1200
C6 35 1400
C7 45 1800

K1 K2

C1(20, 500) C2(40, 1000)

Now we have two clusters, K1 And K2, I randomly threw C1 and C2 Into it. Now For C3, We Will
Calculate Using Distance Formula:

Distance (Cluster K1): √((30 - 20)² + (800 - 500)²) = √90,100 = 300.16

Distance (Cluster K2): √((30 - 40)² + (800 - 1000)²) = √40,100 = 200.24

So, C3 Should Be Assigned To Cluster K2.

K1 K2

C1(20, 500) C2(40, 1000)

C3(30, 800)

Mean(X)(40+30 = 70/2 = 35)

Mean(Y)(1000+800 = 1800/2 = 900)

Now Lets See C4:


Distance (Cluster K1): √((18 - 20)² + (300 - 500)²) = √40,004 = 200.00

Distance (Cluster K2): √((18 - 35)² + (300 - 900)²) = √360,289 = 600.24

So, C4 Should Be Assigned To K1 As:

K1 K2

C1(20, 500) C2(40, 1000)

C4(18, 300) C3(30, 800)

Mean(X)(20+18 = 38/2 = 19) Mean(X)(40+30 = 70/2 = 35)

Mean(Y)(500+300 = 800/2 = 400) Mean(Y)(1000+800 = 1800/2 = 900)

You might also like