Machine Learning Notes
Machine Learning Notes
SUPERVISED LEARNING:
Example: An analogy is made to teaching a child, showing images of an apple, ball, cat, and
dog to create mental associations based on shape, color, and texture .
Labeled Data: Supervised learning uses labeled data, where the system is told what each
piece of data represents to understand and classify information
Data Collection: Gathering labeled data, such as emails marked as spam or not spam.
Model Training: Using methods like classification (e.g., K-Nearest Neighbors, decision trees)
to train the model to understand the data.
Testing: Evaluating the model's accuracy by testing it with a portion of the data to see if it
can correctly classify information.
Techniques Used:
Classification: Categorizing data into predefined classes, such as identifying emails as spam
or not spam.
Regression: Working with continuous numerical data to predict values, like predicting house
prices based on size.
Data Collection: Gathering data on house sizes and their corresponding prices.
Training: Training the model to understand the relationship between size and price, using
methods like linear regression.
Testing: Providing the model with a house size to predict its price.
Common Algorithms:
The video emphasizes the importance of understanding these concepts theoretically before moving
on to practical applications, highlighting the interdisciplinary nature and growing popularity of the
field.
REGRESSION INTRODUCTION:
o Regression is a statistical method used to understand and predict the relationship
between variables.
o It's used to understand the relationships between multiple variables and make
predictions.
o The concept of regression, particularly linear regression, has been around since the
19th century.
o Examples:
Predicting exam scores based on study hours. (Study hours are independent,
exam score is dependent).
2.5
1.5
0.5
0
Day 1 Day 2 Day 3 Day 4
NEGATIVE REGRESSION
Price
3.5
2.5
1.5
0.5
0
Day 1 Day 2 Day 3 Day 4
Types of Regression
o The line may not be straight, and can take various curved forms.
POLYNOMIAL REGRESSION
Price
3.5
2.5
1.5
0.5
0
Hour 1 Hour 2 Hour 3 Hour 4
o It's crucial to understand the underlying math before using tools and libraries.
o The next video will cover linear regression with examples and numerical problems.
LINEAR REGRESSION:
o Linear regression shows the relationship between two variables, one dependent and
the other independent, to make predictions.
y = mx + b
Where:
b is the intercept.
Mean = Sum Of Values / Number Of Values i.e 13+23+24 = 60, Mean = 60/3 = 20.
Deviation = Value – Mean i.e 13 – 20 = -7, Deviation = -7.
Diamet Pric Mean( Mean( Deviations( Deviations( Product Sum Of Square
er (X) In e X) Y) X) Y) Of Product Of
Inches (Y) Deviatio Of Deviatio
ns Deviatio ns For X
ns
8 10 10 13 -2 -3 -2 x -3=6 6+0+6 -2 x-2 = 4
10 13 0 0 0x0=0 = 12 0x0=0
12 16 2 3 2x3=6 2x2=4
Calculations
o Determine the deviation of each x and y value from their respective means.
m = 12 / 8 = 1.5
It Means If We Change The ‘X’ By 1 Unit, The Change In ‘Y’ Will Be 1.5.
Now,
b = 13 – 1.5 * 10 = -2
o Using the equation, predictions can be made. For example, for a 20-inch pizza:
y = 1.5 * 20 - 2 = 28.
o In real life, data points may not perfectly align on a straight line, leading to errors or
outliers.
o These errors represent the difference between actual values and predictions.
This is the primary graph used to visualize the relationship between the independent and
dependent variables.
In the video's pizza example, the x-axis represents the diameter of the pizzas (independent
variable), and the y-axis represents the price (dependent variable).
Each data point on the graph is plotted based on the diameter and price of a pizza from the
dataset. So, if a pizza is 12 inches in diameter and costs $14, a point would be plotted at (12,
14).
The overall pattern of these points visually suggests whether there's a linear relationship (or
not) between the variables.
This is the "best-fit" straight line drawn through the scatter plot.
The line represents the linear equation (y = mx + b) that the regression model has calculated.
It aims to minimize the overall distance between the line and all the data points.
This line is crucial because it's what we use to make predictions. For example, to predict the
price of a pizza with a diameter not in the original dataset, you'd find that diameter on the x-
axis, trace upwards to the regression line, and then across to the y-axis to read the predicted
price.
3. Table of Data:
The video uses a table to organize the data used for the calculations. This table typically
includes:
o Diameter (x-values)
o Price (y-values)
Mean of x
Mean of y
Squared deviations of x
1. Gather the Data: Watch the video and note down the specific diameters and prices of the
pizzas used in the example. This is your (x, y) data pairs.
2. Scatter Plot:
o Follow the formulas explained in the video to calculate the slope (m) and intercept
(b) of the linear regression line.
o Choose two diameter values (x-values) within the range of your graph.
o Plug those x-values into the equation to calculate the corresponding y-values.
o Draw a straight line through these two points. This is your regression line.
LOGISITIC REGRESSION:
Core Concepts
o The dependent variable is categorical (e.g., binary data like 0 or 1, yes or no).
o The model finds the probability of passing (e.g., 0.8 means likely to pass).
Sigmoid Function
o Formula: y = 1 / (1 + e-z)
e is a mathematical constant.
z is calculated as a0 + a1 * x.
a0 is the intercept.
a1 is the coefficient.
Example Calculation
o These values are found using Maximum Likelihood Estimation and cost function
optimization.
o y = 1 / (1 + e^(-1.5)) ≈ 0.818.
Another Example
o y = 1 / (1 + e^(0.6)) ≈ 0.35.
Conclusion
Linear Regression is used for predicting numeric values, while Logistic Regression is used for
classification.
Linear Regression
o a_1 is the coefficient indicating how much y increases for a unit increase in x.
Logistic Regression
Examples include email spam detection or predicting if a student will pass or fail.
Independent variables are continuous, but the dependent variable is discrete (categorical).
Linear Regression predicts continuous numeric values; Logistic Regression classifies data into
categories.
Dependent variables in Linear Regression are continuous; in Logistic Regression, they are
discrete.
Linear Regression uses a linear equation; Logistic Regression uses a sigmoid function.
k-NN INTRODUCTION
The k-NN algorithm is used for predicting classes and is like a "Hello World" program for
classification
Core Concept
k-NN involves taking a vote from the nearest neighbors to make a decision.
Euclidean distance formula: Used to calculate the distance between data points.
o The formula is not explicitly shown in the video, but it's a standard formula: √((x₂ -
x₁)² + (y₂ - y₁)²).
The video uses an example of predicting the genre of a movie (Barbie) based on its IMDB
rating and duration, using a dataset of other movies.
Predicting The Genre Of The ‘Barbie’ Movie With IMDB Rating 7.4 And Duration 114 Minutes.
Steps Explained
Data Collection: Gathering data with IMDB ratings, durations, and genres for existing movies.
Distance Calculation: Using the Euclidean distance formula to find the distance between the
Barbie movie and other movies in the dataset.
Selecting Nearest Neighbors: Choosing the 'k' nearest neighbors based on the calculated
distances.
Majority Voting: Determining the genre of the Barbie movie based on the majority genre
among its nearest neighbors.
Calculate the Euclidean Distance Between the new movie and each movie in the dataset.
√
Distance To (8.0, 160) = ( ( 7.4−8.0 )2+ (114 −160 )2 )=√ ( 0.36+2116 ) ≈ 46.00
K = 1, It Means We Are Finding Only One Neighbor, Which Is The Most Nearest One. So, The Most
Nearest One Is The One Having Distance 41.00, If We Check It Is OMG 2 Movie Whose Genre Is
Comedy, So We can Say That Barbie Movie Has Genre “Comedy” Which Is Correct.
K=3
In That Case we have 41.00, 46.00, 54.00, Now Here Two Of These Has Comedy Genre And One Has
Action Genre, Which Means The Barbie Movie Has More Likely The Comedy Genre.
Definition: Naive Bayes is a supervised learning method used for classification, where the
goal is to predict which class a particular item belongs to. A common example is classifying
emails as spam or not spam.
Based on Bayes' Theorem: The method is based on Bayes' Theorem, which relies heavily on
probability.
Key Assumption: Naive Bayes assumes that the variables used for classification are
independent of each other.
Basic Probability: If a bag contains 2 red and 3 black balls, the probability of picking a red
ball is 2/5.
Total Probability: If there are two bags, Bag 1 (2 red, 3 black) and Bag 2 (4 red, 3 black), the
probability of picking a red ball involves considering the probability of choosing each bag
(1/2 for each) and then the probability of a red ball from that bag (2/5 from Bag 1, 4/7 from
Bag 2).
E1 = ½
2 RED, 3 BLACK Probability Of Red Ball = 2/5 Total Probability:
Reverse Probability: Bayes' Theorem comes into play when you know the outcome (e.g., a
red ball was picked) and want to find the probability of it coming from a specific source (e.g.,
Bag 1).
Formula: If you want to calculate the probability of event A given event X, you use P(A|X) =
P(X|A) * P(A) / P(X).
MATHEMATICS:
P ( X ∨Y )∗P(Y )
P(Y/X) =
P (X )
Consider ‘Y’ As Source (Bag 1) In This Equation, And ‘X’ As Red Ball.
Now This Was For Single Variable, In Real Life Scenario You Might Have Multiple Variables, For That
The Equation Will Become as:
Multiple Variables: In real-life scenarios, multiple variables may influence the outcome.
Naive Bayes considers all these variables, assuming they are independent.
Formula with Multiple Variables: If you have variables x1, x2, etc., the formula becomes
more complex but follows the same principle as Bayes' Theorem.
Simplification: Since the denominator in the Bayes' Theorem formula remains constant
when comparing probabilities, it can be ignored. The focus is on the numerator to determine
which class has a higher probability.
Scenario: Given data about whether people have COVID or the flu (x1, x2) and whether they
have a fever, predict if a new person with the flu and COVID will have a fever.
Steps:
o Calculate the probability of having a fever (Yes) and not having a fever (No) based on
the existing data.
o Calculate conditional probabilities, like the probability of having COVID and a fever.
o Use these probabilities in the Naive Bayes formula to calculate the probability of
fever (Yes) and fever (No) for a person with the flu and COVID.
o Compare the probabilities. The higher probability indicates the predicted class.
Prediction: Based on the calculations, if the probability of "fever: Yes" is higher, the model
predicts that the person will have a fever.
Important Points:
Naive Bayes classifies based on the highest probability, but real-world data might have
outliers and errors.
Definition: Decision Tree is an algorithm mainly used for classification tasks, but it can also
be used for regression.
Process:
Example scenario:
Tree Structure:
Weather
No Yes Yes
Example:
Further Learning: The next video will explain the mathematics behind splitting, Entropy, and
Information Gain.
o Decision trees are primarily used for classification to determine the class to which
something belongs.
o The ID3 algorithm is used to create a decision tree, explaining the process step by
step with mathematical calculations.
o The two main calculations needed are entropy and information gain, which are used
to draw the decision tree.
o Information Gain: Measures how much information a feature provides about the
class. Higher Information Gain means less impurity.
o The example data set includes weather (sunny, cloudy, rain), temperature (hot, mild,
cold), humidity (high, normal), and wind (strong, weak).
o Based on these attributes, the goal is to predict whether a game (football or tennis)
should be played (yes/no).
2. Entropy Calculation:
First, calculate the entropy of the entire dataset. For example, if there are
nine 'yes' and five 'no' outcomes in the dataset, the entropy is calculated
using a specific formula.
3. Information Gain Calculation: Calculate the information gain for each attribute:
For the weather attribute, calculate entropy for each value (sunny, cloudy,
rain).
Count the occurrences of 'yes' and 'no' for each weather condition.
Use these counts to calculate the entropy for each weather condition using
the entropy formula.
Calculate the information gain for the weather attribute using the formula:
Entropy of the entire dataset minus the weighted sum of entropies for each
weather condition.
Repeat this process for all other attributes (temperature, humidity, and
wind).
Select the attribute with the highest information gain as the root node.
MATHEMATICALLY:
−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:
−2 2 3 3
Entropy Of Sunny {+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
−4 4 0 0
Entropy Of Cloudy {+4, -0} = log 2 − log 2 =0
4 4 4 4
−3 3 2 2
Entropy Of Rain {+3, -2} = log 2 − log 2 =0.97
5 5 5 5
Information Gain = Entropy(Entire Dataset) – 5/14.Ent(S) – 4/14.Ent(C) – 5/14.Ent(R) = 0.246
−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:
−2 2 2 2
Entropy Of Hot {+2, -2} = log 2 − log 2 =1.0
4 4 4 4
−4 4 2 2
Entropy Of Mild {+4, -2} = log 2 − log 2 =0.91
6 6 6 6
−3 3 1 1
Entropy Of Cold {+3, -1} = log 2 − log 2 =0. 81
4 4 3 3
Information Gain = Entropy(Entire Dataset) – 4/14.Ent(H) – 6/14.Ent(M) – 4/14.Ent(C) = 0.029
−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:
−3 3 4 4
Entropy Of High {+3, -4} = log 2 − log 2 =0 .98
7 7 7 7
−6 6 1 1
Entropy Of Normal {+6, -1} = log 2 − log 2 =0 . 59
7 7 7 7
Information Gain = Entropy(Entire Dataset) – 7/14.Ent(H) – 7/14.Ent(N) = 0.15
−9 9 5 5
S{+9, -5} = log 2 − log 2 =0.94
14 14 14 14
STEP II: ENTROPY OF ALL ATTRIBUTES:
−3 3 3 3
Entropy Of Strong {+3, -3} = log 2 − log 2 =1 . 0
6 6 6 6
−6 6 2 2
Entropy Of Weak {+6, -2} = log 2 − log 2 =0 . 81
8 8 8 8
Information Gain = Entropy(Entire Dataset) – 6/14.Ent(S) – 8/14.Ent(W) = 0.0478
Now We Found The Root Node. Now Which one Is Gonna Be The Next Node. Well, Again
Information Gain As Expected.
−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:
−0 0 2 2
Entropy Of Hot {+0, -3} = log 2 − log 2 =0
2 2 2 2
−1 1 1 1
Entropy Of Mild {+1, -1} = log 2 − log 2 =1. 0
2 2 2 2
−1 1 0 0
Entropy Of Cool {+1, -0} = log 2 − log 2 =0
1 1 1 1
Information Gain = Entropy(Sunny) – 2/5.Ent(H) – 2/5.Ent(M) – 1/5.Ent(C) = 0.5
−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:
−0 0 3 3
Entropy Of High {+0, -3} = log 2 − log 2 =0
3 3 3 3
−2 2 0 0
Entropy Of Normal {+2, -0} = log 2 − log 2 =0
2 2 2 2
Information Gain = Entropy(Sunny) – 3/5.Ent(H) – 2/5.Ent(N) = 0.97
−2 2 3 3
S{+2, -3} = log 2 − log 2 =0.9 7
5 5 5 5
STEP II: ENTROPY OF ALL ATTRIBUTES:
−1 1 1 1
Entropy Of Strong {+1, -1} = log 2 − log 2 =1
2 2 2 2
−1 1 2 2
Entropy Of Weak {+1, -2} = log 2 − log 2 =0 .918
3 3 3 3
Information Gain = Entropy(Sunny) – 2/5.Ent(S) – 3/5.Ent(W) = 0.019
Now For Rain Which Is Going To Be The Next Node, Lets Calculate:
Weather
No Yes No Yes
o The branches from the weather node are sunny, cloudy, and rain.
o If all outcomes for a branch are the same (e.g., all 'yes' for cloudy), that branch leads
to a leaf node with that outcome.
o For branches with mixed outcomes (e.g., sunny), further calculations are needed.
o For the 'sunny' branch, consider the subset of data where the weather is sunny.
o Repeat the entropy and information gain calculations for the remaining attributes
(temperature, humidity, wind) within this subset.
o Select the attribute with the highest information gain to create the next node under
the 'sunny' branch.
o Continue this process until all branches lead to leaf nodes with 'yes' or 'no'
outcomes.
o The final decision tree will have the weather as the root node, with branches for
sunny, cloudy, and rain.
o Further nodes and branches are created based on calculations, leading to leaf nodes
indicating whether to play the game (yes/no).
Prediction
o To use the tree, input the weather conditions, and follow the branches to reach a
leaf node, which provides the prediction.
CONDITIONAL PROBABILITY:
Basic Probability: The likelihood of a particular event occurring. For example, when tossing a coin,
the probability of getting heads is 1/2.
Conditional Probability: The chances of an event happening given that another event has already
occurred.
Example: Tossing a coin three times. We Get The Sample Space as:
Event E: at least two tails appear. Event F: the first coin shows heads.
o P(E ∩ F): Probability of both E and F occurring. In the example, it's 1/8.
Core Concept: Instead of relying on a single expert for predictions, ensemble learning
involves multiple experts. Data is provided to these experts, their outputs are combined, and
a voting system determines the most likely outcome.
Advantages:
Disadvantages:
o Remote sensing
o Medicine
o Cybersecurity
o Fraud detection
o Image analysis
Techniques:
Implementation:
o Random Forest: Combines multiple decision trees for more accurate predictions,
especially with large datasets.
Process:
o A sample dataset is given to multiple classifiers (e.g., KNN, Logistic Regression). Each
provides an output.
o Instead of relying on a single output, ensemble learning combines these outputs and
uses voting to determine the most frequent outcome.
Data Handling: Ensemble learning can involve giving the entire dataset to multiple predictors
or using random subsets for each predictor.
DIAGRAMMATICALLY:
K-MEAN CLUSTERING:
Concept of Clustering: The core idea is to group similar data points into clusters. Data
points within the same cluster share similarities, while those in different clusters have dissimilar
characteristics.
Define Clusters: Decide on the number of clusters you want to create (e.g., two clusters, k1
and k2).
Initialize Centroids: Assign initial centroid values. You can randomly assign any value from
the dataset to each cluster. For example, customer c1 (20, 500) is assigned to k1, and c2 (40,
1000) to k2.
Calculate Distance: Use the Euclidean distance formula to determine how close each data
point is to the centroids
Where (x₂, y₂) is the observed value, and (x₁, y₁) is the centroid value.
Assign to Cluster: Assign each data point to the nearest cluster based on the calculated
distances. For example, if c3 is closer to k2, it's assigned there.
Update Centroids: After assigning a data point, update the centroid of the affected cluster by
calculating the mean of the values in that cluster.
o Centroid Update Formula: New Centroid = (Sum of values in the cluster) / (Number
of values in the cluster)
For example, if c3 (30, 800) joins k2 (40, 1000), the new centroid for k2 is
((40+30)/2, (1000+800)/2) = (35, 900).
Iterate: Repeat the distance calculation and centroid update steps until the clusters stabilize
and no more changes occur.
Real-life Example: The video uses customer segmentation based on online shopping behavior to
illustrate the K-means clustering method.
K1 K2
Now we have two clusters, K1 And K2, I randomly threw C1 and C2 Into it. Now For C3, We Will
Calculate Using Distance Formula:
K1 K2
C3(30, 800)
K1 K2