Unit 4 - KVR
Unit 4 - KVR
○ The positive class usually refers to the category we are more interested in predicting
correctly compared to the negative class (e.g., the spam category in email classification
problems).
○ If there are more than two possible labels available, then the technique is known as a
multiclass classifier.
○ For example, in the case of identification of different types of fruits, “Shape”, “Color”,
“Radius” can be featured, and “Apple”, “Orange”, “Banana” can be different class labels.
Types of Classifiers
● Deterministic versus Probabilistic (weather forecasting)
○ A deterministic classifier produces a discrete-valued label to each data
instance it classifies (consistent, accurate and precise)
○ Unless the model is highly nonlinear, this one-size-fits-all strategy may not be
effective when the relationship between the attributes and the class labels
varies over the input space.
○ In contrast, a local classifier partitions the input space into smaller regions and
fits a distinct model to training instances in each region.
○ While local classifiers are more flexible in terms of fitting complex decision
boundaries, they are also more susceptible to the model overfitting problem,
especially when the local regions contain few training examples
KNN - K Nearest Neighbors
● Predicting whether a person enjoys a movie which he/she has been recommended (as
in the Netflix challenge)
● Identifying patterns in genetic data, for use in detecting specific protein or diseases
● An object (instance) is classified by the majority votes, for its neighbor classes
KNN
KNN
KNN
KNN
KNN
KNN - Calculating distance
KNN - Calculating distance
KNN - Calculating distance
● The distance formula involves comparing the values of each feature. For example, to calculate the
distance between the tomato (sweetness = 6, crunchiness = 4), and the green bean (sweetness = 3,
crunchiness = 7), we can use the formula as follows:
KNN - Calculating distance
● Manhattan distance
● Euclidean distance
KNN - Calculating distance
Step #1 - Assign a value to K.
Step #2 - Calculate the distance between the new data entry and all other
existing data entries (you'll learn how to do this shortly). Arrange them in
ascending order.
Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.
Step #4 - Assign the new data entry to the majority class in the nearest
neighbors.
KNN - Choosing appropriate k
● Deciding how many neighbors to use for kNN determines how well the mode will generalize to
future data.
● Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner
such that it runs the risk of ignoring small, but important patterns.
KNN - Choosing appropriate k
● One common practice is to set k equal to the square root of the number of
training examples.
● Once the nearest neighbor list is obtained, the test instance is classified based
on the majority class of its nearest neighbors:
● where v is a class label, yi is the class label for one of the nearest neighbors,
and I(·) is an indicator function that returns the value 1 if its argument is true and
0 otherwise.
Characteristics of Nearest Neighbour
● Lazy learners (KNN) doesn’t need model building – defining test examples is enough
● Requires classification function that returns the predicted class of a test instance based
on its proximity to other instances
● Classifying a test instance can be quite expensive because we need to compute the
proximity values individually between the test and training examples.
Characteristics of Nearest Neighbour
● The decision boundaries have high variability because they depend on the
composition of training examples in the local neighborhood.
● Have difficulty handling missing values in both the training and test sets since
proximity computations normally require the presence of all attributes
Characteristics of Nearest Neighbour
● The presence of irrelevant attributes (adds noise) can distort commonly used
proximity measures, especially when the number of irrelevant attributes is
large
● A large number of redundant attributes that are highly correlated with each
other, then the proximity measure can be overly biased toward such
attributes, resulting in improper estimates of distance.
● Can produce wrong predictions unless the appropriate proximity measure and
data preprocessing steps are taken
KNN - Example numericals
KNN - Example numericals
KNN - Example numericals
KNN - Example numericals
KNN
● Random event: an event that occurs and will have several outcomes.
○ Ex: Flip a coin – {Tails, Head} Roll the dice – {1,2,3,4,5,6}
Example: Let’s take two random dice and roll them randomly, now the
probability of getting a total of 10 is calculated.
Solution:
Total Possible events that can occur (sample space) {(1,1), (1,2),…, (1,6),…, (6,6)}.
The total spaces are 36.
Now the required events, {(4,6), (5,5), (6,4)} are all which adds up to 10.
● Random Experiment: trial that is repeated multiple times in order to get a well-
defined set of possible outcomes.(Tossing a coin)
● A random experiment is a physical situation whose outcome cannot be predicted
until it is observed.
● Sample Space: set of all possible outcomes that result from conducting a random
experiment.
○ For example, the sample space of tossing a fair coin is {heads, tails}.
● Event: set of outcomes of an experiment that forms a subset of the sample space.
○ Types:
■ Independent events
■ Dependent events
■ Mutually exclusive
■ Equally likely
Basics of Probability Theory-- Terminologies
•Independent Events: The events whose outcomes are not affected by the outcomes of other future and/or
past events are called independent events.
•For example, the output of tossing a coin in repetition is not affected by its previous outcome.
•Dependent Events: The events whose outcomes are affected by the outcome of other events are called
dependent events.
•For example, picking oranges from a bag that contains 100 oranges without replacement.
•Mutually Exclusive Events: The events that can not occur simultaneously are called mutually exclusive
events.
•For example, obtaining a head or a tail in tossing a coin, because both (head and tail) can not be
obtained together.
•Equally likely Events: The events that have an equal chance or probability of happening are known as
equally likely events.
•For example, observing any face in rolling dice has an equal probability of 1/6.
Basics of Probability Theory
2. Continuous random variable: This is a variable that takes an infinite number of possible values.
Continuous random variables are usually measurements.
Basics of Probability Theory
● Probability simply talks about how likely is the event to occur, and its value
always lies between 0 and 1
● Further, the sum of probability values of all possible events, e.g., outcomes of a
variable X is equal to 1.
○ For example: consider that you have two bags, named A and B, each containing 10 red balls and
10 black balls.
○ In the above case, Bag is a random variable that can take possible values as Bag A and Bag B.
Ball is also a random variable that can take values red and black.
Basics of Probability Theory
● Now, let us consider two random variables, X and Y , that can each take k
discrete values.
● For joint probabilities, it is to useful to consider their sum with respect to one of the
random variables,
● Now imagine the above situation when I say find the probability that ball is drawn from a
bag A given that the ball is red in color?
● Notice that in this question we’re already given the color of the ball and we’ve to find the
probability that the red ball is drawn is from bag A. Instead in other questions we used
to find the probability of drawing a red ball from bag A
Basics of Probability Theory
● In the case where we’ve to find the probability of event and object is given, this
kind of probability is called posterior probability.
● In the case where we’ve to find the probability of an object given for that event,
this kind of probability is known as the prior probability.
● Till now, we were considering the probabilities of discrete events but our
requirement is for continuous events - “Probability Density”
Basics of Probability Theory
● Suppose you have invited two of your friends Alex and Martha to a dinner party.
You know that Alex attends 40% of the parties he is invited to. Further, if Alex is
going to a party, there is an 80% chance of Martha coming along. On the other
hand, if Alex is not going to the party, the chance of Martha coming to the party
is reduced to 30%. If Martha has responded that she will be coming to your
party, what is the probability that Alex will also be coming?
● Bayes theorem presents the statistical principle for answering questions like the
previous one, where evidence from multiple sources has to be combined with
prior beliefs to arrive at predictions.
Bayes Theorem
● Let P(Y |X) denote the conditional probability of observing the random variable
Y whenever the random variable X takes a particular value.
P( X , Y )
P (Y | X )
P( X )
P( X , Y )
P( X | Y )
P (Y )
Bayes Theorem
● Bayes theorem: P ( X | Y ) P (Y )
P (Y | X )
P( X )
● Bayes theorem provides a relationship between the conditional probabilities P(Y
|X) and P(X|Y ).
● Using the previous expression for P(X), we can obtain the following equation
for P(Y |X) solely in terms of P(X|Y ) and P(Y ):
Navie Bayes
● It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. Some popular
examples of Naive Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
● Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
● P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
● P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
● P(B) is Marginal Probability: Probability of Evidence.
Navie Bayes
Navie Bayes
P(Long | Mango)= 0 → 3
P(X | Mango) = 0
Navie Bayes
2.a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana) = ((400/800) * (800/1200)) / (400/1200)
P(Yellow | Banana) = 1 → 4
2.b) P(Sweet | Banana) = (P( Banana | Sweet) * P(Sweet) )/ P (Banana) = ((300/850) * (850/1200)) / (400/1200)
2.c)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana) = ((350/400) * (400/1200)) / (400/1200)
3.a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others) = ((50/800) * (800/1200)) / (150/1200)
3.b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others) = ((100/850) * (850/1200)) / (150/1200)
3.c) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others) = ((50/400) * (400/1200)) / (150/1200)
So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and P(X| Others) == 0.07742.
• Then, transforming the frequency tables to likelihood tables and finally use the
Naive Bayesian equation to calculate the posterior probability for each class.
• The class with the highest posterior probability is the outcome of prediction.
Navie Bayes
Navie Bayes
Navie Bayes
In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.
Navie Bayes
Navie Bayes
Now suppose you want to calculate the probability of playing when the weather is overcast.
Probability of playing:
So you can determine here if the weather is overcast than players will play the sport.
Navie Bayes
2. Calculate Posterior Probabilities: P(Overcast |Yes) = 4/9 = 0.44 P(Mild |Yes) = 4/9 = 0.44
3. Put Posterior probabilities in equation (2) P(Weather=Overcast, Temp=Mild | Play= Yes) = 0.44 * 0.44 =
0.1936(Higher)
4. Put Prior and Posterior probabilities in equation (1) P(Play= Yes | Weather=Overcast, Temp=Mild) =
0.1936*0.64 = 0.124
3. Put posterior probabilities in equation (4) P(Weather=Overcast, Temp=Mild | Play= No) = 0 * 0.4= 0
4. Put prior and posterior probabilities in equation (3) P(Play= No | Weather=Overcast, Temp=Mild) =
0*0.36=0
The probability of a 'Yes' class is higher.
So you can say here that if the weather is overcast than players will play the sport.
Navie Bayes
D
D is parent of C
A is child of C
C B is descendant of D
D is ancestor of A
A B
X
● If X does not have any parents, table contains prior probability P(X)
● If X has only one parent (Y), table contains conditional probability P(X|Y)
● If X has multiple parents (Y1, Y2,…, Yk), table contains conditional probability
P(X|Y1, Y2,…, Yk)
Example of Bayesian Belief Network
Exercise Diet
Blood
Chest Pain
Pressure
Example graph of a logistic regression curve fitted to data. The curve shows the probability of
passing an exam (binary dependent variable) versus hours studying (scalar independent variable).
Logistic Regression
where w and b are the parameters of the model and aT denotes the transpose of
a vector a. Note that if wT x + b > 0, then x belongs to class 1 since its odds is
greater than 1. Otherwise, x belongs to class 0.
Cont…
The parameters of logistic regression, (w, b), are estimated during training using a
statistical approach known as the maximum likelihood estimation (MLE) method.
How does Logistic regression work
Cont…
● Types of Logistic Regression
● On the basis of the categories, Logistic Regression can be classified into three
types:
1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Binary dependent variables: It takes the assumption that the dependent variable must be binary
or dichotomous, meaning it can take only two values.
3. Linearity relationship between independent variables and log odds: The relationship between the
independent variables and the log odds of the dependent variable should be linear.
● In the realm of machine learning we can categorize models into two fundamental
groups: Base Models and Ensemble Models.
● Base Models: These are individual machine learning models that are used
independently to make predictions.
○ For example, decision trees, logistic regression, K-nearest neighbors (KNN), Support Vector Machines
(SVM), linear regression, and so on.
● Ensemble models: The idea behind of ensemble models is to combine many weak
learners( simple “building block” models) in order to obtain a strong learner (a single
and potentially very powerful model).
Ensemble Methods
● Technically : Construct a set of base classifiers learned from the training data
● Predict class label of test records by combining the predictions made by multiple
classifiers (e.g., by taking majority vote)
12/31/2024
Ensemble Methods
Suppose there are 25 base classifiers. Each classifier has error rate, ε = 0.35
• Classifiers that are sensitive to minor perturbations (noise) in training set, due to high model
complexity.
12/31/2024
Bias-Variance Decomposition
12/31/2024
Bias-Variance Trade-off and Overfitting
Overfitting
Underfitting
● Ensemble methods try to reduce the variance of complex models (with low bias)
by aggregating responses of multiple base classifiers
12/31/2024
General Approach of Ensemble Learning
12/31/2024
Constructing Ensemble Classifiers
● Bootstrap method is used to create multiple subsets of training data set and then
train multiple base models separately on these bootstrapped training set to get
different predictions from these base models.
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
12/31/2024
Bagging Algorithm
12/31/2024
Bagging Example
xk
True False
yleft yright
12/31/2024
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 y = -1
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
12/31/2024
Bagging Example
● Use majority vote (sign of sum of predictions) to determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Predicted Sum 2 2 2 -6 -6 -6 -6 2 2 2
Class Sign 1 1 1 -1 -1 -1 -1 1 1 1
12/31/2024
Boosting
● Boosting uses the entire training data set in a sequential manner to improve the performance of weak
learners.
● First, using the whole training data set to fit on a weak leaner.
● Then, according to the prediction results to adjust the weight of each observation in the training data set.
● Giving lower weights to those were classified correctly, while giving higher weights to those were
classified incorrectly.
● The sequential process of adjusting weights and predictions continues until a stopping criterion is
reached.
● An iterative procedure to adaptively change distribution of training data by focusing more on previously
misclassified records
○ Initially, all N records are assigned equal weights (for being selected for training)
○ Unlike bagging, weights may change at the end of each boosting round
● Records that are wrongly classified will have their weights increased in the next round.
● Records that are classified correctly will have their weights decreased in the next round
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
12/31/2024
AdaBoost If there is one incorrect output in our dataset, thus our
total error is 1/5, and the alpha (performance of the
stump) is:
● Importance of a classifier:
1 1 i
i ln
2 i
12/31/2024
AdaBoost Algorithm
● Weight update:
•When the sample is successfully identified, the
amount of, say, (alpha) will be negative.
● If any intermediate rounds produce error rate higher than 50%, the weights
are reverted back to 1/n and the resampling procedure is repeated
● Classification:
12/31/2024
AdaBoost Algorithm
AdaBoost Example
xk
True False
yleft yright
12/31/2024
AdaBoost Example
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
● Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
12/31/2024
AdaBoost Example
● Weights Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
● Classification
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
12/31/2024
Random Forest Algorithm
12/31/2024
Gradient Boosting
12/31/2024
Gradient Boosting Algorithm
● Step 1:
● Let’s assume X, and Y are the input and target having N samples. Our goal is to
learn the function f(x) that maps the input features X to the target variables y. It is
boosted trees i.e the sum of trees.
● The loss function is the difference between the actual and the predicted variables.
Hours Study Pass (1) / Fail • Given the class, how can we calculate the
(0) probability of pass for the student who
29 0 studied 33 hours?
15 0
<9 N G Moderate Y
>=9 N Avg M N
<9 N Avg G N
>=9 Y G M Y
>=9 Y G M Y
● Step 2: Iterate for each weak classifier (due to 4 stumps – weak learners)
○ A. Train the decision stump HCGPA with a random bootstrap sample from
training dataset.
○ B. Compute the weighted error
○ C. Compute the weight of each weak classifier
○ D. calculate the normalization factor ZCGPA
○ E. Update the weights of all data instances
● Step 3: Compute the final predicted value for each data instance
References:
1. Introduction to Data Mining ,Pang-Ning Tan, Michael Steinbach, Vipin Kumar,2nd edition, 2019,Pearson ,
ISBN-10-9332571406, ISBN-13 -978-9332571402
2. Machine Learning ,Tom M. Mitchell, Indian Edition, 2013, McGraw Hill Education, ISBN – 10 –
1259096955
3. Jiawei Han and Micheline, Kamber: Data Mining – Concepts and Techniques, 2 nd Edition, Morgan
Kaufmann, 2006, ISBN 1-55860-901-6