0% found this document useful (0 votes)
37 views111 pages

Unit 4 - KVR

AIML notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views111 pages

Unit 4 - KVR

AIML notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 111

UNIT IV

● Nearest Neighbour Classifiers - Characteristics

● Navie Bayes Classifier – Theory, formula and problems

● Logistic Regression – as Linear Regression, model parameters, and its


characteristics

● Ensemble Methods – Methods for constructing ensemble classifier,


Bagging, Boosting and Random Forests.
Types of Classifiers
● Binary versus Multiclass
○ Binary classifiers assign each data instance to one of two possible labels, typically
denoted as +1 and −1.

○ The positive class usually refers to the category we are more interested in predicting
correctly compared to the negative class (e.g., the spam category in email classification
problems).

○ If there are more than two possible labels available, then the technique is known as a
multiclass classifier.

○ For example, in the case of identification of different types of fruits, “Shape”, “Color”,
“Radius” can be featured, and “Apple”, “Orange”, “Banana” can be different class labels.
Types of Classifiers
● Deterministic versus Probabilistic (weather forecasting)
○ A deterministic classifier produces a discrete-valued label to each data
instance it classifies (consistent, accurate and precise)

○ probabilistic classifier assigns a continuous score between 0 and 1 to


indicate how likely it is that an instance belongs to a particular class, where
the probability scores for all the classes sum up to 1.

● Linear versus Nonlinear


○ A linear classifier uses a linear separating hyperplane to discriminate
instances from different classes whereas a nonlinear classifier enables the
construction of more complex, nonlinear decision surfaces
Types of Classifiers

● Global versus Local


○ A global classifier fits a single model to the entire data set.

○ Unless the model is highly nonlinear, this one-size-fits-all strategy may not be
effective when the relationship between the attributes and the class labels
varies over the input space.

○ In contrast, a local classifier partitions the input space into smaller regions and
fits a distinct model to training instances in each region.

○ While local classifiers are more flexible in terms of fitting complex decision
boundaries, they are also more susceptible to the model overfitting problem,
especially when the local regions contain few training examples
KNN - K Nearest Neighbors

● In a single sentence, nearest neighbor classifiers are defined by their characteristic of


classifying unlabeled examples by assigning them the class of the most similar labeled
examples.

● Powerful classification algorithm - pattern recognition.

● Computer vision applications, including optical character recognition and facial


recognition in both images and video.

● Predicting whether a person enjoys a movie which he/she has been recommended (as
in the Netflix challenge)

● Identifying patterns in genetic data, for use in detecting specific protein or diseases

● An object (instance) is classified by the majority votes, for its neighbor classes
KNN
KNN
KNN
KNN
KNN
KNN - Calculating distance
KNN - Calculating distance
KNN - Calculating distance

● Locating the tomato's nearest neighbors


requires a distance function, or a
formula that measures the similarity
between two instances.

● There are many different ways to


calculate distance.

● Traditionally, the kNN algorithm uses


Euclidean distance,which is the distance
one would measure if you could use a
ruler to connect two points, illustrated in
the previous figure by the dotted lines
connecting the tomato to its neighbors.
KNN - Calculating distance

● The distance formula involves comparing the values of each feature. For example, to calculate the
distance between the tomato (sweetness = 6, crunchiness = 4), and the green bean (sweetness = 3,
crunchiness = 7), we can use the formula as follows:
KNN - Calculating distance

● Manhattan distance

● Euclidean distance
KNN - Calculating distance
Step #1 - Assign a value to K.

Step #2 - Calculate the distance between the new data entry and all other
existing data entries (you'll learn how to do this shortly). Arrange them in
ascending order.

Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.

Step #4 - Assign the new data entry to the majority class in the nearest
neighbors.
KNN - Choosing appropriate k

● Deciding how many neighbors to use for kNN determines how well the mode will generalize to
future data.

● Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner
such that it runs the risk of ignoring small, but important patterns.
KNN - Choosing appropriate k

● In practice, choosing k depends on the difficulty of the concept to be learned


and the number of records in the training data.

● Typically, k is set somewhere between 3 and 10.

● One common practice is to set k equal to the square root of the number of
training examples.

● In the classifier, we might set k = 4, because there were 15 example


ingredients in the training data and the square root of 15 is 3.87.
KNN - Strengths and weaknesses
Algorithm – K nearest Neighbour
How to Determine the class label of a Test Sample?

● Once the nearest neighbor list is obtained, the test instance is classified based
on the majority class of its nearest neighbors:

● where v is a class label, yi is the class label for one of the nearest neighbors,
and I(·) is an indicator function that returns the value 1 if its argument is true and
0 otherwise.
Characteristics of Nearest Neighbour

● Instance based learning: It needs specific training instances to create predictions


without having to support an abstraction (or model) derived from data.

● Lazy learners (KNN) doesn’t need model building – defining test examples is enough

● Require a proximity measure to determine the similarity or distance between instances

● Requires classification function that returns the predicted class of a test instance based
on its proximity to other instances

● Classifying a test instance can be quite expensive because we need to compute the
proximity values individually between the test and training examples.
Characteristics of Nearest Neighbour

● Nearest neighbor classifiers make their predictions based on local information

● Produces decision boundaries of arbitrary shape.

● The decision boundaries have high variability because they depend on the
composition of training examples in the local neighborhood.

● Increasing the number of nearest neighbors may reduce such variability.

● Have difficulty handling missing values in both the training and test sets since
proximity computations normally require the presence of all attributes
Characteristics of Nearest Neighbour

● The presence of irrelevant attributes (adds noise) can distort commonly used
proximity measures, especially when the number of irrelevant attributes is
large

● A large number of redundant attributes that are highly correlated with each
other, then the proximity measure can be overly biased toward such
attributes, resulting in improper estimates of distance.

● Can produce wrong predictions unless the appropriate proximity measure and
data preprocessing steps are taken
KNN - Example numericals
KNN - Example numericals
KNN - Example numericals
KNN - Example numericals
KNN

Why the kNNalgorithm is called a lazy learner?


● Eager learners follow the general steps of machine learning, i.e.
perform an abstraction of the information obtained from the input data
and then follow it through by a generalization step.

● However, as we have seen in the case of the kNNalgorithm, these


steps are completely skipped.

● It stores the training data and directly applies the philosophy of


nearest neighbourhood finding to arrive at the classification.

● So, for kNN, there is no learning happening in the real sense.


Therefore, kNN falls under the category of lazy learner.
Naive Bayes Classifier
𝑝1
Basics of Probability Theory

● Probability theory field of mathematics and statistics – concerned with finding


the probabilities associated with random events.

● Random event: an event that occurs and will have several outcomes.
○ Ex: Flip a coin – {Tails, Head} Roll the dice – {1,2,3,4,5,6}

● Probability theory – 1. Random Variable and 2. Probability distributions

● Probability distributions: Number of favorable outcomes / Total no:of possible


outcomes of an event. (Ex: When u roll a dice : u get favorable outcome as 1 ,
further, possible outcomes are {1,2,3,4,5,6})

● So probability of occurring 3 is = 1/6 = 0.167


Basics of Probability Theory-- Terminologies

Example: Let’s take two random dice and roll them randomly, now the
probability of getting a total of 10 is calculated.

Solution:

Total Possible events that can occur (sample space) {(1,1), (1,2),…, (1,6),…, (6,6)}.
The total spaces are 36.

Now the required events, {(4,6), (5,5), (6,4)} are all which adds up to 10.

So the probability of getting a total of 10 is = 3/36 = 1/12


Basics of Probability Theory-- Terminologies

● Random Experiment: trial that is repeated multiple times in order to get a well-
defined set of possible outcomes.(Tossing a coin)
● A random experiment is a physical situation whose outcome cannot be predicted
until it is observed.

● Sample Space: set of all possible outcomes that result from conducting a random
experiment.
○ For example, the sample space of tossing a fair coin is {heads, tails}.

● Event: set of outcomes of an experiment that forms a subset of the sample space.
○ Types:
■ Independent events
■ Dependent events
■ Mutually exclusive
■ Equally likely
Basics of Probability Theory-- Terminologies
•Independent Events: The events whose outcomes are not affected by the outcomes of other future and/or
past events are called independent events.

•For example, the output of tossing a coin in repetition is not affected by its previous outcome.

•Dependent Events: The events whose outcomes are affected by the outcome of other events are called
dependent events.

•For example, picking oranges from a bag that contains 100 oranges without replacement.

•Mutually Exclusive Events: The events that can not occur simultaneously are called mutually exclusive
events.

•For example, obtaining a head or a tail in tossing a coin, because both (head and tail) can not be
obtained together.

•Equally likely Events: The events that have an equal chance or probability of happening are known as
equally likely events.

•For example, observing any face in rolling dice has an equal probability of 1/6.
Basics of Probability Theory

● Random Variables: A random variable is a variable whose possible values are


numerical outcomes of a random experiment.
● There are two types of random variables:
1. Discrete random variable: This is a variable that may take on only a countable number of
distinct values, such as zero, one, two, three, four, etc.

2. Continuous random variable: This is a variable that takes an infinite number of possible values.
Continuous random variables are usually measurements.
Basics of Probability Theory

● Probability simply talks about how likely is the event to occur, and its value
always lies between 0 and 1

● Further, the sum of probability values of all possible events, e.g., outcomes of a
variable X is equal to 1.

● Random variables - variable whose possible values are numerical outcomes of a


random phenomenon.

○ For example: consider that you have two bags, named A and B, each containing 10 red balls and
10 black balls.

○ In the above case, Bag is a random variable that can take possible values as Bag A and Bag B.
Ball is also a random variable that can take values red and black.
Basics of Probability Theory

● Now, let us consider two random variables, X and Y , that can each take k
discrete values.

● Let nij be the number of times we observe X = xi and Y = yj , out of a total


number of N occurrences.

● The joint probability of observing X = xi and Y = yj together can be estimated as

● Joint probabilités are symmetric, i.e., P(X = x, Y = y) = P(Y = y,X = x)


Basics of Probability Theory

● For joint probabilities, it is to useful to consider their sum with respect to one of the
random variables,

● Now imagine the above situation when I say find the probability that ball is drawn from a
bag A given that the ball is red in color?

● Notice that in this question we’re already given the color of the ball and we’ve to find the
probability that the red ball is drawn is from bag A. Instead in other questions we used
to find the probability of drawing a red ball from bag A
Basics of Probability Theory

● In the case where we’ve to find the probability of event and object is given, this
kind of probability is called posterior probability.

● In the case where we’ve to find the probability of an object given for that event,
this kind of probability is known as the prior probability.

The above equation is what we call


“Bayes Theorem”
Basics of Probability Theory

● Till now, we were considering the probabilities of discrete events but our
requirement is for continuous events - “Probability Density”
Basics of Probability Theory

● How does it relate to Machine learning??

● Consider a case where we’ve got a dataset for different


temperatures over a region for different dates.

● Here we can make predictions on how many water bottles


are required to be stored in that region with the help of a
model.
Bayes Theorem

● Suppose you have invited two of your friends Alex and Martha to a dinner party.
You know that Alex attends 40% of the parties he is invited to. Further, if Alex is
going to a party, there is an 80% chance of Martha coming along. On the other
hand, if Alex is not going to the party, the chance of Martha coming to the party
is reduced to 30%. If Martha has responded that she will be coming to your
party, what is the probability that Alex will also be coming?

● Bayes theorem presents the statistical principle for answering questions like the
previous one, where evidence from multiple sources has to be combined with
prior beliefs to arrive at predictions.
Bayes Theorem

● Let P(Y |X) denote the conditional probability of observing the random variable
Y whenever the random variable X takes a particular value.

● probability of observing Y conditioned on the outcome of X.

● Conditional probabilities of X and Y are related to their joint probability in the


following way:

P( X , Y )
P (Y | X ) 
P( X )
P( X , Y )
P( X | Y ) 
P (Y )
Bayes Theorem

● Bayes theorem: P ( X | Y ) P (Y )
P (Y | X ) 
P( X )
● Bayes theorem provides a relationship between the conditional probabilities P(Y
|X) and P(X|Y ).

● The denominator in above Equation involves the marginal probability of X, which


can also be represented as
Bayes Theorem

● Using the previous expression for P(X), we can obtain the following equation
for P(Y |X) solely in terms of P(X|Y ) and P(Y ):
Navie Bayes

● It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. Some popular
examples of Naive Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

● Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.

Where, P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

● P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
● P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
● P(B) is Marginal Probability: Probability of Evidence.
Navie Bayes
Navie Bayes

Solution: P(A|B) = (P(B|A) * P(A) )/ P(B)

1. Mango: P(X | Mango) = P(Ye | Yellow ) * P(Sweet | Mango) * P(Long | Mango)

a)P(Yellow | Mango) = (P(Mango | Yellow) * P(Yellow) )/ P (Mango) = ((350/800) * (800/1200)) / (650/1200)

P(Yellow | Mango)= 0.53 →1

1.b)P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango) = ((450/850) * (850/1200)) / (650/1200)

P(Sweet | Mango)= 0.69 → 2

2. c)P(Long | Mango) = (P(Long | Mango) * P(Long) )/ P (Mango) = ((0/650) * (400/1200)) / (800/1200)

P(Long | Mango)= 0 → 3

On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0

P(X | Mango) = 0
Navie Bayes

2. Banana: P(X | Banana) = P(Yellow | Banana) * P(Sweet | Banana) * P(Long | Banana)

2.a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana) = ((400/800) * (800/1200)) / (400/1200)

P(Yellow | Banana) = 1 → 4

2.b) P(Sweet | Banana) = (P( Banana | Sweet) * P(Sweet) )/ P (Banana) = ((300/850) * (850/1200)) / (400/1200)

P(Sweet | Banana) = .75→ 5

2.c)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana) = ((350/400) * (400/1200)) / (400/1200)

P(Long | Banana) = 0.875 → 6

On multiplying eq 4,5,6 ==> P(X | Banana) = 1 * .75 * 0.875

P(X | Banana) = 0.6562


Navie Bayes

3. Others: P(X | Others) = P(Yellow | Others) * P(Sweet | Others) * P(Long | Others)

3.a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others) = ((50/800) * (800/1200)) / (150/1200)

P(Yellow | Others) = 0.34→ 7

3.b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others) = ((100/850) * (850/1200)) / (150/1200)

P(Sweet | Others) = 0.67 → 8

3.c) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others) = ((50/400) * (400/1200)) / (150/1200)

P(Long | Others) = 0.34 → 9

On multiplying eq 7,8,9 ==> P(X | Others) = 0.34 * 0.67* 0.34

P(X | Others) = 0.07742

So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and P(X| Others) == 0.07742.

We can conclude Fruit{Yellow,Sweet,Long} is Banana.


Navie Bayes

• The posterior probability can be calculated by first, constructing a frequency


table for each attribute against the target.

• Then, transforming the frequency tables to likelihood tables and finally use the
Naive Bayesian equation to calculate the posterior probability for each class.

• The class with the highest posterior probability is the outcome of prediction.
Navie Bayes
Navie Bayes
Navie Bayes

In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.
Navie Bayes
Navie Bayes

Now suppose you want to calculate the probability of playing when the weather is overcast.

Probability of playing:

P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)

1. Calculate Prior Probabilities: P(Overcast) = 4/14 = 0.29 , P(Yes)= 9/14 = 0.64

2. Calculate Posterior Probabilities: P(Overcast |Yes) = 4/9 = 0.44

3. Put Prior and Posterior probabilities in equation (1)

4. P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)

Similarly, you can calculate the probability of not playing:


Navie Bayes

Probability of not playing:

P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast) .....................(2)

1. Calculate Prior Probabilities: P(Overcast) = 4/14 = 0.29, P(No)= 5/14 = 0.36

2. Calculate Posterior Probabilities: P(Overcast |No) = 0/9 = 0

3. Put Prior and Posterior probabilities in equation (2)

4. P (No | Overcast) = 0 * 0.36 / 0.29 = 0

The probability of a 'Yes' class is higher.

So you can determine here if the weather is overcast than players will play the sport.
Navie Bayes

Second Approach (In case of multiple features)


Navie Bayes

Second Approach (In case of multiple features)


Now suppose you want to calculate the probability of playing when the weather is overcast, and the
temperature is mild.
Probability of playing: P(Play= Yes | Weather=Overcast, Temp=Mild) = P(Weather=Overcast, Temp=Mild |
Play= Yes)P(Play=Yes) ..........(1)

P(Weather=Overcast, Temp=Mild | Play= Yes)= P(Overcast |Yes) P(Mild |Yes) ………..(2)

1. Calculate Prior Probabilities: P(Yes)= 9/14 = 0.64

2. Calculate Posterior Probabilities: P(Overcast |Yes) = 4/9 = 0.44 P(Mild |Yes) = 4/9 = 0.44

3. Put Posterior probabilities in equation (2) P(Weather=Overcast, Temp=Mild | Play= Yes) = 0.44 * 0.44 =
0.1936(Higher)

4. Put Prior and Posterior probabilities in equation (1) P(Play= Yes | Weather=Overcast, Temp=Mild) =
0.1936*0.64 = 0.124

Similarly, you can calculate the probability of not playing:


Navie Bayes

Probability of not playing:


P(Play= No | Weather=Overcast, Temp=Mild) = P(Weather=Overcast, Temp=Mild | Play=
No)P(Play=No) ..........(3)

P(Weather=Overcast, Temp=Mild | Play= No)= P(Weather=Overcast |Play=No) P(Temp=Mild | Play=No)


………..(4)

1. Calculate Prior Probabilities: P(No)= 5/14 = 0.36

2. Calculate Posterior Probabilities: P(Weather=Overcast |Play=No) = 0/9 = 0 P(Temp=Mild |


Play=No)=2/5=0.4

3. Put posterior probabilities in equation (4) P(Weather=Overcast, Temp=Mild | Play= No) = 0 * 0.4= 0

4. Put prior and posterior probabilities in equation (3) P(Play= No | Weather=Overcast, Temp=Mild) =
0*0.36=0
The probability of a 'Yes' class is higher.

So you can say here that if the weather is overcast than players will play the sport.
Navie Bayes

• Robust to isolated noise points

• Handle missing values by ignoring the instance during


probability estimate calculations

• Robust to irrelevant attributes

• Redundant and correlated attributes will violate class


conditional assumption
– Use other techniques such as Bayesian Belief Networks
(BBN)
Bayesian Belief Networks

● Provides graphical representation of probabilistic relationships among a set of


random variables
● Consists of:
– A directed acyclic graph (dag) A B
 Node corresponds to a variable
 Arc corresponds to dependence
relationship between a pair of variables
C
– A probability table associating each node to its immediate parent
Conditional Independence

D
D is parent of C
A is child of C
C B is descendant of D
D is ancestor of A

A B

● A node in a Bayesian network is conditionally independent


of all of its nondescendants, if its parents are known
Probability Tables

X
● If X does not have any parents, table contains prior probability P(X)

● If X has only one parent (Y), table contains conditional probability P(X|Y)

● If X has multiple parents (Y1, Y2,…, Yk), table contains conditional probability
P(X|Y1, Y2,…, Yk)
Example of Bayesian Belief Network

Exercise=Yes 0.7 Diet=Healthy 0.25


Exercise=No 0.3 Diet=Unhealthy 0.75

Exercise Diet

D=Healthy D=Healthy D=Unhealthy D=Unhealthy


Heart E=Yes E=No E=Yes E=No
Disease HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25

Blood
Chest Pain
Pressure

HD=Yes HD=No HD=Yes HD=No


CP=Yes 0.8 0.01 BP=High 0.85 0.2
CP=No 0.2 0.99 BP=Low 0.15 0.8
Example of Inferencing using BBN

● Given: X = (E=No, D=Yes, CP=Yes, BP=High)


– Compute P(HD|E,D,CP,BP)?

● P(HD=Yes| E=No,D=Yes) = 0.55


P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High) Classify X
 0.55  0.8  0.85 = 0.374
as Yes

● P(HD=No| E=No,D=Yes) = 0.45


P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
 0.45  0.01  0.2 = 0.0009
Logistic Regression
Logistic Model

Logistic model (or logit model) is a statistical


model that models the probability of an event taking
place by having the log-odds for the event be a linear
combination of one or more independent variables.
An example curve

Example graph of a logistic regression curve fitted to data. The curve shows the probability of
passing an exam (binary dependent variable) versus hours studying (scalar independent variable).
Logistic Regression

 Logistic regression is a probabilistic discriminative model


that directly estimates the odds of a data instance a using its
attribute values.
 Basic idea is to use linear predictor, z = wT x + b, for
representing the odds of x as follows:

where w and b are the parameters of the model and aT denotes the transpose of
a vector a. Note that if wT x + b > 0, then x belongs to class 1 since its odds is
greater than 1. Otherwise, x belongs to class 0.
Cont…

Since P(y = 0|x) + P(y = 1|x) = 1, we can re-write

This can be further simplified to express P(y = 1|x) as a function of z.

where the function σ(.) is known as the logistic or sigmoid function


Logistic Regression as a Generalized Linear Model

 Logistic regression belongs to a broader family of statistical


regression models, known as generalized linear models
(GLM).

 Even though logistic regression has relationships with


regression models, it is a classification model since the
computed posterior probabilities are eventually used to
determine the class label of a data instance.
Learning Model Parameters

 The parameters of logistic regression, (w, b), are estimated during training using a
statistical approach known as the maximum likelihood estimation (MLE) method.
How does Logistic regression work
Cont…
● Types of Logistic Regression

● On the basis of the categories, Logistic Regression can be classified into three
types:

1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered


types of dependent variables, such as “low”, “Medium”, or “High”.
Cont…
● Assumptions of Logistic Regression
● The assumption include:
1. Independent observations: Each observation is independent of the other. meaning there is no
correlation between any input variables.

2. Binary dependent variables: It takes the assumption that the dependent variable must be binary
or dichotomous, meaning it can take only two values.

3. For more than two categories SoftMax functions are used.

3. Linearity relationship between independent variables and log odds: The relationship between the
independent variables and the log odds of the dependent variable should be linear.

4. No outliers: There should be no outliers in the dataset.

5. Large sample size: The sample size is sufficiently large


Characteristics of LR
 Discriminative model for classification.
 The learned parameters of logistic regression can be analyzed to understand the
relationships between attributes and class labels.
 Can work more robustly even in high-dimensional settings
 Can handle irrelevant attributes
 Cannot handle data instances with missing values
Ensemble Methods

● In the realm of machine learning we can categorize models into two fundamental
groups: Base Models and Ensemble Models.

● Base Models: These are individual machine learning models that are used
independently to make predictions.

○ For example, decision trees, logistic regression, K-nearest neighbors (KNN), Support Vector Machines
(SVM), linear regression, and so on.

○ Base models are often the building blocks of ensemble models.

● Ensemble models: The idea behind of ensemble models is to combine many weak
learners( simple “building block” models) in order to obtain a strong learner (a single
and potentially very powerful model).
Ensemble Methods

● Ensemble techniques in machine learning function much like:


○ seeking advice from multiple sources before making a significant decision, such as purchasing a
car.

● Technically : Construct a set of base classifiers learned from the training data

● Predict class label of test records by combining the predictions made by multiple
classifiers (e.g., by taking majority vote)

12/31/2024
Ensemble Methods

Suppose there are 25 base classifiers. Each classifier has error rate, ε = 0.35

Assume errors made by classifiers are uncorrelated

Probability that the ensemble classifier makes a wrong prediction: Probability of


having more than half of base classifiers are being wrong ..

Necessary Conditions for Ensemble Methods:


Ensemble Methods work better than a single base classifier if:
1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing (error rate < 0.5
for binary classification)
12/31/2024
Rationale for Ensemble Learning

• Ensemble Methods work best with unstable base classifiers.


• Unstable : overfitting – having high variance.

• Classifiers that are sensitive to minor perturbations (noise) in training set, due to high model
complexity.

• Examples: Unpruned decision trees, ANNs, KNN with small k…

12/31/2024
Bias-Variance Decomposition

● Analogous problem of reaching a target y by firing projectiles from x


(regression problem)

● For classification, the generalization error of model can be given by:

12/31/2024
Bias-Variance Trade-off and Overfitting

Overfitting

Underfitting

● Ensemble methods try to reduce the variance of complex models (with low bias)
by aggregating responses of multiple base classifiers
12/31/2024
General Approach of Ensemble Learning

Using majority vote or


weighted majority vote
(weighted according to their
accuracy or relevance)

12/31/2024
Constructing Ensemble Classifiers

● By manipulating training set


○ Example: bagging, boosting, random forests

● By manipulating input features


○ Example: random forests

● By manipulating class labels


○ Example: error-correcting output coding

● By manipulating learning algorithm


○ Example: injecting randomness in the initial weights of ANN
12/31/2024
Bagging (Bootstrap AGGregatING)

● Bootstrap method is used to create multiple subsets of training data set and then
train multiple base models separately on these bootstrapped training set to get
different predictions from these base models.

● The final prediction is obtained by averaging or voting on the predictions from


these base models.
Bagging (Bootstrap AGGregatING)

● Bootstrap sampling: sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

● Build classifier on each bootstrap sample


● Probability of a training instance being selected in a bootstrap sample is:
 1 – (1 - 1/n)n (n: number of training instances)
 ~0.632 when n is large

12/31/2024
Bagging Algorithm

12/31/2024
Bagging Example

● Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

● Classifier is a decision stump (decision tree of size 1)


○ Decision rule: x  k versus x > k
○ Split point k is chosen based on entropy

xk

True False

yleft yright
12/31/2024
Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7  y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7  y = 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35  y = -1
Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05  y = 1
Bagging Example

● Summary of Trained Decision Stumps:

Round Split Point Left Class Right Class


1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

12/31/2024
Bagging Example

● Use majority vote (sign of sum of predictions) to determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Predicted Sum 2 2 2 -6 -6 -6 -6 2 2 2
Class Sign 1 1 1 -1 -1 -1 -1 1 1 1

● Bagging can also increase the complexity (representation capacity) of simple


classifiers such as decision stumps

12/31/2024
Boosting
● Boosting uses the entire training data set in a sequential manner to improve the performance of weak
learners.

● First, using the whole training data set to fit on a weak leaner.

● Then, according to the prediction results to adjust the weight of each observation in the training data set.

● Giving lower weights to those were classified correctly, while giving higher weights to those were
classified incorrectly.

● Fitting a new weak learner on this modified data set.

● The sequential process of adjusting weights and predictions continues until a stopping criterion is
reached.

● The final prediction is a weighted result of all weak learners.


Boosting
Examples of well-established
boosting models in machine learning
include AdaBoost, Gradient Boosting,
and XGBoost.
Boosting

● An iterative procedure to adaptively change distribution of training data by focusing more on previously
misclassified records
○ Initially, all N records are assigned equal weights (for being selected for training)
○ Unlike bagging, weights may change at the end of each boosting round

● Records that are wrongly classified will have their weights increased in the next round.

● Records that are classified correctly will have their weights decreased in the next round
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
12/31/2024
AdaBoost If there is one incorrect output in our dataset, thus our
total error is 1/5, and the alpha (performance of the
stump) is:

● Base classifiers: C1, C2, …, CT

● Error rate of a base classifier:

● Importance of a classifier:
1  1  i 
i  ln 
2  i 

12/31/2024
AdaBoost Algorithm

● Weight update:
•When the sample is successfully identified, the
amount of, say, (alpha) will be negative.

•When the sample is misclassified, the amount of


(alpha) will be positive.

● If any intermediate rounds produce error rate higher than 50%, the weights
are reverted back to 1/n and the resampling procedure is repeated

● Classification:

12/31/2024
AdaBoost Algorithm
AdaBoost Example

● Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

● Classifier is a decision stump


○ Decision rule: x  k versus x > k
○ Split point k is chosen based on entropy

xk

True False

yleft yright
12/31/2024
AdaBoost Example

● Training sets for the first 3 boosting rounds:


Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

● Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
12/31/2024
AdaBoost Example

● Weights Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

● Classification

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

12/31/2024
Random Forest Algorithm

● Construct an ensemble of decision trees by manipulating training set as well as


features – (process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.)
○ Use bootstrap sample to train every decision tree (similar to Bagging)
○ Use the following tree induction algorithm:
■ At every internal node of decision tree, randomly sample p attributes for selecting split
criterion
■ Repeat this procedure until all leaves are pure (unpruned tree)

12/31/2024
Gradient Boosting

● Constructs a series of models


○ Models can be any predictive model that has a differentiable loss function (MSE)
○ Commonly, trees are the chosen model
■ XGboost (extreme gradient boosting) is a popular package because of its impressive
performance
■ For the steepest descent method, the search direction is given by d k = −∇f (xk).

● Boosting can be viewed as optimizing the loss function by iterative functional


gradient descent.

● Implementations of various boosted algorithms are available in Python, R,


Matlab, and more.

12/31/2024
Gradient Boosting Algorithm
● Step 1:
● Let’s assume X, and Y are the input and target having N samples. Our goal is to
learn the function f(x) that maps the input features X to the target variables y. It is
boosted trees i.e the sum of trees.

● The loss function is the difference between the actual and the predicted variables.

● Step 2: We want to minimize the loss function L(f) with respect to f.


Gradient Boosting Algorithm
● Step 3: Steepest Descent (dk = −∇f (xk).
● Step 4: Solution: The gradient Similarly for M trees:
Numerical Problem on Logistic

Hours Study Pass (1) / Fail • Given the class, how can we calculate the
(0) probability of pass for the student who
29 0 studied 33 hours?
15 0

33 1 • At least how many hours student should


study that makes him to pass the course
28 1
with probability of > 95%
39 1

• Assume the model suggested by the


optimizer for odds of passing the course
is : log(odds) = -64+2*hours
Numerical Problem on Adaboost
CGPA InteractiveNess Practical Comm skills Job profile
Knowledge
>=9 Yes Good Good Yes

<9 N G Moderate Y

>=9 N Avg M N

<9 N Avg G N

>=9 Y G M Y

>=9 Y G M Y

• Consider a training dataset of 6 data instances. Use 4 decision stumps for


each of the 4 attributes. Apply adaboost algorithm and classify the dataset
with job offer as target attribute?
Numerical Problem on Adaboost
● Step 1: Initial Weight assigned to each item is 1/6

● Step 2: Iterate for each weak classifier (due to 4 stumps – weak learners)

○ A. Train the decision stump HCGPA with a random bootstrap sample from
training dataset.
○ B. Compute the weighted error
○ C. Compute the weight of each weak classifier
○ D. calculate the normalization factor ZCGPA
○ E. Update the weights of all data instances

● Step 3: Compute the final predicted value for each data instance
References:
1. Introduction to Data Mining ,Pang-Ning Tan, Michael Steinbach, Vipin Kumar,2nd edition, 2019,Pearson ,
ISBN-10-9332571406, ISBN-13 -978-9332571402
2. Machine Learning ,Tom M. Mitchell, Indian Edition, 2013, McGraw Hill Education, ISBN – 10 –
1259096955
3. Jiawei Han and Micheline, Kamber: Data Mining – Concepts and Techniques, 2 nd Edition, Morgan
Kaufmann, 2006, ISBN 1-55860-901-6

Thank you all

You might also like