0% found this document useful (0 votes)
21 views53 pages

ML - 5

Uploaded by

Snehargha Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views53 pages

ML - 5

Uploaded by

Snehargha Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Course Name : AI & ML

Click to edit Master subtitle style

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

The Ensemble methods


• The ensemble methods, also known as committee-based learning or
learning multiple classifier systems train multiple hypotheses to solve
the same problem. One of the most common examples of ensemble
modeling is the random forest trees where a number of decision trees
are used to predict outcomes.
• An ensemble contains a number of hypothesis or learners which are
usually generated from training data with the help of a base learning
algorithm. Most ensemble methods use a single base learning algorithm
to produce homogenous base learners or homogenous ensembles and
there are also some other methods which use multiple learning
algorithms and thus produce heterogenous ensembles. Ensemble
methods are well known for their ability to boost weak learners.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

Why Use Ensemble Methods?


• The learning algorithms which output only a single hypothesis tends to
suffer from basically three issues. These issues are the statistical
problem, the computational problem and the representation problem
which can be partly overcome by applying ensemble methods.
• The learning algorithm which suffers from the statistical problem is
said to have high variance. The algorithm which exhibits the
computational problem is sometimes described as having
computational variance and the learning algorithm which suffers from
the representational problem is said to have a high bias. These three
fundamental issues can be said as the three important ways in which
existing learning algorithms fail. The ensemble methods promise of
reducing both the bias and the variance of these three shortcomings of
the standard learning algorithm.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• Different Techniques
• Some of the commonly used Ensemble techniques are
discussed below
Bagging
• Bagging or Bootstrap Aggregation is a powerful, effective
and simple ensemble method. The method uses multiple
versions of a training set by using the bootstrap, i.e.
sampling with replacement and t it can be used with any
type of model for classification or regression. Bagging is
only effective when using unstable (i.e. a small change in
the training set can cause a significant change in the model)
non-linear models.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

Boosting
• Boosting is a meta-algorithm which can be viewed as a model averaging
method. It is the most widely used ensemble method and one of the
most powerful learning ideas. This method was originally designed for
classification but it can also be profitably extended to regression. The
original boosting algorithm combined three weak learners to generate a
strong learner.
Stacking
• Stacking is concerned with combining multiple classifiers generated by
using different learning algorithms on a single dataset which consists of
pairs of feature vectors and their classifications. This technique consists
of basically two phases, in the first phase, a set of base-level classifiers
is generated and in the second phase, a meta-level classifier is learned
which combines the outputs of the base-level classifiers.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

Applications Of Ensemble Methods


• Ensemble methods can be used as overall diagnostic procedures for a more
conventional model building. The larger the difference in fit quality between
one of the stronger ensemble methods and a conventional statistical model, the
more information that the conventional model is probably missing.
• Ensemble methods can be used to evaluate the relationships between
explanatory variables and the response in conventional statistical models.
Predictors or basis functions overlooked in a conventional model may surface
with an ensemble approach.
• With the help of the ensemble method, the selection process could be better
captured and the probability of membership in each treatment group estimated
with less bias.
• One could use ensemble methods to implement the covariance adjustments
inherent in multiple regression and related procedures. One would
“residualized” the response and the predictors of interest with ensemble
methods.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

7
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• How to generate an ensemble of classifiers?


– Bagging

– Boosting

8
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Bagging

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Bootstrapping is the method of randomly creating samples of data out of a


population with replacement to estimate a population parameter.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Steps to Perform Bagging


• Consider there are n observations and m features in the
training set. You need to select a random sample from the
training dataset without replacement
• A subset of m features is chosen randomly to create a model
using sample observations
• The feature offering the best split out of the lot is used to split
the nodes
• The tree is grown, so you have the best root nodes
• The above steps are repeated n times. It aggregates the
output of individual decision trees to give the best prediction

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Advantages of Bagging in Machine Learning


• Bagging minimizes the overfitting of data
• It improves the model’s accuracy
• It deals with higher dimensional data
efficiently

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Bagging
• Sampling with replacement Training Data
Data ID
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

• Build classifier on each bootstrap sample


• Each sample has probability (1 – 1/n)n of being selected as test data
• Training data = 1- (1 – 1/n)n of the original data

13
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• This method is also called the 0.632 bootstrap


– A particular training data has a probability of 1-
1/n of not being picked
– Thus its probability of ending up in the test data
(not selected) is:

– This means the training data will contain


approximately 63.2% of the instances

14
14
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Bagging Example : Assume that the training data is:

+1 0.4 to 0.7: -1 +1

0.8 x
0.3

Goal: find a collection of 10 simple thresholding classifiers that


collectively can classify correctly.
-Each simple (or weak) classifier is:
(x<=K  class = +1 or -1 depending on
which value yields the lowest error; where K
is determined by entropy minimization)

15
16
Bagging (applied to training data)

Accuracy of ensemble classifier: 100% 

17
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Bagging Summary
• Works well if the base classifiers are unstable
(complement each other)
• Increased accuracy because it reduces the
variance of the individual classifier
• Does not focus on any particular instance of the
training data
– Therefore, less susceptible to model over-fitting
when applied to noisy data
• What if we want to focus on a particular instances
of training data?
18
19
20
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Definition

• WHAT IS RANDOM FOREST?


• Random forest is a supervised learning algorithm. The
"forest" it builds, is an ensemble of decision trees,
usually trained with the “bagging” method. The general
idea of the bagging method is that a combination of
learning models increases the overall result.
• Put simply: random forest builds multiple decision trees
and merges them together to get a more accurate and
stable prediction.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• Random forest adds additional randomness to the model,


while growing the trees. Instead of searching for the most
important feature while splitting a node, it searches for the
best feature among a random subset of features. This
results in a wide diversity that generally results in a better
model.
• Therefore, in random forest, only a random subset of the
features is taken into consideration by the algorithm for
splitting a node. You can even make trees more random by
additionally using random thresholds for each feature
rather than searching for the best possible thresholds (like
a normal decision tree does).
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

DIFFERENCE BETWEEN DECISION TREES AND


RANDOM FORESTS
• While random forest is a collection of decision trees, there are some differences.
• If you input a training dataset with features and labels into a decision tree, it will
formulate some set of rules, which will be used to make the predictions.
• For example, to predict whether a person will click on an online advertisement, you might
collect the ads the person clicked on in the past and some features that describe his/her
decision. If you put the features and labels into a decision tree, it will generate some rules
that help predict whether the advertisement will be clicked or not. In comparison, the
random forest algorithm randomly selects observations and features to build several
decision trees and then averages the results.
• Another difference is "deep" decision trees might suffer from overfitting. Most of the time,
random forest prevents this by creating random subsets of the features and building
smaller trees using those subsets. Afterwards, it combines the subtrees. It's important to
note this doesn’t work every time and it also makes the computation slower, depending on
how many trees the random forest builds.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• ADVANTAGES AND DISADVANTAGES OF THE RANDOM FOREST


ALGORITHM
• One of the biggest advantages of random forest is its versatility. It can be used for both regression and
classification tasks, and it’s also easy to view the relative importance it assigns to the input features.
• Random forest is also a very handy algorithm because the default hyperparameters it uses often
produce a good prediction result. Understanding the hyperparameters is pretty straightforward, and
there's also not that many of them.
• One of the biggest problems in machine learning is overfitting, but most of the time this won’t happen
thanks to the random forest classifier. If there are enough trees in the forest, the classifier won’t overfit
the model.
• The main limitation of random forest is that a large number of trees can make the algorithm too slow
and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to
create predictions once they are trained. A more accurate prediction requires more trees, which
results in a slower model. In most real-world applications, the random forest algorithm is fast enough
but there can certainly be situations where run-time performance is important and other approaches
would be preferred.
• And, of course, random forest is a predictive modeling tool and not a descriptive tool, meaning if
you're looking for a description of the relationships in your data, other approaches would be better.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• How does Random Forest algorithm work?


• Random Forest works in two-phase first is to create the random forest
by combining N decision tree, and second is to make predictions for
each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data points
(Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• Applications of Random Forest


• There are mainly four sectors where Random forest mostly used:
• Banking: Banking sector mostly uses this algorithm for the
identification of loan risk.
• Medicine: With the help of this algorithm, disease trends and risks of
the disease can be identified.
• Land Use: We can identify the areas of similar land use by this
algorithm.
• Marketing: Marketing trends can be identified using this algorithm.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Basic Concept

• Lazy learners
• Lazy learners simply store the training data and wait until a testing data
appear. When it does, classification is conducted based on the most
related data in the stored training data. Compared to eager learners,
lazy learners have less training time but more time in predicting.
• Ex. k-nearest neighbor, Case-based reasoning
• Eager learners
• Eager learners construct a classification model based on the given
training data before receiving data for classification. It must be able to
commit to a single hypothesis that covers the entire instance space. Due
to the model construction, eager learners take a long time for train and
less time to predict.
• Ex. Decision Tree, Naive Bayes, Artificial Neural Networks
03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• k-Nearest-Neighbor Method
– first described in the early 1950s
– It has since been widely used in the area of pattern
recognition.
– The training instances are described by n attributes.
– Each instance represents a point in an n-
dimensional
space.
– A k-nearest-neighbor classifier searches the pattern
space for the k training instances that are closest to
the unknown instance.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• The nearest neighbor can be defined in terms of


Euclidean distance, dist(X1, X2)
• The Euclidean distance between two points or
instances, say, X1 = (x11, x12, … , x1n) and X2 = (x21, x22, ... ,
x2n), is:

– Nominal attributes: distance either 0 or 1


– Refer to cluster analysis for more distance metrics
KNN Example

03/06/2022
Continued…

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• For k-nearest-neighbor classification, the unknown


instance is assigned the most common class among its
k nearest neighbors.
• When k = 1, the unknown instance is assigned the
class of the training instance that is closest to it in
pattern space.
• Nearest-neighbor classifiers can also be used for
prediction, that is, to return a real-valued prediction
for a given unknown instance.
– In this case, the classifier returns the average value of the
real-valued labels associated with the k nearest neighbors of
the unknown instance.
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Distances for categorical attributes:


– A simple method is to compare the corresponding
value of the attribute in instance X1 with that in
instance X2.
– If the two are identical (e.g., instances X1 and X2 both
have the color blue), then the difference between the
two is taken as 0, otherwise 1.
– Other methods may incorporate more sophisticated
schemes for differential grading (e.g., where a
difference score is assigned, say, for blue and white
than for blue and black).
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Handling missing values:


– In general, if the value of a given attribute A is missing
in instance X1 and/or in instance X2, we assume the
maximum possible difference.
– For categorical attributes, we take the difference value
to be 1 if either one or both of the corresponding values
of A are missing.
– If A is numeric and missing from both instances X1
and X2, then the difference is also taken to be 1.
◆ If only one value is missing and the other (which we’ll call v’) is
present and normalized, then we can take the difference to be
either |1 - v’| or |0 – v’| , whichever is greater.
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Typically, we normalize the values of each attribute in


advanced.
• This helps prevent attributes with initially large ranges
(such as income) from outweighing attributes with
initially smaller ranges (such as binary attributes).
• Min-max normalization:

– all attribute values lie between 0 and 1


– For more information on normalization methods refer
to data preprocessing section
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Determining a good value for k:


– k can be determined experimentally.
– Starting with k = 1, we use a test set to estimate the
error rate of the classifier.
– This process can be repeated each time by
incrementing k to allow for one more neighbor.
– The k value that gives the minimum error rate may be
selected.
– In general, the larger the number of training instances
is, the larger the value of k will be
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing
more on previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the end
of a boosting round

39
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Steps for Boosting


• Initialise the dataset and assign equal weight to each of the data
point.
• Provide this as input to the model and identify the wrongly
classified data points.
• Increase the weight of the wrongly classified data points.
• if (got required results)
Goto step 5
else
Goto step 2

• End

40
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to
understand it in a stepwise process:
• B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of
which are plus(+) and the other 5 are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and generates a vertical separator line
but it wrongly classifies 3 plus(+) as minus(-).
• B2 consists of the 10 data points from the previous model in which the 3 wrongly classified
plus(+) are weighted more so that the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that correctly classifies the previously
wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
• B3 consists of the 10 data points from the previous model in which the 3 wrongly classified
minus(-) are weighted more so that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that correctly classifies the
previously wrongly classified minuses(-).
• B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much
better than any individual model used.

03/06/2022
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Records that are wrongly classified will have


their weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely to
be chosen again in subsequent rounds

43
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Boosting
• Equal weights are assigned to each training
instance (1/N for round 1) at first round
• After a classifier Ci is learned, the weights are
adjusted to allow the subsequent classifier
Ci+1 to “pay more attention” to data that were
misclassified by Ci.
• Final boosted classifier C* combines the votes of
each individual classifier
– Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm

44
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Adaboost (Adaptive Boost)


• Input:
– Training set D containing N instances
– T rounds
– A classification learning scheme
• Output:
– A composite model

45
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Adaboost: Training Phase


• Training data D contain N labeled data (X1,y1), (X2,y2 ),
(X3,y3),….(XN,yN)
• Initially assign equal weight 1/d to each data
• To generate T base classifiers, we need T rounds or
iterations
• Round i, data from D are sampled with replacement , to
form Di (size N)
• Each data’s chance of being selected in the next rounds
depends on its weight
– Each time the new sample is generated directly from the
training data D with different sampling probability according
to the weights; these weights are not zero

46
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

• Base classifier Ci, is derived from training data of


Di
• Error of Ci is tested using Di
• Weights of training data are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a data indicates how hard it is to
classify it (directly proportional)

47
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Adaboost: Testing Phase


• The lower a classifier error rate, the more accurate it is,
and therefore, the higher its weight for voting should be
• Weight of a classifier Ci’s vote is

• Testing:
– For each class c, sum the weights of each classifier that assigned
class c to X (unseen data)
– The class with the highest sum is the WINNER!

48
Example: Error and Classifier Weight in AdaBoost

• Base classifiers: C1, C2, …, CT

• Error rate: (i = index of classifier,


j=index of instance)

• Importance of a classifier:

49
Example: Data Instance Weight in AdaBoost

• Assume: N training data in D, T rounds, (xj,yj) are


the training data, Ci, ai are the classifier and
weight of the ith round, respectively.
• Weight update on all training data in D:

50
Illustrating AdaBoost
Initial weights for each data point Data points
for training

51
Illustrating AdaBoost

52
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA

Thank You

03/06/2022

You might also like