0% found this document useful (0 votes)
66 views

Module 5 Advanced Classification Techniques

The document discusses various advanced machine learning classification techniques including ensemble methods like bagging and boosting, random forests, techniques for handling class imbalanced data, metrics for evaluating classifier performance like accuracy, precision, recall, F1 score, AUC-ROC, and log loss. It also covers methods for evaluating classifiers like holdout method, cross-validation, bootstrap, and comparing classifiers based on cost-benefit and ROC curves.

Uploaded by

Saurabh Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Module 5 Advanced Classification Techniques

The document discusses various advanced machine learning classification techniques including ensemble methods like bagging and boosting, random forests, techniques for handling class imbalanced data, metrics for evaluating classifier performance like accuracy, precision, recall, F1 score, AUC-ROC, and log loss. It also covers methods for evaluating classifiers like holdout method, cross-validation, bootstrap, and comparing classifiers based on cost-benefit and ROC curves.

Uploaded by

Saurabh Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Vidyavardhini’s College of Engineering and Technology

Department of Information Technology

Module 5
ADVANCED ML CLASSIFICATION
TECHNIQUES
- Mrs. Anagha J. Patil, VCET, Vasai
Syllabus:
• Introduction to Ensemble Methods
• Bagging
• Boosting
• Random forests
• Improving classification accuracy of Class Imbalanced Data.
• Metrics for Evaluating Classifier Performance
• Holdout Method and Random Subsampling
• Cross-Validation
• Bootstrap
• Model Selection Using Statistical Tests of Significance
• Comparing Classifiers Based on Cost–Benefit and ROC Curves.

Mrs. Anagha J. Patil, VCET, Vasai 2


Classification
• Classification is a form of data analysis that extracts models
describing data classes. It is a Supervised learning technique that is
used to identify the category of new observations on the basis of
training data.
• A classifier, or classification model, predicts categorical labels
(classes). Numeric prediction models continuous-valued functions.
Classification and numeric prediction are the two major types of
prediction problems.
• Various classification techniques:
• Decision tree induction
• Naive Bayesian classification
• A rule-based classifier
Model Evaluation and Selection
Metrics for Evaluating Classifier Performance:
• Accuracy
• Error rate
• Precision
• Sensitivity (Recall)
• Specificity
• F1 score
• AOC-ROC
• Log Loss
Confusion Matrix

True Positives (TP) − These refers to the positive tuples that were correctly labeled by the classifier. It is
the case when both actual class & predicted class of data point is 1.
True Negatives (TN) − These are the negative tuples that were correctly labeled by the classifier. It is the
case when both actual class & predicted class of data point is 0.
False Positives (FP) − These are the negative tuples that were incorrectly labeled as positive. It is the case
when actual class of data point is 0 & predicted class of data point is 1.
False Negatives (FN) − These are the positive tuples that were mislabeled as negative. It is the case when
actual class of data point is 1 & predicted class of data point is 0.
Accuracy
• The accuracy of a classifier on a given test set is the percentage of test
set tuples that are correctly classified by the classifier. In other words,
Accuracy is the ratio of the number of correct predictions and the
total number of predictions. The formula for Accuracy is:
• Accuracy=(TP+TN)/(P+N)
• Accuracy is useful when the target class is well balanced but is not a
good choice for the unbalanced classes. Whereas other measures,
such as sensitivity (or recall), specificity, precision, F, and Fβ, are
better suited to the class imbalance problem, where the main class of
interest is rare.
Error/Misclassification rate
• Error rate for a classifier, M, is simply 1 − accuracy(M), where
accuracy(M) is the accuracy of the model M. We can also write this
as:
• Error rate=(FP+FN)/(P+N)
Recall/Sensitivity
• Recall is a measure of completeness i.e., what percentage of positive
tuples are labeled as such.
• Sensitivity refers to True Positive (recognition) rate which is the
proportion of positive tuples that are correctly identified. Recall and
Sensitivity are similar.
• Recall or Sensitivity=TP/(TP+FN)=TP/P
Specificity
• Specificity is the true negative rate which is the proportion of
negative tuples that are correctly identified.
• Specificity=TN/N
Precision
• Precision is a measure of exactness i.e., what percentage of tuples
labeled as positive are actually such.
• Precision=TP/(TP+FP)
F-Score
• Precision and Recall are combined into a single measure to form F
measures (also known as the F1 score or F-score). F1 Score is the
harmonic mean of precision and recall. It is maximum when Precision
is equal to Recall.
• F=(2*Precision*Recall)/(Precision+Recall)

• Fβ=((1+ β2)*Precision*Recall)/(β2*Precision+Recall)
AUC-ROC
• The Area Under the Curve (AUC) is the measure of the ability of a
classifier to distinguish between classes. And, The Receiver Operator
Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold
values and separates the ‘signal’ from the ‘noise’.
• In a ROC curve, the X-axis value shows False Positive Rate/Recall and
Y-axis shows True Positive Rate/Sensitivity. Higher the value of X
means higher the number of False Positives (FP) than True Negatives
(TN), while a higher Y-axis value indicates a higher number of TP than
FN. So, the choice of the threshold depends on the ability to balance
between FP and FN.
Log Loss
• It is also called Logistic regression loss or cross-entropy loss. It
basically defined on probability estimates and measures the
performance of a classification model where the input is a probability
value between 0 and 1. It can be understood more clearly by
differentiating it with accuracy.
• As we know that accuracy is the count of predictions (predicted value
= actual value) in our model whereas Log Loss is the amount of
uncertainty of our prediction based on how much it varies from the
actual label. With the help of Log Loss value, we can have more
accurate view of the performance of our model.
Methods for evaluating classifiers
Holdout Method
• This is the simplest method to evaluate the
classifier. In the holdout method, the given
data set is randomly partitioned into two
independent sets, a training set and a test
set.
• Typically, we allocate two-thirds of the data
to the training set, and the remaining one-
third is allocated to the test set.
• The training set is used to train the model.
• The model is then validated with the test set
to get accuracy, error rate and error estimate
of model.
• Not suited for sparse dataset.
Random subsampling
• Is a variation of the holdout method in which the holdout method is
repeated k times.
• In this method, we form ‘k’ replica of given data. For each iteration (for
each replica) a fix number of observation is chosen and it is kept aside as
test set.
• The model is fitted to training set from each iteration, and an estimate of
prediction error is obtained from each test set.
• The overall accuracy estimate is taken as the average of the accuracies
obtained from each iteration.
• It is better approach than Holdout method for sparse dataset. But there is a
chance of selecting same record in test set for other iteration.
Cross-Validation
• In k-fold cross-validation, the basic idea is to split the data into k chunks.
Train with (k - 1) chunks of data objects, test it on (k - 1)th chunk. Next
time, include the one that you excluded in the last iteration to train your
model and test it on this excluded chunk. Repeat. This should take you K
iterations i.e., training and testing is performed k times.
• Unlike the holdout and random subsampling methods, here each sample is
used the same number of times for training and once for testing.
• For classification, the accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of tuples
in the initial data.
• This method is good for huge data.
Cross-Validation
• One can be a victim of skewed target values with Random subsampling and
K-fold which can be fixed by stratification.
• Stratification makes sure that you get similar target distribution in each of
your folds (chunks) of your data.
• Leave-one-out is a special case of k-fold cross-validation where k is set to
the number of initial tuples. That is, only one sample is “left out” at a time
for the test set. In stratified cross-validation, the folds are stratified so that
the class distribution of the tuples in each fold is approximately the same
as that in the initial data.
• This method is good for less data, unbalanced dataset and target values. In
general, stratified 10-fold cross-validation is recommended for estimating
accuracy (even if computation power allows using more folds) due to its
relatively low bias and variance.
Bootstrap
• Bootstrapping method is a resampling statistical technique that evaluates
statistics of a given population by testing a dataset by replacing the sample. This
technique involves repeatedly sampling a dataset with random replacement. That
is, each time a tuple is selected, it is equally likely to be selected again and re-
added to the training set.
• For instance, imagine a machine that randomly selects tuples for the training set.
In sampling with replacement, the machine is allowed to select the same tuple
more than once.
• Steps that are involved in the Bootstrapping method:
1. Randomly choose a sample size.
2. Pick an observation from the training dataset in random order.
3. Combine this observation with the sample chosen earlier.
• For the samples that are chosen in the representative sample size, they are
referred to as the ‘Bootstrapped samples’ or the bootstrap sample size. On the
other hand, the samples that are not chosen are referred to as the ‘Out-of-the-
bag’ samples that serve as the testing dataset.
Bootstrap
• Bootstrapping method can be either parametric or non-parametric.
• In parametric bootstrap method, the distribution parameter must be
known. This means that the assumption of the kind of distribution the
sample has must be provided beforehand.
• Unlike the parametric bootstrap method, non-parametric bootstrap
does not require the parameter of distribution to be known
beforehand.
• Therefore, this type of bootstrap method works without assuming the
nature of the sample distribution.
Model Selection Using Statistical Tests of
Significance
• Statistical significance tests quantify the likelihood of the samples of
skill scores being observed given the assumption that they were
drawn from the same distribution. If this assumption, or null
hypothesis, is rejected, it suggests that the difference in skill scores is
statistically significant.
• Generally, a statistical hypothesis test for comparing samples
calculates how likely it is to observe two data samples given the
assumption that the samples have the same distribution.
• The assumption of a statistical test is called the null hypothesis and
we can calculate statistical measures and interpret them in order to
decide whether or not to accept or reject the null hypothesis.
Model Selection Using Statistical Tests of
Significance
• Suppose that for each model, we did 10-fold cross-validation, say, 10 times, each
time using a different 10-fold data partitioning. We can average the 10 error
rates obtained each for M1 and M2, respectively, to obtain the mean error rate
for each model. For a given model, the individual error rates calculated in the
cross-validations may be considered as different, independent samples from a
probability distribution. In general, they follow a t-distribution with k − 1 degrees
of freedom, here, k = 10.
• This allows us to do hypothesis testing where the significance test used is the t-
test, or Student’s t-test. Our hypothesis is that the two models are the same, or in
other words, that the difference in mean error rate between the two is zero. If we
can reject this null hypothesis, then we can conclude that the difference between
the two models is statistically significant, and we can select the model with the
lower error rate.
• Comparing machine learning models via statistical significance tests imposes
some expectations that in turn will impact the types of statistical tests that can be
used.
Model Selection Using Statistical Tests of
Significance
• To determine whether M1 and M2 are significantly different, we compute t
and select a significance level, sig(α). In practice, a significance level (α) of
5% or 1% is typically used. We then consult a table for the t-distribution.
This distribution is available in standard textbooks on statistics.
• However, because the t-distribution is symmetric, typically only the upper
percentage points of the distribution are shown. Therefore, we look up the
table value for z = α/2, which in this case is 0.025, where z is also referred
to as a confidence limit.
• If t > z or t < −z, then our value of t lies in the rejection region, within the
distribution’s tails. This means that we can reject the null hypothesis that
the means of M1 and M2 are the same and conclude that there is a
statistically significant difference between the two models. Otherwise, if
we cannot reject the null hypothesis, we conclude that any difference
between M1 and M2 can be attributed to chance.
Model Selection Using Statistical Tests of
Significance
Comparing Classifiers Based on Cost–Benefit
and ROC Curves
• ROC curve can be used to select a threshold for a classifier which maximizes the
true positives, while minimizing the false positives. However, different types of
problems have different optimal classifier thresholds. For a cancer screening test,
for example, we may be prepared to put up with a relatively high false positive
rate in order to get a high true positive, it is most important to identify possible
cancer sufferers. For a follow-up test after treatment, however, a different
threshold might be more desirable, since we want to minimize false negatives, we
don’t want to tell a patient they’re clear if this is not actually the case.
• The AUC can be used to compare the performance of two or more classifiers. A
single threshold can be selected and the classifiers’ performance at that point
compared, or the overall performance can be compared by considering the AUC.
Most published reports compare AUCs in absolute terms: “Classifier 1 has an AUC
of 0.85, and classifier 2 has an AUC of 0.79, so classifier 1 is clearly better“. It is,
however, possible to calculate whether differences in AUC are statistically
significant.
Techniques to Improve Classification Accuracy
• In ensemble learning theory, we call weak learners (or base models)
models that can be used as building blocks for designing more
complex models by combining several of them. The idea of ensemble
methods is to try reducing bias and/or variance of such weak learners
by combining several of them together in order to create a strong
learner (or ensemble model) that achieves better performances.
• To outline the definition and practicality of Ensemble Methods, here
we have used example of Decision tree classifier. However, it is
important to note that Ensemble Methods do not only pertain to
Decision Trees.
A decision tree to determine whether to play
outside or not
Problems with a single classifier
• When making Decision Trees, there are several factors we must take into
consideration: On what features do we make our decisions on? What is the
threshold for classifying each question into a yes or no answer? In the first
Decision Tree, what if we wanted to ask ourselves if we had friends to play
with or not. If we have friends, we will play every time. If not, we might
continue to ask ourselves questions about the weather. By adding an
additional question, we hope to greater define the Yes and No classes.
• This is where Ensemble Methods come into picture! Rather than just
relying on one Decision Tree and hoping we made the right decision at
each split, Ensemble Methods allow us to take a sample of Decision Trees
into account, calculate which features to use or questions to ask at each
split, and make a final predictor based on the aggregated results of the
sampled Decision Trees.
Ensemble Methods
• Ensemble methods is a machine learning technique that combines
several base models in order to produce one optimal predictive
model which helps to improve machine learning results.
• This approach allows the production of better predictive performance
compared to a single model.
• Basic idea is to learn a set of classifiers and to allow them to vote.
• Ensembles tend to be more accurate than their component classifiers.
• Different types of ensemble classifiers are:
1. Bagging
2. Boosting and AdaBoost
3. Random Forests
Ensemble Methods
Bagging (Bootstrap Aggregating)
Bagging (Bootstrap Aggregating)
• This approach combines Bootstrapping and Aggregation to form one
ensemble model, that’s why the name is Bagging.
• Consider yourself as a patient and you would like to have a diagnosis
made based on the symptoms. Instead of asking one doctor, you may
choose to ask several.
• If a certain diagnosis occurs more than any other, you may choose
this as the final or best diagnosis. That is, the final diagnosis is made
based on a majority vote, where each doctor gets an equal vote.
• If we replace each doctor by a classifier, and that’s the basic idea
behind bagging. Naturally, a majority vote made by a large group of
doctors may be more reliable than a majority vote made by a small
group.
Bagging (Bootstrap Aggregating)
• Given a sample of data, multiple bootstrapped subsamples are pulled. A
Decision Tree is formed on each of the bootstrapped subsamples. Each
training set is a bootstrap sample.
• After each subsample Decision Tree has been formed, an algorithm is used
to aggregate over the Decision Trees to form the most efficient predictor.
To classify an unknown tuple, X, each classifier, Mi, returns its class
prediction, which counts as one vote.
• The bagged classifier, M∗, counts the votes and assigns the class with the
most votes to X.
• Bagging often considers homogeneous weak learners, learns them
independently from each other in parallel and combines them following
some kind of deterministic averaging process Bagging can be applied to the
prediction of continuous values by taking the average value of each
prediction for a given test tuple.
Boosting
• Consider the same example that was taken in previous section, you as a
patient, you have certain symptoms. Now, instead of consulting one doctor,
you choose to consult several. Suppose you assign weights to the value or
worth of each doctor’s diagno sis, based on the accuracies of previous
diagnoses they have made. The final diagnosis is then a combination of the
weighted diagnoses. This is the basic idea behind boosting.
• Boosting often considers homogeneous weak learners, learns them
sequentially in a very adaptative way (a base model depends on the
previous ones) and combines them following a deterministic strategy.
• In boosting, weights are also assigned to each training tuple. A series of k
classifiers is iteratively learned. After a classifier, Mi, is learned, the weights
are updated to allow the subsequent classifier, Mi+1, to “pay more
attention” to the training tuples that were misclassified by Mi. The final
boosted classifier, M∗, combines the votes of each individual classifier,
where the weight of each classifier’s vote is a function of its accuracy.
AdaBoost
• In adaptative boosting (often called “adaboost”), we try to define our
ensemble model as a weighted sum of L weak learners.
• It’s a popular boosting algorithm.
• The basic idea is that when we build a classifier, we want it to focus
more on the misclassified tuples of the previous round.
• Some classifiers may be better at classifying some “difficult” tuples
than others.
• In this way, we build a series of classifiers that complement each
other.
AdaBoost
• We are given D, a data set of d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi
is the class label of tuple Xi. Initially, AdaBoost assigns each training tuple an equal
weight of 1/d.
• Generating k classifiers for the ensemble requires k rounds through the rest of the
algorithm. In round i, the tuples from D are sampled to form a training set, Di, of size d.
• Sampling with replacement is used. This indicates the same tuple may be selected more
than once. Each tuple’s chance of being selected is based on its weight.
• A classifier model, Mi, is derived from the training tuples of Di. Its error is then calculated
using Di as a test set. The weights of the training tuples are then adjusted according to
how they were classified.
• If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly
classified, its weight is decreased.
• A tuple’s weight reflects how difficult it is to classify. The higher the weight, the more
often it has been misclassified. These weights will be used to generate the training
samples for the classifier of the next round. This is how, a series of classifiers that
complement each other are built.
AdaBoost
Random Forests
Random Forests
• Random Forest Models can be thought of as extension over bagging, as it is
bagging with a slight twist. Each classifier in the ensemble is a decision tree
classifier so that the collection of classifiers is a “forest”. Classifier is
generated using a random selection of attributes at each node to
determine the split. During classification, each tree votes and the most
popular class is returned.
• When deciding where to split and how to make decisions, bagged Decision
Trees have the full disposal of features to choose from. Therefore, although
the bootstrapped samples may be slightly different, the data is largely
going to break off at the same features throughout each model. In
contrary, Random Forest models decide where to split based on a random
selection of features. Rather than splitting at similar features at each node
throughout, Random Forest models implement a level of differentiation
because each tree will split based on different features. This level of
differentiation provides a greater ensemble to aggregate over, producing a
more accurate predictor.
Steps for implementing Random Forest
Classifier
1. Multiple subsets are created from the original data set, selecting
observations with replacement.
2. A subset of features is selected randomly and whichever feature
gives the best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the
aggregation of predictions from n number of trees.

You might also like