IntroClassificationDA 2024
IntroClassificationDA 2024
Class #16
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
The machine learning framework
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning/Classification
framework
y = f(x)
output prediction feature or
function representation
Learned
Features Training
model
Testing
Learned
Features Prediction
model
Test sample
What is deep learning?
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Supervised vs. Unsupervised Learning
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
no yes yes
24
Use of decision tree
• For a tuple X, the attribute values are tested
against the decision tree.
• A path is traced from the root to leaf.
• Reason for popularity
• Easy to construct, simple and fast
• No domain knowledge is required
• Easy to understand; interpretable
• They can handle multidimensional data
• Good accuracy
• Several applications
History
• Introduced by J.Ross Quinlan
• Developed ID3 algorithm (Iterative Dichotomiser)
• Quinlan also presented C4.5, which became a
benchmark
• A CART algorithm is published by four researchers
independently
• IBM Intelligent Miner
• Three algorithms: ID3, C4.5, and CART adopt a
greedy approach (non-backtracking)
• Top-down, divide and conquer
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
m=2
30
About Shannon’s entropy
• Consider transmission of a1, a2, a3, a4
• Option 1: a1= 00, a2=01, a3=10, a4= 11
• Option 2: a1=0, a2=10, a3=110, a4=111
(i) Option 1: If p(a1)=p(a2)=p(a3)=p(a4)=1/4
Expected code length= 2 bits.
Option 2: Expected code length=2.25 bits
(ii) If p(a1)= 1/2; p(a2)=1/4; p(a3)=1/8; p(a4)=1/8 pi= probability of event X
Option 1: 2 bits (if we do not consider probabilities)
log(pi) is the amount of
Option 2= 1.75 bits =(-1/2*log(1/2) -1/4*log(1/4)-
information of event x
1/8*log(1/8)-1/8*log(1/8)=1/2+2/4+ 3/8+3/8=1.75
(Equal to Shannon’s entropy) Information content of
event x with probability
p=log(1/p(x))=-log(p(x))
The idea is that frequent letters should be coded with
smaller lengths.
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
classify D:
v | Dj |
InfoA ( D) Info( D j )
j 1 | D|
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) 0.151
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 33high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
34
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 | D| | D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
|D1| |D |
• Reduction in Impurity: gini A (D) gini(D1) 2 gini(D2)
|D| |D|
gini( A) gini(D) giniA(D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
37
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
• Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D) 10 Gini( D1 ) 4 Gini( D2 )
14 14
39
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
Underfitting: when model is too simple, both training and test errors are large
Overfitting: once the tree becomes too large, its test error rates begins to
increase even though its training error rate continues to decrease.
Underfitting
• Underfitting is a situation where training and test
error rates of the model are large when the size of the
tree is very small.
• Underfitting occurs because the model has yet to learn
the true structure of the data
• It performs poorly on both training and test sets
• As number of nodes in the decision tree increases the
tree will have a fewer training and test errors.
• However, once the tree becomes too large, its test error
rates begins to increase even though its training error
rate continues to decrease. It is called overfitting.
Underfitting
• Underfitting refers to a model that is too general and fails to find
interesting patterns in the data.
• This can result from not including important variables as inputs during
the model building process.
• In terms of a loan application scenario, the analyst or model-builder may
include annual salary as an important factor, but may exclude
information about the applicant's job, which may turn out to be
important.
• For example, jobs that are seasonal (swimming pool maintenance,
landscaping, etc.) may affect the person's ability to submit monthly
payments during the times when work is slower -- information that is
not reflected by their annual salary.
• This might suggest that as much information as possible should be
included as inputs during the model building process to avoid
underfitting.
More on Underfitting
• Training error can be reduced by increasing the
model complexity.
• Leaf-nodes can be expanded until it perfectly fits the
training data.
• As a result the test error may be large because the tree
may fit some of the noise points in the data.
• Such nodes degrade the performance as they do not
generalize well to the test examples.
Overfitting
• Overfitting: once the tree becomes too large, its test error rates begins
to increase even though its training error rate continues to decrease.
• Including as much information as possible to develop a model can
lead to the problem of overfitting.
• Overfitting refers to models that are too specific, or too sensitive to
the particulars of the data (the training set) used to build the model.
• This can be due to having too many variables used as inputs and/or a
non-representative training set.
• In the loan application scenario, if in the training set, many people that
defaulted on their loans happened to be named "Smith" (a popular
name), then the model (e.g. a decision tree) may decide that if the
applicant's last name is "Smith", then deny the loan.
• In refining the model, perhaps last name should not serve as an input.
More importantly, the characteristics of the data used to build the model
have to be representative of the data at large -- it's unlikely that people
named Smith have a disproportionately high rate of defaulting on loans.
Overfitting
• Overfitting can be corrected using the evaluation data
set. If accurate performance on the training set is due
to particular characteristics in that data (e.g. person's
last name), then performance will be poor on the
evaluation set as long as the evaluation set does not
share these idiosyncrasies.
• Refining the model (e.g. by pruning the decision tree)
involves setting performance to be generally
equivalent on both the training and evaluation data
sets.
• This will generally give the analyst the idea of how
well the model may perform on "real" data.
Overfitting Examples
• Overfitting can still occur despite the use of a training and evaluation data set.
• This can be the result of poorly creating the training and evaluation sets -- so that
neither is representative of subsequent data that the model will be applied to.
50
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why is decision tree induction popular?
• relatively faster learning speed (than other classification
methods)
• convertible to simple and easy to understand classification
rules
• can use SQL queries for accessing databases
• comparable classification accuracy with other methods
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
• Builds an AVC-list (attribute, value, class label)
51
Scalability Framework for RainForest
• Separates the scalability aspects from the criteria that
determine the quality of the tree
• Builds an AVC-list: AVC (Attribute, Value, Class_label)
• AVC-set (of an attribute X )
• Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
• AVC-group (of a node n )
• Set of AVC-sets of all predictor attributes at the node n
52
Rainforest: Training Set and Its AVC Sets
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer
54
Presentation Outline (Class #18)
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision-making
against which other methods can be measured
Derivation of Bays’ theorem
• Definition: The conditional probability that A is true, given that B, with the notion P(A/B)
(read “probability of A given B). Here, P(AB) is probability of the joint intersection of
events A and B
• P(A/B)= P(A B)/P(B)
• Or P(A B)= P(A/B) * P(B)
• Bays theorem
• P(A B)= P(B A)
• P(B A)=P(B/A) * P(A)
• So, we have P(A/B)=P(B/A)*P(A) / P(B)
• Let H be some hypothesis, such as the data sample X belongs to a specified class C. For
this we want to determine P(H/X), the probability that the hypothesis H holds given the
observed sample X.
• P(H/X)=P(X/H)*P(H)/P(X)
Examples of conditional probability
• Example 1: Suppose you are drawing three marbles—red, blue, and green—from a
bag. Each marble has an equal chance of being drawn. What is the conditional
probability of drawing the red marble after already drawing the blue one?
• First, the probability of drawing a blue marble is about 33% because it is one possible outcome
out of three. Assuming this first event occurs, there will be two marbles remaining, with each
having a 50% chance of being drawn. So the chance of drawing a blue marble after already
drawing a red marble would be about 16.5% (33% x 50%).
• Example 2: Consider that a fair die has been rolled and you are asked to give the
probability that it was a five. There are six equally likely outcomes, so your answer
is 1/6.
• But imagine if before you answer, you get extra information that the number rolled was odd. Since
there are only three odd numbers that are possible, one of which is five, you would certainly revise
your estimate for the likelihood that a five was rolled from 1/6 to 1/3.
• Example 3: Suppose a student is applying for admission to a university and hopes
to receive an academic scholarship. The school to which they are applying accepts
100 of every 1,000 applicants (10%) and awards academic scholarships to 10 of
every 500 students who are accepted (2%). Of the scholarship recipients, 50% of
them also receive university stipends for books, meals, and housing.
• For the students, the chance of them being accepted and then receiving a scholarship is .2% (.1 x
.02). The chance of them being accepted, receiving the scholarship, then also receiving a stipend
for books, etc. is .1% (.1 x .02 x .5).
Bayesian Classification: Simple Overview
"The essence of the Bayesian approach is to provide a mathematical
rule explaining how you should change your existing beliefs in the light of new evidence.
In other words, it allows scientists to combine new data with their existing knowledge or
expertise.
The canonical example is to imagine that a precocious newborn observes his first sunset,
and wonders whether the sun will rise again or not. He assigns equal prior probabilities to
both possible outcomes, and represents this by placing one white and one black marble
into a bag. The following day, when the sun rises, the child places another white marble in
the bag. The probability that a marble plucked randomly from the bag will be white (ie,
the child's degree of belief in future sunrises) has thus gone from a half to two-thirds.
After sunrise the next day, the child adds another white marble, and the probability (and
thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the
initial belief that the sun is just as likely as not to rise each morning is modified to become
a near-certainty that the sun will always rise."
Bayesian Classification: Simple overview
• Suppose your data consist of fruits, described by their color and shape. Bayesian
classifiers operate by saying "If you see a fruit that is red and round, which type of fruit is
it most likely to be, based on the observed data sample? In future, classify red and round
fruit as that type of fruit.“
• A difficulty arises when you have more than a few variables and classes -- you would
require an enormous number of observations (records) to estimate these probabilities.
• Naive Baye’s classification gets around this problem by not requiring that you have lots of
observations for each possible combination of the variables. Rather, the variables are
assumed to be independent of one another and, therefore the probability that a fruit that is
red, round, firm, 3" in diameter, etc. will be an apple can be calculated from the
independent probabilities that a fruit is red, that it is round, that it is firm, that is 3" in
diameter, etc.
• In other words, Naive Bayes classifiers assume that the effect of an variable value on a
given class is independent of the values of other variable. This assumption is called class
conditional independence. It is made to simplify the computation and in this sense
considered to be naïve.
Bayesian Classification: Simple overview…
• This assumption is a fairly strong assumption and is often not applicable.
However, bias in estimating probabilities often may not make a difference in
practice -- it is the order of the probabilities, not their exact values, that
determine the classifications.
• Studies comparing classification algorithms have found the Naïve Bayesian
classifier to be comparable in performance with classification trees and with
neural network classifiers. They have also exhibited high accuracy and speed
when applied to large databases.
Bays Theorem
Let X be the data record (case) whose class label is unknown. Let H be some hypothesis,
such as "data record X belongs to a specified class C.“
For classification, we want to determine P (H|X) – the probability that the hypothesis H
holds, given the observed data record X.
In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H)
is the probability that any given data record is an apple, regardless of how the data record
looks.
P(X) is the prior probability of X, i.e., it is the probability that a data record from our set
of fruits is red and round.
Bayes theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Consider each attribute and class label as random
variables
P( A A A | C ) P(C )
P(C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
2
i j 2
2 No Married 100K No ij
1
( 120110) 2
windy
P(p) = 9/14 P(true|p) = 3/9 P(true|n) = 3/5
• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
79
Lazy Learner: Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference
80
k-Nearest Neighbor Classifiers
• Basic idea:
• If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test
Record
X X X
d ( p, q ) ( pi
i
q )
i
2
X
Nearest Neighbor Classification…
• Scaling issues
• Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
d = 1.4142 d = 1.4142
http://scott.fortmann-roe.com/docs/BiasVariance.html
Case-Based Reasoning (CBR)
• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
• Instances represented by rich symbolic descriptions (e.g., function graphs)
• Search for similar cases, multiple retrieved cases may be combined
• Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
• Challenges
• Find a good similarity metric
• Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
90
Presentation Outline (Class #19)
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (k-nearest neighbours)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
About Linear classifiers
• Classifies generate complex decision boundaries.
• Decision tree might generate hyperrectangular shape
boundary
• 1-nearest neighbour classifier may generate hyper
polygonal boundary
• Linear classifier may general linear boundary
• Better generalization performance, better
interpretability
• We discuss three classifiers: Linear regression,
perceptron, logistic regression (widely used)
Linear Regression Problem: Example
Living Area
Linear Regression Problem: Model
• Linear regression
• Data: n independent objects
• Observed Value: 𝑦𝑖 , 𝑖 = 1,2,3, ⋯ , 𝑛
𝑇
• p-dimensional attributes: 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , ⋯ , 𝑥𝑖𝑝 ,
𝑖 = 1,2,3 ⋯ , 𝑛
• Model:
• Weight vector: 𝑤 = 𝑤1 , 𝑤2 , ⋯ , 𝑤𝑝
• 𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝑏
• The weight vector w and bias b are the model parameter
learnt by data
Linear Regression Model: Solution
• Closed-form solution:
Σ𝑛 ത
𝑖=1 𝑥𝑖 (𝑦𝑖 −𝑦) 1𝑛
• 𝑤= 2 𝑏 = Σ𝑖=1 (𝑦𝑖 − 𝑤𝑥𝑖 )
Σ𝑛 2 𝑛
𝑖=1 𝑥𝑖 −𝑛 Σ𝑖=1 𝑥𝑖
𝑛
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
102
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
103
More about Precision and Recall
• Precession (Exactness): Suppose you have done duty for 10 hours, e.g., digging
a well. How many hours out of total duty have you spent doing useful work?
Ideally, you are supposed to spend 10 hours doing useful work. Total duty time
(TP+FP) = time spent on useful work (TP)+ time spent on other work/mistakes
(FP). Precision=TP/TP+FP
• Recall (completeness): Suppose the objective is to dig the well for 10 meters
depth. You have spent some total time. Recall is about how much of the well is
being dug. Total time = time spent digging work (TP) plus other work/mistakes
(FN). Recall= TP/TP+FN.
• For a classifier, it is possible to have 100% precision and low recall. For example,
you have spent 10 hours (the entire time) digging well. But, only a small portion
of the well is being dug (recall is low).
• For a classifier, it is possible to have 100% recall and low precision. For example,
you have dug the 10 meters well as required but have spent several days making
mistakes (several wrong results).
• The objective is to get good precision and recall, which is a good F1 metric.
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = Total A\P C ¬C
no
cancer = yes 90 210 300 C TP FN P
cancer = no 140 9560 9700 ¬C FP TN N
Total 230 9770 10000
P’ N’ All
107
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
• These mean error rates are just estimates of error on the true
population of future data cases
108
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation
• Assume samples follow a t distribution with k–1 degrees of
freedom
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
• we conclude that the difference between M1 & M2 is
statistically significant
• Chose model with lower error rate
109
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of classification
models
• Shows the trade-off between the true positive
rate (TPR) and the false positive rate (FPR)
• TPR is the proportion of positive (or “yes”)
tuples that are correctly labeled by the model;
• FPR is the proportion of negative (or “no”)
tuples that are mislabeled as positive. Recall Vertical axis
that T P, FP, P, and N are the number of true represents TPR
positive, false positive, positive, and negative Horizontal axis rep.
tuples, respectively. the FPR
• Rank the test tuples in decreasing order: the The plot also shows a
one that is most likely to belong to the
positive class appears at the top of the list diagonal line
• The closer to the diagonal line (i.e., the closer A model with perfect
the area is to 0.5), the less accurate is the accuracy will have an
model area of 1.0
110
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
112
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Ensemble Methods: Increasing the
Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
114
Ensemble Methods: Increasing the Accuracy
• Popular ensemble methods
• Bagging: Trains each model using a subset of the training set, and models
learned in parallel
• Boosting: Trains each new model instance to emphasize the training
instances that previous models mis-classified, and models learned in order
Boosting
Bagging
Bagging: Bootstrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
117
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
119
Adaboost (Freund and Schapire, 1997)
3. Update
weights based
1. Assign initial on current
weights to each model
training tuple {wn(1) {wn(2) {wn(k)
} } … }
2. Train base M1 M2 … Mk
classifier on
weighted dataset
4. After base classifiers
Two ‘weighting’ strategy: are trained, they are
𝑘
1. Assign weights to each training combined to give the
example 𝑀 ∗ 𝑥 = sign 𝛼𝑖 𝑀𝑖 𝑥
final classifier
2. Sample dataset based on weight 𝑖=1
distribution
Adaboost (Freund and Schapire, 1997)
Adaptive boosting
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights d of the misclassified tuples:
error ( M i ) w j err ( X j )
j
1 error ( M i )
log
• The weight of classifier Mi’s vote is error ( M i )
121
Gradient Boosting and XGboost
• Gradient boosting is another powerful boosting technique, which can be used
for classification, regression, and ranking.
• If we use a tree (e.g., decision tree for classification, regression tree for
regression) as the base model (i.e., the weak learner), it is called gradient
tree boosting, or gradient boosted tree
124
Ensemble Methods Recap
128
Summary (II)
• Significance tests and ROC curves are useful for model selection.
• There have been numerous comparisons of the different
classification methods; the matter remains a research topic
• No single method has been found to be superior over all others
for all data sets
• Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method
129