0% found this document useful (0 votes)
37 views129 pages

IntroClassificationDA 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views129 pages

IntroClassificationDA 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Classification: Basic Concepts

Class #16

P. Krishna Reddy, IIIT Hyderabad


Topics
• Introduction (1.5 hour): Definition, KDD framework, Issues in data mining.
• Data summarization (7.5 hrs): Data Types, Preprocessing, Characterization,
Discrimination, data warehousing techniques (Multidimensional data model,
Data warehousing architecture, Data cube computation and OLAP technology)
• Concepts and algorithms for mining patterns and associations (9 hours)
(Frequent item-set generation, A priori and FP-growth algorithm, Evaluation of
Association patterns) and preprocessing
• Concepts and algorithms related to classification and regression (9hrs)
(Overview, Decision tree induction, Over-fitting and under-fitting, Scalable
decision tree algorithms, Bayesian Classification, Regression-based Prediction
methods (9 hours)
• Concepts and algorithms for clustering the data (9 hours) (Overview, Types of
Data, K-means, Aglomerative clustering, Clustering algorithms (DBSCAN, BIRCH,
CURE, ROCK, CHAMELEON)).
• Outlier analysis and future trends (graph mining, spatio-temporal mining). (3
hours)
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (k-nearest neighbours)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
A simple question
• 1, 3, 5, 7, 9, …. What is the next number?
• Ans: 11. ; Odd numbers, Or 2n+1
• 1, 3, 9, 19, 33, ... What is the next number?
• Ans: 51; 2n2+1
• How do we solve such problems?
• Find a pattern from the examples.
• (function f(n) = 2n+1. Or model the data)
• Use it to predict the next number (or solve the problem)
• How do we design a computational procedure?
A simple question (cont.)
• We know: 1, 3, 9, 19, 33, ... What is the next
number?
• Ans: 51; 2n2+1
• 0.99, 3.02, 9.00, 18.98, 33.01, ... What next?
• Consider a series of 2D points
• (1,3), (2,6), (3,9), (4,12), ....
• What is the next point?
• When does the problem become difficult?
• When numbers are “uncertain”. Noise in
measurements
• When numbers are not just “simple numbers”?
Traditional Programming

Data
Computer Output
Program
Machine Learning

Data
Computer Program
Output
The machine learning framework

• Apply a prediction function to a feature representation of


the “sample” to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning/Classification
framework
y = f(x)
output prediction feature or
function representation

• Training: given a training set of labeled examples {(x1,y1), …,


(xN,yN)}, estimate the prediction function f by minimizing the
prediction error.

• Testing: apply f to a never before seen test example x and output


the predicted value y = f(x)

Slide credit: L. Lazebnik


Steps
Training Training
Labels
Training Data

Learned
Features Training
model

Testing

Learned
Features Prediction
model
Test sample
What is deep learning?

Y. Bengio et al, ``Deep


Learning”, MIT Press, 2015
Hand-crafted features, Representative Learning and Deep
Learning
• "Hand Crafted" features refer to properties derived using
various algorithms using the information present in the data
itself.
• Representation learning, also known as feature learning, is
a process that allows a machine to identify the most useful
features or representations from raw data automatically. This
process is crucial in machine learning because it can
significantly improve the performance of learning
algorithms.
• Deep learning is a method in artificial intelligence (AI) that
teaches computers to process data in a way that is inspired
by the human brain. Deep learning models can recognize
complex patterns in pictures, text, sounds, and other data to
produce accurate insights and predictions.
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (k-nearest neighbours)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Supervised vs. Unsupervised Learning

• Supervised learning (classification)


• Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
• New data is classified based on the training set

• Unsupervised learning (clustering)


• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
• If the accuracy is acceptable, use the model to classify data tuples whose class
labels are not known
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Classification Issues: Data Preparation
• Data cleaning
• Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
• Remove the irrelevant or redundant attributes
• Data transformation
• Generalize and/or normalize data
Issues: Evaluating Classification Methods
• Estimate accuracy of the model
• The known label of test sample is compared with
the classified result from the model
• Accuracy rate is the percentage of test set
samples that are correctly classified by the model
• Test set is independent of training set (otherwise
overfitting)
• If the accuracy is acceptable, use the model to
classify new data
• Note: If the test set is used to select models, it is called
validation (test) set
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Decision tree induction
(Class #17)
• Decision tree induction is the learning of
decision trees from class-labeled training
tuples.
• A decision tree is a flowchart-like tree
structure
• Each internal node (nonleaf node) denotes a test on an
attribute,
• Each branch represents an outcome of the test
• Each leaf node (or terminal node) holds a class label.
• Root node is the topmost node.
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
31…40
 The data set follows an example of high no fair yes
>40 medium no fair yes
Quinlan’s ID3 >40 low yes fair yes
>40 low yes excellent no
 Resulting tree: 31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31…40 medium no excellent yes
overcast
31..40 >40 31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
24
Use of decision tree
• For a tuple X, the attribute values are tested
against the decision tree.
• A path is traced from the root to leaf.
• Reason for popularity
• Easy to construct, simple and fast
• No domain knowledge is required
• Easy to understand; interpretable
• They can handle multidimensional data
• Good accuracy
• Several applications
History
• Introduced by J.Ross Quinlan
• Developed ID3 algorithm (Iterative Dichotomiser)
• Quinlan also presented C4.5, which became a
benchmark
• A CART algorithm is published by four researchers
independently
• IBM Intelligent Miner
• Three algorithms: ID3, C4.5, and CART adopt a
greedy approach (non-backtracking)
• Top-down, divide and conquer
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)

• Conditions for stopping partitioning


• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
27
• Inputs: D, attribute list, and attribute selection
method
• D is the complete set of training tuples and
associated class lablels
• Attribute list: list of attributes describing the tuples

• Attribute selection method specifies the method to


select the attributes.
Splitting possibilities
Brief Review of Entropy

m=2
30
About Shannon’s entropy
• Consider transmission of a1, a2, a3, a4
• Option 1: a1= 00, a2=01, a3=10, a4= 11
• Option 2: a1=0, a2=10, a3=110, a4=111
(i) Option 1: If p(a1)=p(a2)=p(a3)=p(a4)=1/4
Expected code length= 2 bits.
Option 2: Expected code length=2.25 bits
(ii) If p(a1)= 1/2; p(a2)=1/4; p(a3)=1/8; p(a4)=1/8 pi= probability of event X
Option 1: 2 bits (if we do not consider probabilities)
log(pi) is the amount of
Option 2= 1.75 bits =(-1/2*log(1/2) -1/4*log(1/4)-
information of event x
1/8*log(1/8)-1/8*log(1/8)=1/2+2/4+ 3/8+3/8=1.75
(Equal to Shannon’s entropy) Information content of
event x with probability
p=log(1/p(x))=-log(p(x))
The idea is that frequent letters should be coded with
smaller lengths.
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D:
v | Dj |
InfoA ( D)    Info( D j )
j 1 | D|

 Information gained by branching on attribute A


Gain(A)  Info(D)  InfoA(D)
32
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
 Class N: buys_computer = “no”
Infoage ( D)  I (2,3)  I (4,0)
14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of 14
<=30 2 3 0.971 14
samples, with 2 yes’es and 3 no’s.
31…40 4 0 0 Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high
31…40 high
no
no
excellent
fair
no
yes
Similarly,
>40 medium no fair yes
>40 low yes fair yes

Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student )  0.151
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes Gain(credit _ rating )  0.048
31…40 33high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
34
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 | D| | D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the
splitting attribute
35
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D)  1  p j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as

|D1| |D |
• Reduction in Impurity: gini A (D)  gini(D1)  2 gini(D2)
|D| |D|
gini( A)  gini(D)  giniA(D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)

37
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D)   10 Gini( D1 )   4 Gini( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split
values
• Can be modified for categorical attributes 38
Comparing Attribute Selection Measures

• The three measures, in general, return good results but


• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity in
both partitions

39
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence

• G-statistic: has a close approximation to χ2 distribution

• MDL (Minimal Description Length) principle (i.e., the simplest solution is


preferred):
• The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree

• Multivariate splits (partition based on multiple variable combinations)


• CART: finds multivariate splits based on a linear comb. of attrs.

• Which attribute selection measure is the best?


• Most give good results, none is significantly superior than others
40
Enhancements to Basic Decision Tree Induction

• Allow for continuous-valued attributes


• Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
• Handle missing attribute values
• Assign the most common value of the attribute
• Assign probability to each of the possible values
• Attribute construction
• Create new attributes based on existing ones that are
sparsely represented
• This reduces fragmentation, repetition, and replication
41
Underfitting and Overfitting
• Underfitting and overfitting are two problems that will
adversely affect the accuracy and usefulness of the model.

• Classification requires three data sets


• Training data set
• This is the data that are analyzed to first produce a model.
• Test data set
• Once a model is developed it is then tested on the evaluation data
set. The second data set is an evaluation data set important for
refining the model if necessary, and to get a sense of how well the
model will perform on subsequent data that it will be applied on.
• Data on which classification is applied
Underfitting and Overfitting
Underfitting Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: once the tree becomes too large, its test error rates begins to
increase even though its training error rate continues to decrease.
Underfitting
• Underfitting is a situation where training and test
error rates of the model are large when the size of the
tree is very small.
• Underfitting occurs because the model has yet to learn
the true structure of the data
• It performs poorly on both training and test sets
• As number of nodes in the decision tree increases the
tree will have a fewer training and test errors.
• However, once the tree becomes too large, its test error
rates begins to increase even though its training error
rate continues to decrease. It is called overfitting.
Underfitting
• Underfitting refers to a model that is too general and fails to find
interesting patterns in the data.
• This can result from not including important variables as inputs during
the model building process.
• In terms of a loan application scenario, the analyst or model-builder may
include annual salary as an important factor, but may exclude
information about the applicant's job, which may turn out to be
important.
• For example, jobs that are seasonal (swimming pool maintenance,
landscaping, etc.) may affect the person's ability to submit monthly
payments during the times when work is slower -- information that is
not reflected by their annual salary.
• This might suggest that as much information as possible should be
included as inputs during the model building process to avoid
underfitting.
More on Underfitting
• Training error can be reduced by increasing the
model complexity.
• Leaf-nodes can be expanded until it perfectly fits the
training data.
• As a result the test error may be large because the tree
may fit some of the noise points in the data.
• Such nodes degrade the performance as they do not
generalize well to the test examples.
Overfitting
• Overfitting: once the tree becomes too large, its test error rates begins
to increase even though its training error rate continues to decrease.
• Including as much information as possible to develop a model can
lead to the problem of overfitting.
• Overfitting refers to models that are too specific, or too sensitive to
the particulars of the data (the training set) used to build the model.
• This can be due to having too many variables used as inputs and/or a
non-representative training set.
• In the loan application scenario, if in the training set, many people that
defaulted on their loans happened to be named "Smith" (a popular
name), then the model (e.g. a decision tree) may decide that if the
applicant's last name is "Smith", then deny the loan.
• In refining the model, perhaps last name should not serve as an input.
More importantly, the characteristics of the data used to build the model
have to be representative of the data at large -- it's unlikely that people
named Smith have a disproportionately high rate of defaulting on loans.
Overfitting
• Overfitting can be corrected using the evaluation data
set. If accurate performance on the training set is due
to particular characteristics in that data (e.g. person's
last name), then performance will be poor on the
evaluation set as long as the evaluation set does not
share these idiosyncrasies.
• Refining the model (e.g. by pruning the decision tree)
involves setting performance to be generally
equivalent on both the training and evaluation data
sets.
• This will generally give the analyst the idea of how
well the model may perform on "real" data.
Overfitting Examples
• Overfitting can still occur despite the use of a training and evaluation data set.

• This can be the result of poorly creating the training and evaluation sets -- so that
neither is representative of subsequent data that the model will be applied to.

• For example, of predicting stock market performance.


• One year's worth of data were partitioned into equal sized training and evaluation
sets. A quantitative model was developed based on the training set -- and its
predictive power on this dataset was impressive. It was equally impressive on the
evaluation dataset, apparently requiring no refinement. But when applied to the
next year's data it performed miserably despite the fact that there were not
significant events related to the stock market. The reason for this is had to do with
how the training and evaluation data sets were created. The training and evaluation
sets were based on the daily closing values of alternate days. Day 1's close was
assigned to the training set, day 2 to the evaluation set, day 3 to the training set, day
4 to the evaluation set, and so forth. As a result the overfitting that occurred when
the model picked up factors tied to the temporal fluctuations in the stock market's
closing values were carried over into the evaluation set as well.
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to
noise or outliers
• Poor accuracy for unseen samples
• Two approaches to avoid overfitting
• Prepruning: Halt tree construction early ̵ do not split a node if
this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
• Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is
the “best pruned tree”

50
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why is decision tree induction popular?
• relatively faster learning speed (than other classification
methods)
• convertible to simple and easy to understand classification
rules
• can use SQL queries for accessing databases
• comparable classification accuracy with other methods
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
• Builds an AVC-list (attribute, value, class label)
51
Scalability Framework for RainForest
• Separates the scalability aspects from the criteria that
determine the quality of the tree
• Builds an AVC-list: AVC (Attribute, Value, Class_label)
• AVC-set (of an attribute X )
• Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
• AVC-group (of a node n )
• Set of AVC-sets of all predictor attributes at the node n

52
Rainforest: Training Set and Its AVC Sets
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
53
BOAT (Bootstrapped Optimistic Algorithm
for Tree Construction)
• Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
• Each subset is used to create a tree, resulting in several
trees
• These trees are examined and used to construct a new
tree T’
• It turns out that T’ is very close to the tree that would
be generated using the whole data set together
• Adv: requires only two scans of DB, an incremental alg.

54
Presentation Outline (Class #18)
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision-making
against which other methods can be measured
Derivation of Bays’ theorem
• Definition: The conditional probability that A is true, given that B, with the notion P(A/B)
(read “probability of A given B). Here, P(AB) is probability of the joint intersection of
events A and B
• P(A/B)= P(A  B)/P(B)
• Or P(A  B)= P(A/B) * P(B)
• Bays theorem
• P(A  B)= P(B  A)
• P(B A)=P(B/A) * P(A)
• So, we have P(A/B)=P(B/A)*P(A) / P(B)

• Let H be some hypothesis, such as the data sample X belongs to a specified class C. For
this we want to determine P(H/X), the probability that the hypothesis H holds given the
observed sample X.
• P(H/X)=P(X/H)*P(H)/P(X)
Examples of conditional probability
• Example 1: Suppose you are drawing three marbles—red, blue, and green—from a
bag. Each marble has an equal chance of being drawn. What is the conditional
probability of drawing the red marble after already drawing the blue one?
• First, the probability of drawing a blue marble is about 33% because it is one possible outcome
out of three. Assuming this first event occurs, there will be two marbles remaining, with each
having a 50% chance of being drawn. So the chance of drawing a blue marble after already
drawing a red marble would be about 16.5% (33% x 50%).
• Example 2: Consider that a fair die has been rolled and you are asked to give the
probability that it was a five. There are six equally likely outcomes, so your answer
is 1/6.
• But imagine if before you answer, you get extra information that the number rolled was odd. Since
there are only three odd numbers that are possible, one of which is five, you would certainly revise
your estimate for the likelihood that a five was rolled from 1/6 to 1/3.
• Example 3: Suppose a student is applying for admission to a university and hopes
to receive an academic scholarship. The school to which they are applying accepts
100 of every 1,000 applicants (10%) and awards academic scholarships to 10 of
every 500 students who are accepted (2%). Of the scholarship recipients, 50% of
them also receive university stipends for books, meals, and housing.
• For the students, the chance of them being accepted and then receiving a scholarship is .2% (.1 x
.02). The chance of them being accepted, receiving the scholarship, then also receiving a stipend
for books, etc. is .1% (.1 x .02 x .5).
Bayesian Classification: Simple Overview

"The essence of the Bayesian approach is to provide a mathematical
rule explaining how you should change your existing beliefs in the light of new evidence.
 In other words, it allows scientists to combine new data with their existing knowledge or
expertise.

 The canonical example is to imagine that a precocious newborn observes his first sunset,
and wonders whether the sun will rise again or not. He assigns equal prior probabilities to
both possible outcomes, and represents this by placing one white and one black marble
into a bag. The following day, when the sun rises, the child places another white marble in
the bag. The probability that a marble plucked randomly from the bag will be white (ie,
the child's degree of belief in future sunrises) has thus gone from a half to two-thirds.
After sunrise the next day, the child adds another white marble, and the probability (and
thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the
initial belief that the sun is just as likely as not to rise each morning is modified to become
a near-certainty that the sun will always rise."
Bayesian Classification: Simple overview
• Suppose your data consist of fruits, described by their color and shape. Bayesian
classifiers operate by saying "If you see a fruit that is red and round, which type of fruit is
it most likely to be, based on the observed data sample? In future, classify red and round
fruit as that type of fruit.“
• A difficulty arises when you have more than a few variables and classes -- you would
require an enormous number of observations (records) to estimate these probabilities.
• Naive Baye’s classification gets around this problem by not requiring that you have lots of
observations for each possible combination of the variables. Rather, the variables are
assumed to be independent of one another and, therefore the probability that a fruit that is
red, round, firm, 3" in diameter, etc. will be an apple can be calculated from the
independent probabilities that a fruit is red, that it is round, that it is firm, that is 3" in
diameter, etc.
• In other words, Naive Bayes classifiers assume that the effect of an variable value on a
given class is independent of the values of other variable. This assumption is called class
conditional independence. It is made to simplify the computation and in this sense
considered to be naïve.
Bayesian Classification: Simple overview…
• This assumption is a fairly strong assumption and is often not applicable.
However, bias in estimating probabilities often may not make a difference in
practice -- it is the order of the probabilities, not their exact values, that
determine the classifications.
• Studies comparing classification algorithms have found the Naïve Bayesian
classifier to be comparable in performance with classification trees and with
neural network classifiers. They have also exhibited high accuracy and speed
when applied to large databases.
Bays Theorem
Let X be the data record (case) whose class label is unknown. Let H be some hypothesis,
such as "data record X belongs to a specified class C.“

For classification, we want to determine P (H|X) – the probability that the hypothesis H
holds, given the observed data record X.

P (H|X) is the posterior probability of H conditioned on X. For example, the probability


that a fruit is an apple, given the condition that it is red and round.

In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H)
is the probability that any given data record is an apple, regardless of how the data record
looks.

The posterior probability, P (H|X), is based on more information (such as background


knowledge) than the prior probability, P(H), which is independent of X.
Bayesian Classification: Simple introduction…

Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the


probability that X is red and round given that we know that it is true that X is an apple.

P(X) is the prior probability of X, i.e., it is the probability that a data record from our set
of fruits is red and round.

Bayes theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is

P (H|X) = P(X|H) P(H) / P(X)


Bayes Classifier
• A probabilistic framework for solving classification
problems
• Conditional Probability: P ( A, C )
P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )
• Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20

• If a patient has a stiff neck, what’s the probability


he/she has meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Consider each attribute and class label as random
variables

• Given a record with attributes (A1, A2,…,An)


• Goal is to predict class C
• Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?


Bayesian Classifiers
• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all values of C
using the Bayes theorem

P( A A  A | C ) P(C )
P(C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

• Choose value of C that maximizes


P(C | A1, A2, …, An)

• Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C ) ?


Naïve Bayes Classifier
• Assume independence among attributes Ai when
class is given:
• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

• Can estimate P(Ai| Cj) for all Ai and Cj.

• New point is classified to Cj if P(Cj)  P(Ai| Cj) is


maximal.
How to Estimate
l l
Probabilities from Data?
a a u s
r ic r ic o
o o u
eg eg it n s
c at c at c on cl
a s • Class: P(C) = Nc/N
Tid Refund Marital Taxable • e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No • For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Ai | Ck) = |Aik|/ Nc
5 No Divorced 95K Yes • where |Aik| is number of instances
6 No Married 60K No having attribute Ai and belongs to
class Ck
7 Yes Divorced 220K No
• Examples:
8 No Single 85K Yes
9 No Married 75K No P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?

• For continuous attributes:


• Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
• Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
a l a l s
u
How togoEstimate
go
Probabilities
tin
r
s
i
s
c
from Data?
r ic
u o
t e t e n a
ca ca c o cl
Tid Refund Marital Taxable • Normal distribution:
Status Income Evade
( Ai   ij ) 2
1 
1 Yes Single 125K No P( A | c )  e 2  ij2

2
i j 2
2 No Married 100K No ij

3 No Single 70K No • One for each (Ai,ci) pair


4 Yes Married 120K No
• For (Income, Class=No):
5 No Divorced 95K Yes
• If Class=No
6 No Married 60K No
• sample mean = 110
7 Yes Divorced 220K No
• sample variance = 2975
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

1 
( 120110) 2

P( Income  120 | No)  e 2 ( 2975)


 0.0072
2 (54.54)
Example of Naïve Bayes Classifier
Given a Test Record:

X  (Refund  No, Married, Income  120K)


naive Bayes Classifier:

P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)


P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Naïve Bayes Classifier
• If one of the conditional probability is zero, then
the entire expression becomes zero
• Probability estimation:
N ic
Original : P( Ai | C )  c: number of classes
Nc
p: prior probability
N ic  1
Laplace : P( Ai | C )  m: parameter
Nc  c
N ic  mp
m - estimate : P( Ai | C ) 
Nc  m
Naïve Bayes Classifier: Example 1
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P( A | M )      0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P( A | M ) P( M )  0.06   0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P( A | N ) P( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals
Naïve Bayes Classifier: Example 2: estimating P(xi|C)
outlook

P(sunny|p) = 2/9 P(sunny|n) = 3/5

Outlook Temperature Humidity Windy Class P(overcast|p) = 4/9 P(overcast|n) = 0


sunny hot high false N
sunny hot high true N P(rain|p) = 3/9 P(rain|n) = 2/5
overcast hot high false P
rain mild high false P temperature
rain cool normal false P
rain cool normal true N
P(hot|p) = 2/9 P(hot|n) = 2/5
overcast cool normal true P
sunny mild high false N P(mild|p) = 4/9 P(mild|n) = 2/5
sunny cool normal false P
rain mild normal false P P(cool|p) = 3/9 P(cool|n) = 1/5
sunny mild normal true P
overcast mild high true P humidity
overcast hot normal false P
P(high|p) = 3/9 P(high|n) = 4/5
rain mild high true N
P(normal|p) = 6/9 P(normal|n) = 2/5

windy
P(p) = 9/14 P(true|p) = 3/9 P(true|n) = 3/5

P(false|p) = 6/9 P(false|n) = 2/5


P(n) = 5/14
Example 2….
• Classify an unseen sample X = <rain, hot, high, false>

• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (don’t play)


Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance during probability
estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some attributes
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables) are often
correlated.
• Attempts to overcome this limitation:
• Bayesian networks, that combine Bayesian reasoning with causal
relationships between attributes
• Decision trees, that reason on one attribute at the time, considering most
important attributes first
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (k-nearest neighbours)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
• Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
receiving new (e.g., test) data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form an implicit global
approximation to the target function
• Eager: must commit to a single hypothesis that covers the
entire instance space

79
Lazy Learner: Instance-Based Methods

• Instance-based learning:
• Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.

• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

80
k-Nearest Neighbor Classifiers
• Basic idea:
• If it walks like a duck, quacks like a duck, then it’s
probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
k-Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Nearest Neighbor Classification

• Compute distance between two points:


• Euclidean distance

d ( p, q )   ( pi
i
q )
i
2

• Determine the class from nearest neighbor list


• take the majority vote of class labels among the k-
nearest neighbors
• Weigh the vote according to distance
• weight factor, w = 1/d2
Nearest Neighbor Classification…

• Choosing the value of k:


• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from other
classes

X
Nearest Neighbor Classification…

• Scaling issues
• Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…

• Problem with Euclidean measure:


• High dimensional data
• curse of dimensionality
• Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001

d = 1.4142 d = 1.4142

 Solution: Normalize the vectors to unit length


Nearest neighbor Classification…

• k-NN classifiers are lazy learners


• It does not build models explicitly
• Unlike eager learners such as decision tree induction
and rule-based systems
• Classifying unknown records are relatively expensive
Selection of k for kNN
• The number of neighbors k
• Small k: overfitting (high var., low bias)
• Big k: bringing too many irrelevant points (high bias, low
var.)

http://scott.fortmann-roe.com/docs/BiasVariance.html
Case-Based Reasoning (CBR)
• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
• Instances represented by rich symbolic descriptions (e.g., function graphs)
• Search for similar cases, multiple retrieved cases may be combined
• Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
• Challenges
• Find a good similarity metric
• Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
90
Presentation Outline (Class #19)
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners (k-nearest neighbours)
• Linear Classifiers
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
About Linear classifiers
• Classifies generate complex decision boundaries.
• Decision tree might generate hyperrectangular shape
boundary
• 1-nearest neighbour classifier may generate hyper
polygonal boundary
• Linear classifier may general linear boundary
• Better generalization performance, better
interpretability
• We discuss three classifiers: Linear regression,
perceptron, logistic regression (widely used)
Linear Regression Problem: Example

• Mapping from independent attributes to continuous


value: x => y
• {living area} => Price of the house
• {college; major; GPA} => Future Income
Price of houses

Living Area
Linear Regression Problem: Model

• Linear regression
• Data: n independent objects
• Observed Value: 𝑦𝑖 , 𝑖 = 1,2,3, ⋯ , 𝑛
𝑇
• p-dimensional attributes: 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , ⋯ , 𝑥𝑖𝑝 ,
𝑖 = 1,2,3 ⋯ , 𝑛

• Model:
• Weight vector: 𝑤 = 𝑤1 , 𝑤2 , ⋯ , 𝑤𝑝
• 𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝑏
• The weight vector w and bias b are the model parameter
learnt by data
Linear Regression Model: Solution

• Least Square Method


𝑛
• Cost / Loss Function: L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏 2
𝑛 2
• Optimization Goal: argmin L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏

• Closed-form solution:
Σ𝑛 ത
𝑖=1 𝑥𝑖 (𝑦𝑖 −𝑦) 1𝑛
• 𝑤= 2 𝑏 = Σ𝑖=1 (𝑦𝑖 − 𝑤𝑥𝑖 )
Σ𝑛 2 𝑛
𝑖=1 𝑥𝑖 −𝑛 Σ𝑖=1 𝑥𝑖
𝑛

• For multiple attributes, multiple linear regression methods are applied.


Perceptron
• Consider a binary classification task
• The output value yi for a given tuple is a binary variable: yi = +1 indicates the
ith tuple is a positive tuple (e.g., buy computer) and yi = 0 indicates the ith
tuple is a negative one (e.g., not buy computer).
• Output of regression function is the sign of regression function
• Sign=+1 if yi is the predicted class label, sign=0, otherwise
• If we know the weight vector W, we can predict the class label.
• W is interactively learned from the training set.
• If the training tuples are linearly separable, the perceptron algorithm is guaranteed to
find a weight vector (i.e., a hyperplane decision boundary)
Logistic Regression
• Perceptron predicts the binary class label of a given tuple. However, can we
also tell how confident such a prediction is?
• Logistic regression estimates the probability of an event occurring, such as
voted or didn't vote, based on a given dataset of independent variables.
• To convert the output to 0 to 1, a sigmoid function is introduced.
• Sigmoid function (differentiable function) :
1 𝑒𝑧
• 𝜎 𝑧 = =
1+𝑒 −𝑧 𝑒 𝑧 +1
• Projects (−∞, +∞) to [0, 1]
• Not only LR uses this function,
but also neural network, deep learning Sigmoid
Function
• To determine the optimal W vector, the maximum
likelihood estimation method is employed.
• It aims to solve the following optimization
problem: “choosing the the best weight vector w
that maximizes the likelihood of the training set.
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Lazy Learners
• K-nearest neighbors
• Case based reasoning
• Linear Classifiers
• Linear regression
• Perceptron: turning linear regression to classification
• Logistic regression
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
• Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
99
Confusion Matrix
• True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier.
• True negatives (TN): These are the negative tuples that were
correctly labeled by the classifier.
• False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys_computer
= no for which the classifier predicted buys_computer = yes).
• False negatives (FN): These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys_computer = yes
for which the classifier predicted buys_computer = no).
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates #


of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
101
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N fraud, or HIV-positive
P’ N’ All
 Significant majority of the

negative class and minority of


• Classifier Accuracy, or recognition the positive class
rate: percentage of test set tuples
 Sensitivity: True Positive
that are correctly classified
Accuracy = (TP + TN)/P+N recognition rate
 Sensitivity = TP/P
• Error rate: 1 – accuracy, or
 Specificity: True Negative
Error rate = (FP + FN)/P+N
recognition rate
 Specificity = TN/N

102
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall


• assigns ß times as much weight to recall as to precision

103
More about Precision and Recall
• Precession (Exactness): Suppose you have done duty for 10 hours, e.g., digging
a well. How many hours out of total duty have you spent doing useful work?
Ideally, you are supposed to spend 10 hours doing useful work. Total duty time
(TP+FP) = time spent on useful work (TP)+ time spent on other work/mistakes
(FP). Precision=TP/TP+FP
• Recall (completeness): Suppose the objective is to dig the well for 10 meters
depth. You have spent some total time. Recall is about how much of the well is
being dug. Total time = time spent digging work (TP) plus other work/mistakes
(FN). Recall= TP/TP+FN.
• For a classifier, it is possible to have 100% precision and low recall. For example,
you have spent 10 hours (the entire time) digging well. But, only a small portion
of the well is being dug (recall is low).
• For a classifier, it is possible to have 100% recall and low precision. For example,
you have dug the 10 meters well as required but have spent several days making
mistakes (several wrong results).
• The objective is to get good precision and recall, which is a good F1 metric.
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = Total A\P C ¬C
no
cancer = yes 90 210 300 C TP FN P
cancer = no 140 9560 9700 ¬C FP TN N
Total 230 9770 10000
P’ N’ All

• Calculate the measure just introduced


• Sensitivity = TP/P = 90/300 = 30%
• Specificity = TN/N = 9560/9700 = 98.56%
• Accuracy = (TP + TN)/All = (90+9560)/10000 = 96.50%
• Error rate = (FP + FN)/All = (140 + 210)/10000 = 3.50%
• Precision = TP/(TP + FP) = 90/(90 + 140) = 90/230 = 39.13%
• Recall = TP/ (TP + FN) = 90/(90 + 210) = 90/300 = 30.00%
• F1 = 2 P × R /(P + R) = 2 × 39.13% × 30.00%/(39.13% + 30%) =
33.96%
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized
data
• *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data (popular method)
106
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

107
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• These mean error rates are just estimates of error on the true
population of future data cases

• What if the difference between the error rates is just attributed


to chance?
• Use a test of statistical significance
• Obtain confidence limits for our error estimates

108
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation
• Assume samples follow a t distribution with k–1 degrees of
freedom
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
• we conclude that the difference between M1 & M2 is
statistically significant
• Chose model with lower error rate
109
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of classification
models
• Shows the trade-off between the true positive
rate (TPR) and the false positive rate (FPR)
• TPR is the proportion of positive (or “yes”)
tuples that are correctly labeled by the model;
• FPR is the proportion of negative (or “no”)
tuples that are mislabeled as positive. Recall  Vertical axis
that T P, FP, P, and N are the number of true represents TPR
positive, false positive, positive, and negative  Horizontal axis rep.
tuples, respectively. the FPR
• Rank the test tuples in decreasing order: the  The plot also shows a
one that is most likely to belong to the
positive class appears at the top of the list diagonal line
• The closer to the diagonal line (i.e., the closer  A model with perfect
the area is to 0.5), the less accurate is the accuracy will have an
model area of 1.0
110
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
112
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Ensemble Methods: Increasing the
Accuracy

• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*

114
Ensemble Methods: Increasing the Accuracy
• Popular ensemble methods
• Bagging: Trains each model using a subset of the training set, and models
learned in parallel
• Boosting: Trains each new model instance to emphasize the training
instances that previous models mis-classified, and models learned in order

Boosting
Bagging
Bagging: Bootstrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
117
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data

119
Adaboost (Freund and Schapire, 1997)

3. Update
weights based
1. Assign initial on current
weights to each model
training tuple {wn(1) {wn(2) {wn(k)
} } … }

2. Train base M1 M2 … Mk
classifier on
weighted dataset
4. After base classifiers
Two ‘weighting’ strategy: are trained, they are
𝑘
1. Assign weights to each training combined to give the
example 𝑀 ∗ 𝑥 = sign ෍ 𝛼𝑖 𝑀𝑖 𝑥
final classifier
2. Sample dataset based on weight 𝑖=1

distribution
Adaboost (Freund and Schapire, 1997)
Adaptive boosting
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights d of the misclassified tuples:
error ( M i )   w j  err ( X j )
j

1  error ( M i )
log
• The weight of classifier Mi’s vote is error ( M i )
121
Gradient Boosting and XGboost
• Gradient boosting is another powerful boosting technique, which can be used
for classification, regression, and ranking.
• If we use a tree (e.g., decision tree for classification, regression tree for
regression) as the base model (i.e., the weak learner), it is called gradient
tree boosting, or gradient boosted tree

• A highly scalable end-to-end gradient tree boosting system is called XGBoost,


which is capable to handle a billion-scale training set.
• XGBoost has made a number of innovations for training gradient tree boosting,
including a new tree construction algorithm designed for sparse data, feature
subsampling (as opposed to training tuple subsampling in stochastic gradient
boosting)
• A highly efficient cacheaware block structure. XGBoost has been successfully
used by data scientists in many data mining challenges, often leading to top
competitive results.
Random Forest (Breiman 2001)
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to determine
the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
• Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting

124
Ensemble Methods Recap

• Random forest and XGBoost are the most commonly


used algorithms for tabular data
• Pros
• Good performance for tabular data, requires no data scaling
• Can scale to large datasets
• Can handle missing data to some extent
• Cons
• Can overfit to training data if not tuned properly
• Lack of interpretability (compared to decision trees)
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous negative
ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and equal
error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
• Oversampling: re-sampling of data from positive class so that the
training set contains equal number of positive and negative samples.
• Under-sampling: randomly eliminate tuples from negative class so
that the training set contains equal number of positive and negative
samples.
• Threshold-moving: moves the decision threshold, t, so that the rare
class tuples are easier to classify, and hence, less chance of costly
false negative errors
• Ensemble techniques: Ensemble multiple classifiers introduced above
• Still difficult for class imbalance problem on multiclass tasks
126
Presentation Outline
• Background
• Basic concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Accuracy
• Summary
Summary (I)
• Classification is a form of data analysis that extracts models describing
important data classes.
• Effective and scalable methods have been developed for decision tree
induction, Naive Bayesian classification, rule-based classification, and
many other classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision,
recall, F measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.
Bagging and boosting can be used to increase overall accuracy by learning
and combining a series of individual models.

128
Summary (II)
• Significance tests and ROC curves are useful for model selection.
• There have been numerous comparisons of the different
classification methods; the matter remains a research topic
• No single method has been found to be superior over all others
for all data sets
• Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method

129

You might also like