0% found this document useful (0 votes)
12 views75 pages

Classification

The document discusses various classification methods in machine learning, including Naïve Bayesian classification, decision trees, and instance-based methods, highlighting the differences between supervised and unsupervised learning. It details the classification process, which involves a learning step to construct a model and a classification step to predict class labels for new data, emphasizing the importance of accuracy and avoiding overfitting. Additionally, it explains decision tree algorithms and their attribute selection measures, such as information gain, gain ratio, and Gini index, along with Bayesian classification principles.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views75 pages

Classification

The document discusses various classification methods in machine learning, including Naïve Bayesian classification, decision trees, and instance-based methods, highlighting the differences between supervised and unsupervised learning. It details the classification process, which involves a learning step to construct a model and a classification step to predict class labels for new data, emphasizing the importance of accuracy and avoiding overfitting. Additionally, it explains decision tree algorithms and their attribute selection measures, such as information gain, gain ratio, and Gini index, along with Bayesian classification principles.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Classification

Naïve Bayesian classification, Decision trees, Decision rules,


Instance-based methods
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
Unsupervised learning (clustering)
• The class labels of training data is unknown.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
Prediction Problems: Classification vs. Numeric Prediction
Classification
• Predicts categorical class labels (discrete or nominal)
• Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction
• Models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification
Data classification is a two-step process: Learning & Classification step.
i. Learning step (or Training phase or Model construction): where a
classification model is constructed.
• Describing a set of predetermined classes.
• “Learning from” a training set made up of database tuples and their
associated class labels.
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute.
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• The class label attribute is categorical (or nominal) in that each value
serves as a category or class.
Classification
ii. Classification step (Model usage ): where the model is used to predict
class labels for given data.
• Classifying future or unknown objects
• If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Classification
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
Decision Tree
• A decision tree is a flowchart-like tree structure:
• Each internal node (nonleaf node) denotes a test on an attribute
• Each branch represents an outcome of the test.
• Each leaf node (or terminal node) holds a class label.
• Internal nodes are denoted
by rectangles.
• Leaf nodes are denoted by
ovals.
• Some decision tree
algorithms produce only
binary trees.
• Others can produce
nonbinary trees.
Decision Tree
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the
attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be
converted to classification
rules.
• Decision tree classifiers have
good accuracy. However,
successful use may depend
on the data at hand.
Decision Tree
• Popular traditional Decision Tree Algorithms:
✓ID3 (Iterative Dichotomiser)
✓C4.5 (a successor of ID3)
✓CART (Classification And Regression Trees)
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in
which decision trees are constructed in a top-down recursive divide-
and-conquer manner.
Decision Tree
Decision tree algorithm has three parameters:
i. D: data partition, Initially, it is the complete set of training tuples and their
associated class labels.
ii. Parameter attribute list: the set of candidate attributes.
iii. Attribute selection method: specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class.
➢ The criterion consists of a splitting attribute and, possibly, either a split-
point or splitting subset.
➢ Attribute selection measure such as information gain or the Gini index.
➢ Gini index: Tree is strictly binary
➢ Information gain: Multiway splits (i.e., two or more branches to be grown
from a node)
Decision Tree
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (Stopping criteria).
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
• All the tuples in partition D (represented at node N) belong to the same
class.
Stopping criteria: Recursive partitioning stops only when any one of the
following terminating conditions is true :
✓All the tuples in partition D (represented at node N) belong to the
same class.
✓There are no remaining attributes on which the tuples may be further
partitioned.
✓There are no tuples for a given branch, that is, a partition Dj is empty.
Decision Tree
• Let A be the splitting attribute. A has v distinct values, {a1, a2,..., av},
based on the training data.
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A.
• A branch is created for each known value, aj, of A and labeled with that
value.
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split
point, respectively.
Decision Tree: Attribute Selection Measures
Three popular attribute selection measures—
a) Information gain (ID3)
b) Gain ratio (C4.5)
c) Gini index (CART)
Decision Tree: Attribute Selection Measures- Information gain
Information Gain:
• ID3 uses information gain as its attribute selection measure.
• The attribute with the highest information gain is chosen as the
splitting attribute for node N.
• To classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
Decision Tree: Attribute Selection Measures- Information gain
Entropy (Information Theory)
• A measure of uncertainty associated with a random variable.
• High entropy -> higher uncertainty
• Lower entropy -> lower uncertainty
• The expected information (or entropy) needed to classify a tuple in D
is:
Decision Tree: Attribute Selection Measures- Information gain
• The expected information (or entropy) needed to classify a tuple in D
is:

• where pi is the nonzero probability that an arbitrary tuple in D belongs


to class Ci and is estimated by |Ci,D|/|D|.
• A log function to the base 2 is used, because the information is encoded
in bits.
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Information needed (after using A to split D into v partitions) to classify
D:

• The term acts as the weight of the jth partition.


• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A.
• The smaller the expected information (still) required, the greater the
purity of the partitions.
Decision Tree: Attribute Selection Measures- Information gain
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A).
• Information gained by branching on attribute A
Decision Tree: Attribute Selection Measures- Information gain
• In this example, class label attribute, buys_computer, has two distinct
values (namely, yes, no).
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for
these tuples, we first need to
compute the expected information
to classify a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Next, we need to compute the expected information requirement for
each attribute.
• Let’s start with the attribute age. We need to look at the distribution of
yes and no tuples for each category of age.
Decision Tree: Attribute Selection Measures- Information gain
• For the age category “youth,” there are two yes tuples and three no
tuples.
• For the category “middle aged,” there are four yes tuples and zero no
tuples.
• For the category “senior,” there
are three yes tuples and two no
tuples.
Decision Tree: Attribute Selection Measures- Information gain

• Similarly, we can compute Gain(income) =0.029 bits, Gain(student) =


0.151 bits, and Gain(credit_rating) = 0.048 bits.
• Because age has the highest
information gain among the
attributes, it is selected as the
splitting attribute.
Decision Tree: Attribute Selection Measures- Information gain

• *Note the tuples falling into the partition for age D


middle aged all belong to the same class.
• Therefore be created at the end of this branch and
labeled “yes.”
Decision Tree: Attribute Selection Measures- Information gain
Decision Tree: Attribute Selection Measures
• Computing Information-Gain for Continuous-Valued Attributes:
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
➢ Sort the value A in increasing order
➢ Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
➢ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
➢ The point with the minimum expected information requirement for
A is selected as the split-point for A
• Split:
➢ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Decision Tree: Attribute Selection Measures- Gain Ratio
• Information gain measure is biased toward tests with many outcomes.
• Information gain prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique identifier such
as product ID.
• A split on product ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple.
• Each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct_ID(D)= 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
Decision Tree: Attribute Selection Measures- Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio, which attempts to overcome this bias.

• For each outcome, it considers the number of tuples having that


outcome with respect to the total number of tuples in D.

• The attribute with the maximum gain ratio is selected as the splitting
attribute.
Decision Tree: Attribute Selection Measures- Gain Ratio
• Computation of gain ratio for the attribute income.

• We have Gain(income) = 0.029.


• Therefore,
GainRatio(income) 0.029/1.557
= 0.019.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index measures the impurity of a data partition or set of training
tuples.

• where pi is the probability that a tuple in D belongs to class Ci and is


estimated by |Ci,D|/|D|. The sum is computed over m classes.
• Gini index considers a binary split for each attribute.
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A.
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are {low, medium, high}, {low,
medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
Decision Tree: Attribute Selection Measures- Gini Index
• We exclude the power set, {low, medium, high}, and the empty set {}
from consideration since, conceptually, they do not represent a split.
• Therefore, there are 2v − 2 possible ways to form two partitions of the
data D, based on a binary split on A.
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition.
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is:

• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Decision Tree: Attribute Selection Measures- Gini Index
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is:

• The attribute that maximizes the reduction in impurity (or, equivalently,


has the minimum Gini index) is selected as the splitting attribute.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index to compute the impurity of D:

• To find the splitting criterion for the tuples in D, we need to compute


the Gini index for each attribute.
• For attribute income and consider each
of the possible splitting subsets.
• The subset {low, medium} in 10 tuples
in partition D1 satisfying the condition
“income ∈ {low, medium}.”
• The remaining four tuples of D would
be assigned to partition D2.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index value computed based on partitioning {low, medium} is

• Gini index for the subsets {low, high}


and {medium} is 0.458
• {medium, high} and {low}= 0.450.
• The best binary split for attribute
income is on {low, medium*} (or {high})
because it minimizes the Gini index.
Decision Tree: Attribute Selection Measures- Gini Index
• Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.

•…
Decision Tree
The three measures, in general, return good results but
• Information gain:
➢biased towards multivalued attributes
• Gain ratio:
➢tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
➢biased to multivalued attributes
➢has difficulty when # of classes is large
➢tends to favor tests that result in equal-sized partitions and purity in
both partitions
Bayes Classification
• Bayesian classifiers are statistical classifiers.
• Bayesian classifier can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• Bayes’ theorem is useful in that it provides a way of calculating the


posterior probability, P(H|X) from P(H), P(X|H), and P(X).
Bayes Classification
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• For example, X is a 35-year-old customer with an income of $40,000.


Suppose that H is the hypothesis that our customer will buy a computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Naïve Bayesian Classification
• Naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
• Suppose that there are m classes, C1, C2,…, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the naïve Bayesian
classifier predicts that tuple X belongs to the class Ci if and only if

• Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is


called the maximum posteriori hypothesis. By Bayes’ theorem:
Naïve Bayesian Classification
• As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be
maximized. Note that the class prior probabilities may be estimated by
P(Ci)=|Ci,D|/|D|, where |Ci,D| is the number of training tuples of class Ci
in D.
• A simplified assumption: attributes are conditionally independent (i.e.,
no dependence relation between attributes):

• This greatly reduces the computation cost: Only counts the class
distribution.
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.
• For example, we wish to classify

• We need to maximize P(X|Ci)P(Ci), for


i=1, 2.
• P(Ci), the prior probability of each class,
can be computed based on the training
tuples:
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.
• For example, we wish to classify

• We need to maximize P(X|Ci)P(Ci), for


i=1, 2.
• To compute P(X|Ci), we compute the
following conditional probabilities:
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.

• To compute P(X|Ci), we compute the


following conditional probabilities:
Naïve Bayesian Classification
Naïve Bayesian Classification
Rule-Based Classification
• Rule-based classifiers, where the learned model is represented as a set
of IF-THEN rules:
IF condition THEN conclusion
Example: IF age = youth AND student = yes THEN buys_computer = yes
• Rule-based classifiers can be generated, either from a decision tree or
• Directly from the training data using a sequential covering algorithm.
Rule-Based Classification: Using Decision Tree
• Rules are easier to understand than large trees.
• One rule is created for each path from the root to a leaf.
• Each attribute-value pair along a path forms a conjunction: the leaf
holds the class prediction.
• Rules are mutually exclusive and exhaustive.
• Example: Rule extraction from our buys_computer decision-tree.

IF age = young AND student = no THEN buys_computer = no


IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule-Based Classification: Using Sequential Covering Algorithm
• Sequential covering algorithm: Extracts rules directly from training data.
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER.
• Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes.
• Steps:
▪ Rules are learned one at a time
▪ Each time a rule is learned, the tuples covered by the rules are
removed
▪ Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules simultaneously.
Instance-based Classification
• Instance-based learning (sometimes called memory-based learning)
compares new problem instances with instances seen in training, which
have been stored in memory.
• Because computation is postponed until a new instance is observed,
these algorithms are sometimes referred to as "lazy“.
• Store training examples and delay the processing until a new instance
must be classified.
Instance-based Classification
Lazy vs Eager learning
Lazy learning:
• Simply stores training data (or only minor processing) and waits until it
is given a test tuple.
• Lazy learning as sees the test tuple it performs generalization to classify
the tuple based on its similarity to the stored training tuples.
Eager learning:
• Given a set of training tuples, constructs a classification model before
receiving new (e.g., test) data to classify.
• Eager learning model are ready and eager to classify previously unseen
tuples.
• Lazy: less time in training but more time in predicting.
Instance-based Classification: k-Nearest-Neighbor
• Nearest-neighbor classifiers compare a given test tuple with training
tuples that are similar to it.
• k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown (or test) tuple.
• These k training tuples are the k nearest neighbors of the unknown
tuple.
• Closeness is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points or tuples:
X1 =(x11, x12,…, x1n) and X2 =(x21, x22,…, x2n)
Instance-based Classification: k-Nearest-Neighbor
• For k-nearest-neighbor classification, the unknown tuple is assigned the
most common class among its k-nearest neighbors.
• When k = 1, the unknown tuple is assigned the class of the training
tuple that is closest to it in pattern space.
• Nearest-neighbor classifiers can also be used for numeric prediction,
that is, to return a real-valued prediction for a given unknown tuple.
• In this case, the classifier returns the average value of the real-valued
labels associated with the k-nearest neighbors of the unknown tuple.
Instance-based Classification: k-Nearest-Neighbor
Applicant Cibil Score Income Loan Approved Applicant Cibil Score Income Loan Approved Euclidean Distance
A 700 50000 Y A 700 50000 Y 25000.20
B 800 40000 Y B 800 40000 Y 15001.33
C 750 30000 Y C 750 30000 Y 5002.25
D 400 10000 N D 400 10000 N 15001.33
E 850 8000 Y E 850 8000 Y 17001.84
F 600 20000 N F 600 20000 N 5000.00
G 700 35000 Y G 700 35000 Y 10000.50
H 750 100000 Y H 750 100000 Y 75000.15
I 500 150000 N I 500 150000 N 125000.04
J 650 18000 N J 650 18000 N 7000.18

X 600 25000 ?
X 600 25000 k=3
Instance-based Classification: k-Nearest-Neighbor
Applicant Cibil Score Income Loan Approved Applicant Cibil Score Income Loan Approved Euclidean Distance
A 700 50000 Y A 700 50000 Y 25000.20
B 800 40000 Y B 800 40000 Y 15001.33
C 750 30000 Y C 750 30000 Y 5002.25
D 400 10000 N D 400 10000 N 15001.33
E 850 8000 Y E 850 8000 Y 17001.84
F 600 20000 N F 600 20000 N 5000.00
G 700 35000 Y G 700 35000 Y 10000.50
H 750 100000 Y H 750 100000 Y 75000.15
I 500 150000 N I 500 150000 N 125000.04
J 650 18000 N J 650 18000 N 7000.18

X 600 25000 N
X 600 25000 N k=3
Metrics for Evaluating Classifier Performance
• The evaluation metrics assess how good or how accurate your classifier
is at predicting the class label of tuples.
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• These terms are summarized in the confusion matrix.
• True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
• True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• False positives (FP): These are the negative tuples that were incorrectly
labeled as positive.
• e.g., tuples of class buys_computer = no for which the classifier
predicted buys_computer = yes. Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabeled
as negative.
• e.g., tuples of class buys_computer =
yes for which the classifier predicted
buys_computer = no. Let FN be the
number of false negatives.
Metrics for Evaluating Classifier Performance
• The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right.
• while FP and FN tell us when the classifier is getting things wrong (i.e.,
mislabeling).
• Given m classes (where m ≥ 2), a confusion matrix is a table of at least size m
by m. An entry, CMi,j in the first m rows and m columns indicates the number
of tuples of class i that were labeled by the classifier as class j.
• Good accuracy: ideally most of the tuples
would be represented along the diagonal of
the confusion matrix, from entry CM1,1 to
CMm,m.
• Rest of the entries being zero or close to zero.
That is, ideally, FP and FN are around zero.
Metrics for Evaluating Classifier Performance
• The table may have additional rows or columns to provide totals.
• For example, in the confusion matrix has P and N as shown in table.
• In addition, P’ is the number of tuples that were labeled as positive (TP
+ FP).
• N’ is the number of tuples that were labeled as negative (TN + FN).
• The total number of tuples is TP + TN + FP + TN, or P + N, or P’ + N’.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.

• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.

• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.

• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
Class Imbalance Problem:
• where the main class of interest is rare. That is, the data set distribution
reflects a significant majority of the negative class and a minority positive
class.
• For example, in fraud detection applications, the class of interest (or
positive class) is “fraud,” which occurs much less frequently than the
negative “nonfraudulant” class.
• In medical data, there may be a rare class, such as “cancer.” Suppose that
you have trained a classifier to classify medical data tuples, where the class
label attribute is “cancer” and the possible class values are “yes” and “no.”
An accuracy rate of, say, 97% may make the classifier seem quite accurate.
• But what if only, say, 3% of the training tuples are actually cancer? Clearly,
an accuracy rate of 97% may not be acceptable—the classifier could be
correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• sensitivity is also referred to as the true positive (recognition) rate (i.e.,
the proportion of positive tuples that are correctly identified),

• specificity is the true negative rate (i.e., the proportion of negative


tuples that are correctly identified).
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• sensitivity is also referred to as the true positive (recognition) rate (i.e.,
the proportion of positive tuples that are correctly identified),

• specificity is the true negative rate (i.e., the proportion of negative


tuples that are correctly identified).
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• Although the classifier has a high accuracy, it’s ability to correctly label
the positive (rare) class is poor given its low sensitivity.
• It has high specificity, meaning that it can accurately recognize negative
tuples.
Metrics for Evaluating Classifier Performance
• The precision and recall measures are also widely used in classification.
• Precision can be thought of as a measure of exactness (i.e., what
percentage of tuples labeled as positive are actually such).

• The recall is a measure of completeness (what percentage of positive


tuples are labeled as such).
• If recall seems familiar, that’s because it is the same as sensitivity (or
the true positive rate).
Metrics for Evaluating Classifier Performance
• A perfect precision score of 1.0 for a class C means that every tuple that
the classifier labeled as belonging to class C does indeed belong to class
C.
• However, it does not tell us anything about the number of class C tuples
that the classifier mislabeled.
• A perfect recall score of 1.0 for C means that every item from class C
was labeled as such, but it does not tell us how many other tuples were
incorrectly labeled as belonging to class C.
• There tends to be an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the other.
Metrics for Evaluating Classifier Performance
• An alternative way to use precision and recall is to combine them into a
single measure.
• The F measure (also known as F1 score or F-score) the Fβ measure.

• The F measure is the harmonic mean of precision and recall.


• It gives equal weight to precision and recall.
Metrics for Evaluating Classifier Performance

• where β is a non-negative real number.


• The Fβ measure is a weighted measure of precision and recall. It assigns
β times as much weight to recall as to precision.
• Commonly used Fβ measures are F2 (which weights recall twice as much
as precision) and
• F0.5 (which weights precision twice as much as recall).
Metrics for Evaluating Classifier Performance
• In addition to accuracy-based measures, classifiers can also be
compared with respect to the following additional aspects:
• Speed: This refers to the computational costs involved in generating and
using the given classifier.
• Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values. Robustness is
typically assessed with a series of synthetic data sets representing
increasing degrees of noise and missing values.
• Scalability: This refers to the ability to construct the classifier efficiently
given large amounts of data. Scalability is typically assessed with a
series of data sets of increasing size.
Summary
• Classification is a form of data analysis that extracts models describing
data classes. A classifier, or classification model, predicts categorical
labels (classes). Numeric prediction models continuous-valued
functions. Classification and numeric prediction are the two major types
of prediction problems.
• Decision tree induction is a top-down recursive tree induction
algorithm, which uses an attribute selection measure to select the
attribute tested for each non-leaf node in the tree.
• ID3, C4.5, and CART are examples of such algorithms using different
attribute selection measures. Tree pruning algorithms attempt to
improve accuracy by removing tree branches reflecting noise in the
data. Early decision tree algorithms typically assume that the data are
memory resident.
Summary
• Naïve Bayesian classification is based on Bayes’ theorem of posterior
probability. It assumes class-conditional independence—that the effect
of an attribute value on a given class is independent of the values of the
other attributes.
• A rule-based classifier uses a set of IF-THEN rules for classification.
Rules can be extracted from a decision tree. Rules may also be
generated directly from training data using sequential covering
algorithms.
• A confusion matrix can be used to evaluate a classifier’s quality. For a
two-class problem, it shows the true positives, true negatives, false
positives, and false negatives. Measures that assess a classifier’s
predictive ability include accuracy, sensitivity (also known as recall),
specificity, precision, F, and Fβ. Reliance on the accuracy measure can be
deceiving when the main class of interest is in the minority.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.

You might also like