0% found this document useful (0 votes)

12 views75 pages

Classification

The document discusses various classification methods in machine learning, including Naïve Bayesian classification, decision trees, and instance-based methods, highlighting the differences between supervised and unsupervised learning. It details the classification process, which involves a learning step to construct a model and a classification step to predict class labels for new data, emphasizing the importance of accuracy and avoiding overfitting. Additionally, it explains decision tree algorithms and their attribute selection measures, such as information gain, gain ratio, and Gini index, along with Bayesian classification principles.

Uploaded by

yijac51850

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views75 pages

Classification

Uploaded by

yijac51850

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Classification

Naïve Bayesian classification, Decision trees, Decision rules,

Instance-based methods
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
Unsupervised learning (clustering)
• The class labels of training data is unknown.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
Prediction Problems: Classification vs. Numeric Prediction
Classification
• Predicts categorical class labels (discrete or nominal)
• Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction
• Models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification
Data classification is a two-step process: Learning & Classification step.
i. Learning step (or Training phase or Model construction): where a
classification model is constructed.
• Describing a set of predetermined classes.
• “Learning from” a training set made up of database tuples and their
associated class labels.
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute.
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• The class label attribute is categorical (or nominal) in that each value
serves as a category or class.
Classification
ii. Classification step (Model usage ): where the model is used to predict
class labels for given data.
• Classifying future or unknown objects
• If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Classification
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
Decision Tree
• A decision tree is a flowchart-like tree structure:
• Each internal node (nonleaf node) denotes a test on an attribute
• Each branch represents an outcome of the test.
• Each leaf node (or terminal node) holds a class label.
• Internal nodes are denoted
by rectangles.
• Leaf nodes are denoted by
ovals.
• Some decision tree
algorithms produce only
binary trees.
• Others can produce
nonbinary trees.
Decision Tree
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the
attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be
converted to classification
rules.
• Decision tree classifiers have
good accuracy. However,
successful use may depend
on the data at hand.
Decision Tree
• Popular traditional Decision Tree Algorithms:
✓ID3 (Iterative Dichotomiser)
✓C4.5 (a successor of ID3)
✓CART (Classification And Regression Trees)
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in
which decision trees are constructed in a top-down recursive divide-
and-conquer manner.
Decision Tree
Decision tree algorithm has three parameters:
i. D: data partition, Initially, it is the complete set of training tuples and their
associated class labels.
ii. Parameter attribute list: the set of candidate attributes.
iii. Attribute selection method: specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class.
➢ The criterion consists of a splitting attribute and, possibly, either a split-
point or splitting subset.
➢ Attribute selection measure such as information gain or the Gini index.
➢ Gini index: Tree is strictly binary
➢ Information gain: Multiway splits (i.e., two or more branches to be grown
from a node)
Decision Tree
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (Stopping criteria).
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
• All the tuples in partition D (represented at node N) belong to the same
class.
Stopping criteria: Recursive partitioning stops only when any one of the
following terminating conditions is true :
✓All the tuples in partition D (represented at node N) belong to the
same class.
✓There are no remaining attributes on which the tuples may be further
partitioned.
✓There are no tuples for a given branch, that is, a partition Dj is empty.
Decision Tree
• Let A be the splitting attribute. A has v distinct values, {a1, a2,..., av},
based on the training data.
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A.
• A branch is created for each known value, aj, of A and labeled with that
value.
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split
point, respectively.
Decision Tree: Attribute Selection Measures
Three popular attribute selection measures—
a) Information gain (ID3)
b) Gain ratio (C4.5)
c) Gini index (CART)
Decision Tree: Attribute Selection Measures- Information gain
Information Gain:
• ID3 uses information gain as its attribute selection measure.
• The attribute with the highest information gain is chosen as the
splitting attribute for node N.
• To classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
Decision Tree: Attribute Selection Measures- Information gain
Entropy (Information Theory)
• A measure of uncertainty associated with a random variable.
• High entropy -> higher uncertainty
• Lower entropy -> lower uncertainty
• The expected information (or entropy) needed to classify a tuple in D
is:
Decision Tree: Attribute Selection Measures- Information gain
• The expected information (or entropy) needed to classify a tuple in D
is:

• where pi is the nonzero probability that an arbitrary tuple in D belongs

to class Ci and is estimated by |Ci,D|/|D|.
• A log function to the base 2 is used, because the information is encoded
in bits.
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Information needed (after using A to split D into v partitions) to classify
D:

• The term acts as the weight of the jth partition.

• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A.
• The smaller the expected information (still) required, the greater the
purity of the partitions.
Decision Tree: Attribute Selection Measures- Information gain
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A).
• Information gained by branching on attribute A
Decision Tree: Attribute Selection Measures- Information gain
• In this example, class label attribute, buys_computer, has two distinct
values (namely, yes, no).
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for
these tuples, we first need to
compute the expected information
to classify a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Next, we need to compute the expected information requirement for
each attribute.
• Let’s start with the attribute age. We need to look at the distribution of
yes and no tuples for each category of age.
Decision Tree: Attribute Selection Measures- Information gain
• For the age category “youth,” there are two yes tuples and three no
tuples.
• For the category “middle aged,” there are four yes tuples and zero no
tuples.
• For the category “senior,” there
are three yes tuples and two no
tuples.
Decision Tree: Attribute Selection Measures- Information gain

• Similarly, we can compute Gain(income) =0.029 bits, Gain(student) =

0.151 bits, and Gain(credit_rating) = 0.048 bits.
• Because age has the highest
information gain among the
attributes, it is selected as the
splitting attribute.
Decision Tree: Attribute Selection Measures- Information gain

• *Note the tuples falling into the partition for age D

middle aged all belong to the same class.
• Therefore be created at the end of this branch and
labeled “yes.”
Decision Tree: Attribute Selection Measures- Information gain
Decision Tree: Attribute Selection Measures
• Computing Information-Gain for Continuous-Valued Attributes:
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
➢ Sort the value A in increasing order
➢ Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
➢ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
➢ The point with the minimum expected information requirement for
A is selected as the split-point for A
• Split:
➢ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Decision Tree: Attribute Selection Measures- Gain Ratio
• Information gain measure is biased toward tests with many outcomes.
• Information gain prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique identifier such
as product ID.
• A split on product ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple.
• Each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct_ID(D)= 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
Decision Tree: Attribute Selection Measures- Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio, which attempts to overcome this bias.

• For each outcome, it considers the number of tuples having that

outcome with respect to the total number of tuples in D.

• The attribute with the maximum gain ratio is selected as the splitting
attribute.
Decision Tree: Attribute Selection Measures- Gain Ratio
• Computation of gain ratio for the attribute income.

• We have Gain(income) = 0.029.

• Therefore,
GainRatio(income) 0.029/1.557
= 0.019.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index measures the impurity of a data partition or set of training
tuples.

• where pi is the probability that a tuple in D belongs to class Ci and is

estimated by |Ci,D|/|D|. The sum is computed over m classes.
• Gini index considers a binary split for each attribute.
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A.
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are {low, medium, high}, {low,
medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
Decision Tree: Attribute Selection Measures- Gini Index
• We exclude the power set, {low, medium, high}, and the empty set {}
from consideration since, conceptually, they do not represent a split.
• Therefore, there are 2v − 2 possible ways to form two partitions of the
data D, based on a binary split on A.
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition.
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is:

• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Decision Tree: Attribute Selection Measures- Gini Index
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is:

• The attribute that maximizes the reduction in impurity (or, equivalently,

has the minimum Gini index) is selected as the splitting attribute.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index to compute the impurity of D:

• To find the splitting criterion for the tuples in D, we need to compute

the Gini index for each attribute.
• For attribute income and consider each
of the possible splitting subsets.
• The subset {low, medium} in 10 tuples
in partition D1 satisfying the condition
“income ∈ {low, medium}.”
• The remaining four tuples of D would
be assigned to partition D2.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index value computed based on partitioning {low, medium} is

• Gini index for the subsets {low, high}

and {medium} is 0.458
• {medium, high} and {low}= 0.450.
• The best binary split for attribute
income is on {low, medium*} (or {high})
because it minimizes the Gini index.
Decision Tree: Attribute Selection Measures- Gini Index
• Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.

•…
Decision Tree
The three measures, in general, return good results but
• Information gain:
➢biased towards multivalued attributes
• Gain ratio:
➢tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
➢biased to multivalued attributes
➢has difficulty when # of classes is large
➢tends to favor tests that result in equal-sized partitions and purity in
both partitions
Bayes Classification
• Bayesian classifiers are statistical classifiers.
• Bayesian classifier can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• Bayes’ theorem is useful in that it provides a way of calculating the

posterior probability, P(H|X) from P(H), P(X|H), and P(X).
Bayes Classification
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• For example, X is a 35-year-old customer with an income of $40,000.

Suppose that H is the hypothesis that our customer will buy a computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Naïve Bayesian Classification
• Naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
• Suppose that there are m classes, C1, C2,…, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the naïve Bayesian
classifier predicts that tuple X belongs to the class Ci if and only if

• Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is

called the maximum posteriori hypothesis. By Bayes’ theorem:
Naïve Bayesian Classification
• As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be
maximized. Note that the class prior probabilities may be estimated by
P(Ci)=|Ci,D|/|D|, where |Ci,D| is the number of training tuples of class Ci
in D.
• A simplified assumption: attributes are conditionally independent (i.e.,
no dependence relation between attributes):

• This greatly reduces the computation cost: Only counts the class
distribution.
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.
• For example, we wish to classify

• We need to maximize P(X|Ci)P(Ci), for

i=1, 2.
• P(Ci), the prior probability of each class,
can be computed based on the training
tuples:
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.
• For example, we wish to classify

• We need to maximize P(X|Ci)P(Ci), for

i=1, 2.
• To compute P(X|Ci), we compute the
following conditional probabilities:
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.

• To compute P(X|Ci), we compute the

following conditional probabilities:
Naïve Bayesian Classification
Naïve Bayesian Classification
Rule-Based Classification
• Rule-based classifiers, where the learned model is represented as a set
of IF-THEN rules:
IF condition THEN conclusion
Example: IF age = youth AND student = yes THEN buys_computer = yes
• Rule-based classifiers can be generated, either from a decision tree or
• Directly from the training data using a sequential covering algorithm.
Rule-Based Classification: Using Decision Tree
• Rules are easier to understand than large trees.
• One rule is created for each path from the root to a leaf.
• Each attribute-value pair along a path forms a conjunction: the leaf
holds the class prediction.
• Rules are mutually exclusive and exhaustive.
• Example: Rule extraction from our buys_computer decision-tree.

IF age = young AND student = no THEN buys_computer = no

IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule-Based Classification: Using Sequential Covering Algorithm
• Sequential covering algorithm: Extracts rules directly from training data.
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER.
• Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes.
• Steps:
▪ Rules are learned one at a time
▪ Each time a rule is learned, the tuples covered by the rules are
removed
▪ Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules simultaneously.
Instance-based Classification
• Instance-based learning (sometimes called memory-based learning)
compares new problem instances with instances seen in training, which
have been stored in memory.
• Because computation is postponed until a new instance is observed,
these algorithms are sometimes referred to as "lazy“.
• Store training examples and delay the processing until a new instance
must be classified.
Instance-based Classification
Lazy vs Eager learning
Lazy learning:
• Simply stores training data (or only minor processing) and waits until it
is given a test tuple.
• Lazy learning as sees the test tuple it performs generalization to classify
the tuple based on its similarity to the stored training tuples.
Eager learning:
• Given a set of training tuples, constructs a classification model before
receiving new (e.g., test) data to classify.
• Eager learning model are ready and eager to classify previously unseen
tuples.
• Lazy: less time in training but more time in predicting.
Instance-based Classification: k-Nearest-Neighbor
• Nearest-neighbor classifiers compare a given test tuple with training
tuples that are similar to it.
• k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown (or test) tuple.
• These k training tuples are the k nearest neighbors of the unknown
tuple.
• Closeness is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points or tuples:
X1 =(x11, x12,…, x1n) and X2 =(x21, x22,…, x2n)
Instance-based Classification: k-Nearest-Neighbor
• For k-nearest-neighbor classification, the unknown tuple is assigned the
most common class among its k-nearest neighbors.
• When k = 1, the unknown tuple is assigned the class of the training
tuple that is closest to it in pattern space.
• Nearest-neighbor classifiers can also be used for numeric prediction,
that is, to return a real-valued prediction for a given unknown tuple.
• In this case, the classifier returns the average value of the real-valued
labels associated with the k-nearest neighbors of the unknown tuple.
Instance-based Classification: k-Nearest-Neighbor
Applicant Cibil Score Income Loan Approved Applicant Cibil Score Income Loan Approved Euclidean Distance
A 700 50000 Y A 700 50000 Y 25000.20
B 800 40000 Y B 800 40000 Y 15001.33
C 750 30000 Y C 750 30000 Y 5002.25
D 400 10000 N D 400 10000 N 15001.33
E 850 8000 Y E 850 8000 Y 17001.84
F 600 20000 N F 600 20000 N 5000.00
G 700 35000 Y G 700 35000 Y 10000.50
H 750 100000 Y H 750 100000 Y 75000.15
I 500 150000 N I 500 150000 N 125000.04
J 650 18000 N J 650 18000 N 7000.18

X 600 25000 ?
X 600 25000 k=3
Instance-based Classification: k-Nearest-Neighbor
Applicant Cibil Score Income Loan Approved Applicant Cibil Score Income Loan Approved Euclidean Distance
A 700 50000 Y A 700 50000 Y 25000.20
B 800 40000 Y B 800 40000 Y 15001.33
C 750 30000 Y C 750 30000 Y 5002.25
D 400 10000 N D 400 10000 N 15001.33
E 850 8000 Y E 850 8000 Y 17001.84
F 600 20000 N F 600 20000 N 5000.00
G 700 35000 Y G 700 35000 Y 10000.50
H 750 100000 Y H 750 100000 Y 75000.15
I 500 150000 N I 500 150000 N 125000.04
J 650 18000 N J 650 18000 N 7000.18

X 600 25000 N
X 600 25000 N k=3
Metrics for Evaluating Classifier Performance
• The evaluation metrics assess how good or how accurate your classifier
is at predicting the class label of tuples.
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• These terms are summarized in the confusion matrix.
• True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
• True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• False positives (FP): These are the negative tuples that were incorrectly
labeled as positive.
• e.g., tuples of class buys_computer = no for which the classifier
predicted buys_computer = yes. Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabeled
as negative.
• e.g., tuples of class buys_computer =
yes for which the classifier predicted
buys_computer = no. Let FN be the
number of false negatives.
Metrics for Evaluating Classifier Performance
• The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right.
• while FP and FN tell us when the classifier is getting things wrong (i.e.,
mislabeling).
• Given m classes (where m ≥ 2), a confusion matrix is a table of at least size m
by m. An entry, CMi,j in the first m rows and m columns indicates the number
of tuples of class i that were labeled by the classifier as class j.
• Good accuracy: ideally most of the tuples
would be represented along the diagonal of
the confusion matrix, from entry CM1,1 to
CMm,m.
• Rest of the entries being zero or close to zero.
That is, ideally, FP and FN are around zero.
Metrics for Evaluating Classifier Performance
• The table may have additional rows or columns to provide totals.
• For example, in the confusion matrix has P and N as shown in table.
• In addition, P’ is the number of tuples that were labeled as positive (TP
+ FP).
• N’ is the number of tuples that were labeled as negative (TN + FN).
• The total number of tuples is TP + TN + FP + TN, or P + N, or P’ + N’.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.

• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.

• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
Class Imbalance Problem:
• where the main class of interest is rare. That is, the data set distribution
reflects a significant majority of the negative class and a minority positive
class.
• For example, in fraud detection applications, the class of interest (or
positive class) is “fraud,” which occurs much less frequently than the
negative “nonfraudulant” class.
• In medical data, there may be a rare class, such as “cancer.” Suppose that
you have trained a classifier to classify medical data tuples, where the class
label attribute is “cancer” and the possible class values are “yes” and “no.”
An accuracy rate of, say, 97% may make the classifier seem quite accurate.
• But what if only, say, 3% of the training tuples are actually cancer? Clearly,
an accuracy rate of 97% may not be acceptable—the classifier could be
correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• sensitivity is also referred to as the true positive (recognition) rate (i.e.,
the proportion of positive tuples that are correctly identified),

• specificity is the true negative rate (i.e., the proportion of negative

tuples that are correctly identified).
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• sensitivity is also referred to as the true positive (recognition) rate (i.e.,
the proportion of positive tuples that are correctly identified),

• specificity is the true negative rate (i.e., the proportion of negative

tuples that are correctly identified).
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• Although the classifier has a high accuracy, it’s ability to correctly label
the positive (rare) class is poor given its low sensitivity.
• It has high specificity, meaning that it can accurately recognize negative
tuples.
Metrics for Evaluating Classifier Performance
• The precision and recall measures are also widely used in classification.
• Precision can be thought of as a measure of exactness (i.e., what
percentage of tuples labeled as positive are actually such).

• The recall is a measure of completeness (what percentage of positive

tuples are labeled as such).
• If recall seems familiar, that’s because it is the same as sensitivity (or
the true positive rate).
Metrics for Evaluating Classifier Performance
• A perfect precision score of 1.0 for a class C means that every tuple that
the classifier labeled as belonging to class C does indeed belong to class
C.
• However, it does not tell us anything about the number of class C tuples
that the classifier mislabeled.
• A perfect recall score of 1.0 for C means that every item from class C
was labeled as such, but it does not tell us how many other tuples were
incorrectly labeled as belonging to class C.
• There tends to be an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the other.
Metrics for Evaluating Classifier Performance
• An alternative way to use precision and recall is to combine them into a
single measure.
• The F measure (also known as F1 score or F-score) the Fβ measure.

• The F measure is the harmonic mean of precision and recall.

• It gives equal weight to precision and recall.
Metrics for Evaluating Classifier Performance

• where β is a non-negative real number.

• The Fβ measure is a weighted measure of precision and recall. It assigns
β times as much weight to recall as to precision.
• Commonly used Fβ measures are F2 (which weights recall twice as much
as precision) and
• F0.5 (which weights precision twice as much as recall).
Metrics for Evaluating Classifier Performance
• In addition to accuracy-based measures, classifiers can also be
compared with respect to the following additional aspects:
• Speed: This refers to the computational costs involved in generating and
using the given classifier.
• Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values. Robustness is
typically assessed with a series of synthetic data sets representing
increasing degrees of noise and missing values.
• Scalability: This refers to the ability to construct the classifier efficiently
given large amounts of data. Scalability is typically assessed with a
series of data sets of increasing size.
Summary
• Classification is a form of data analysis that extracts models describing
data classes. A classifier, or classification model, predicts categorical
labels (classes). Numeric prediction models continuous-valued
functions. Classification and numeric prediction are the two major types
of prediction problems.
• Decision tree induction is a top-down recursive tree induction
algorithm, which uses an attribute selection measure to select the
attribute tested for each non-leaf node in the tree.
• ID3, C4.5, and CART are examples of such algorithms using different
attribute selection measures. Tree pruning algorithms attempt to
improve accuracy by removing tree branches reflecting noise in the
data. Early decision tree algorithms typically assume that the data are
memory resident.
Summary
• Naïve Bayesian classification is based on Bayes’ theorem of posterior
probability. It assumes class-conditional independence—that the effect
of an attribute value on a given class is independent of the values of the
other attributes.
• A rule-based classifier uses a set of IF-THEN rules for classification.
Rules can be extracted from a decision tree. Rules may also be
generated directly from training data using sequential covering
algorithms.
• A confusion matrix can be used to evaluate a classifier’s quality. For a
two-class problem, it shows the true positives, true negatives, false
positives, and false negatives. Measures that assess a classifier’s
predictive ability include accuracy, sensitivity (also known as recall),
specificity, precision, F, and Fβ. Reliance on the accuracy measure can be
deceiving when the main class of interest is in the minority.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.

2007 Sanjay Prabhakaran
100% (1)
2007 Sanjay Prabhakaran
21 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Trees
No ratings yet
Trees
78 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Session 5b Classification by Decision Tree Induction
No ratings yet
Session 5b Classification by Decision Tree Induction
42 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Unit-2 Material
No ratings yet
Unit-2 Material
52 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
COMP 6930 Topic01 Classification Basics
No ratings yet
COMP 6930 Topic01 Classification Basics
190 pages
DM 4
No ratings yet
DM 4
68 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Topic01 Classification Basics Jiawei Han Extra
No ratings yet
Topic01 Classification Basics Jiawei Han Extra
198 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Decisiontree 2
No ratings yet
Decisiontree 2
16 pages
Lecture11 Ch8 ClassBasic Part1
No ratings yet
Lecture11 Ch8 ClassBasic Part1
38 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Classification
No ratings yet
Classification
81 pages
CH 5
No ratings yet
CH 5
84 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
DM 3
No ratings yet
DM 3
37 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Class Basic
No ratings yet
Class Basic
75 pages
Lecture 6 Classification-Decision Tree Rule Based K-NN
No ratings yet
Lecture 6 Classification-Decision Tree Rule Based K-NN
73 pages
Unit Iii
No ratings yet
Unit Iii
11 pages
Module 5: Data Mining Algorithms: Classification
No ratings yet
Module 5: Data Mining Algorithms: Classification
34 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
15.module6 Decisiontree-Updated 14
No ratings yet
15.module6 Decisiontree-Updated 14
20 pages
MLT 3 UNIT-Part-1
No ratings yet
MLT 3 UNIT-Part-1
28 pages
2 - Decision Tree
No ratings yet
2 - Decision Tree
23 pages
Unit 3
No ratings yet
Unit 3
98 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Week 5
No ratings yet
Week 5
72 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Energies 14 06455 v2
No ratings yet
Energies 14 06455 v2
23 pages
Miller Approximation
No ratings yet
Miller Approximation
14 pages
Carbohydrates Pcog
No ratings yet
Carbohydrates Pcog
8 pages
Finance Technical Assessment
No ratings yet
Finance Technical Assessment
3 pages
Monthly Updates: Lead Sponsor
No ratings yet
Monthly Updates: Lead Sponsor
20 pages
11 Board Question Paper Maths II November 2020 - 6598093377c7e
No ratings yet
11 Board Question Paper Maths II November 2020 - 6598093377c7e
4 pages
Gainsforth 1945
100% (1)
Gainsforth 1945
12 pages
Nicolaus Copernicus
No ratings yet
Nicolaus Copernicus
11 pages
Entso-E CESysSep 210724 02 Final Report 220325
No ratings yet
Entso-E CESysSep 210724 02 Final Report 220325
132 pages
Liverpool Medals Catalogue
100% (1)
Liverpool Medals Catalogue
116 pages
The Master
100% (1)
The Master
39 pages
Jfets and Mosfets
No ratings yet
Jfets and Mosfets
26 pages
Mimo Elimination of States
No ratings yet
Mimo Elimination of States
19 pages
Plant Pest and Diseases
No ratings yet
Plant Pest and Diseases
25 pages
TEDTalksFREEWorksheettoUseWithANYTEDTalkPublicSpeakingGrades612 PDF
No ratings yet
TEDTalksFREEWorksheettoUseWithANYTEDTalkPublicSpeakingGrades612 PDF
2 pages
Traits
No ratings yet
Traits
2 pages
Learning Guide No 1
No ratings yet
Learning Guide No 1
66 pages
Reference: - Loading..
No ratings yet
Reference: - Loading..
23 pages
Solutions For Using Sage 50 Accounting 2021 1st Edition by Purbhoo
100% (1)
Solutions For Using Sage 50 Accounting 2021 1st Edition by Purbhoo
21 pages
Exemple Dintroduction de Dissertation de Philosophie Sur Le Bonheur
100% (1)
Exemple Dintroduction de Dissertation de Philosophie Sur Le Bonheur
6 pages
Life Cycle of Mushroom
No ratings yet
Life Cycle of Mushroom
17 pages
Curriculam-Vitae: Career Objective
No ratings yet
Curriculam-Vitae: Career Objective
2 pages
Accepptance Test Report of Acsr Rabbit Conductoor
No ratings yet
Accepptance Test Report of Acsr Rabbit Conductoor
1 page
3.0 Central Processing Unit: ITE 1922 - ICT Applications
No ratings yet
3.0 Central Processing Unit: ITE 1922 - ICT Applications
7 pages
CH5 MINERALS AND ENERGY REOURCES Mcqs
No ratings yet
CH5 MINERALS AND ENERGY REOURCES Mcqs
4 pages
Structural Optimization of Composite Steel Trussed-Concrete Beams
No ratings yet
Structural Optimization of Composite Steel Trussed-Concrete Beams
10 pages
5、How Much Inequity Do You See? Structural Power, Perceptions of Gender and Racial Inequity, And Support for Diversity Initiatives
No ratings yet
5、How Much Inequity Do You See? Structural Power, Perceptions of Gender and Racial Inequity, And Support for Diversity Initiatives
25 pages
Robin Adair Petite 00 Pint
No ratings yet
Robin Adair Petite 00 Pint
6 pages
101 Newsletter Content Ideas - Pages
No ratings yet
101 Newsletter Content Ideas - Pages
13 pages

Classification

Uploaded by

Classification

Uploaded by

Classification

Naïve Bayesian classification, Decision trees, Decision rules,

• where pi is the nonzero probability that an arbitrary tuple in D belongs

• The term acts as the weight of the jth partition.

• Similarly, we can compute Gain(income) =0.029 bits, Gain(student) =

• *Note the tuples falling into the partition for age D

• For each outcome, it considers the number of tuples having that

• We have Gain(income) = 0.029.

• where pi is the probability that a tuple in D belongs to class Ci and is

• The attribute that maximizes the reduction in impurity (or, equivalently,

• To find the splitting criterion for the tuples in D, we need to compute

• Gini index for the subsets {low, high}

• Bayes’ theorem is useful in that it provides a way of calculating the

• For example, X is a 35-year-old customer with an income of $40,000.

• Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is

• We need to maximize P(X|Ci)P(Ci), for

• We need to maximize P(X|Ci)P(Ci), for

• To compute P(X|Ci), we compute the

IF age = young AND student = no THEN buys_computer = no

• specificity is the true negative rate (i.e., the proportion of negative

• specificity is the true negative rate (i.e., the proportion of negative

• The recall is a measure of completeness (what percentage of positive

• The F measure is the harmonic mean of precision and recall.

• where β is a non-negative real number.

You might also like