TE - DWM Module No 3
TE - DWM Module No 3
CLASSIFICATION
1
CLASSIFICATION
3
FIRST STEP
4
SECOND STEP
5
General Approach
The data classification process:
(a) Learning:
• Training data are analyzed by a classification algorithm.
• Here, the class label attribute is loan decision, and the learned
model or classifier is represented in the form of classification
rules
(b) Classification:
• Test data are used to estimate the accuracy of the classification
rules.
• If the accuracy is considered acceptable, the rules can be
applied to the classification of new data tuples.
6
Decision Tree Induction
• Decision tree induction is the learning of decision trees from
class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each
internal node (nonleaf node) denotes a test on an attribute
• Each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• A typical decision tree is shown in Figure
7
8
DECISION TREE
INDUCTION
“How are decision trees used for classification?”
• Given a tuple, X, for which the associated class label is unknown,
the attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be converted to classification rules.
9
RULE EXTRACTION FROM A DECISION TREE
• Decision tree classifiers are a popular method of
• To extract rules from a decision tree, one rule is created for each
path from the root to a leaf node.
• Each splitting criterion along a given path is logically ANDed to form
the rule antecedent (“IF” part)
• The leaf node holds the class prediction, forming the rule
consequent (“THEN” part)
10
RULE EXTRACTION FROM A DECISION TREE
11
12
13
14
15
RULE EXTRACTION FROM A DECISION TREE
18
19
BAYES CLASSIFICATION METHOD : NAIVE BAYES
• CLASSIFICATION
Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
• A simple Bayesian classifier known as the naive Bayesian classifier
to be comparable in performance with decision tree
• Naive Bayesian classifiers assume that the effect of an attribute
value on a given class is independent of the values of the other
attributes. This assumption is called class conditional independence.
• It is made to simplify the computations involved and, in this sense,
is considered “naive.”
20
PREDICTING A CLASS LABEL USING NAIVE
BAYESIAN CLASSIFICATION
21
22
23
22
25
26
MODEL EVALUATION AND
• SELECTION
Now we know what is classification, how classifiers works so we
may built a classification model
• For example, suppose you used previous sales data to build a
classifier to predict customer purchasing behaviour
• In this example we would like to analyse how our model can predict
the purchasing behaviour of future customers.(data on which
classifier has not been trained)
• We may built different classifiers and we can compare their
accuracy/performance by applying various evaluation matrics
• Before we discuss the various evaluation matrics, we need to
understand some basic terminologies
29
MODEL EVALUATION AND
• SELECTION
MODEL : a model is created by applying an algorithms(or statistical
calculations) to data to generate predictions/classifications of new
data.
• Given data set is partitioned into subsets
• Training data set
• Testing data set
• Training data set: training data set is used to derive the model or
train the model
• Testing data set: the models accuracy is estimated by using testing
data set
30
MODEL EVALUATION AND
• SELECTION
Positive tuples : positive tuples of the class attribute (in our last
example positive tuples are buys_computer= yes)
• Negative tuples : negative tuples of the class attribute (in our last
example negative tuples are buys_computer= no)
• Suppose we use our classifier on a test set of labeled tuples.
• P is the number of positive tuples and N is the number of negative
tuples.
• For each tuple, we compare the classifier’s class attribute prediction
with the tuple’s known class attribute value.
31
MODEL EVALUATION AND SELECTION
There are four additional terms we need to know that are
• True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
• False positives (FP) Type I Error: These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys_computer=no
for which the classifier predicted buys_computer=yes). Let FP be the
number of false positives.
• False negatives (FN) Type II Error: These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys_computer=yes for which
the classifier predicted buys_computer=no). Let FN be the number of
false negatives.
32
MODEL EVALUATION AND
SELECTION
33
CONFUSION
• MATRIX
The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right, while
FP and FN tell us when the classifier is getting things wrong.
34
CONFUSION
• MATRIX
E.g. suppose in a data set of the customers who buys the computer,
there are total 10000 tuples, out of that 7000 are positive and 3000
are negative and our model has predicated 6954 are positive and
2588 are negative, so prepare confusion matrix
35
CONFUSION
• MATRIX
E.g. suppose in a data set of the customers who buys the computer,
there are total 10000 tuples, out of that 7000 are positive and 3000
are negative and our model has predicated 6954 are positive and
2588 are negative, so the confusion matrix will be
36
CLASSIFIERS PERFORMANCE
EVALUATION MEASURES
37
CLASSIFIERS PERFORMANCE
• EVALUATION
Find all evaluation MEASURES
measures for the following confusion matrix
38
CONFUSION
• MATRIX
E.g. suppose in a data set of the cancer, there are total 10000 tuples,
out of that 300 are positive and 9700 are negative and our model has
predicated 90 are positive and 9560 are negative, so prepare
confusion matrix and Find all evaluation measures for the confusion
matrix
39
Evaluation measures for the confusion matrix
1. Accuracy:
2. Error rate:
3. Sensitivity: ability to correctly label the positive as positive
4. Specificity: ability to correctly label the negative as negative
5. Precision: % of positive tuples labelled as positive
41
MODEL EVALUATION AND
SELECTION METHODS
1. Holdout
2. Random sampling
3. Cross validation
4. Bootstrap
5. ROC Curves (Receiver operating characteristic curves)
42
HOLD
• InOUT
this method, the given data are randomly partitioned into two
independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training set, and
the remaining one-third is allocated to the test set.
• The training set is used to derive the model. The model’s accuracy is
then estimated with the test set.
• The estimate is pessimistic(negative) because only a portion of the
initial data is used to derive the model.
43
HOLD
OUT
44
RANDOM SUB-
• SAMPLING
Random subsampling is a variation of the holdout method in which
the holdout method is repeated k times.
• The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
45
CROSS-VALIDATION
• In k-fold cross-validation, the initial data are randomly partitioned into k
mutually exclusive subsets or “folds,” D1, D2, .... , Dk, each of
approximately equal size.
• Training and testing is performed k times. In iteration i, partition Di is
reserved as the test set, and the remaining partitions are collectively used
to train the model.
• That is, in the first iteration, subsets D2,....., Dk collectively serve as the
training set to obtain a first model, which is tested on D1
• the second iteration is trained on subsets D1, D3, ...... , Dk and tested on
D2 and so on...
• Each fold is used the same number of times for training and once for
testing
• the accuracy estimate is the overall number of correct classifications from
the k iterations, divided by the total number of tuples in the initial data
46
CROSS-
VALIDATION
47
BOOTST Video
RAP
• Bootstrap randomly selects a tuple from the original data set
• Add that tuple into the training dataset and again send it back to the
original dataset
• Repeat this process N times (N is the total number of tuples in the
original dataset)
• The bootstrap is allowed to select the same tuple more than once.
• We use the training data set to train the model and test dataset to
obtain an accuracy estimate of the model
48