Module 04 Edited
Module 04 Edited
& PREDICTION
❖ Using a decision tree, we can visualize the decisions that make it easy to
understand and thus it is a popular data mining technique.
CLASSIFICATION ANALYSIS GENERAL APPROACH TO BUILD CLASSIFICATION
❖ A two-step process is followed, to build a classification model.
TREE
➢ In the first step i.e. learning:A classification model based on training data is built.
➢ In the second step i.e. Classification, the accuracy of the model is checked and then
the model is used to classify new data.
➢ The class labels presented here are in the form of discrete values such as “yes” or
“no”, “safe” or “risky”.
An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.
Conceptually, the “best” splitting criterion is the one that most closely results in such a
scenario.
ATTRIBUTE SELECTION MEASURES INFORMATION GAIN
The following are three popular attribute selection measures—INFORMATION GAIN, ❖ This method is the main method that is used to build decision trees. It reduces
GAIN RATIO, AND GINI INDEX. the information that is required to classify the tuples. It reduces the number of
tests that are needed to classify the given tuple. The attribute with the highest
The notation used herein is as follows.
• Let D, the data partition, be a training set of class-labeled tuples. information gain is selected.
❖ ID3 uses Information Gain as its attribute selection measure
•Suppose the class label attribute has m distinct values defining m distinct classes, ❖ The expected information needed to classify a tuple in dataset D is given
Ci(for i = 1, . . . , m).
by:
• Let Ci,D be the set of tuples of class Ci in D.
• Let |D| and |Ci,D| denote the number of tuples in D and Ci, D, respectively.
where pₗ is the non-zero probability that an arbitrary tuple in D belongs to class Ci and Now, suppose we were to partition the tuples in D on some attribute A having v distinct
is estimated by |Ci,D|/|D|. values, {a₁, a₂, . . . , av }, as observed from the training data.
A log function to the base 2 is used, because the information is encoded in bits. How much more information would we still need (after the partitioning) to arrive at
an exact classification? InfoA(D) is the expected information required to classify a tuple
Info(D) is just the average amount of information needed to identify the class label of a from D based on the partitioning by A
tuple in D.
The attribute selection measures are used to find out the weightage of the split. Threshold values
are prescribed to decide which splits are regarded as useful. If the portioning of the node results in
splitting by falling below threshold then the process is halted.
#2) Post Pruning: This method removes the outlier branches from a fully grown tree. The
unwanted branches are removed and replaced by a leaf node denoting the most frequent class label.
This technique requires more computation than prepruning, however, it is more reliable.
The pruned trees are more precise and compact when compared to unpruned trees but they carry a
disadvantage of replication and repetition.
Repetition occurs when the same attribute is tested again and again along a branch of a tree.
Replication occurs when the duplicate subtrees are present within the tree. These issues can be The above image shows an unpruned and pruned tree.
Constructing a Decision Tree 1. Decision tree classification does not require any domain knowledge, hence, it is appropriate for the
knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans and it is intuitive.
Let us take an example of the last 10 days weather dataset with attributes
3. It can handle multidimensional data.
outlook, temperature, wind, and humidity. The outcome variable will be 4. It is a quick process with great accuracy.
playing cricket or not. We will use the ID3 algorithm to build the decision
tree. Disadvantages Of Decision Tree Classification
https://jcsites.juniata.edu/faculty/rhodes/ida/decisionTrees.html Given below are the various demerits of Decision Tree Classification:
1. Sometimes decision trees become very complex and these are called overfitted trees.
2. The decision tree algorithm may not be an optimal solution.
3. The decision trees may return a biased solution if some class label dominates it.
BAYESIAN CLASSIFICATION BAYESIAN CLASSIFICATION
❖ Thomas Bayes, who proposed the Bayes Theorem so, it named Bayesian “What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the probability that a given tuple belongs
theorem. to a particular class.
❖ It is statistical method & supervised learning method for classification. Naive Bayesian classifiers assume that the effect of an attribute value on a given class
is independent of the values of the other attributes. This assumption is called class-
❖ It can solve problems involving both categorical and continuous conditional independence. It is made to simplify the computations involved and, in this
valued attributes. sense, is considered “naive.”
PREDICTION METHODS
LINEAR AND NONLINEAR REGRESSION
❖ It is simplest form of regression. Linear regression attempts to model the relationship between two variables by
linear model.
❖ The relationship between dependent variable is given by straight line and it has only one independent variable.
Y= a + bX
❖ The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
CONFUSION MATRIX CONFUSION MATRIX
Accuracy: Overall, how often is the classifier correct? True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall“
(TP+TN)/total = (100+50)/165 = 0.91
False Positive Rate: When it's actually no, how often does it predict yes?
Misclassification Rate: Overall, how often is it wrong? FP/actual no = 10/60 = 0.17
(FP+FN)/total = (10+5)/165 = 0.09 True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus Accuracy
also known as "Error Rate" equivalent to 1 minus False Positive Rate also known as "Specificity"
CONFUSION MATRIX
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our sample? actual
yes/total = 105/165 = 0.64