Unit 3
Unit 3
Definition:
Classification is the task of learning a target function ‘f’ that maps each attribute set ‘x’ to one of
the predefined class label ‘y’.
In this attribute set ‘x’ can be any number of attributes and the attributes can be binary,
categorical and continuous. The class label ‘y’ must be a discrete attribute; i.e., either binary or
categorical (nominal or ordinal).
Classification models:
--Predictive modeling is a classification model used to predict the class label of unknown
records.
Applications:
i) Detecting spam email messages based upon the message header and content.
ii) Classifying galaxies based upon their shapes.
iii) Classifying the Students based on their Grades.
iv) Classifying the Patients according to their Medical records.
v) Classification can be used in credit approval.
--Examples are decision tree classifiers, rule-based classifiers, neural networks, support vector
machines and naïve Bayes classifier.
--Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and the class label of the input data.
--A training set consists of records whose class labels are known must be provided. The training
test is used to build a classification model, which is applied to the test set. The test set consists of
records whose class label is unknown
--Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model.
--Each entry fij in the table denotes the number of records from the class ‘i’ predicted to be of
class ‘j’.
--For example, f01 refers to the number of records from class 0 incorrectly predicted as class 1.
--Based on the entries in the confusion matrix, the total number of correct predictions made by
the model is (f11+f00) and the total number of incorrect predictions is (f01+f10).
--Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information with a single number would make it
more convenient to compare the performance of different models.
. Accuracy= (f11+f00)/(f11+f10+f00+f01)
Decision Tree Induction: Decision tree induction is a technique used for identifying
unknown class labels in classification. The topics are:
In this example, we are classifying whether vertebrate is a mammal or non-mammal. From this
decision tree, we can identify a new vertebrate as mammal or non-mammal. If the vertebrate is
cold-blooded, then it is a non-mammal. If the vertebrate is warm-blooded, then check the next
node gives berth. If it gives berth, then it is a mammal, else, non-mammal.
--There are various algorithms devised for constructing a decision tree. They are:
i) Hunt’s algorithm
ii) ID3 (Iterative Dichotomiser 3)
iii) C4.5 (Classification 4.5)
iv) CART (Classification Algorithm and Regression Tree)
--These algorithms usually employ a greedy strategy that grows a decision tree by making a
series of locally optimum decisions about which attribute to use for partitioning the data. One
such algorithm is Hunts algorithm.
Hunt’s algorithm
--In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into subsets.
--Let Dt be a set of training records that are associated with node t and y={y 1,y2,…,yc} be the
class labels.
STEP 1
If all the records in Dt belong to same class yt, then t is a leaf node labeled as yt.
STEP 2
If Dt contains records that belong to more than one class, an attribute test condition is selected to
partition the records into smaller subsets. A child node is created for each outcome and the
records in Dt are distributed based on the outcomes. The algorithm is then recursively applied for
each node.
Fig: Training set for predicting borrowers who will default on loan payments
--In the above data set, the class labels for all the 10 records are not same, so step 1 cannot be
satisfied. We need to construct the decision tree using step 2.
--The class label has maximum number of records with “no”. So, label the node as follows:
--Select one of the attribute as root node, say, home owner since home owner with entry “yes”
need not require any further splitting. There are 3 records with home owner =yes and records
with home owner=no.
--The records with home owner=yes are classified and we now need to classify other 7 records
i.e., home owner=no. The attribute test condition can be applied either on marital status or annual
income.
--Let us select marital status, where we apply binary split. Here marital status=married need not
require further splitting.
--The records with marital status=married are classified and we now need to classify other 4
records i.e., home owner=no and marital status=single, divorced.
--The left out attribute is annual income. Here we select the range since it is a continuous
attribute.
--Now the other 4 records are also classified.
i) It is possible for some of the child nodes created in step 2 to be empty; i.e., there are no
records associated with these nodes. In such cases assign the same class label as the
majority class of training records associated with its parent node; i.e., in our example
majority class is no, so assign ‘no’ for the new record.
ii) If all the records in Dt have identical attribute values but the class label is different in
such cases, assign the majority class label.
The following are the methods for expressing attribute test conditions. They are:
i) Binary attribute: The test condition for binary attribute generate two outcomes as
shown below:
ii) Nominal attributes: since a nominal attribute can have many values, its test condition
can be expressed in two ways as shown below:
For a multi way split, the number of outcomes depends on the number of distinct
values for the corresponding attribute.
Some algorithms, such as CART supports only binary splits. In such case we can
partition the k-attribute values into 2k-1-1 ways.
For example, marital status is a 3-attribute value, we can split it in 22-1-1; i.e., 3 ways.
iii) Ordinal attribute: It can also produce binary or multi way splits. Ordinal attribute
values can be grouped as long as the grouping does not violate the order property of
the attribute values.
In the above example, condition a and condition b satisfies order but condition c
violates the order property.
iv) Continuous attributes: The test condition can be expressed as a comparison test (A<v)
or (A>=v) with binary outcomes, or a range query with outcomes of the form
vi<=A<vi+1 for i=1,2,…,k.
Measures for selecting the best split:
There are many measures that can be used to determine the best way to split the records.
Let P(i|t) denote the fraction of records belonging to class i at a node t. the measures for selecting
the best split are often based on the degree of impurity of the child nodes. The smaller the
degree of impurity, the more skewed the class distribution. For example, a node with class
distribution (0,1) has zero impurity, whereas a node with uniform class distribution (0.5,0.5) has
the highest impurity.
Compare the degree of impurity of the parent node with the degree of impurity of the child node.
The larger their difference, the better the test condition. The gain, ∆, is a criterion that can be
used to determine the goodness of a split.
Where I(.) is the impurity measure of a given node, N is the total number of records at the parent
node, k is the attribute values and N(vj) is the number of records associated with node vj. when
entropy is used as impurity measure the difference in entropy is known as information gain, ∆info.
Suppose there are two ways to split the data into smaller subsets, say, A and B. before splitting
the GINI index is 0.5 since there are equal number of records from both the classes.
For attribute A,
For attribute B, the average weighted GINI index is 0.375, since the subsets for attribute B have
smaller GINI index than A, attribute B is preferable.
The computation of GINI index is same as for binary attributes. The smaller the average GINI
index is the best split. In our example, multi way split has the lowest GINI index, so it is the best
split.
In our example, the sorted values represents the ascending order of distinct values in continuous
attribute.
Calculate the GINI index for every split position and the smaller GINI index split position can be
chosen as the range for continuous attribute
1. Bayes’ Theorem:
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to
perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct? We can
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 Now, P
(Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not
Consider a fictional dataset that describes the weather conditions for playing a game of Cricket.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for
playing Cricket.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors (rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each
row of feature matrix. In above dataset, the class variable name is ‘Play Cricket’.
Bayes’ Theorem: Bayes’ Theorem finds the probability of an event occurring given the probabili
P( A | B ) =P( B | A ) P(A)
P(B)
where A and B are events and P(B) ? 0.
Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
P( y | x1,……, xn ) = P( y ) πn i=1 P( xi | y )
P(x1) P(x2)............P(xn)
Now, as the denominator remains constant for a given input, we can remove that term:
P( y | x1,……, xn ) P( y ) πn i=1 P( xi | y )
Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
y = argmaxyP( y ) πn i=1 P( xi | y )
So, finally, we are left with the task of calculating P(y) and P(x i | y).
Please note that
P(y) is also called class probability and
P(xi | y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some pre-computations on our dataset.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:
So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the tables
For example, probability of playing Cricket given that the temperature is cool,
i.e P(temp. = cool | play Cricket = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the table 5.
For example, P(play Cricket = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
Today = ( Sunny, Hot, Normal, False )
So, probability of playing cricket is given by:
P( No | today ) . . .
3 2 1 5 ≈ 0.0068
2.
5 5 5 14
Now, Since 5
P( Yes | today ) + P( No | today ) = 1
These numbers can be converted into a probability by making the sum equal to 1
( normalization)
Since
A Bayesian Belief Networks (BBN) is a special type of diagram (called a directed graph) together with
an associated set of probability tables. They are also known as Belief Networks, Bayesian Networks, or
Probabilistic Networks.
The graph consists of nodes and arcs. The nodes represent variables, which can be discrete or
continuous. The arcs represent causal relationships between variables.