100% found this document useful (1 vote)
104 views21 pages

Unit 3

The document discusses classification, which is a machine learning task of mapping attribute sets to predefined class labels. Classification models can be descriptive, to summarize data, or predictive, to predict class labels of unknown records. Some common classification techniques discussed are decision trees, rules, neural networks, support vector machines, and Naive Bayes. Decision tree induction builds trees that partition data into subsets using attribute test conditions to classify examples. The best split is selected using measures like information gain, Gini index, and cross entropy that quantify the purity of child nodes.

Uploaded by

nandan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
104 views21 pages

Unit 3

The document discusses classification, which is a machine learning task of mapping attribute sets to predefined class labels. Classification models can be descriptive, to summarize data, or predictive, to predict class labels of unknown records. Some common classification techniques discussed are decision trees, rules, neural networks, support vector machines, and Naive Bayes. Decision tree induction builds trees that partition data into subsets using attribute test conditions to classify examples. The best split is selected using measures like information gain, Gini index, and cross entropy that quantify the purity of child nodes.

Uploaded by

nandan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT 4: CLASSIFICATION

Classification: Basic concepts:

Definition:

Classification is the task of learning a target function ‘f’ that maps each attribute set ‘x’ to one of
the predefined class label ‘y’.

In this attribute set ‘x’ can be any number of attributes and the attributes can be binary,
categorical and continuous. The class label ‘y’ must be a discrete attribute; i.e., either binary or
categorical (nominal or ordinal).

Classification models:

--Descriptive modeling is a classification model used for summarizing the data.

--Predictive modeling is a classification model used to predict the class label of unknown
records.

Applications:
i) Detecting spam email messages based upon the message header and content.
ii) Classifying galaxies based upon their shapes.
iii) Classifying the Students based on their Grades.
iv) Classifying the Patients according to their Medical records.
v) Classification can be used in credit approval.

General approach to solve a classification problem:

--A classification technique is a systematic approach to build classification models based on a


data set.

--Examples are decision tree classifiers, rule-based classifiers, neural networks, support vector
machines and naïve Bayes classifier.

--Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and the class label of the input data.

--A training set consists of records whose class labels are known must be provided. The training
test is used to build a classification model, which is applied to the test set. The test set consists of
records whose class label is unknown

--Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model.

--These counts are tabulated in a table known as confusion matrix.


Predicted Class
Class 1 Class 0
Class=1 f11 f10
Actual Class f01 f00
Class=0

--Each entry fij in the table denotes the number of records from the class ‘i’ predicted to be of
class ‘j’.

--For example, f01 refers to the number of records from class 0 incorrectly predicted as class 1.

--Based on the entries in the confusion matrix, the total number of correct predictions made by
the model is (f11+f00) and the total number of incorrect predictions is (f01+f10).

--Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information with a single number would make it
more convenient to compare the performance of different models.

--This can be done using a performance metric.

--Accuracy can be expresses as:

Accuracy= Number of correct predictions/ Total number of predictions

. Accuracy= (f11+f00)/(f11+f10+f00+f01)

-- Equivalently, Error rate can be expresses as:

Error rate=Number of wrong predictions/ Total number of predictions

. Error rate = (f10+f01)/(f11+f10+f00+f01)

Decision Tree Induction: Decision tree induction is a technique used for identifying
unknown class labels in classification. The topics are:

--Working of decision tree

--Building a decision tree

--Methods for expressing attribute test conditions

--Measures for selecting the best split

--Algorithm for decision tree induction

Working of a decision tree:

The tree has three types of nodes.


i) A root node has no incoming edges and zero or more outgoing edges.
ii) Internal nodes, each of which has exactly one incoming edge and two or more outgoing
edges.
iii) Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges.

Fig: A decision tree for mammal classification problem

In this example, we are classifying whether vertebrate is a mammal or non-mammal. From this
decision tree, we can identify a new vertebrate as mammal or non-mammal. If the vertebrate is
cold-blooded, then it is a non-mammal. If the vertebrate is warm-blooded, then check the next
node gives berth. If it gives berth, then it is a mammal, else, non-mammal.

Fig: Classifying an unlabelled vertebrate


Building of a decision tree:

--There are various algorithms devised for constructing a decision tree. They are:

i) Hunt’s algorithm
ii) ID3 (Iterative Dichotomiser 3)
iii) C4.5 (Classification 4.5)
iv) CART (Classification Algorithm and Regression Tree)

--These algorithms usually employ a greedy strategy that grows a decision tree by making a
series of locally optimum decisions about which attribute to use for partitioning the data. One
such algorithm is Hunts algorithm.

Hunt’s algorithm

--In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into subsets.

--Let Dt be a set of training records that are associated with node t and y={y 1,y2,…,yc} be the
class labels.

--The recursive procedure for hunt’s algorithm is as follows:

STEP 1

If all the records in Dt belong to same class yt, then t is a leaf node labeled as yt.

STEP 2

If Dt contains records that belong to more than one class, an attribute test condition is selected to
partition the records into smaller subsets. A child node is created for each outcome and the
records in Dt are distributed based on the outcomes. The algorithm is then recursively applied for
each node.

Fig: Training set for predicting borrowers who will default on loan payments
--In the above data set, the class labels for all the 10 records are not same, so step 1 cannot be
satisfied. We need to construct the decision tree using step 2.

--The class label has maximum number of records with “no”. So, label the node as follows:

--Select one of the attribute as root node, say, home owner since home owner with entry “yes”
need not require any further splitting. There are 3 records with home owner =yes and records
with home owner=no.

--The records with home owner=yes are classified and we now need to classify other 7 records
i.e., home owner=no. The attribute test condition can be applied either on marital status or annual
income.

--Let us select marital status, where we apply binary split. Here marital status=married need not
require further splitting.

--The records with marital status=married are classified and we now need to classify other 4
records i.e., home owner=no and marital status=single, divorced.

--The left out attribute is annual income. Here we select the range since it is a continuous
attribute.
--Now the other 4 records are also classified.

Additional conditions are needed to handle some special cases:

i) It is possible for some of the child nodes created in step 2 to be empty; i.e., there are no
records associated with these nodes. In such cases assign the same class label as the
majority class of training records associated with its parent node; i.e., in our example
majority class is no, so assign ‘no’ for the new record.
ii) If all the records in Dt have identical attribute values but the class label is different in
such cases, assign the majority class label.

Methods for expressing attribute test conditions:

The following are the methods for expressing attribute test conditions. They are:

i) Binary attribute: The test condition for binary attribute generate two outcomes as
shown below:

ii) Nominal attributes: since a nominal attribute can have many values, its test condition
can be expressed in two ways as shown below:
For a multi way split, the number of outcomes depends on the number of distinct
values for the corresponding attribute.

Some algorithms, such as CART supports only binary splits. In such case we can
partition the k-attribute values into 2k-1-1 ways.
For example, marital status is a 3-attribute value, we can split it in 22-1-1; i.e., 3 ways.
iii) Ordinal attribute: It can also produce binary or multi way splits. Ordinal attribute
values can be grouped as long as the grouping does not violate the order property of
the attribute values.

In the above example, condition a and condition b satisfies order but condition c
violates the order property.

iv) Continuous attributes: The test condition can be expressed as a comparison test (A<v)
or (A>=v) with binary outcomes, or a range query with outcomes of the form
vi<=A<vi+1 for i=1,2,…,k.
Measures for selecting the best split:

There are many measures that can be used to determine the best way to split the records.

Let P(i|t) denote the fraction of records belonging to class i at a node t. the measures for selecting
the best split are often based on the degree of impurity of the child nodes. The smaller the
degree of impurity, the more skewed the class distribution. For example, a node with class
distribution (0,1) has zero impurity, whereas a node with uniform class distribution (0.5,0.5) has
the highest impurity.

Examples of impurity measures include:


The 3 measures attain maximum values when the class distribution is uniform and minimum
when all the records belong to same class.

Compare the degree of impurity of the parent node with the degree of impurity of the child node.
The larger their difference, the better the test condition. The gain, ∆, is a criterion that can be
used to determine the goodness of a split.

Where I(.) is the impurity measure of a given node, N is the total number of records at the parent
node, k is the attribute values and N(vj) is the number of records associated with node vj. when
entropy is used as impurity measure the difference in entropy is known as information gain, ∆info.

Splitting of binary attributes

Suppose there are two ways to split the data into smaller subsets, say, A and B. before splitting
the GINI index is 0.5 since there are equal number of records from both the classes.

For attribute A,

For node N1, the GINI index is 1-[(4/7)2+(3/7)2]=0.4898

For node N2, the GINI index is 1-[(2/5)2+(3/5)2]=0.48

The average weighted GINI index is (7/12)(0.4898)+(5/12)(0.48)=0.486

For attribute B, the average weighted GINI index is 0.375, since the subsets for attribute B have
smaller GINI index than A, attribute B is preferable.

Splitting of nominal attributes


A nominal attribute can produce either binary or multi way split.

The computation of GINI index is same as for binary attributes. The smaller the average GINI
index is the best split. In our example, multi way split has the lowest GINI index, so it is the best
split.

Splitting of continuous attributes

In order to split a continuous attribute, we select a range.

In our example, the sorted values represents the ascending order of distinct values in continuous
attribute.

Split positions represent mean between two adjacent sorted values.

Calculate the GINI index for every split position and the smaller GINI index split position can be
chosen as the range for continuous attribute

Algorithm for decision tree induction:


i) The create node() function extends the decision tree by creating a new node. A node in the
decision tree has either a test condition, denoted as node.test_cond, or a class label,
denoted as node.label.
ii) The find.best_split () function determines which attribute should be selected as the test
condition for splitting the training records.
iii) The classify() function determines the class label to be assigned to a leaf node.
iv) The stopping_cond() function is used to terminate the tree-growing process by testing
whether all the records are classified or not.

1. Bayes’ Theorem:

It is a classification technique based on Bayes’ Theorem with an assumption of independence


among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features
depend on each other or upon the existence of the other features, all of these properties independently
contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).

 P(c) is the prior probability of class.

 P(x|c) is the likelihood which is the probability of predictor given class.

 P(x) is the prior probability of predictor.

How Naive Bayes algorithm works?

Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to
perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct? We can

solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 Now, P
(Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

Applications of Naive Bayes Algorithms

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.

Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not

1. Naïve Bayesian Classification:


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.

To start with, let us consider a dataset.

Consider a fictional dataset that describes the weather conditions for playing a game of Cricket.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for
playing Cricket.

Here is a tabular representation of our dataset.

OUTLOO TEMPERATURE HUMIDIT WINDY PLAY CRICKET


K Y

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcas Hot High False Yes


t

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcas Cool Normal True Yes


t
7 Rainy Mild High False No
OUTLO TEMPERATURE HUMIDIT WINDY PLAY CRICKET
OK Y

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overca Mild High True Yes


st

12 Overca Hot Normal False Yes


st

13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector.

 Feature matrix contains all the vectors (rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.

 Response vector contains the value of class variable (prediction or output) for each
row of feature matrix. In above dataset, the class variable name is ‘Play Cricket’.

Bayes’ Theorem: Bayes’ Theorem finds the probability of an event occurring given the probabili
P( A | B ) =P( B | A ) P(A)
P(B)
where A and B are events and P(B) ? 0.

 Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.

 P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).

 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.


Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P( y | X ) = P( X | y ) P(y)
P(X)
where, y is class variable and X is a dependent feature vector (of size n) where:

X = { x1, x2, x3, x4, x5,................, Xn }


Just to clear, an example of a feature vector and corresponding class variable can be (refer 1st
row of dataset)
X = ( Rainy, Hot, High, False )
y = No
So basically, P(X|y) here means, the probability of “Not playing Cricket” given that the weather
conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
Naive Assumption: Now, its time to put a naive assumption to the Bayes’ theorem,
which is independence among the features. So now, we split evidence into the
independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
P( y|x1,……, xn ) = P( x1|y ) P( x2|y )… P( xn|y ) P(y)
P(x1) P(x2)............P(xn)
which can be expressed as:

P( y | x1,……, xn ) = P( y ) πn i=1 P( xi | y )
P(x1) P(x2)............P(xn)
Now, as the denominator remains constant for a given input, we can remove that term:

P( y | x1,……, xn ) P( y ) πn i=1 P( xi | y )
Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum

y = argmaxyP( y ) πn i=1 P( xi | y )
So, finally, we are left with the task of calculating P(y) and P(x i | y).
Please note that
P(y) is also called class probability and
P(xi | y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi | y).

Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some pre-computations on our dataset.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:

So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the tables

For example, probability of playing Cricket given that the temperature is cool,
i.e P(temp. = cool | play Cricket = Yes) = 3/9.

Also, we need to find class probabilities (P(y)) which has been calculated in the table 5.
For example, P(play Cricket = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
Today = ( Sunny, Hot, Normal, False )
So, probability of playing cricket is given by:

P(Yes|today) = P( SunnyOutlook|Yes) P(HotTemperature|Yes) P( NormalHunidity|Yes) P(NoWind|Yes) P (Yes)


P(today)

and probability to not play cricket is given by

P(No|today) = P( SunnyOutlook|No) P(HotTemperature|No) P( NormalHunidity|No) P(NoWind|No) P (No)


P(today)

Since, P(today) is common in both probabilities, we can ignore P(today) and


find proportional probabilities as:

P( Yes | today ) 2 .2 .6 .6 . 9 ≈ 0.0141


9 9 9 14
And 9

P( No | today ) . . .
3 2 1 5 ≈ 0.0068
2.
5 5 5 14
Now, Since 5
P( Yes | today ) + P( No | today ) = 1

These numbers can be converted into a probability by making the sum equal to 1
( normalization)

P( Yes | today ) = 0.0141 = 0 67.


0.0141+0.0068

P( No | today ) = 0.0068 = 0 33.


0.0141+0.0068

Since

P( Yes | today ) > P( No | today )

So, Prediction that Cricket would be played is ‘ Yes’.


2. Bayesian Belief Networks

A Bayesian Belief Networks (BBN) is a special type of diagram (called a directed graph) together with
an associated set of probability tables. They are also known as Belief Networks, Bayesian Networks, or
Probabilistic Networks.

There are two components that define a Bayesian Belief Network −


 Directed acyclic graph
 A set of conditional probability tables

The graph consists of nodes and arcs. The nodes represent variables, which can be discrete or
continuous. The arcs represent causal relationships between variables.

Directed Acyclic Graph


 Each node in a directed acyclic graph represents a random variable.
 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the
data.
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example,
lung cancer is influenced by a person's family history of lung cancer, as well as
whether or not the person is a smoker. It is worth noting that the variable
PositiveXray is independent of whether the patient has a family history of lung
cancer or that the patient is a smoker, given that we know the patient has lung
cancer.

Conditional Probability Table


The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows −

You might also like