Data Mining Unit 3
Data Mining Unit 3
The term |Dj | |D| acts as the weight of the jth partition. InfoA (D) is the expected
information required to classify a tuple from D based on the partitioning by A
• Information gain is defined as the difference between the
original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
Gain ratio
• The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple.
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore,
the information gained by partitioning on this attribute is maximal.
• Clearly, such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio
Gini index
• Gini Index is calculated for binary variables only. It measures the
impurity in training tuples of dataset D, as
P is the probability that tuple belongs to class C. The Gini index that is
calculated for binary split dataset D by attribute A is given by:
Example:
Lets consider the dataset in the image below and draw a decision tree using gini index.
• Index A B C D E
• 1 4.8 3.4 1.9 0.2 positive
• 2 5 3 1.6 1.2 positive
• 3 5 3.4 1.6 0.2 positive
• 4 5.2 3.5 1.5 0.2 positive
• 5 5.2 3.4 1.4 0.2 positive
• 6 4.7 3.2 1.6 0.2 positive
• 7 4.8 3.1 1.6 0.2 positive
• 8 5.4 3.4 1.5 0.4 positive
• 9 7 3.2 4.7 1.4 negative
• 10 6.4 3.2 4.7 1.5 negative
• 11 6.9 3.1 4.9 1.5 negative
• 12 5.5 2.3 4 1.3 negative
• 13 6.5 2.8 4.6 1.5 negative
• 14 5.7 2.8 4.5 1.3 negative
• 15 6.3 3.3 4.7 1.6 negative
• 16 4.9 2.4 3.3 1 negative
• In Gini Index, we have to choose some random values to categorize
each attribute. These values for this dataset are:
Tree Pruning