0% found this document useful (0 votes)
60 views61 pages

Machine Learning Unit-3.2

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views61 pages

Machine Learning Unit-3.2

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Decision Tree Classification Algorithm:

– Decision Tree is a Supervised learning technique that can be used


for both classification and Regression problems.

• It uses a flowchart like a tree structure.

• It is a tree-structured classifier, where


 Internal nodes represent the features of a dataset,
 Branches represent the decision rules and
 Leaf node represents the outcome.

• Leaf nodes are the output of those decisions.

• A decision tree is used for multi-dimensional analysis with


multiple classes.
• Each node (or decision node) of a decision tree
corresponds to one of the feature vector.

• The tree terminates at different leaf nodes (or


terminal nodes) where each leaf node represents
a possible value for the output variable.

• Thus, a decision tree consists of three types of


nodes:
 Root Node
 Branch Node
 Leaf Node
• Below diagram explains the general structure
of a decision tree:
Decision Tree Terminologies:

• Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.

• Splitting: Splitting is the process of dividing the decision


node/root node into sub-nodes according to the given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
What is overfitting in decision tree?
• The decision tree algorithm, unless a stopping criterion is applied,
may keep growing indefinitely – splitting for every feature and
dividing into smaller partitions till the point that the data is perfectly
classified.

• This, as is quite evident, results in overfitting problem.

How to Avoid Overfitting?


• To prevent a decision tree getting overfitted , pruning of the decision
tree is essential.

• Pruning reduces the size of the tree such that the model is more
generalized and can classify unknown and unlabelled data in a better
way.

• Pruning: Pruning is the process of removing the unwanted branches


from the tree.
There are two approaches of pruning:
• Pre-pruning: Stop growing the tree before it reaches perfection.
• Post-pruning: Allow the tree to grow entirely and then post-prune some of the branches
from it.

• In the case of pre-pruning, the tree is stopped from further growing once it reaches a
certain number of decision nodes or decisions. Hence, in this strategy, the algorithm avoids
overfitting as well as optimizes computational cost.

• However, it also stands a chance to ignore important information contributed by a feature


which was skipped, thereby resulting in miss out of certain patterns in the data.

• On the other hand, in the case of post-pruning, the tree is allowed to grow to the full
extent.

• Then, by using certain pruning criterion, e.g. error rates at the nodes, the size of the tree is
reduced.

• This is a more effective approach in terms of classification accuracy as it considers all


minute information available from the training data.

• However, the computational cost is obviously more than that of pre-pruning.


Strengths of decision tree
• It produces very simple understandable rules.

• For smaller trees, not much mathematical and computational


knowledge is required to understand this model.

• Works well for most of the problems.

• It can handle both numerical and categorical variables.

• Can work well both with small and large training data sets.

• Decision trees provide a definite clue of which features are


more useful for classification.
The independent variables are Outlook, Temperature, Humidity, and Wind. The
dependent variable is whether to play football or not.
As the first step, we have to find the Root Node(parent node) for our decision tree.
For that follow the steps:
Entropy:
• Entropy is a measure of randomness. In other words, its a
measure of unpredictability.

• In case of binary event(like the coin toss, where output can be


either of the two events, head or tail) a mathematical face:

Entropy = -[probability(a) * log2(probability(a))] – [probability(b) * log2(probability(b))]

Where,
probability(a) is probability of getting head and
probability(b) is probability of getting tail.
What is “Entropy”? and What is its function?
• In machine learning, entropy is a measure of the randomness
in the information being processed.

• The higher the entropy, the harder it is to draw any


conclusions from that information.

• It is a measure of the amount of uncertainty in a data set.

• Entropy controls how a Decision Tree decides to split the data.


It actually affects how a Decision Tree draws its boundaries.
Example:- Given the set S = {a, a, a, b, b, b, b, b}
Total intances: 8
Instances of b: 5
Instances of a: 3
= [ -3/8*log2(3/8) - 5/8*log2(5/8)]
= [ -0.375 * (-1.415) - 0.625 * (-0.678)]
= (0.530+0.424)
= 0.954 bits
• Information Gain
• In the decision tree shown here:
Information Gain:
• The measure we will use called information gain, is
simply the expected reduction in entropy caused
by partitioning the data set according to this
attribute.

• The information gain (Gain(S,A)) of an attribute A


relative to a collection of data set S, is defined as-
• Information gain is used for determining the best
features/attributes that render maximum information about a
class.

• It follows the concept of entropy while aiming at decreasing the


level of entropy, beginning from the root node to the leaf nodes.

• Information gain computes the difference between entropy


before and after split and specifies the impurity in class elements.

• Information Gain = Entropy before splitting - Entropy after


splitting
The expected information needed to classify a tuple in D is given
by
Information gain is calculated by comparing the entropy of the
dataset before and after a transformation.

• where pi is the nonzero probability that an arbitrary tuple in D


belongs to class Ci and is estimated by | Ci , D |/ |D |.

• A log function to the base 2 is used, because the information


is encoded in bits.

• Info(D) is also known as the entropy of D.


• How much more information would we still need (after the
partitioning) to arrive at an exact classification?

• This amount is measured by

• The term | Dj |/ |D| acts as the weight of the jth partition.

• InfoA(D) is the expected information required to classify a


tuple from D based on the partitioning by A.

• The smaller the expected information (still) required, the


greater the purity of the partitions.
Information gain is defined as the difference between the
original information requirement and the new requirement
(i.e., obtained after partitioning on A).

• It is given by

• In other words, Gain(A) tells us how much would be gained


by branching on A.

• It is the expected reduction in the information requirement


caused by knowing the value of A.

• The attribute A with the highest information gain, Gain(A), is


chosen as the splitting attribute at node N.
Table 8.1 Class-Labeled Training Tuples
RID Age income student Credit_rating Class:
buys_computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
14 senior medium No Excellent No
Example 8.1 Induction of a decision tree using information
gain.

Solution: Table 8.1 presents a training set, D.


• In this example, each attribute is discrete valued.
( Continuous-valued attributes have been generalized.)

• The class label attribute, buys computer, has two distinct


values (namely, {yes, no}) ;

• therefore, there are two distinct classes (i.e., m=2).

• Let class C1 correspond to yes and class C2 correspond to


no.
• There are 9 tuples of class yes and 5 tuples of class no.

• A (root) node N is created for the tuples in D.

• To find the splitting criterion for these tuples, we must


compute the information gain of each attribute.

• We first use Eq. (8.1) to compute the expected


information needed to classify a tuple in D:
• Next, we need to compute the expected information
requirement for each attribute.

• Let’s start with the attribute age. We need to look at the


distribution of yes and no tuples for each category of age.

• For the age category “youth,” there are 2 yes tuples and 3
no tuples.

• For the category “middle aged,” there are 4 yes tuples and
0 no tuples.

• For the category “senior,” there are 3 yes tuples and 2 no


tuples.
• Using Eq. (8.2), the expected information needed to
classify a tuple in D if the tuples are partitioned
according to age is,

= 0.694
• Hence, the gain in information from such a partitioning
would be,

= 0.940-0.694
= 0.246 bits
• Next, we need to compute the expected information
requirement for the attribute Income.

• We need to look at the distribution of yes and no tuples for


each category of age.

• For the income category “High ,” there are 2 yes tuples and
2 no tuples.

• For the income category “Medium,” there are 4 yes tuples


and 2 no tuples.

• For the category “Low ,” there are 3 yes tuples and 1 no


tuples.
• Using Eq. (8.2), the expected information needed to classify a tuple in D if the
tuples are partitioned according to income is,

Entropy(income(D))=(4/14)*[-2/4*log(2/4)-2/4*log(2/4)+(6/14)*[-4/6*log(4/6)-
2/6*log(2/6)] + (4/14)*[-3/4*log(3/4)- ¼ *log(1/4)]

=(4/14)*[(-0.5*-1.0+-0.5* 1.0)]+(6/14)*[(0.666*0.5849)+(0.3333*1.5851)]+
(4/14)*[(0.75*0.4150) + (0.25*2.0)]

=(2/7)*[0.5+0.5]+(3/7)*[(0.666*0.5849)+(0.3333*1.5851)]+
(2/7)*[(0.75*0.4150) + (0.25*2.0)]

= 0.2857*[1.0] + 0.4285*[0.38954+0.5283]+ 0.2857*[0.31125+0.5]


= 0.2857+0.3934+ 0.2317
= 0.9108
• Hence, the gain in information from such a partitioning would be,
• Gain(income)= Info(D)- Info(income)(D)
= 0.940-0.9108
= 0.029 bits ------------------(8.1)
• Next, we need to compute the expected
information requirement for the attribute Student.

• We need to look at the distribution of yes and no


tuples for each category of student.

• For the student category “Yes ,” there are 6 yes


tuples and 1 no tuples.

• For the student category “No,” there are 3 yes


tuples and 4 no tuples.
• Using Eq. (8.2), the expected information needed to classify a tuple in D if the
tuples are partitioned according to student is,

Info(student)(D)=(7/14)*[-6/7*log(6/7)-1/7*log(1/7)+(7/14)*[-3/7*log(3/7)-
4/7*log(4/7)]

= (7/14)*[0.8571*(0.2224)+(0.1428*2.8079)+(7/14)*[0.4285*1.2226 -
0.5714*0.8074)]

= (7/14)*[0.8571*(0.2224)+(0.1428*2.8079)+(7/14)*[0.4285*1.2226 -
0.5714*0.8074)]
= 0.5*(0.1906+0.4009)+0.5*(0.5238+0.4613)
=0.2957+0.4925
=0.7880

• Hence, the gain in information from such a partitioning would be,

Gain(student) = 0.940-0.788
= 0.152 bits
• Similarly, we can compute Gain(credit_rating)= 0.048 bits.

• Because age has the highest information gain among the


attributes, it is selected as the splitting attribute.

• Node N is labeled with age, and branches are grown for each of
the attribute’s values.

• The tuples are then partitioned accordingly, as shown in Figure 8.5.

• Notice that the tuples falling into the partition for


age=middle_aged all belong to the same class. Because they all
belong to class “yes,” a leaf should therefore be created at the end
of this branch and labeled “yes.”

• The final decision tree returned by the algorithm was shown earlier
in Figure 8.2.
• Step 1: Calculate the Entropy of one attribute —
Prediction: Clare Will Play Tennis/ Not Play Tennis

• For this illustration, I will use this contingency table to


calculate the entropy of our target variable: Played?
(Yes/No).

• There are 14 observations (10 “Yes” and 4 “No”).


• The probability (p) of ‘Yes’ is 0.71428(10/14), and the
probability of ‘No’ is 0.28571 (4/14).

• You can then calculate the entropy of our target


variable using the equation above.
• Step 2: Calculate the Entropy for each feature using the
contingency table
• To illustrate, I use Outlook as an example to explain how to
calculate its Entropy. There are a total of 14 observations.
Summing across the rows we can see there are 5 of them
belong to Sunny, 4 belong to Overcast, and 5 belong to Rainy.
Therefore, we can find the probability of Sunny, Overcast,
and Rainy and then calculate their entropy one by one using
the above equation. The calculation steps are shown below.
• Definition: Information Gain is the decrease
or increase in Entropy value when the node is
split.
• The equation of Information Gain:

• Information Gain from X on Y.


• Step 3: Choose attribute with the largest
Information Gain as the Root Node
• The information gain of ‘Humidity’ is the
highest at 0.918. Humidity is the root node.
• Step 4: A branch with an entropy of 0 is a leaf
node, while a branch with entropy more than 0
needs further splitting.
• Step 5: Nodes are grown recursively in the ID3
algorithm until all data is classified.
• We decided to break the first decision on the basis of
outlook. We could have our first decision based on
humidity or wind but we chose outlook. Why?

• Because making our decision on the basis of outlook


reduced our randomness in the outcome(which
is whether to play or not), more than what it would
have been reduced in case of humidity or wind.

• Let’s understand with the example here.

• Please refer to the play tennis dataset that is pasted


above.
• What is a good quantitative measure of the
worth of an attribute? Information gain
measures how well a given attribute separates
the training examples according to their target
classification. ID3 uses this information gain
measure to select among the candidate
attributes at each step while growing the tree.
• Information Gain is based on Entropy
• We have data for 14 days. We have only two
outcomes :
• Either we played tennis or we didn’t play.
• In the given 14 days, we played tennis on 9 occasions
and we did not play on 5 occasions.

• Probability of playing tennis:


• Number of favourable events : 9
• Number of total events : 14
• Probability = (Number of favourable events) / (Number of total events)

= 9/14
= 0.642
• Now, we will see probability of not playing
tennis.
• Probability of not playing tennis:
Number of favourable events : 5
• Number of total events : 14
• Probability = (Number of favourable events) / (Number of total events)

=5/14
=0.357
• Entropy at source= -(Probability of playing
tennis) * log2(Probability of playing tennis) –
(Probability of not playing tennis) *
log2(Probability of not playing tennis)
• E(S) = -[(9/14)log2(9/14) - (5/14)log2(5/14)]
= -0.652 * log2(0.652) – 0.357*log2(0.357)
=0.940
So, Entropy of whole system before we make
our first question is 0.940
Note: Here typically we will take log to base 2
• From the above data for outlook we can arrive at the following
table easily:

• Now we have to calculate average weighted entropy. ie, we have


found the total of weights of each feature multiplied by probabilities.
• Entropy among the three branches:
• Entropy among three branches = ((number of sunny days)/(total
days) * (entropy when sunny)) + ((number of overcast
days)/(total days) * (entropy when overcast)) + ((number of rainy
days)/(total days) * (entropy when rainy))
• E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) +
(5/14)*E(2,3)
= ((5/14) * 0.971) +((4/14) * 0.0) +((5/14)*0.971)
= 0.693

Hence, the gain in information from such a


partitioning would be,

Gain(S, Outlook)= E(S)- E(S, Outlook)


= 0.940-0.693
=0.247
• Now we will calculate Information gain for Humidity:
Play

Humidity Yes No Total


High 3 4 7
Normal 6 1 7

• Entropy among the three branches:


• Entropy among three branches = ((number of High
Humidity days)/(total days) * (entropy when High))
+ ((number of Normal days)/(total days) * (entropy
when Normal))

• E(S, Humidity) = (7/14)*E(3,4) + (7/14)*E(6,1)


= (7/14)*(-(3/7)log2(3/7)-(4/7)log2(4/7))+ (7/14)* (-
(6/7)log2(6/7)-(1/7)log2(1/7))
=0.788

Hence, the gain in information from such a


partitioning would be,

Gain(S, Humidity)= E(S)- E(S,Humidity)


= 0.940-0.788
=0.152
Information gain for Outlook, Temperature,
Humidity, and Wind is as follows:
• Outlook: Information Gain= 0.247
• Humidity:Information Gain = 0.152
• Windy: Information Gain = 0.048
• Temperature: Information Gain = 0.029

• Now select the feature having the largest


Information gain.
• Here it is Outlook. So it forms the first node(Root
node) of our decision tree.
• Since overcast contains only examples of class
‘Yes’ we can set it as yes. That means If outlook
is overcast Tennis will be played. Now our
decision tree looks as follows.
• The next step is to find the next node in our
decision tree.
• Now we will find one under sunny.
• Now, we have to determine which of the
following Temperature, Humidity or Wind has
higher information gain.
• The next step is to find the next node in our
decision tree. Now we will find one under
sunny. We have to determine which of the
following Temperature, Humidity or Wind has
higher information gain.

Day Outlook Temp Humidity Wind Class:


Play Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Rain
Overcast

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]
Humidity Yes ?
Test for this
node
High Normal
• Calculate parent entropy E(sunny)
E(sunny) = -(2/5)log(2/5)-(3/5)log(3/5)
= -( 0.4)*(-1.3219)-(0.6)*(-0.7369)
= 0.5287+0.4421
= 0.971
• Now Calculate the information gain of
Humidity, Wind and Temperature.
• IG(sunny, Temperature)
Play Tennis?
Yes No Total
Sunny Humidity High 0 3 3
Normal 2 0 2
5

• E(Sunny, Humidity)=(3/5)*E(0,3)+(2/5)*E(2,0)
=(3/5)*0+(2/5)*0
=0
IG(Sunny, Humidity)=E(S)- E(Sunny, Humidity)
= 0.971-0
= 0.971
For humidity from the above table, we can say that play will occur if
humidity is normal and will not occur if it is high. Similarly, find the
nodes under rainy.
Play Tennis?
Yes No Total
Hot 0 2 2
Sunny Temp. Mild 1 1 2
Cool 1 0 1
5

E(Sunny, Temperature)=(2/5)*E(0,2)+(2/5)*E(1,1)+(1/5)*E(1,0)
=(2/5)*0+(2/5)*(-(1/2)log1/2-(1/2)log1/2)+ (1/5)*0
=0+(2/5)*1.0+0= 0.40

IG(Sunny, Temp.)=E(S)- E(Sunny, Temperature)


= 0.971-0.40
= 0.571
Class:
Play Tennis?
Yes No Total
Strong 1 1 2
Sunny Wind Weak 1 2 3
5

E(Sunny, Wind)=(2/5)*E(1,1)+(3/5)*E(1,2)
=(2/5)*(-(1/2)log(1/2)-(1/2)log(1/2)+ (3/5)*(-(1/3)log1/3-(2/3)log2/3)
=0.4*[0.5(-1)+(0.5)(-1)]+0.6[-0.3333*(-1.5851)-0.6666*(-0.585)]
=0.4*1.0+0.6 *[0.5278+0.3896]
=0.4+0.6*0.9174
=0.4+0.5504
=0.9504

IG(Sunny, Wind)=E(S)- E(Sunny, Wind)


= 0.971-0.9504
= 0.020
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No

E(Rain)=E(3,2)
= (-3/5)*log3/5-(2/5)log(2/5)
=0.6*0.7369+0.4*1.3219
=0.970

Now Calculate the information gain of Humidity, Wind and Temperature.


Humidity Yes No Total

High 1 1 2
Outlook Rain Normal 2 1 3
Total 5

E(Rain , Humidity)=(2/5)*E(1,1)+(3/5)*E(2,1)
=(2/5)*(-(1/2)log1/2-(1/2)log1/2+(3/5)*(-(2/3)log2/3-(1/3)log1/3
=0.4*((0.5*-1)+(0.5*1)+0.6*(0.666*0.5851+0.333*1.5851)
=0.4*1.0+0.6*(0.3900+0.5283)
=0.4+0.5509
=0.950
IG(Rain, Humidity)=E(S)- E(Rain, Humidity)
= 0.970-0.950
= 0.020
Day Outlook Wind Total
D4 Rain Weak Yes
D5 Rain Weak Yes
D6 Rain Strong No
D10 Rain Weak Yes
D14 Rain Strong No

E(Rain , Wind)=(2/5)*E(1,1)+(3/5)*E(2,1)
=(2/5)*(-(1/2)log1/2-(1/2)log1/2+(3/5)*(-(2/3)log2/3-(1/3)log1/3
=0.4*((0.5*-1)+(0.5*1)+0.6*(0.666*0.5851+0.333*1.5851)
=0.4*1.0+0.6*(0.3900+0.5283)
=0.4+0.5509
=0.950
IG(Rain, Wind)=E(S)- E(Rain, Wind)
= 0.970-0.950
= 0.020
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
• Outlook: Information Gain=
• Humidity:Information Gain = 0.152
• Windy: Information Gain = 0.048
• Temperature: Information Gain = 0.029

You might also like