0% found this document useful (0 votes)
153 views16 pages

Unit-4-DECISION TREES

The document discusses decision trees and how they are used for classification. It provides an example of a decision tree for determining whether to play cricket based on weather conditions. It then explains that the root node of the decision tree is chosen based on which attribute provides the highest information gain, as calculated using entropy. The document calculates the entropy and information gain for different attributes in a weather dataset, showing that outlook has the highest information gain and should therefore be the root node.

Uploaded by

sai tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views16 pages

Unit-4-DECISION TREES

The document discusses decision trees and how they are used for classification. It provides an example of a decision tree for determining whether to play cricket based on weather conditions. It then explains that the root node of the decision tree is chosen based on which attribute provides the highest information gain, as calculated using entropy. The document calculates the entropy and information gain for different attributes in a weather dataset, showing that outlook has the highest information gain and should therefore be the root node.

Uploaded by

sai tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

DECISION TREES

Suppose we want to play cricket today. We should first consider weather conditions
that may or may not permit us to play cricket. For example, if it is raining today, we may not
be able to play cricket. If the temperature is too hot, then we may not play cricket. But if it is
an overcast day where clouds can be seen and the weather is cool, we may play cricket. In
this manner, our decision of playing or not playing cricket will be affected depending on the
weather conditions. So, what we are doing here? We are taking a final decision depending on
certain conditions or rules. This is what a decision tree does.

A decision tree is a Machine Learning model that gives the decision after considering
certain conditions. It will partition (or divide) the data based on the conditions to arrive at
correct decision. The decision tree looks like a tree structure as shown in Figure 30.1:

Figure 30.1:A decision tree

In the Figure 30.1, we are showing a decision tree that represents data that helps to
take decision neither to play cricket or not under the given weather conditions. In the decision
tree, the topmost node is called root node'. For example, Outlook is the root node that
represents the outlook of the day. This node may have values called 'attributes’. For example,
Outlook may be Sunny, Overcast or Rainy which are called attributes. These attributes can be
imagined like branches of a tree.

Below the root node, we can have another node. For example, 'Windy' is another
node. This node may also have values called attributes. For example, Windy may be TRUE
or FALSE. That means there may be wind or not on that day. These TRUE or FALSE values
become attributes of that node. In this manner, there may be several nodes descending from
each other. Each node may have attributes.
Finally, there would be decision in the form of Yes' or No'. These are called leaf
nodes For example, if it is a windy day (TRUE), we may not be able to play cricket. So, the
final decision is represented as 'No'. If it is not a windy day (FALSE), then we may play
cricket. Thus Yes' or No' decision can be made finally and they will become the last nodes in
the tree. These last nodes are called leaf nodes.

A decision tree should arrive at a final decision after checking several conditions, For
example, if Outlook is Sunny, then it will check if Windy is TRUE, then it gives the output
No'. If Outlook is Sunny and if Windy is FALSE, then it gives the output Yes'. Similarly, if
Outlook is Overcast, then it will check if Windy is TRUE or FALSE. If Outlook is Rainy,
then it will again check if Windy is TRUE or FALSE. Depending on all these tests, it will
provide the output as Yes' or No'. These checking paths look like branches of a tree.

In Figure 30.1, we did not show the complete decision tree as it becomes a bit
complicated if we take all the nodes and attributes into consideration. However, the total data
related to this decision tree is given in Figure 30.2:

Figure 30.2: The weather data for playing cricket

Observe the decision tree in Figure 30.1.While drawing this decision tree. Why we
represented the Outlook as root node? If we observe the data, we can understand that there
are other columns like Temperature, Humidity and Windy. They also contribute to the final
decision. So, why not we take those other columns as the root node?

The question of which column should be taken as root node depends on either
'entropy’ or ‘gini index’. Entropy represents the randomness of data. When entropy is more.
The randomness will be more. That means, the data points are distributed here and there. It
also indicates impurities in the data which keep the data points apart. Please see Figure 30.3.

Gini index is a measurement of impurities in the data. There may be abnormal values
or here may be values which provide confusion in taking decision. Both the entropy and gini
index represent impurities in data. Whereas entropy indirectly represents the level of
impurities in the data, gini index directly measures the impurities in the data.

Figure 30.3: The concept of entropy

Entropy
Entropy is the measure of randomness of data. When the data is more random, the
data points are distant apart. That means some impurities are present in the data which are
keeping the data points apart. When entropy is high, randomness is high and hence there are
more impurities. In this case, the output may not be accurate. When entropy is low,
randomness is low and hence the data is close together without much impurities. Such data is
useful for the decision tree to make correct decisions.

When entropy value is low, the data contains less impurities. When impurities are
less, it provides more useful information to the decision tree algorithm. This is called
information gain. This is the reason entropy is generally applied on each node in the decision
tree to calculate information gain. When the information gain is highest for a node, it should
be taken as root node. Below the root node, we will represent that node which has a bit less
information gain. In this manner, entropy helps to decide the root node and other nodes that
are represented in the subsequent levels. The formula for calculating entropy is:

E(S) = -P(yes) log2P(yes) -P(no) log2P(no)

Where E(S) represents the Entropy of sample space. P(yes) represents the Probability
of Yes' and P(no) represents Probability of "No. The sample space S indicates all data points.

If number of Yes' and number of 'No are equal, then P(yes) and P(no) will be equal.
Since the total probability is always 1, P(yes) = P{no) = 0.5. In this case,

E(S) = -0.5 log2 (0.5)-0.5 log2 (0.5)

= -0.5(-1)-0.5 (-1)

= 0.5+0.5 = 1

If the sample space contains all Yes', that means there are no No's. Since the total
probability is 1, we have to take P(yes) as 1. Now,
E(S) = - P(yes) log2P(yes)

E(S) = -0.5 log2 (1)

= -0.5 (0)

=0

Similarly, if there are only No's and there are no rows with Yes', then also, E(S) = 0.

How to use Scientific Calculator to calculate log2 Values


We can take the help of scientific calculator available in our computer to calculate
logarithmic values. First, open the calculator by right clicking the Windows operating system
Start' button and then click the 'Run' app. Then type 'calc' to open the calculator app. In the
calculator, we can see horizontal lines at the left top corner. Click on them to view options.
There, select 'Scientific. This will present the scientific calculator.

Suppose we want to calculate the value of log, 0.5. This is equal to log 0.5 / log 2.
Hence click on 0.5 and then log' button in the calculator. It shows -0.3010 with several
fraction digits. Then click on 'division' (+ ) symbol and then type 2 and then press log button.
It shows 0.3010 with several fraction digits. Then click on 'equal' (=) symbol to see the result.
It will show -1. Therefore, the value of log, 0.5 is -1. See Figure 30.4.

Let us log take another example. To calculate the value of log2 (9/14). This is equal to
(9/14) / log 2. First click on 9 and then division symbol and then 14. Then press equal button.
It shows 9/14 value. Then click on log button. This gives log (9/14) value. Now we are left
with denominator. So, click on division symbol, then click on 2 and then log button. Then
click on equal button to see the final result, i.e. -0.6374.
Figure 30.4: Calculating log base 2 values using scientific calculator in Windows

With this knowledge of using Scientific calculator, now let us calculate entropy and
1OTmation gain for the dataset presented in Figure 30.2.

Calculating Total Entropy E(S) for the Dataset


There are 14 rows in the dataset. Among them, we have 9 rows with Yes' and 5 rows
with ‘No’. The formula for Entropy:

E(S) = -P(Yes) log2 P(Yes) - P(NO) log2P(NO)


E(S) = -(9/14) * log2 9/14 -(5/14) * log2 5/14
E(S )= 0.41 + 0.53 = 0.94
We will now calculate entropy for each column. First, we will take 'Outlook' column.

Calculate Entropy for Outlook

Outlook has 3 different attributes: Sunny. Overcast, Rainy In case of Outlook Sunny,
count how many rows are contributing to Yes' and how many are for ‘No’,

Total rows where Outlook=Sunny are 5. The number of rows with ‘Yes’=2 and with ‘No’ =
3.

So, Entropy (outlook=Sunny) =-2/5 log2 2/5 - 3/5 log 2 3/5 = 0.971
In case of Outlook=Overcast, count how many rows are contributing to 'Yes' and how many
are for ‘No'.

Total rows where Outlook=Overcast are 4. The number of rows with "Yes' = 4 and with ‘No’
= 0.

Entropy (outlook=overcast) = -1 log2 1 -0 log2 0 = 0

In case of Outlook=Rainy, count how many rows are contributing to Yes' and how many are
for ‘No'.

Total rows where Outlook=Rainy are 5. The number of rows with ‘Yes' =3 and with ‘No' = 2.

Entropy (outlook=rainy)= -3/5 log2 3/5-2/5 1og2 2/5 = 0.971

Information from Outlook


I(outlook) = 5/14 x 0.971 +4/14 x 0 + 5/14 x 0.971 = 0.693

Information Gain from Outlook


IG(outlook) = E(S) - I(outlook) = 0.94 -0.693 = 0.247

In the previous steps, we calculated the Information Gain for Outlook node. Similarly, if we
calculate for other nodes, we will have the following results, as shown in Table 30 .1:

Table 30.1: The information Gain of columns in the dataset

Please observe the Table 30.1. The highest information gain (IG) value (0.247) is seen in the
Outlook' column. Hence this column should be selected as root node. The next highest value
(0.152) is seen for the Humidity' column. Hence this column becomes the node at the next
level. In this manner, entropy is used by decision tree algorithm to decide which columns
should be used as nodes at different levels.

Gini index
Gini index is a direct measurement of impurities in the data. When gini index value is high. The
impurities are high. When it is low, the impurities are low, Hence. we should consider that column
having lowest gini index as root node.

The formula for calculating gini index is:


If a dataset S contains data points from n classes (or categories). the gini index of that dataset is:

Where p is the relative probability of class j in S

We will apply this formula and calculate gini index for each of the columns in the dataset.

Calculate Gini Index for Outlook


There are 3 classes Sunny, Overcast and Rainy in the Outlook column.

First of all, we will note down the total number of rows and how many of them are ‘Yes' and

How many are containing ‘No’ in each class.

Sunny class contains 5 rows. Among them, there are 2 Yes and 3 Nos. The formula for
calculating gini index is 1 - (Probability of Yes) 2 - (Probability of No) 2

Gini (outlook = Sunny) = 1 – (2/5)2 - (3/5)2 = 0.48

Overcast class contains 4 rows and there are 4 Yes and 0 Nos.

Gini (outlook = overcast) =1 - (4/4)2 - (0/4)2 = 0

Rainy class contains 5 rows and there are 3 Yes and 2 Nos.

Gini (outlook = rainy) = 1 -(3/5)2 -( 2/5)2 = 0.48

If n is the total rows in the dataset, then Gini index of Outlook is:

(Sunny rows/n) X (Sunny Gini) + (Overcast rows/n) X (Overcast Gini) + (Rainy rows/n) X
(Rainy Gini)

=5/14 x 0.48 + 4/14 x 0 + 5/14 x 0.48 = 0.3429

In the same manner, let us calculate Gini index for other columns also. The results are

shown in table form in the Table 30.2

Table 30.2: Gini Index values for the columns of the dataset
Among all the columns, the gini index of Outlook column is very low (0.3429).

That means there are very less impurities. Hence, we select outlook as our root node. The
next low value (0.3674) can be seen for Humidity. Hence, this column should be taken as
second level node.

Comparison of Entropy and Gini Index

Both the entropy and gini are used to compute which node should be taken as root node in the
tree and which nodes should be taken in the subsequent levels. But if we compare both the
methods then Gini Impurity is more efficient than entropy in terms of computing power
Please remember the term 'computing power indicates the processor time and memory.

The entropy values will be in the range of 0 to 1, whereas the gini values lie between 0 and
0.5. Please see Figure 30.5 where the entropy values are increasing up to 1 and then starts
decreasing. But in case of gini, it goes up to 0.5 and then it starts decreasing. Hence gini
requires less computational power.

Figure 30.5: Entropy and Gini Index values

Decision Tree

A decision tree is a machine learning model that contains logic regarding how to split
data based on some conditions and finally make conclusions. It internally creates a tree
structure with the columns of the dataset as nodes in various levels. The final nodes will
provide Yes' or 'No' type of decisions or conclusions.

Decision tree internally uses entropy or gini concept to decide the hierarchy of nodes
starting from the root node. Let us see how to apply decision tree on weather condition to
play or not to play cricket.
Dataset given: cricket1.csv

This dataset has 14 rows and 5 columns. The column names are: Outlook,
Temperature, Humidity. Windy and Play Cricket. The last column represents the target
column that shows either ‘Yes’ or 'No'. The total dataset is shown in Figure 30.2.

Since the total data 1s in the form of strings (or text), we have to convert all the
columns into numeric. For this purpose, we can use Label Encoder. Label Encoder simply
assigns 0.1.2 etc to each category. For example, when we apply Label Encoder on Outlook
column, 3 attributes of Outlook column are converted as:

Overcast -> 0, Rainy -> 1 and Sunny-> 2

Thus. Label Encoder will replace Overcast with a 0, Rainy with 1 and Sunny with 2.
Thus, they are converted from text to numeric. LabelEncoder can be created by creating an
object to Label Encoder class, as:

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder ()
Let us apply LabelEncoder on the columns of the dataset and create new columns with
numeric data using fit_transform() method of LabelEncoder class, as:

df['Outlook_n'] = le.fit_transform (df['outlook'])


df['Temp_n’] = le.fit_transform (df['Temperature '])
df['Humidity-n'] = le.fit_transform(df['Humidity'])
df['windy_n '] = le.fit_transform(df['windy '])
df['Play_n'] = le.fit_transform (df['Play Cricket '])
df
Output:

The last 5 columns in the data frame represent the converted columns that contain numeric
type of data. In this data frame, we can understand that the column data was decoded by the
LabelEncoder in the following way:

Outlook ->0 Overcast, 1 Rainy, 2 Sunny


Temperature-> 0 Cool, 1 Hot, 2 Mild
Humidity-> 0 High, 1 Normal
Windy-> 0 False, 1 True
Play cricket -> 0 No, 1 Yes
Since all the 5 columns have been converted into numeric, let us delete them and keen only
the last 5 columns that contain numeric type of data.

df = df.drop([‘outlook', 'Temperature', 'Humidity', ‘Windy', 'Play’ , ’Cricket'], axis


='columns')
df
Output:
Let us divide data into independent variables (x) and dependent or target variable (v), as:

x=df.drop (['play_n'], axis='columns')


y=df [‘play_n']
Since the data is ready for the model, we can apply DecisionTreeClassifier model on this
data. Let us create a decision tree by creating an object to DecisionTreeClassifier class, as:

from sklearn. tree import DecisionTreeClassifier


model = DecisionTreeClassifier()
model.fit(x, y)
In the previous code, the default criterion used is 'gini'. That means the decision tree s created
using gini index as criteria in splitting the data. We can also specify entropy as criterion in the
DecisionTreeClassifier object as:

today = (Outlook=Sunny, Temperature=Hot, Humidity=High, windy=FALSE)

In that, this is the first row in our dataset. To pass this data to our model, first we should
represent this data in numeric format according to LabelEncoder class, as:

Today=(Outlook=2, Temperature=1, Humidity=0, Windy=0)

Now, pass this data to the predict() method of the model, as:

model.predict ([[2,1,0,0]]) # pass data as 2D array

Output:

The above output shows that it is an array with the element 0. This 0 indicates 'No'. So, we
cannot play cricket under the given weather conditions.

The total logic can be seen in Program 1. Please go through this program and observe how to
use DecisionTreeClassifier model.

Program

Program 1: Apply DecisionTreeClassifier Machine Learning Model to take a decision


whether to play cricket or not under given conditions.

# deciding to play cricket or not using a decision tree


import pandas as pd

# load the dataset


df = pd.read_csv('D:\\AI&ML\MRU-ML\datasets\\Machine Learning Datasets Updated\\30.
Decision tree\\cricket1.csv')
df
# let us convert the column data into numeric.
#This is done with LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# apply label encoder on all columns.


# The following conversion takes place
''' Outlook --> 0 Overcast, 1 Rainy, 2 Sunny
Temperature-> 0 Cool, 1 Hot, 2 Mild
Humidity --> 0 High, 1 Normal
Windy --> 0 False, 1 True
Play Cricket --> 0 No, 1 Yes
'''
df['out1ook_n'] = le.fit_transform (df['Outlook'])
df['Temp_n'] = le.fit_transform (df['Temperature'])
df['Humidity_n'] = le.fit_transform(df['Humidity'])
df['Windy_n'] = le.fit_transform (df['Windy'])
df['Play_n'] = le.fit_transform(df['Play Cricket'])
df

# delete cols with labels (or strings)


df = df.drop(['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play Cricket'], axis ='columns')

# divide the data into x and y


x = df.drop(['Play_n'], axis='columns')
y = df['Play_n']

# create the DecisionTreeClassifier mode1


from sklearn.tree import DecisionTreeClassifier

# default criterion='gini'. we can use criterion= ‘entropy' also.


model = DecisionTreeClassifier()
model.fit(x, y)

# predict whether to play cricket or not for the following data:


#today = (Outlook=sunny, Temperature=Hot, Humidity=High,
#windy=FALSE)
model.predict ([[2,1,0,0]]) # array ([0]) --> No

Dataset given: salaries.csv


We are given another task to analyse the salaries of employees in various companies
This is a small dataset with 16 rows and 4 columns. The column names are: company job,
degree and salary_more_than_100k. Please observe the rows of this dataset in Figure 30.6.
Observe the first row. We can understand that the employee working in amazon as a project
manager with bachelors' degree is earning more than 100,000 dollars salary per annum. The
last column 'salary_more_than_100k' represents 0 or 1. Here, O means the salary is not more
than 100k dollars and 1 means the salary is more than 100k dollars.

Figure 30.6: Salaries of employees' dataset

The decision tree for this data can be shown in Figure 30.7 where the root node is 'company
why we started with 'company' as root node? The reason is that this column has low entropy
and hence high information gain. When information gain is high, the decision tree model can
split the nodes properly. Alternately, we can say that the gini impurities are less far
'Company' node. Hence this became the root node.
Figure 30.7: Decision tree for salaries dataset

We should take the first 3 columns: company, job and degree columns as inputs and the 4h
column: salary_more_than_100k as target column.

inputs = df.drop('salary_more_than_100k', axis='columns’)


target = df[' salary_more_than_100k']

The inputs are representing textual data and hence they should be converted into numeric
using LabelEncoder class. So, create an object to LabelEncoder class as:

from sklearn.preprocessing import LabelEncoder


le= LabelEncoder ()

Convert the columns in ‘inputs' object into numeric using fit_transform () method on them,
as:

inputs['company_n'] = le.fit_transform(inputs[' company'])


inputs[‘job_n'] = le.fit_transform(inputs[ 'job' ])

Since the data is ready, we can create decision tree by creating an object to Decision Tree
Classifier class, as:

from sklearn.tree import DecisionTreeClassifier


#default criterion='gini. we can use criterion='entropy’
model = DecisionTreeClassifier()

To train the decision tree on the data, we can use fit() method as:

model.fit(inputs_n , target)

Once the model is trained, it is ready to be used on new data. We can provide predictions by
calling predict() method on the model object. For example, we take the following data.

Data= (company=jp morgan, job=project manager, degree=bachelors)

That means. we have to predict the salary of the employee who is working in J.P. Morgan as
a project manager and having a bachelors' degree. When the data is represented by
corresponding numeric values, it will be:

data = (1,1,0)

This data should be passed to predict() method in the form of a 2D array, as:

model .predict ([[1, 1,0]])

The two square brackets around the elements represent that it is in the form of a 2D array.

Output:

predict() method produces output in the form of a 1D array. The 0th element of this array is 0.
This is the result. O in the target column represents No'. That means this employee will not
get more than 100k as salary. He may get less than or equal to 100k salary. The total code is
shown in Program 2. Please go through it.

Program
Program 2: Create a Python Program using a decision tree machine learning model to
analyse employee salary data of various companies and then predict the salary of a new
employee

#prediction for salary of an emp1oyee decision tree


import pandas as pd

#load the dataset


df= pd.read_csv('D:\\AI&ML\MRU-ML\datasets\\Machine Learning Datasets Updated\\30.
Decision tree\\salaries.csv')
df

#drop the target column and take others as inputs


inputs =df.drop('salary_more_than_100k', axis='columns')
# take only target column separately
target = df['salary_more_than_100k']

# let us convert the column data into numeric.


# this is done with LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

'''Encoding done by LabelEncoder:


company --> 0 amazon, 1 jp morgan, 2 microsofft
job -> 0 programmer, 1 project manager, 2 sales executive
degree--> 0 bachelors, 1 masters
'''
inputs['company_n'] = le.fit_transform(inputs['company'])
inputs['job_n'] = le.fit_transform(inputs ['job'])
inputs['degree_n'] = le.fit_transform(inputs ['degree'])
inputs

# delete cols with 1abels (or strings)


# keep only cols with numeric values
inputs_n = inputs.drop (['company', 'job', 'degree'], axis ='columns')
inputs_n

# create the model


from sklearn.tree import DecisionTreeClassifier
# default criterion='gini'. we can use criterion='entropy’
model = DecisionTreeClassifier()
model.fit(inputs_n, target)

#predict for a person working in google as sales executive


#with masters degree
model.predict ([[1,1,0]]) # array([0]) # less than 100K
model.predict([[2,0,0]]) # array([1]) # >= 100k

Please execute this program in Spyder IDE line by line or block by block and observe how s
are displayed. Also, observe the data in the variables by clicking on Variable explorer' tab in
Spyder.

Points to Remember
 Decision Tree model gives final output after checking various conditions and
following various paths.
 Decision Tree will have a root node' at the top giving rise to several 'nodes' in the next
levels and finally leaf nodes'.
 Entropy is the measurement of randomness of data. It represents the level of
impurities in the data.
 Gini index is the direct measure of impurities in the data.
 Root node of the Decision tree is selected based on either Entropy or Gini index.
 The node with more information gain should be selected as root node for the Decision
tree.
 The Entropy values will be in the range of 0 to 1, whereas the Gini values lie between
0 and 0.5.

You might also like