Unit-4-DECISION TREES
Unit-4-DECISION TREES
Suppose we want to play cricket today. We should first consider weather conditions
that may or may not permit us to play cricket. For example, if it is raining today, we may not
be able to play cricket. If the temperature is too hot, then we may not play cricket. But if it is
an overcast day where clouds can be seen and the weather is cool, we may play cricket. In
this manner, our decision of playing or not playing cricket will be affected depending on the
weather conditions. So, what we are doing here? We are taking a final decision depending on
certain conditions or rules. This is what a decision tree does.
A decision tree is a Machine Learning model that gives the decision after considering
certain conditions. It will partition (or divide) the data based on the conditions to arrive at
correct decision. The decision tree looks like a tree structure as shown in Figure 30.1:
In the Figure 30.1, we are showing a decision tree that represents data that helps to
take decision neither to play cricket or not under the given weather conditions. In the decision
tree, the topmost node is called root node'. For example, Outlook is the root node that
represents the outlook of the day. This node may have values called 'attributes’. For example,
Outlook may be Sunny, Overcast or Rainy which are called attributes. These attributes can be
imagined like branches of a tree.
Below the root node, we can have another node. For example, 'Windy' is another
node. This node may also have values called attributes. For example, Windy may be TRUE
or FALSE. That means there may be wind or not on that day. These TRUE or FALSE values
become attributes of that node. In this manner, there may be several nodes descending from
each other. Each node may have attributes.
Finally, there would be decision in the form of Yes' or No'. These are called leaf
nodes For example, if it is a windy day (TRUE), we may not be able to play cricket. So, the
final decision is represented as 'No'. If it is not a windy day (FALSE), then we may play
cricket. Thus Yes' or No' decision can be made finally and they will become the last nodes in
the tree. These last nodes are called leaf nodes.
A decision tree should arrive at a final decision after checking several conditions, For
example, if Outlook is Sunny, then it will check if Windy is TRUE, then it gives the output
No'. If Outlook is Sunny and if Windy is FALSE, then it gives the output Yes'. Similarly, if
Outlook is Overcast, then it will check if Windy is TRUE or FALSE. If Outlook is Rainy,
then it will again check if Windy is TRUE or FALSE. Depending on all these tests, it will
provide the output as Yes' or No'. These checking paths look like branches of a tree.
In Figure 30.1, we did not show the complete decision tree as it becomes a bit
complicated if we take all the nodes and attributes into consideration. However, the total data
related to this decision tree is given in Figure 30.2:
Observe the decision tree in Figure 30.1.While drawing this decision tree. Why we
represented the Outlook as root node? If we observe the data, we can understand that there
are other columns like Temperature, Humidity and Windy. They also contribute to the final
decision. So, why not we take those other columns as the root node?
The question of which column should be taken as root node depends on either
'entropy’ or ‘gini index’. Entropy represents the randomness of data. When entropy is more.
The randomness will be more. That means, the data points are distributed here and there. It
also indicates impurities in the data which keep the data points apart. Please see Figure 30.3.
Gini index is a measurement of impurities in the data. There may be abnormal values
or here may be values which provide confusion in taking decision. Both the entropy and gini
index represent impurities in data. Whereas entropy indirectly represents the level of
impurities in the data, gini index directly measures the impurities in the data.
Entropy
Entropy is the measure of randomness of data. When the data is more random, the
data points are distant apart. That means some impurities are present in the data which are
keeping the data points apart. When entropy is high, randomness is high and hence there are
more impurities. In this case, the output may not be accurate. When entropy is low,
randomness is low and hence the data is close together without much impurities. Such data is
useful for the decision tree to make correct decisions.
When entropy value is low, the data contains less impurities. When impurities are
less, it provides more useful information to the decision tree algorithm. This is called
information gain. This is the reason entropy is generally applied on each node in the decision
tree to calculate information gain. When the information gain is highest for a node, it should
be taken as root node. Below the root node, we will represent that node which has a bit less
information gain. In this manner, entropy helps to decide the root node and other nodes that
are represented in the subsequent levels. The formula for calculating entropy is:
Where E(S) represents the Entropy of sample space. P(yes) represents the Probability
of Yes' and P(no) represents Probability of "No. The sample space S indicates all data points.
If number of Yes' and number of 'No are equal, then P(yes) and P(no) will be equal.
Since the total probability is always 1, P(yes) = P{no) = 0.5. In this case,
= -0.5(-1)-0.5 (-1)
= 0.5+0.5 = 1
If the sample space contains all Yes', that means there are no No's. Since the total
probability is 1, we have to take P(yes) as 1. Now,
E(S) = - P(yes) log2P(yes)
= -0.5 (0)
=0
Similarly, if there are only No's and there are no rows with Yes', then also, E(S) = 0.
Suppose we want to calculate the value of log, 0.5. This is equal to log 0.5 / log 2.
Hence click on 0.5 and then log' button in the calculator. It shows -0.3010 with several
fraction digits. Then click on 'division' (+ ) symbol and then type 2 and then press log button.
It shows 0.3010 with several fraction digits. Then click on 'equal' (=) symbol to see the result.
It will show -1. Therefore, the value of log, 0.5 is -1. See Figure 30.4.
Let us log take another example. To calculate the value of log2 (9/14). This is equal to
(9/14) / log 2. First click on 9 and then division symbol and then 14. Then press equal button.
It shows 9/14 value. Then click on log button. This gives log (9/14) value. Now we are left
with denominator. So, click on division symbol, then click on 2 and then log button. Then
click on equal button to see the final result, i.e. -0.6374.
Figure 30.4: Calculating log base 2 values using scientific calculator in Windows
With this knowledge of using Scientific calculator, now let us calculate entropy and
1OTmation gain for the dataset presented in Figure 30.2.
Outlook has 3 different attributes: Sunny. Overcast, Rainy In case of Outlook Sunny,
count how many rows are contributing to Yes' and how many are for ‘No’,
Total rows where Outlook=Sunny are 5. The number of rows with ‘Yes’=2 and with ‘No’ =
3.
So, Entropy (outlook=Sunny) =-2/5 log2 2/5 - 3/5 log 2 3/5 = 0.971
In case of Outlook=Overcast, count how many rows are contributing to 'Yes' and how many
are for ‘No'.
Total rows where Outlook=Overcast are 4. The number of rows with "Yes' = 4 and with ‘No’
= 0.
In case of Outlook=Rainy, count how many rows are contributing to Yes' and how many are
for ‘No'.
Total rows where Outlook=Rainy are 5. The number of rows with ‘Yes' =3 and with ‘No' = 2.
In the previous steps, we calculated the Information Gain for Outlook node. Similarly, if we
calculate for other nodes, we will have the following results, as shown in Table 30 .1:
Please observe the Table 30.1. The highest information gain (IG) value (0.247) is seen in the
Outlook' column. Hence this column should be selected as root node. The next highest value
(0.152) is seen for the Humidity' column. Hence this column becomes the node at the next
level. In this manner, entropy is used by decision tree algorithm to decide which columns
should be used as nodes at different levels.
Gini index
Gini index is a direct measurement of impurities in the data. When gini index value is high. The
impurities are high. When it is low, the impurities are low, Hence. we should consider that column
having lowest gini index as root node.
We will apply this formula and calculate gini index for each of the columns in the dataset.
First of all, we will note down the total number of rows and how many of them are ‘Yes' and
Sunny class contains 5 rows. Among them, there are 2 Yes and 3 Nos. The formula for
calculating gini index is 1 - (Probability of Yes) 2 - (Probability of No) 2
Overcast class contains 4 rows and there are 4 Yes and 0 Nos.
Rainy class contains 5 rows and there are 3 Yes and 2 Nos.
If n is the total rows in the dataset, then Gini index of Outlook is:
(Sunny rows/n) X (Sunny Gini) + (Overcast rows/n) X (Overcast Gini) + (Rainy rows/n) X
(Rainy Gini)
In the same manner, let us calculate Gini index for other columns also. The results are
Table 30.2: Gini Index values for the columns of the dataset
Among all the columns, the gini index of Outlook column is very low (0.3429).
That means there are very less impurities. Hence, we select outlook as our root node. The
next low value (0.3674) can be seen for Humidity. Hence, this column should be taken as
second level node.
Both the entropy and gini are used to compute which node should be taken as root node in the
tree and which nodes should be taken in the subsequent levels. But if we compare both the
methods then Gini Impurity is more efficient than entropy in terms of computing power
Please remember the term 'computing power indicates the processor time and memory.
The entropy values will be in the range of 0 to 1, whereas the gini values lie between 0 and
0.5. Please see Figure 30.5 where the entropy values are increasing up to 1 and then starts
decreasing. But in case of gini, it goes up to 0.5 and then it starts decreasing. Hence gini
requires less computational power.
Decision Tree
A decision tree is a machine learning model that contains logic regarding how to split
data based on some conditions and finally make conclusions. It internally creates a tree
structure with the columns of the dataset as nodes in various levels. The final nodes will
provide Yes' or 'No' type of decisions or conclusions.
Decision tree internally uses entropy or gini concept to decide the hierarchy of nodes
starting from the root node. Let us see how to apply decision tree on weather condition to
play or not to play cricket.
Dataset given: cricket1.csv
This dataset has 14 rows and 5 columns. The column names are: Outlook,
Temperature, Humidity. Windy and Play Cricket. The last column represents the target
column that shows either ‘Yes’ or 'No'. The total dataset is shown in Figure 30.2.
Since the total data 1s in the form of strings (or text), we have to convert all the
columns into numeric. For this purpose, we can use Label Encoder. Label Encoder simply
assigns 0.1.2 etc to each category. For example, when we apply Label Encoder on Outlook
column, 3 attributes of Outlook column are converted as:
Thus. Label Encoder will replace Overcast with a 0, Rainy with 1 and Sunny with 2.
Thus, they are converted from text to numeric. LabelEncoder can be created by creating an
object to Label Encoder class, as:
The last 5 columns in the data frame represent the converted columns that contain numeric
type of data. In this data frame, we can understand that the column data was decoded by the
LabelEncoder in the following way:
In that, this is the first row in our dataset. To pass this data to our model, first we should
represent this data in numeric format according to LabelEncoder class, as:
Now, pass this data to the predict() method of the model, as:
Output:
The above output shows that it is an array with the element 0. This 0 indicates 'No'. So, we
cannot play cricket under the given weather conditions.
The total logic can be seen in Program 1. Please go through this program and observe how to
use DecisionTreeClassifier model.
Program
The decision tree for this data can be shown in Figure 30.7 where the root node is 'company
why we started with 'company' as root node? The reason is that this column has low entropy
and hence high information gain. When information gain is high, the decision tree model can
split the nodes properly. Alternately, we can say that the gini impurities are less far
'Company' node. Hence this became the root node.
Figure 30.7: Decision tree for salaries dataset
We should take the first 3 columns: company, job and degree columns as inputs and the 4h
column: salary_more_than_100k as target column.
The inputs are representing textual data and hence they should be converted into numeric
using LabelEncoder class. So, create an object to LabelEncoder class as:
Convert the columns in ‘inputs' object into numeric using fit_transform () method on them,
as:
Since the data is ready, we can create decision tree by creating an object to Decision Tree
Classifier class, as:
To train the decision tree on the data, we can use fit() method as:
model.fit(inputs_n , target)
Once the model is trained, it is ready to be used on new data. We can provide predictions by
calling predict() method on the model object. For example, we take the following data.
That means. we have to predict the salary of the employee who is working in J.P. Morgan as
a project manager and having a bachelors' degree. When the data is represented by
corresponding numeric values, it will be:
data = (1,1,0)
This data should be passed to predict() method in the form of a 2D array, as:
The two square brackets around the elements represent that it is in the form of a 2D array.
Output:
predict() method produces output in the form of a 1D array. The 0th element of this array is 0.
This is the result. O in the target column represents No'. That means this employee will not
get more than 100k as salary. He may get less than or equal to 100k salary. The total code is
shown in Program 2. Please go through it.
Program
Program 2: Create a Python Program using a decision tree machine learning model to
analyse employee salary data of various companies and then predict the salary of a new
employee
Please execute this program in Spyder IDE line by line or block by block and observe how s
are displayed. Also, observe the data in the variables by clicking on Variable explorer' tab in
Spyder.
Points to Remember
Decision Tree model gives final output after checking various conditions and
following various paths.
Decision Tree will have a root node' at the top giving rise to several 'nodes' in the next
levels and finally leaf nodes'.
Entropy is the measurement of randomness of data. It represents the level of
impurities in the data.
Gini index is the direct measure of impurities in the data.
Root node of the Decision tree is selected based on either Entropy or Gini index.
The node with more information gain should be selected as root node for the Decision
tree.
The Entropy values will be in the range of 0 to 1, whereas the Gini values lie between
0 and 0.5.