0% found this document useful (0 votes)

153 views16 pages

Unit-4-DECISION TREES

The document discusses decision trees and how they are used for classification. It provides an example of a decision tree for determining whether to play cricket based on weather conditions. It then explains that the root node of the decision tree is chosen based on which attribute provides the highest information gain, as calculated using entropy. The document calculates the entropy and information gain for different attributes in a weather dataset, showing that outlook has the highest information gain and should therefore be the root node.

Uploaded by

sai tejas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

153 views16 pages

Unit-4-DECISION TREES

Uploaded by

sai tejas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

DECISION TREES

Suppose we want to play cricket today. We should first consider weather conditions
that may or may not permit us to play cricket. For example, if it is raining today, we may not
be able to play cricket. If the temperature is too hot, then we may not play cricket. But if it is
an overcast day where clouds can be seen and the weather is cool, we may play cricket. In
this manner, our decision of playing or not playing cricket will be affected depending on the
weather conditions. So, what we are doing here? We are taking a final decision depending on
certain conditions or rules. This is what a decision tree does.

A decision tree is a Machine Learning model that gives the decision after considering
certain conditions. It will partition (or divide) the data based on the conditions to arrive at
correct decision. The decision tree looks like a tree structure as shown in Figure 30.1:

Figure 30.1:A decision tree

In the Figure 30.1, we are showing a decision tree that represents data that helps to
take decision neither to play cricket or not under the given weather conditions. In the decision
tree, the topmost node is called root node'. For example, Outlook is the root node that
represents the outlook of the day. This node may have values called 'attributes’. For example,
Outlook may be Sunny, Overcast or Rainy which are called attributes. These attributes can be
imagined like branches of a tree.

Below the root node, we can have another node. For example, 'Windy' is another
node. This node may also have values called attributes. For example, Windy may be TRUE
or FALSE. That means there may be wind or not on that day. These TRUE or FALSE values
become attributes of that node. In this manner, there may be several nodes descending from
each other. Each node may have attributes.
Finally, there would be decision in the form of Yes' or No'. These are called leaf
nodes For example, if it is a windy day (TRUE), we may not be able to play cricket. So, the
final decision is represented as 'No'. If it is not a windy day (FALSE), then we may play
cricket. Thus Yes' or No' decision can be made finally and they will become the last nodes in
the tree. These last nodes are called leaf nodes.

A decision tree should arrive at a final decision after checking several conditions, For
example, if Outlook is Sunny, then it will check if Windy is TRUE, then it gives the output
No'. If Outlook is Sunny and if Windy is FALSE, then it gives the output Yes'. Similarly, if
Outlook is Overcast, then it will check if Windy is TRUE or FALSE. If Outlook is Rainy,
then it will again check if Windy is TRUE or FALSE. Depending on all these tests, it will
provide the output as Yes' or No'. These checking paths look like branches of a tree.

In Figure 30.1, we did not show the complete decision tree as it becomes a bit
complicated if we take all the nodes and attributes into consideration. However, the total data
related to this decision tree is given in Figure 30.2:

Figure 30.2: The weather data for playing cricket

Observe the decision tree in Figure 30.1.While drawing this decision tree. Why we
represented the Outlook as root node? If we observe the data, we can understand that there
are other columns like Temperature, Humidity and Windy. They also contribute to the final
decision. So, why not we take those other columns as the root node?

The question of which column should be taken as root node depends on either
'entropy’ or ‘gini index’. Entropy represents the randomness of data. When entropy is more.
The randomness will be more. That means, the data points are distributed here and there. It
also indicates impurities in the data which keep the data points apart. Please see Figure 30.3.

Gini index is a measurement of impurities in the data. There may be abnormal values
or here may be values which provide confusion in taking decision. Both the entropy and gini
index represent impurities in data. Whereas entropy indirectly represents the level of
impurities in the data, gini index directly measures the impurities in the data.

Figure 30.3: The concept of entropy

Entropy
Entropy is the measure of randomness of data. When the data is more random, the
data points are distant apart. That means some impurities are present in the data which are
keeping the data points apart. When entropy is high, randomness is high and hence there are
more impurities. In this case, the output may not be accurate. When entropy is low,
randomness is low and hence the data is close together without much impurities. Such data is
useful for the decision tree to make correct decisions.

When entropy value is low, the data contains less impurities. When impurities are
less, it provides more useful information to the decision tree algorithm. This is called
information gain. This is the reason entropy is generally applied on each node in the decision
tree to calculate information gain. When the information gain is highest for a node, it should
be taken as root node. Below the root node, we will represent that node which has a bit less
information gain. In this manner, entropy helps to decide the root node and other nodes that
are represented in the subsequent levels. The formula for calculating entropy is:

E(S) = -P(yes) log2P(yes) -P(no) log2P(no)

Where E(S) represents the Entropy of sample space. P(yes) represents the Probability
of Yes' and P(no) represents Probability of "No. The sample space S indicates all data points.

If number of Yes' and number of 'No are equal, then P(yes) and P(no) will be equal.
Since the total probability is always 1, P(yes) = P{no) = 0.5. In this case,

E(S) = -0.5 log2 (0.5)-0.5 log2 (0.5)

= -0.5(-1)-0.5 (-1)

= 0.5+0.5 = 1

If the sample space contains all Yes', that means there are no No's. Since the total
probability is 1, we have to take P(yes) as 1. Now,
E(S) = - P(yes) log2P(yes)

E(S) = -0.5 log2 (1)

= -0.5 (0)

Similarly, if there are only No's and there are no rows with Yes', then also, E(S) = 0.

How to use Scientific Calculator to calculate log2 Values

We can take the help of scientific calculator available in our computer to calculate
logarithmic values. First, open the calculator by right clicking the Windows operating system
Start' button and then click the 'Run' app. Then type 'calc' to open the calculator app. In the
calculator, we can see horizontal lines at the left top corner. Click on them to view options.
There, select 'Scientific. This will present the scientific calculator.

Suppose we want to calculate the value of log, 0.5. This is equal to log 0.5 / log 2.
Hence click on 0.5 and then log' button in the calculator. It shows -0.3010 with several
fraction digits. Then click on 'division' (+ ) symbol and then type 2 and then press log button.
It shows 0.3010 with several fraction digits. Then click on 'equal' (=) symbol to see the result.
It will show -1. Therefore, the value of log, 0.5 is -1. See Figure 30.4.

Let us log take another example. To calculate the value of log2 (9/14). This is equal to
(9/14) / log 2. First click on 9 and then division symbol and then 14. Then press equal button.
It shows 9/14 value. Then click on log button. This gives log (9/14) value. Now we are left
with denominator. So, click on division symbol, then click on 2 and then log button. Then
click on equal button to see the final result, i.e. -0.6374.
Figure 30.4: Calculating log base 2 values using scientific calculator in Windows

With this knowledge of using Scientific calculator, now let us calculate entropy and
1OTmation gain for the dataset presented in Figure 30.2.

Calculating Total Entropy E(S) for the Dataset

There are 14 rows in the dataset. Among them, we have 9 rows with Yes' and 5 rows
with ‘No’. The formula for Entropy:

E(S) = -P(Yes) log2 P(Yes) - P(NO) log2P(NO)

E(S) = -(9/14) * log2 9/14 -(5/14) * log2 5/14
E(S )= 0.41 + 0.53 = 0.94
We will now calculate entropy for each column. First, we will take 'Outlook' column.

Calculate Entropy for Outlook

Outlook has 3 different attributes: Sunny. Overcast, Rainy In case of Outlook Sunny,
count how many rows are contributing to Yes' and how many are for ‘No’,

Total rows where Outlook=Sunny are 5. The number of rows with ‘Yes’=2 and with ‘No’ =
3.

So, Entropy (outlook=Sunny) =-2/5 log2 2/5 - 3/5 log 2 3/5 = 0.971
In case of Outlook=Overcast, count how many rows are contributing to 'Yes' and how many
are for ‘No'.

Total rows where Outlook=Overcast are 4. The number of rows with "Yes' = 4 and with ‘No’
= 0.

Entropy (outlook=overcast) = -1 log2 1 -0 log2 0 = 0

In case of Outlook=Rainy, count how many rows are contributing to Yes' and how many are
for ‘No'.

Total rows where Outlook=Rainy are 5. The number of rows with ‘Yes' =3 and with ‘No' = 2.

Entropy (outlook=rainy)= -3/5 log2 3/5-2/5 1og2 2/5 = 0.971

Information from Outlook

I(outlook) = 5/14 x 0.971 +4/14 x 0 + 5/14 x 0.971 = 0.693

Information Gain from Outlook

IG(outlook) = E(S) - I(outlook) = 0.94 -0.693 = 0.247

In the previous steps, we calculated the Information Gain for Outlook node. Similarly, if we
calculate for other nodes, we will have the following results, as shown in Table 30 .1:

Table 30.1: The information Gain of columns in the dataset

Please observe the Table 30.1. The highest information gain (IG) value (0.247) is seen in the
Outlook' column. Hence this column should be selected as root node. The next highest value
(0.152) is seen for the Humidity' column. Hence this column becomes the node at the next
level. In this manner, entropy is used by decision tree algorithm to decide which columns
should be used as nodes at different levels.

Gini index
Gini index is a direct measurement of impurities in the data. When gini index value is high. The
impurities are high. When it is low, the impurities are low, Hence. we should consider that column
having lowest gini index as root node.

The formula for calculating gini index is:

If a dataset S contains data points from n classes (or categories). the gini index of that dataset is:

Where p is the relative probability of class j in S

We will apply this formula and calculate gini index for each of the columns in the dataset.

Calculate Gini Index for Outlook

There are 3 classes Sunny, Overcast and Rainy in the Outlook column.

First of all, we will note down the total number of rows and how many of them are ‘Yes' and

How many are containing ‘No’ in each class.

Sunny class contains 5 rows. Among them, there are 2 Yes and 3 Nos. The formula for
calculating gini index is 1 - (Probability of Yes) 2 - (Probability of No) 2

Gini (outlook = Sunny) = 1 – (2/5)2 - (3/5)2 = 0.48

Overcast class contains 4 rows and there are 4 Yes and 0 Nos.

Gini (outlook = overcast) =1 - (4/4)2 - (0/4)2 = 0

Rainy class contains 5 rows and there are 3 Yes and 2 Nos.

Gini (outlook = rainy) = 1 -(3/5)2 -( 2/5)2 = 0.48

If n is the total rows in the dataset, then Gini index of Outlook is:

(Sunny rows/n) X (Sunny Gini) + (Overcast rows/n) X (Overcast Gini) + (Rainy rows/n) X
(Rainy Gini)

=5/14 x 0.48 + 4/14 x 0 + 5/14 x 0.48 = 0.3429

In the same manner, let us calculate Gini index for other columns also. The results are

shown in table form in the Table 30.2

Table 30.2: Gini Index values for the columns of the dataset
Among all the columns, the gini index of Outlook column is very low (0.3429).

That means there are very less impurities. Hence, we select outlook as our root node. The
next low value (0.3674) can be seen for Humidity. Hence, this column should be taken as
second level node.

Comparison of Entropy and Gini Index

Both the entropy and gini are used to compute which node should be taken as root node in the
tree and which nodes should be taken in the subsequent levels. But if we compare both the
methods then Gini Impurity is more efficient than entropy in terms of computing power
Please remember the term 'computing power indicates the processor time and memory.

The entropy values will be in the range of 0 to 1, whereas the gini values lie between 0 and
0.5. Please see Figure 30.5 where the entropy values are increasing up to 1 and then starts
decreasing. But in case of gini, it goes up to 0.5 and then it starts decreasing. Hence gini
requires less computational power.

Figure 30.5: Entropy and Gini Index values

Decision Tree

A decision tree is a machine learning model that contains logic regarding how to split
data based on some conditions and finally make conclusions. It internally creates a tree
structure with the columns of the dataset as nodes in various levels. The final nodes will
provide Yes' or 'No' type of decisions or conclusions.

Decision tree internally uses entropy or gini concept to decide the hierarchy of nodes
starting from the root node. Let us see how to apply decision tree on weather condition to
play or not to play cricket.
Dataset given: cricket1.csv

This dataset has 14 rows and 5 columns. The column names are: Outlook,
Temperature, Humidity. Windy and Play Cricket. The last column represents the target
column that shows either ‘Yes’ or 'No'. The total dataset is shown in Figure 30.2.

Since the total data 1s in the form of strings (or text), we have to convert all the
columns into numeric. For this purpose, we can use Label Encoder. Label Encoder simply
assigns 0.1.2 etc to each category. For example, when we apply Label Encoder on Outlook
column, 3 attributes of Outlook column are converted as:

Overcast -> 0, Rainy -> 1 and Sunny-> 2

Thus. Label Encoder will replace Overcast with a 0, Rainy with 1 and Sunny with 2.
Thus, they are converted from text to numeric. LabelEncoder can be created by creating an
object to Label Encoder class, as:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder ()
Let us apply LabelEncoder on the columns of the dataset and create new columns with
numeric data using fit_transform() method of LabelEncoder class, as:

df['Outlook_n'] = le.fit_transform (df['outlook'])

df['Temp_n’] = le.fit_transform (df['Temperature '])
df['Humidity-n'] = le.fit_transform(df['Humidity'])
df['windy_n '] = le.fit_transform(df['windy '])
df['Play_n'] = le.fit_transform (df['Play Cricket '])
df
Output:

The last 5 columns in the data frame represent the converted columns that contain numeric
type of data. In this data frame, we can understand that the column data was decoded by the
LabelEncoder in the following way:

Outlook ->0 Overcast, 1 Rainy, 2 Sunny

Temperature-> 0 Cool, 1 Hot, 2 Mild
Humidity-> 0 High, 1 Normal
Windy-> 0 False, 1 True
Play cricket -> 0 No, 1 Yes
Since all the 5 columns have been converted into numeric, let us delete them and keen only
the last 5 columns that contain numeric type of data.

df = df.drop([‘outlook', 'Temperature', 'Humidity', ‘Windy', 'Play’ , ’Cricket'], axis

='columns')
df
Output:
Let us divide data into independent variables (x) and dependent or target variable (v), as:

x=df.drop (['play_n'], axis='columns')

y=df [‘play_n']
Since the data is ready for the model, we can apply DecisionTreeClassifier model on this
data. Let us create a decision tree by creating an object to DecisionTreeClassifier class, as:

from sklearn. tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x, y)
In the previous code, the default criterion used is 'gini'. That means the decision tree s created
using gini index as criteria in splitting the data. We can also specify entropy as criterion in the
DecisionTreeClassifier object as:

today = (Outlook=Sunny, Temperature=Hot, Humidity=High, windy=FALSE)

In that, this is the first row in our dataset. To pass this data to our model, first we should
represent this data in numeric format according to LabelEncoder class, as:

Today=(Outlook=2, Temperature=1, Humidity=0, Windy=0)

Now, pass this data to the predict() method of the model, as:

model.predict ([[2,1,0,0]]) # pass data as 2D array

Output:

The above output shows that it is an array with the element 0. This 0 indicates 'No'. So, we
cannot play cricket under the given weather conditions.

The total logic can be seen in Program 1. Please go through this program and observe how to
use DecisionTreeClassifier model.

Program

Program 1: Apply DecisionTreeClassifier Machine Learning Model to take a decision

whether to play cricket or not under given conditions.

# deciding to play cricket or not using a decision tree

import pandas as pd

# load the dataset

df = pd.read_csv('D:\\AI&ML\MRU-ML\datasets\\Machine Learning Datasets Updated\\30.
Decision tree\\cricket1.csv')
df
# let us convert the column data into numeric.
#This is done with LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# apply label encoder on all columns.

# The following conversion takes place
''' Outlook --> 0 Overcast, 1 Rainy, 2 Sunny
Temperature-> 0 Cool, 1 Hot, 2 Mild
Humidity --> 0 High, 1 Normal
Windy --> 0 False, 1 True
Play Cricket --> 0 No, 1 Yes
'''
df['out1ook_n'] = le.fit_transform (df['Outlook'])
df['Temp_n'] = le.fit_transform (df['Temperature'])
df['Humidity_n'] = le.fit_transform(df['Humidity'])
df['Windy_n'] = le.fit_transform (df['Windy'])
df['Play_n'] = le.fit_transform(df['Play Cricket'])
df

# delete cols with labels (or strings)

df = df.drop(['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play Cricket'], axis ='columns')

# divide the data into x and y

x = df.drop(['Play_n'], axis='columns')
y = df['Play_n']

# create the DecisionTreeClassifier mode1

from sklearn.tree import DecisionTreeClassifier

# default criterion='gini'. we can use criterion= ‘entropy' also.

model = DecisionTreeClassifier()
model.fit(x, y)

# predict whether to play cricket or not for the following data:

#today = (Outlook=sunny, Temperature=Hot, Humidity=High,
#windy=FALSE)
model.predict ([[2,1,0,0]]) # array ([0]) --> No

Dataset given: salaries.csv

We are given another task to analyse the salaries of employees in various companies
This is a small dataset with 16 rows and 4 columns. The column names are: company job,
degree and salary_more_than_100k. Please observe the rows of this dataset in Figure 30.6.
Observe the first row. We can understand that the employee working in amazon as a project
manager with bachelors' degree is earning more than 100,000 dollars salary per annum. The
last column 'salary_more_than_100k' represents 0 or 1. Here, O means the salary is not more
than 100k dollars and 1 means the salary is more than 100k dollars.

Figure 30.6: Salaries of employees' dataset

The decision tree for this data can be shown in Figure 30.7 where the root node is 'company
why we started with 'company' as root node? The reason is that this column has low entropy
and hence high information gain. When information gain is high, the decision tree model can
split the nodes properly. Alternately, we can say that the gini impurities are less far
'Company' node. Hence this became the root node.
Figure 30.7: Decision tree for salaries dataset

We should take the first 3 columns: company, job and degree columns as inputs and the 4h
column: salary_more_than_100k as target column.

inputs = df.drop('salary_more_than_100k', axis='columns’)

target = df[' salary_more_than_100k']

The inputs are representing textual data and hence they should be converted into numeric
using LabelEncoder class. So, create an object to LabelEncoder class as:

from sklearn.preprocessing import LabelEncoder

le= LabelEncoder ()

Convert the columns in ‘inputs' object into numeric using fit_transform () method on them,
as:

inputs['company_n'] = le.fit_transform(inputs[' company'])

inputs[‘job_n'] = le.fit_transform(inputs[ 'job' ])

Since the data is ready, we can create decision tree by creating an object to Decision Tree
Classifier class, as:

from sklearn.tree import DecisionTreeClassifier

#default criterion='gini. we can use criterion='entropy’
model = DecisionTreeClassifier()

To train the decision tree on the data, we can use fit() method as:

model.fit(inputs_n , target)

Once the model is trained, it is ready to be used on new data. We can provide predictions by
calling predict() method on the model object. For example, we take the following data.

Data= (company=jp morgan, job=project manager, degree=bachelors)

That means. we have to predict the salary of the employee who is working in J.P. Morgan as
a project manager and having a bachelors' degree. When the data is represented by
corresponding numeric values, it will be:

data = (1,1,0)

This data should be passed to predict() method in the form of a 2D array, as:

model .predict ([[1, 1,0]])

The two square brackets around the elements represent that it is in the form of a 2D array.

Output:

predict() method produces output in the form of a 1D array. The 0th element of this array is 0.
This is the result. O in the target column represents No'. That means this employee will not
get more than 100k as salary. He may get less than or equal to 100k salary. The total code is
shown in Program 2. Please go through it.

Program
Program 2: Create a Python Program using a decision tree machine learning model to
analyse employee salary data of various companies and then predict the salary of a new
employee

#prediction for salary of an emp1oyee decision tree

import pandas as pd

#load the dataset

df= pd.read_csv('D:\\AI&ML\MRU-ML\datasets\\Machine Learning Datasets Updated\\30.
Decision tree\\salaries.csv')
df

#drop the target column and take others as inputs

inputs =df.drop('salary_more_than_100k', axis='columns')
# take only target column separately
target = df['salary_more_than_100k']

# let us convert the column data into numeric.

# this is done with LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

'''Encoding done by LabelEncoder:

company --> 0 amazon, 1 jp morgan, 2 microsofft
job -> 0 programmer, 1 project manager, 2 sales executive
degree--> 0 bachelors, 1 masters
'''
inputs['company_n'] = le.fit_transform(inputs['company'])
inputs['job_n'] = le.fit_transform(inputs ['job'])
inputs['degree_n'] = le.fit_transform(inputs ['degree'])
inputs

# delete cols with 1abels (or strings)

# keep only cols with numeric values
inputs_n = inputs.drop (['company', 'job', 'degree'], axis ='columns')
inputs_n

# create the model

from sklearn.tree import DecisionTreeClassifier
# default criterion='gini'. we can use criterion='entropy’
model = DecisionTreeClassifier()
model.fit(inputs_n, target)

#predict for a person working in google as sales executive

#with masters degree
model.predict ([[1,1,0]]) # array([0]) # less than 100K
model.predict([[2,0,0]]) # array([1]) # >= 100k

Please execute this program in Spyder IDE line by line or block by block and observe how s
are displayed. Also, observe the data in the variables by clicking on Variable explorer' tab in
Spyder.

Points to Remember
 Decision Tree model gives final output after checking various conditions and
following various paths.
 Decision Tree will have a root node' at the top giving rise to several 'nodes' in the next
levels and finally leaf nodes'.
 Entropy is the measurement of randomness of data. It represents the level of
impurities in the data.
 Gini index is the direct measure of impurities in the data.
 Root node of the Decision tree is selected based on either Entropy or Gini index.
 The node with more information gain should be selected as root node for the Decision
tree.
 The Entropy values will be in the range of 0 to 1, whereas the Gini values lie between
0 and 0.5.

Breaking News English: Ready-to-Use English Lessons by Sean Banville
No ratings yet
Breaking News English: Ready-to-Use English Lessons by Sean Banville
27 pages
Salle Ala Khawaja Shauq
No ratings yet
Salle Ala Khawaja Shauq
284 pages
List of Keywords in Python
No ratings yet
List of Keywords in Python
16 pages
Decision Tree Entropy Gini
No ratings yet
Decision Tree Entropy Gini
5 pages
Aerospace & Defense Technology - April 2020
No ratings yet
Aerospace & Defense Technology - April 2020
52 pages
Guidance For Digital Arsm Exams
No ratings yet
Guidance For Digital Arsm Exams
9 pages
History of The Chevrolet Corvette W3 - P2 - Esubmission
No ratings yet
History of The Chevrolet Corvette W3 - P2 - Esubmission
12 pages
Caiaque Prijon - Eng
0% (1)
Caiaque Prijon - Eng
26 pages
ALEX D. STEIN - Les Stein - The Value Frontier - An Introduction To Competitive Business Strategies (2018, Kendall Hunt Publishing Company) - Libgen - Li
100% (1)
ALEX D. STEIN - Les Stein - The Value Frontier - An Introduction To Competitive Business Strategies (2018, Kendall Hunt Publishing Company) - Libgen - Li
186 pages
Assignment 1
No ratings yet
Assignment 1
19 pages
2021 05 01BusinessInsider
No ratings yet
2021 05 01BusinessInsider
102 pages
ملزمة الاسلامية 2024 الاستاذ عدنان البياتي - السادس العلمي - موقع سطور
No ratings yet
ملزمة الاسلامية 2024 الاستاذ عدنان البياتي - السادس العلمي - موقع سطور
127 pages
Env107 Lab Report 1 of Group C Final
No ratings yet
Env107 Lab Report 1 of Group C Final
20 pages
Sea Wolf PG
No ratings yet
Sea Wolf PG
64 pages
Rinsing Postprocessing Procedure of A 3d-Printed Orthodontic Appliance Material: Impact of Alternative Post-Rinsing Solutions On The Roughness, Flexural Strength and Cytotoxicity
100% (1)
Rinsing Postprocessing Procedure of A 3d-Printed Orthodontic Appliance Material: Impact of Alternative Post-Rinsing Solutions On The Roughness, Flexural Strength and Cytotoxicity
10 pages
Bali Major 2023 - Media Handbook 1
No ratings yet
Bali Major 2023 - Media Handbook 1
25 pages
M.A.1-Managerial Accounting, The Business Organization, and Professional Ethics - Ed
No ratings yet
M.A.1-Managerial Accounting, The Business Organization, and Professional Ethics - Ed
34 pages
Topics For Discussion
No ratings yet
Topics For Discussion
5 pages
May-22 CA Consolidated (A4 Format) - 8262457 - 2022 - 10 - 18 - 13 - 20
No ratings yet
May-22 CA Consolidated (A4 Format) - 8262457 - 2022 - 10 - 18 - 13 - 20
161 pages
SediGraph III Plus Operator Manual Rev A Aug 2021
No ratings yet
SediGraph III Plus Operator Manual Rev A Aug 2021
266 pages
Practice Test
No ratings yet
Practice Test
4 pages
JUMP20 Datasheet 07192021
No ratings yet
JUMP20 Datasheet 07192021
2 pages
The Omega Files - Episode 1
No ratings yet
The Omega Files - Episode 1
96 pages
Autonomous Cars: Past, Present and Future
No ratings yet
Autonomous Cars: Past, Present and Future
8 pages
Module 1
No ratings yet
Module 1
33 pages
Beatson-Tracking Antibiotic Resistance-2014-Science (New York, NY)
No ratings yet
Beatson-Tracking Antibiotic Resistance-2014-Science (New York, NY)
3 pages
Male Pre Engineering
No ratings yet
Male Pre Engineering
2 pages
Link-Thyssenkrupp Rothe Erde Turntables Brochure
No ratings yet
Link-Thyssenkrupp Rothe Erde Turntables Brochure
9 pages
Unit 3 - Compiler Design - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Compiler Design - WWW - Rgpvnotes.in
19 pages
Manali 3N - 4D Family Volvo Package For Mr. Shikhar Gupta - Best Deal From Raisoone Travels & Holidays !!!
100% (1)
Manali 3N - 4D Family Volvo Package For Mr. Shikhar Gupta - Best Deal From Raisoone Travels & Holidays !!!
2 pages
J Energy 2021 120877
No ratings yet
J Energy 2021 120877
16 pages
2018 Microprocessor Optimizations For The IOT
No ratings yet
2018 Microprocessor Optimizations For The IOT
14 pages
Before Li Ion Batteriesacs - Chemrev.8b00422
No ratings yet
Before Li Ion Batteriesacs - Chemrev.8b00422
24 pages
AEG L85470SL User Manual
No ratings yet
AEG L85470SL User Manual
36 pages
Different Paradigms of Pattern Recognition
No ratings yet
Different Paradigms of Pattern Recognition
8 pages
TTL-Network ProductCatalogue 2021 2022 ENG
No ratings yet
TTL-Network ProductCatalogue 2021 2022 ENG
116 pages
(1-14) Study On Individual's Preference About Online Cab Booking Services
No ratings yet
(1-14) Study On Individual's Preference About Online Cab Booking Services
14 pages
MIT SBL Booklet2
No ratings yet
MIT SBL Booklet2
13 pages
Marketing Communications MOD001178: Anglia Ruskin University
No ratings yet
Marketing Communications MOD001178: Anglia Ruskin University
21 pages
Saer Elettropompe
No ratings yet
Saer Elettropompe
37 pages
Report On Zomato
No ratings yet
Report On Zomato
10 pages
Sri Roth 2000
No ratings yet
Sri Roth 2000
11 pages
Laptop Price Prediction Using Machine Learning: International Journal of Computer Science and Mobile Computing
100% (1)
Laptop Price Prediction Using Machine Learning: International Journal of Computer Science and Mobile Computing
5 pages
Marshall Mueck Shockley
No ratings yet
Marshall Mueck Shockley
10 pages
STA2216 Mini Project
No ratings yet
STA2216 Mini Project
3 pages
ch2,3 Marketing Plan Handbook
No ratings yet
ch2,3 Marketing Plan Handbook
40 pages
Final-Incoming Grade 10
No ratings yet
Final-Incoming Grade 10
5 pages
Fevo 08 602190
No ratings yet
Fevo 08 602190
10 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
20th Edition - State of The Developer Nation
No ratings yet
20th Edition - State of The Developer Nation
48 pages
Chapter 4 SQQS1013
No ratings yet
Chapter 4 SQQS1013
20 pages
GarbageGreen Documentation JUNE15
No ratings yet
GarbageGreen Documentation JUNE15
26 pages
Tech Startups - Sheet1
No ratings yet
Tech Startups - Sheet1
1 page
4 Colgate Palmolive
No ratings yet
4 Colgate Palmolive
36 pages
Se Srs Practical File
No ratings yet
Se Srs Practical File
22 pages
HP+New+Year+Offer+2022 V4
No ratings yet
HP+New+Year+Offer+2022 V4
9 pages
To Detect Acidic and Basic Radicals in The Different Water Sample KMC
No ratings yet
To Detect Acidic and Basic Radicals in The Different Water Sample KMC
22 pages
Group 5 Sony Ericsson
100% (1)
Group 5 Sony Ericsson
4 pages
An Economic Evaluation System For Building Construction Projects in The Conceputal Phase
No ratings yet
An Economic Evaluation System For Building Construction Projects in The Conceputal Phase
6 pages
Royal Palm Services Analysis
No ratings yet
Royal Palm Services Analysis
22 pages
Institutional Ethics Committee Regulations and Cur
No ratings yet
Institutional Ethics Committee Regulations and Cur
5 pages
(Project) Effectiveness of Training and Development - Programme
No ratings yet
(Project) Effectiveness of Training and Development - Programme
84 pages
IML-IITKGP - Assignment 2 Solution
No ratings yet
IML-IITKGP - Assignment 2 Solution
11 pages
Chapter 11-Project Risk Management
No ratings yet
Chapter 11-Project Risk Management
65 pages
St-21.908-03 - Definitions and Calculations
100% (1)
St-21.908-03 - Definitions and Calculations
51 pages
DWDM Lab Manual r20
No ratings yet
DWDM Lab Manual r20
97 pages
Capstone Notes-2
No ratings yet
Capstone Notes-2
27 pages
To Design and Implement Application For Bank Customer Churning Rate Prediction and Analysis Using Machine Learning Algorithm
No ratings yet
To Design and Implement Application For Bank Customer Churning Rate Prediction and Analysis Using Machine Learning Algorithm
4 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Prediction of Road Traffic Congestion Based On Random Forest
No ratings yet
Prediction of Road Traffic Congestion Based On Random Forest
4 pages
Lab-Practice-I (ML) - Lab Manual-Vaishali
No ratings yet
Lab-Practice-I (ML) - Lab Manual-Vaishali
57 pages
Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques
No ratings yet
Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques
26 pages
Unit - 1 Risk Meaning, Types, Risk Analysis in Capital Budgeting
No ratings yet
Unit - 1 Risk Meaning, Types, Risk Analysis in Capital Budgeting
19 pages
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
No ratings yet
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
11 pages
Unit 2 1
No ratings yet
Unit 2 1
15 pages
MIRAD A Method For Interpretable Ransomware Attack Detection
No ratings yet
MIRAD A Method For Interpretable Ransomware Attack Detection
19 pages
1 s2.0 S135063072300701X Main
No ratings yet
1 s2.0 S135063072300701X Main
21 pages
Unit 3
No ratings yet
Unit 3
20 pages
Credit
No ratings yet
Credit
6 pages
Unit 2 - Decision Analysis
No ratings yet
Unit 2 - Decision Analysis
40 pages
Aids - 21ad62 - Datascience Lab Manual-1
No ratings yet
Aids - 21ad62 - Datascience Lab Manual-1
15 pages
Decision Making Tools: Presented By:-Nikita Saini Mba 4
No ratings yet
Decision Making Tools: Presented By:-Nikita Saini Mba 4
9 pages
Amit Trivedi Cap450 Term Paper
No ratings yet
Amit Trivedi Cap450 Term Paper
18 pages
Prediction of Failures in The Project Management K
No ratings yet
Prediction of Failures in The Project Management K
14 pages
Mid - Term - Question - 233 - CSE4889 - B - DMF
No ratings yet
Mid - Term - Question - 233 - CSE4889 - B - DMF
2 pages
Exploring Text-Based Emotions Recognition Machine
No ratings yet
Exploring Text-Based Emotions Recognition Machine
8 pages
Ijst 2023 3152
No ratings yet
Ijst 2023 3152
11 pages
An - Approach - To - Understand - The - End - User - Behaviour Trough Log Analysis
No ratings yet
An - Approach - To - Understand - The - End - User - Behaviour Trough Log Analysis
8 pages
Decision Analysis 1
No ratings yet
Decision Analysis 1
20 pages
The Usage of Influence Diagram For Decision Making
No ratings yet
The Usage of Influence Diagram For Decision Making
7 pages

Unit-4-DECISION TREES

Uploaded by

Unit-4-DECISION TREES

Uploaded by

DECISION TREES

Figure 30.1:A decision tree

Figure 30.2: The weather data for playing cricket

Figure 30.3: The concept of entropy

E(S) = -P(yes) log2P(yes) -P(no) log2P(no)

E(S) = -0.5 log2 (0.5)-0.5 log2 (0.5)

E(S) = -0.5 log2 (1)

How to use Scientific Calculator to calculate log2 Values

Calculating Total Entropy E(S) for the Dataset

E(S) = -P(Yes) log2 P(Yes) - P(NO) log2P(NO)

Calculate Entropy for Outlook

Entropy (outlook=overcast) = -1 log2 1 -0 log2 0 = 0

Entropy (outlook=rainy)= -3/5 log2 3/5-2/5 1og2 2/5 = 0.971

Information from Outlook

Information Gain from Outlook

Table 30.1: The information Gain of columns in the dataset

The formula for calculating gini index is:

Where p is the relative probability of class j in S

Calculate Gini Index for Outlook

How many are containing ‘No’ in each class.

Gini (outlook = Sunny) = 1 – (2/5)2 - (3/5)2 = 0.48

Gini (outlook = overcast) =1 - (4/4)2 - (0/4)2 = 0

Gini (outlook = rainy) = 1 -(3/5)2 -( 2/5)2 = 0.48

=5/14 x 0.48 + 4/14 x 0 + 5/14 x 0.48 = 0.3429

shown in table form in the Table 30.2

Comparison of Entropy and Gini Index

Figure 30.5: Entropy and Gini Index values

Overcast -> 0, Rainy -> 1 and Sunny-> 2

from sklearn.preprocessing import LabelEncoder

df['Outlook_n'] = le.fit_transform (df['outlook'])

Outlook ->0 Overcast, 1 Rainy, 2 Sunny

df = df.drop([‘outlook', 'Temperature', 'Humidity', ‘Windy', 'Play’ , ’Cricket'], axis

x=df.drop (['play_n'], axis='columns')

from sklearn. tree import DecisionTreeClassifier

today = (Outlook=Sunny, Temperature=Hot, Humidity=High, windy=FALSE)

Today=(Outlook=2, Temperature=1, Humidity=0, Windy=0)

model.predict ([[2,1,0,0]]) # pass data as 2D array

Program 1: Apply DecisionTreeClassifier Machine Learning Model to take a decision

# deciding to play cricket or not using a decision tree

# load the dataset

# apply label encoder on all columns.

# delete cols with labels (or strings)

# divide the data into x and y

# create the DecisionTreeClassifier mode1

# default criterion='gini'. we can use criterion= ‘entropy' also.

# predict whether to play cricket or not for the following data:

Dataset given: salaries.csv

Figure 30.6: Salaries of employees' dataset

inputs = df.drop('salary_more_than_100k', axis='columns’)

from sklearn.preprocessing import LabelEncoder

inputs['company_n'] = le.fit_transform(inputs[' company'])

from sklearn.tree import DecisionTreeClassifier

Data= (company=jp morgan, job=project manager, degree=bachelors)

model .predict ([[1, 1,0]])

#prediction for salary of an emp1oyee decision tree

#load the dataset

#drop the target column and take others as inputs

# let us convert the column data into numeric.

'''Encoding done by LabelEncoder:

# delete cols with 1abels (or strings)

# create the model

#predict for a person working in google as sales executive

You might also like