0% found this document useful (0 votes)
4 views22 pages

Decision Tree in R Programming Language

The document provides an overview of using Decision Trees in R for statistical computing and data visualization, highlighting its applications in data mining and machine learning. It explains the processes of partitioning and pruning in decision trees, as well as important factors like entropy and information gain for model selection. Additionally, it discusses an example using the 'readingSkills' dataset to demonstrate model creation, prediction, and accuracy evaluation.

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

Decision Tree in R Programming Language

The document provides an overview of using Decision Trees in R for statistical computing and data visualization, highlighting its applications in data mining and machine learning. It explains the processes of partitioning and pruning in decision trees, as well as important factors like entropy and information gain for model selection. Additionally, it discusses an example using the 'readingSkills' dataset to demonstrate model creation, prediction, and accuracy evaluation.

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Decision Tree in R

Programming
Language
 R is a programming language for statistical computing and data
visualization. It has been adopted in the fields of data
mining, bioinformatics and data analysis.
 The core R language is augmented by a large number of extension
packages, containing reusable code, documentation, and sample
data.
 R software is open-source and free software. It is licensed by the GNU
Project and available under the GNU General Public License. It is
written primarily in C, Fortran, and R
itself. Precompiled executable are provided for various operating
systems.
Why use R Programming?

 There are several tools available in the market to perform data


analysis. Learning new languages is time taken.
 The data scientist can use two excellent tools, i.e., R and Python.
 Data scientist job is to understand the data, manipulate it, and
expose the best approach.
 For machine learning, the best algorithms can be implemented
with R.
 R communicate with the other languages and possibly calls Python,
Java, C++. The big data world is also accessible to R.
 We can connect R with different databases like Spark or Hadoop.
EXAMPLE

 Let us consider the scenario where a medical company wants to


predict whether a person will die if he is exposed to the Virus.
 The important factor determining this outcome is the strength of his
immune system, but the company doesn’t have this info.
 Since this is an important variable, a decision tree can be
constructed to predict the immune strength based on factors like the
sleep cycles, cortisol levels, supplement intaken, nutrients derived
from food intake, and so one of the person which is all continuous
variables.
Working of a Decision Tree in R

 Partitioning:
 It refers to the process of splitting the data set into subsets.
 The decision of making strategic splits greatly affects the accuracy of
the tree.
 Many algorithms are used by the tree to split a node into sub-nodes
which results in an overall increase in the clarity of the node with
respect to the target variable.
 Various Algorithms like the chi-square and Gini index are used for
this purpose and the algorithm with the best efficiency is chosen.
 Pruning:
 This refers to the process wherein the branch nodes are turned into
leaf nodes which results in the shortening of the branches of the tree.
 The essence behind this idea is that overfitting is avoided by simpler
trees as most complex classification trees may fit the training data
well but do an underwhelming job in classifying new values.
 Selection of the tree:
 The main goal of this process is to select the smallest tree that fits
the data due to the reasons discussed in the pruning section.
Important factors to consider while
selecting the tree in R

 Entropy:
Mainly used to determine the uniformity in the given sample.
 If the sample is completely uniform then entropy is 0, if it’s uniformly
partitioned it is one.
 The higher the entropy more difficult it becomes to draw conclusions
from that information.
 Information Gain:
Statistical property which measures how well training examples are
separated based on the target classification.
 The main idea behind constructing a decision tree is to find an
attribute that returns the smallest entropy and the highest
information gain.
 It is basically a measure in the decrease of the total entropy, and it is
calculated by computing the total difference between the entropy
before split and average entropy after the split of dataset based on
the given attribute values.
R – Decision Tree Example

 Let us now examine this concept with the help of an example, which
in this case is the most widely used “readingSkills” dataset by
visualizing a decision tree for it and examining its accuracy.
 Installing the required libraries
Import required libraries and Load the dataset
readingSkills and execute head(readingSkills)
As you can see clearly there 4 columns nativeSpeaker, age, shoeSize,
and score. Thus basically we are going to find out whether a person is
a native speaker or not using the other criteria and see the accuracy
of the decision tree model developed in doing so.
Splitting dataset into 4:1 ratio for train
and test data

Separating data into training and testing sets is an important part of


evaluating data mining models. Hence it is separated into training and
testing sets. After a model has been processed by using the training set,
you test the model by making predictions against the test set. Because
the data in the testing set already contains known values for the
attribute that you want to predict, it is easy to determine whether the
model’s guesses are correct.
Create the decision tree model using
ctree and plot the model
The basic syntax for creating a decision
tree in R is:

where, formula describes the predictor and response variables and data
is the data set used.

In this case, nativeSpeaker is the response variable and the other


predictor variables are represented by, hence when we plot the model we
get the following output
From the tree, it is clear that those who have a score less
than or equal to 31.08 and whose age is less than or
OUTPUT equal to 6 are not native speakers and for those whose
score is greater than 31.086 under the same criteria,
they are found to be native speakers.
Making a prediction
OUTPUT

The model has correctly predicted 13 people to be non-native speakers but


classified an additional 13 to be non-native, and the model by analogy has
misclassified none of the passengers to be native speakers when actually
they are not.
Determining the accuracy of the model
developed

Here the accuracy-test from the confusion matrix is calculated and is


found to be 0.74. Hence this model is found to predict with an
accuracy of 74 %.
Inference

 Thus Decision Trees are very useful algorithms as they are not only
used to choose alternatives based on expected values but are also
used for the classification of priorities and making predictions.
 It is up to us to determine the accuracy of using such models in the
appropriate applications.
Advantages of Decision Trees

 Easy to understand and interpret


 Does not require Data normalization
 Doesn’t facilitate the need for scaling of data
 The pre-processing stage requires lesser effort compared to other
major algorithms, hence in a way optimizes the given problem
Disadvantages of Decision Trees

 Requires higher time to train the model


 It has considerable high complexity and takes more time to process
the data
 When the decrease in user input parameter is very small it leads to
the termination of the tree
 Calculations can get very complex at times

You might also like