Decision Tree in R Programming Language
Decision Tree in R Programming Language
Programming
Language
R is a programming language for statistical computing and data
visualization. It has been adopted in the fields of data
mining, bioinformatics and data analysis.
The core R language is augmented by a large number of extension
packages, containing reusable code, documentation, and sample
data.
R software is open-source and free software. It is licensed by the GNU
Project and available under the GNU General Public License. It is
written primarily in C, Fortran, and R
itself. Precompiled executable are provided for various operating
systems.
Why use R Programming?
Partitioning:
It refers to the process of splitting the data set into subsets.
The decision of making strategic splits greatly affects the accuracy of
the tree.
Many algorithms are used by the tree to split a node into sub-nodes
which results in an overall increase in the clarity of the node with
respect to the target variable.
Various Algorithms like the chi-square and Gini index are used for
this purpose and the algorithm with the best efficiency is chosen.
Pruning:
This refers to the process wherein the branch nodes are turned into
leaf nodes which results in the shortening of the branches of the tree.
The essence behind this idea is that overfitting is avoided by simpler
trees as most complex classification trees may fit the training data
well but do an underwhelming job in classifying new values.
Selection of the tree:
The main goal of this process is to select the smallest tree that fits
the data due to the reasons discussed in the pruning section.
Important factors to consider while
selecting the tree in R
Entropy:
Mainly used to determine the uniformity in the given sample.
If the sample is completely uniform then entropy is 0, if it’s uniformly
partitioned it is one.
The higher the entropy more difficult it becomes to draw conclusions
from that information.
Information Gain:
Statistical property which measures how well training examples are
separated based on the target classification.
The main idea behind constructing a decision tree is to find an
attribute that returns the smallest entropy and the highest
information gain.
It is basically a measure in the decrease of the total entropy, and it is
calculated by computing the total difference between the entropy
before split and average entropy after the split of dataset based on
the given attribute values.
R – Decision Tree Example
Let us now examine this concept with the help of an example, which
in this case is the most widely used “readingSkills” dataset by
visualizing a decision tree for it and examining its accuracy.
Installing the required libraries
Import required libraries and Load the dataset
readingSkills and execute head(readingSkills)
As you can see clearly there 4 columns nativeSpeaker, age, shoeSize,
and score. Thus basically we are going to find out whether a person is
a native speaker or not using the other criteria and see the accuracy
of the decision tree model developed in doing so.
Splitting dataset into 4:1 ratio for train
and test data
where, formula describes the predictor and response variables and data
is the data set used.
Thus Decision Trees are very useful algorithms as they are not only
used to choose alternatives based on expected values but are also
used for the classification of priorities and making predictions.
It is up to us to determine the accuracy of using such models in the
appropriate applications.
Advantages of Decision Trees