Decision Tree
Decision Tree
Example:
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl),
Class( IX/ X) and Height (5 to 6 ft).
15 out of these 30 play cricket in leisure time.
Now, I want to create a model to predict who will play cricket during leisure period?
In this problem, we need to segregate students who play cricket in their leisure time
based on highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of
three variable and identify the variable, which creates the best homogeneous sets of
students (which are heterogeneous to each other).
In the snapshot below, you can see that variable Gender is able to identify best
homogeneous sets compared to the other two variables.
As mentioned above, decision tree identifies the most significant variable and it‟s
value that gives best homogeneous sets of population.
Now the question which arises is, how does it identify the variable and the split?
To do this, decision tree uses various algorithms, which we will shall discuss in the
following section.
1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree.
Example:- In above scenario of student problem, where the target variable was “Student will
play cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.
Example:-
Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no).
Here we know that income of customer is a significant variable but insurance company
does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to
predict customer income based on occupation, product and various other variables.
In this case, we are predicting values for continuous variable.
1. Root Node: It represents entire population or sample and this further gets divided into
two or more homogeneous sets.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You
can say opposite process of splitting.
These are the terms commonly used for decision trees. As we know that every algorithm has
advantages and disadvantages, below are the important factors which one should know.
Advantages:
1. Easy to Understand: Decision tree output is very easy to understand even for people from
non-analytical background. It does not require any statistical knowledge to read and interpret
them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of decision
trees, we can create new variables / features that has better power to predict target variable.
You can refer article (Trick to enhance power of regression model) for one such trick. It can
also be used in data exploration stage. For example, we are working on a problem where we
have information available in hundreds of variables, there decision tree will help to identify
most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
Disadvantages:
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models.
This problem gets solved by setting constraints on model parameters and pruning .
2. Not fit for continuous variables: While working with continuous numerical variables,
decision tree looses information when it categorizes variables in different categories.