Business Analytics: Data Classification
Business Analytics: Data Classification
Chapter - 1
Business Analytics
Data Classification
Caselet
The HR of a company wants to gauge who will stay and who will leave the
company in next 1 year.
They have collected a sample data of the employees and they find out that:
Promotion in
last 5 years
YES NO
So, the employee who got promoted and completed more than 5 projects will stay
in the company and similarly we can conclude the remaining. The HR will simply
apply this set of rules to classify the remaining employees(not a part of this sample)
to know if they will stay with the company or leave in the next 1 year.
This graphical representation which classifies the data is called as the Decision
Tree.
We can also use Logistic Regression to decide which employee will stay or
leave using the predictor variables.
• It works for both categorical and continuous input and output variables.
•In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant differentiator
in input variables.
Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to
predict target variable. It can also be used in data exploration stage.
Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair
degree.
Data type is not a constraint: It can handle both numerical and categorical variables.
Types of decision tree is based on the type of target variable we have. It can be
of two types:
Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. Example:- In above
scenario of HR problem, where the target variable was “Employee will stay or
leave the company”.
It is also called as Classification Tree.
Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.
It is also called as Regression Tree.
Primary differences & similarity between classification and regression trees:
•Both the trees follow a top-down greedy approach known as recursive binary splitting. We call
it as ‘top-down greedy’ because it begins from the top of tree when all the observations are
available in a single region and successively splits the predictor space into two new branches
down the tree and the algorithm considers about only the current split, and not about future
splits which will lead to a better tree. This splitting process is continued until a user defined
stopping criteria is reached.
Example
HR of a company has picked up a sample of 300 employees with three variables:
•Promotion in last 5 years (Yes/ No),
•Number of Projects completed ( <=5/>5), and
•Satisfaction Level (<=0.5/>0.5).
This is where decision tree will help, it will segregate the employees based on all
values of three variable and identify the variable, which creates the best
homogeneous sets of employees (which are heterogeneous to each other).
Decision tree will identify the most significant variable and it’s value that gives best
homogeneous sets of population. Now the question which arises is, how does it
identify the variable and the split? To do this, decision tree uses various algorithms,
which will be discussed in the following section.
Split on ‘Promotion in last 5 years’
Employees = 300
Left = 60 Split on ‘Projects completed’
YES NO
Employees = 300
Employees = 100 Employees = 200
Left = 60
Left = 8 (0.08) Left = 52 (0.26)
<= 5 >5
Employees = 140 Employees = 160
Split on ‘Satisfaction Level’ Left = 24 (0.17) Left = 36 (0.23)
Employees = 300
Left = 60
The decision criteria for attribute selection is different for classification and
regression trees.
We can see that Gini score for Promotion in last 5 years is higher
than remaining variables, so the node split will take place on Promotion in last
5 years.
Chi Square
Chi square is an algorithm to find out the statistical significance between the
differences between sub-nodes and parent node. It is measured by sum of squares
of standardized differences between observed and expected frequencies of target
variable.
As observed, that Chi-square also identify the Promotion in last 5 years split is more
significant compare to other variables as its chi square value is higher.
Information and Entropy
Information Gain
By using information gain as a criterion, we try to estimate the information
contained by each attribute. Look at the image below,
A B
These are 2 nodes A and B. We can easily explain node B as it requires less
information because it has more of blue stars and can be classified as ‘Blue Star’
category, whereas in Node A it is equal mixture of both blue and red. Node B is
called as a pure node/ homogenous node.
Constructing a decision tree is all about finding attribute that returns the highest
information gain (i.e., the most homogeneous branches).
Entropy
To measure the randomness or uncertainty of a random variable X is defined
by Entropy. If the sample is completely homogeneous the entropy is zero and if the
sample is an equally divided it has entropy of one.
Entropy is calculated as, Entropy = -p log2 p – q log2 q
Entropy for ‘YES’ node = -(8/100) log2 (8/100) – (92/100) log2 (92/100) = 0.4022
and
For ‘NO’ node = -(52/200) log2 (52/200) – (148/200) log2 (148/200) = 0.8267
Entropy for ‘<= 5’ node = -(24/140) log2 (24/140) – (116/140) log2 (116/140) =
0.6610 and
For ‘> 5’ node = -(36/160) log2 (36/160) – (124/160) log2 (124/160) = 0.7692
Entropy for ‘<= 0.5’ node = -(20/120) log2 (20/120) – (100/120) log2 (100/120)
= 0.6500 and
For ‘> 0.5’ node = -(40/180) log2 (40/180) – (140/180) log2 (140/180) = 0.7642
Above, we see that entropy for Split on Promotion in last 5 years is the lowest
among all, so the tree will split on Promotion in last 5 years.
This algorithm uses the standard formula of variance to choose the best split. The
split with lower variance is selected as the criteria to split the population.
Above, we can see that ‘Promotion in last 5 years’ split has lower variance
compared to the other splits, so the split would first take place on ’Promotion in
last 5 years’ variable.
Decision Tree - Overfitting
As we can see this is a over-fit tree which is perfectly classifying the data.
Decision Tree – Over-fitting
Over-fitting is a significant practical difficulty for decision tree models and many
other predictive models. If there is no limit set of a decision tree, it will give you
100% accuracy on training set because in the worse case it will end up making 1
leaf for each observation as we see in the earlier diagram.
The min_split of 70
would not allow any
node with less than 70
Deciding the
samples to split
maximum no.of
further.
terminal nodes
Pruning
Post-pruning is the method that allows the tree to perfectly classify the training
set, and then post prune the tree.
Here, we can see 3 counters and 3 queues. All the 3 counters have same
operational time.
In queue I and III, there are 2 and 3 persons
Standing respectively, and in queue II, there is only 1 person
standing.
But you figure out by enquiring that counter II, will close
by the time its your turn. Hence, in this case you would
optimally make a choice to enter in counter I.
This is exactly the difference between normal decision tree & pruning.
A decision tree with constraints won’t see the closing time ahead and adopt a
greedy approach by taking counter II . On the other hand if we use pruning, we in
effect look at a few steps ahead and make a choice.
So we know pruning is better. But how to implement it in decision tree? The idea is
simple.
•We first make the decision tree to a large depth.
•Then we start at the bottom and start removing leaves which are giving us
negative returns when compared from the top.
•Suppose a split is giving us a gain of say -20 (loss of 20) and then the next split on
that gives us a gain of 30. A simple decision tree will stop at step 1 but in pruning,
we will see that the overall gain is +10 and keep both leaves.
Suppose you want to buy a XYZ Car. You aren’t sure about its performance though.
So, you will look for advice on whether it meets all your expectations? You decide to
approach various experts having diverse domain experience:
1. Sales person of Company XYZ: This person knows the internal functionality of the
car and have the insider information about the functionality and performance of the
car. But he lacks a broader perspective on how are competitors innovating, how is
the technology evolving and what will be the impact of this evolution on XYZ car. In
the past, he has been right 80% times.
2. Online Auto review: They have a broader perspective on how the performance of
cay will fair of in the competitive environment. However, it lacks a view on how the
company’s internal functionality is faring. In the past, it has been right 85% times.
3. Friend who has the same car: This person has observed the car’s performance
over past 8 months. He is also a knowledgeable in this domain, due to his love for
cars. He also has developed a strong intuition on how car might perform over
time. In the past, he has been right 80% times.
Given the broad spectrum of access we have, we can probably combine all the
information and make an informed decision.
•Thus if we set a max depth to a lower number, it would be simpler model with
less variance but ultimately will not be a strong predictive model.
•Ideally, we would like to minimize both error due to bias and error due to
variance.
A champion model should maintain a balance between these two types of errors.
This is known as the trade-off management of bias-variance errors. Ensemble
learning is one way to execute this trade off analysis.
Thank You