Decision Tree & Regression
Decision Tree & Regression
1 = root node
2 3
4 5 6 7 = terminal/leaf
node
8 9 10 11
= non-terminal
12 13
Decision Tree
Root node
(20000, 1000)
have a car?
yes
(13000,540)
(7000, 460)
PG or higher?
Decision
Married? nodes
yes yes
Leaf nodes
Tree Construction
• Building the tree
– A tree where each node represent a FEATURE.
– Each link represent a branch or Decision
– Each leaf represent an outcome
• Pruning – cross validation
• How to do it in R
• Binary is used in two places used –
– y or
– splitting class – boolean test question
• Decision Tree has multiple classes but asking question is only
Boolean Test for Splitting.
Build the tree
• Choose the best split at every level – that’s the
algorithm used to find
• How to know what is best?
– Best is that, which reduces “impurity” of a node the
most
• Method to determine impurity and therefore best
– Gini Index
– Entropy
– Misclassification
Gini
• Simply put – it is the probability of randomly
selected two elements belonging to the same
class. Sum(p(i)^2) for all class i.
• Assumption: Higher data is better
• P(Age)^2 + p(salary)^2+ p(education)^2 –
ginni score
• ( p(age)- plog2(age)) + (p(salary ) – p
log(salary)) + (p(education) – p log(education))
Gini calculation
C1 5
C2 6
4 5 6 7
4 5 6 7
10 11
8 9 10 11 8 9
12 13
12 13
Some interesting ways to look at the the important variables by random forest .
Random Forest
• > getTree(rf, 15) – It will tell me entire story of tree 15.
• Even though it is a black box we can see which all variables are
important
• > importance(rf) show me important variables
• It gives some story telling power
• Random forest can be used for the knowledge based in easy way
• Also can understand or find out easily he importance variables out of
a large nos.
• Using this information can go and create logistic regression and if
error is not much difernce than give explanationnfrom the LR model.
• >varImpPlot(rf) it gives the importance of variables
• Random forest we don’t bother worry about “multicollinearity”
BOOSTING
• It runs on first dtaset 1 and have some
• pred(t) = pred(1)+ pred(2)……….+Pred(t-1)
• Till error reuction is happening the sequectially
new model will be creating and prediction will
behappening.
• This methods works with only numeric not with
categorical or factor variables
• Bcs it works with gradient descent which
requires derivatives
Boosting
• Use all factor variable to convert into dummy variables
• The shortcut method is run logistic regression model
• Convert the result into matrix model or dummy variables
– >Ex- glm(CREDIT~., data= train) -> m
– >Model.matrix(m) -> dv
– >View(dv)
– Run xgboost model
– > dim(dv)
– > xgboost(data =m, label=tain$CREDIT, nrounds=3, max_depth=2, objectives
“binary:logistic”) -> xg
– > xgboost(data =m, label=tain$CREDIT, nrounds=20, max_depth=2,
objectives “binary:logistic”) -> xg
– For test dataset