Decision Tree & Regression

A decision tree is a predictive model that uses a branching series of Boolean tests to classify or predict outcomes. It works by splitting the data into smaller groups based on attribute values, with each split aimed at increasing the homogeneity of the resulting groups. Decision trees allow for intuitive visualization and interpretation of classification or regression rules. Random forests build multiple decision trees and merge their results to improve predictive accuracy over a single tree, helping to avoid overfitting. They have become very popular due to their accuracy and ability to handle large datasets with many attributes.

Uploaded by

vignesh waran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Decision Tree & Regression

Uploaded by

vignesh waran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

What is a decision tree

• An inductive learning task

– Use particular facts to make more generalized conclusions

• A predictive model based on a branching series of

Boolean tests
– These smaller Boolean tests are less complex than a one-
stage classifier

• Let’s look at a sample decision tree…

• Statistical model objective – Predictive (industry),
Explanation(Academic)
Sample
1

1 = root node
2 3

4 5 6 7 = terminal/leaf
node
8 9 10 11

= non-terminal
12 13
Decision Tree

A decision tree is a tree where each

node represents a FEATURE
(Attribute),
Each link branch represents a
Decision (RULE)
and each LEAF represent a outcome
Algorithm to Find Root Node
• Algorithms –
– CART
– Gini Index – raprt, randomforest, scikitkearn
ID3 –
- Entropy Function
- Information Gain
- Misclassification
- Chisqaure
Entropy
Entropy
Entropy Calculation
Entropy
Example : Cross sell - respondents

Root node
(20000, 1000)
have a car?
yes

(13000,540)
(7000, 460)
PG or higher?
Decision
Married? nodes

yes yes

(2000,400) (5000,60) (3000 , 500) (10000,40)

Leaf nodes
Tree Construction
• Building the tree
– A tree where each node represent a FEATURE.
– Each link represent a branch or Decision
– Each leaf represent an outcome
• Pruning – cross validation
• How to do it in R
• Binary is used in two places used –
– y or
– splitting class – boolean test question
• Decision Tree has multiple classes but asking question is only
Boolean Test for Splitting.
Build the tree
• Choose the best split at every level – that’s the
algorithm used to find
• How to know what is best?
– Best is that, which reduces “impurity” of a node the
most
• Method to determine impurity and therefore best
– Gini Index
– Entropy
– Misclassification
Gini
• Simply put – it is the probability of randomly
selected two elements belonging to the same
class. Sum(p(i)^2) for all class i.
• Assumption: Higher data is better
• P(Age)^2 + p(salary)^2+ p(education)^2 –
ginni score
• ( p(age)- plog2(age)) + (p(salary ) – p
log(salary)) + (p(education) – p log(education))
Gini calculation
C1 5
C2 6

Gini = (5/11) ^ 2 + (6/11) ^2 = 0.504

Entropy
• Measures reduction in entropy
• Entropy is
– Sum[p(i)*log2p(i)]
where p(i) is probability of being in class I
• Information Gain
Entropy(parent node) – Entropy(child nodes)
Entropy calculation
C1 5
C2 6

Entropy = - 5/11 log2 (5/11) – 6/11 log2 (6/11) = 0.99

Misclassification

Proportion wrongly classified = 5/11

Pruning a tree
• Definition : Pruning a branch Tt from a tree T consists of
deleting all descendants of t except its root node. T- Tt is the
pruned tree.
1
1
2
2 3
2 3

4 5 6 7
4 5 6 7

10 11
8 9 10 11 8 9

12 13
12 13

Tree T Branch T2 Tree T-T2

Why you need to prune
• Over fitting
• Error rate on test data has different pattern
than error rate on training data
• Largest tree is not the best
Pruning in practice
• 10 fold cross validation - The data is divided into 10 subsets
of equal size (at random) and then the tree is grown leaving
out one of the subsets and the performance assessed on the
subset left out from growing the tree. This is done for each
of the 10 sets. The average performance is then assessed.
• We can then use this information to decide how complex
(determined by the size of cp) the tree needs to be. The
possible rules are to minimise the cross validation relative
error (xerror), or to use the “1-SE rule” which uses the
largest value of cp with the “xerror” within one standard
deviation of the minimum.
Optimization
• Optimize tree branches (decision node).
• Optimize depth of the tree to control
complexity and over-fitting.
Regression Tree
• Y is a continuous variable
• Y = f(x1, x2, x3. ….xk)
• We use to split the tree using SS (sum of
squared error) reduction is high by which
variables
•
• Cooks Distance – is a measure of how much influence each
point is making on the line
• Thumb rule to have
• Cooks distance
• = 1 / 39 = 1/ 40(rows) – k(features)
• Change in estimate with and
• without the influence rows is
• Cooks distance.
• Higher means than point
• Is making significan impact
• Fine Tuning – cooks distance fundamental to remove
Ensamble
• Class of Technique – called Ensamble
(Collection )
• It is extremally popular
• Because of accuracy
Bagging
• Random Forest based on randomization of two things –
– Samples (rows) – it will do sample with replacement sample of the
data ,
– Ex- if I create forest of 200 trees so
– Step 1 – create 200 samples of the sample data which I have , will be
creating drawing sampl with replacement, which employs some rows
will be repeated and some will be left out.
– Left out data / observation are called OUT OF BAG sample. Idea of
Bagging.
– STEP 2 – not using the same or all columns
– Thumb rules is 1/3 or sqrt(nos columns) randomly
– So out of 21 columns randomly will choose 7 columns for 1 st tree.
– This sampling techniques said BOOTSTRAP
– Means for 200 tree create 200 boottrap samples.
– Basically two things
– Randomize samples
– Create botstrap sampling
– Step 3 – Build the tree
We don’t bother about overfitting, so we don’t
care about CP – otherwise it will have cancelling
effect.
• Now once the tree is grown up will pass new dataset
and majority of class will be class of that tree.
• Higer nos of class from all treess will be final decision.
• Regression Tree – Average of all tree yhat percentage
will be final result.
• Basic OF - some tree will have right
– Some will have wrong
– When we wil start aggregating over a large nos of tree
particularly if hey are diverse tree
– They will more over be right than wrong.
Class of Technique - Ensamble
• Ex- if I want to answer of karnattaka election decision than
will not rely on a one person will ask many group of people
or can say rely on independent of many people.
• This is the whole idea. (collection of prediction)
• SO there is a clas of technique is called Ensamble technique.
• Not so much of their expanatory power but bcs of their
accuracy.
• Two broad types of techniques of this
– Bagging technique
– Boosting technique – rocket science in ensamble is this
• Collecion technique is either bagging or boosting tech.
• Bagging most popular technique is – Random Forest.
• Boosting – gradient boosting , Ada boosting.
• Thesse are basically collection of predicter used
together to come out with estimation.
• Underline both techniques popular base is TREE
• The foundation of them is decision tree.
• So if I say talk to 100 people mean I say to develop 100
trees. That is what is mean
How does Random forest work?
Random Forest developed by aggregating trees
Can be used for clasification and regression
Avoid overfitting
Can deal withlarge nos of features
Helps with feature selection based on importacne
User friendly : only 2 free parameters
- Tree – ntree , default 500
- Variables randomly sample as candidate at each split – mtry
- default is sqrt(p) for classifcication and 1/3 for regresion
- mtry is for column selection so that every tree is randomized tree not the similar.
> randomforest(as.factor(CREDIT)~ ., data =d, ntree = 200, mtry =5)) -> rf
-> rf
- > names(rf)
- OOBS – this ssample is not used for building the tree so all tree has OOBS sample left out, now this
OOBS will pass to their respective tree model for testing.
Some of them will be misclassified.
Now those misclasification of all OOBS sample are OOBS error.
RF Steps
•1. Draw ntree Bootstrap samples
•OOBS will beused to know how my Random forest model will perform with new data set.
•See err.rate > rf$err
•This shows how error rate is reducing while increasing the tree.
•Plot the OOBS error rate
•> plot(rf) – this can be checked wha shold be optimum size of the tree based on after how
many tree there is no error reducion is signficant.
> randomforest(as.factor(CREDIT)~ ., data =d, ntree = 170, mtry =5)) -> rf
-> rf
Predict from this forest :
> predict(rf, test) -> pred
> table(test$CREDIT, pred)

Some interesting ways to look at the the important variables by random forest .
Random Forest
• > getTree(rf, 15) – It will tell me entire story of tree 15.
• Even though it is a black box we can see which all variables are
important
• > importance(rf) show me important variables
• It gives some story telling power
• Random forest can be used for the knowledge based in easy way
• Also can understand or find out easily he importance variables out of
a large nos.
• Using this information can go and create logistic regression and if
error is not much difernce than give explanationnfrom the LR model.
• >varImpPlot(rf) it gives the importance of variables
• Random forest we don’t bother worry about “multicollinearity”
BOOSTING
• It runs on first dtaset 1 and have some
• pred(t) = pred(1)+ pred(2)……….+Pred(t-1)
• Till error reuction is happening the sequectially
new model will be creating and prediction will
behappening.
• This methods works with only numeric not with
categorical or factor variables
• Bcs it works with gradient descent which
requires derivatives
Boosting
• Use all factor variable to convert into dummy variables
• The shortcut method is run logistic regression model
• Convert the result into matrix model or dummy variables
– >Ex- glm(CREDIT~., data= train) -> m
– >Model.matrix(m) -> dv
– >View(dv)
– Run xgboost model
– > dim(dv)
– > xgboost(data =m, label=tain$CREDIT, nrounds=3, max_depth=2, objectives
“binary:logistic”) -> xg
– > xgboost(data =m, label=tain$CREDIT, nrounds=20, max_depth=2,
objectives “binary:logistic”) -> xg
– For test dataset

Hearth Book 1 - Heart - Earth Syncretism (121 Page 4
50% (2)
Hearth Book 1 - Heart - Earth Syncretism (121 Page 4
12 pages
Service Reset - Jungheinrich EFG425
No ratings yet
Service Reset - Jungheinrich EFG425
1 page
Random Forest
No ratings yet
Random Forest
8 pages
Random Forest
No ratings yet
Random Forest
83 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
ML-Lec6
No ratings yet
ML-Lec6
4 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Unit 4
No ratings yet
Unit 4
33 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Module09 TreeBasedMethods
No ratings yet
Module09 TreeBasedMethods
36 pages
Random Forests
No ratings yet
Random Forests
43 pages
Week 12
No ratings yet
Week 12
34 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
30 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Random Forest Intro Presented
No ratings yet
Random Forest Intro Presented
38 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
BSC ML Ch3.pptx
No ratings yet
BSC ML Ch3.pptx
106 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
AIML Final Cpy Word
No ratings yet
AIML Final Cpy Word
15 pages
Random Forest
No ratings yet
Random Forest
25 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Business Analytics: Foundation: Material Handouts
No ratings yet
Business Analytics: Foundation: Material Handouts
7 pages
FMLanswerkey-IT 2.docx (1) (1) (1)
No ratings yet
FMLanswerkey-IT 2.docx (1) (1) (1)
11 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
MI_Unit 4
No ratings yet
MI_Unit 4
79 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Random+Forest+Summary
No ratings yet
Random+Forest+Summary
6 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
Team 5
No ratings yet
Team 5
12 pages
Decision Tree
No ratings yet
Decision Tree
82 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
Decision Tree (Autosaved)
No ratings yet
Decision Tree (Autosaved)
14 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
CP 4
No ratings yet
CP 4
2 pages
Random Forest
No ratings yet
Random Forest
5 pages
Random Forest
No ratings yet
Random Forest
25 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Decision_tree
No ratings yet
Decision_tree
15 pages
Ch8 Tree Based Methods
No ratings yet
Ch8 Tree Based Methods
81 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
Unit Ii
No ratings yet
Unit Ii
22 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
MI Touch Installation Steps-3
No ratings yet
MI Touch Installation Steps-3
2 pages
MI Touch Installation Steps-1
No ratings yet
MI Touch Installation Steps-1
2 pages
MI Touch Installation Steps
No ratings yet
MI Touch Installation Steps
4 pages
Agroforestry: Presentation by G.AISHWARYA (BSC, (AGRI) ) 1 Year
100% (1)
Agroforestry: Presentation by G.AISHWARYA (BSC, (AGRI) ) 1 Year
23 pages
The Plates of The Eagles 1864 1866
No ratings yet
The Plates of The Eagles 1864 1866
35 pages
ITODO PRINCE's CV
No ratings yet
ITODO PRINCE's CV
3 pages
Homogenized Yarn Level Cloth in OpenGL and C++
100% (1)
Homogenized Yarn Level Cloth in OpenGL and C++
1 page
ARCTECH - Solar International Logistics Manager
No ratings yet
ARCTECH - Solar International Logistics Manager
2 pages
Internet Archive - GeoCities Special Collection 2009
No ratings yet
Internet Archive - GeoCities Special Collection 2009
2 pages
Literature Review Qualitative Vs Quantitative
100% (2)
Literature Review Qualitative Vs Quantitative
7 pages
C CER ISO 14001 Fronius International EN
No ratings yet
C CER ISO 14001 Fronius International EN
3 pages
Holiday Hotel Payment Receipt
No ratings yet
Holiday Hotel Payment Receipt
2 pages
A Data Analysis and Data Visualization Using Python
No ratings yet
A Data Analysis and Data Visualization Using Python
7 pages
3BSE056141-510 C en System 800xa 5.1 Server Node Virtualization
No ratings yet
3BSE056141-510 C en System 800xa 5.1 Server Node Virtualization
136 pages
Case Study - MPESA
100% (1)
Case Study - MPESA
5 pages
CS566 Course Outline Lums PDF
0% (1)
CS566 Course Outline Lums PDF
3 pages
Structural Residence
No ratings yet
Structural Residence
4 pages
Java_DSA_Roadmap
No ratings yet
Java_DSA_Roadmap
2 pages
CS1031 SSD Datasheet
No ratings yet
CS1031 SSD Datasheet
1 page
V11 Sage x3 Release Guide With Options On-Premises
No ratings yet
V11 Sage x3 Release Guide With Options On-Premises
48 pages
Dali 2-0
No ratings yet
Dali 2-0
21 pages
Interview Questions To Ask A Cyber Security Analyst Xobin Downloaded
No ratings yet
Interview Questions To Ask A Cyber Security Analyst Xobin Downloaded
8 pages
Lab1 - OSG202
No ratings yet
Lab1 - OSG202
34 pages
dna-water-cooling-ozone-generator-with-air-feeding
No ratings yet
dna-water-cooling-ozone-generator-with-air-feeding
2 pages
DDR Rig 01 Jan 2024 MRC Bondang
No ratings yet
DDR Rig 01 Jan 2024 MRC Bondang
92 pages
Aanhidayatulloh,+7+etty+padmiati (1) - Dikonversi
No ratings yet
Aanhidayatulloh,+7+etty+padmiati (1) - Dikonversi
26 pages
BITLY API REFERENCE
No ratings yet
BITLY API REFERENCE
92 pages
Mallikarjun - CV-Updated
No ratings yet
Mallikarjun - CV-Updated
4 pages
Cinematography Theory and Practice Image Making for Cinematographers and Directors 2nd Edition Blain Brown instant download
100% (1)
Cinematography Theory and Practice Image Making for Cinematographers and Directors 2nd Edition Blain Brown instant download
56 pages
DSA Lab Assignment 3
No ratings yet
DSA Lab Assignment 3
6 pages
OKR Spreadsheet Template With Weekly Checkins
No ratings yet
OKR Spreadsheet Template With Weekly Checkins
30 pages
Chapter 5-DATA TYPE AND DATA REPRESENTATIONS
No ratings yet
Chapter 5-DATA TYPE AND DATA REPRESENTATIONS
25 pages