Lecture 7 - Decision Tree Regression Imran 19032025 103416am
Lecture 7 - Decision Tree Regression Imran 19032025 103416am
Class Exercise
(Classification and regression trees)
Decision tree intuition
Since the set of splitting rules used to segment the predictor space can be
summarized in a tree, these types of machine learning approaches are
known as decision tree methods.
The basic idea of these methods is to partition the space and identify
some representative centroids.
• We are given some data,
which consists of two
independent variable x1
and x2
• The plot represents a
scatter plot of the data
• We want to predict a
dependent variable y, which
we cannot see on this
scatter plot
• We will work with the data
points to build a regression
tree and then consider Y
One way to make predictions in a
regression problem is to divide the
predictor space (i.e. all the possible values
for X1, X2,…, Xp) into distinct regions, say R1,
R2,…,Rk (terminal nodes or leaves).
• The points along the tree where the predictor space is split are referred to as internal nodes.
• The terminology used for these regions created by the splits is leaf
• When a split cannot take place in a region it is called a terminal leaf
• Split 1:
• x1>20 and x1<20
• Split 2:
• x2>170 and x2<170
• For only those points
where x1>20
• Split 3:
• x2>200 and x2<200
• For only those points
where x1<20
• Split 4:
• x1>40 and 20<x1<40
• For only those points
where x2<170
• Now what are we going to
put in the empty boxes
• This is where we need to
consider y i.e., our
dependent variable that
we need to predict or
model
• What we need to check is
how are we going to
predict the value of y for
let us assume an
observation that has
x1=30 and x2=50
• The observation x1=30
and x2=50 lies in the
terminal leaf given in
green
• Now how does that
information that the
observation lies in the
terminal leaf given in
green help us in predicting
the value of y.
• You take the average of all
the values in your terminal
leaf
• Y
• So let us assume the
average for each
terminal leaf are as
given the figure.
• Then for the given
point x1=30 and x2=50
the regression tree
algorithm will give the
output value or
predicted value of Y as
-64.1
• It is pretty straight
forward
• We need to remember
that it works on
averages
• The goal is to add
information to better
predict Y
• The last step is to
add the mean
values into the
decision tree
Example of decision tree regression
When the number of instances is more than one at a leaf node we calculate the average as the final value for
the target.
Implementation using Python
Introduction
Once you have imported the dataset check to see which columns you need to consider
Also check whether you need to apply any data pre-processing techniques
• Handling missing values
• Data cleaning
• Encoding categorical data
• Feature scaling
Training the decision tree
regression model on the whole
data
• [[0,1,0,10,9194,2000]]