0% found this document useful (0 votes)
12 views6 pages

Lecture 16

Uploaded by

sher887895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Lecture 16

Uploaded by

sher887895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Lecture 16: Statistical Inference by Dr.

Javed Iqbal
Decision Tree Models for Regression and Classification
Machine Learning is an AI technique that teaches computers to learn from experience. Machine
learning algorithms use computational methods to “learn” information directly from data
without relying on a predetermined equation as a model.

Decision tree is a supervised machine learning technique which involves segmenting or


stratifying predictor space into simple regions (containing mostly one type of objects or
observations with similar y values). The splitting rule can be represented by (upside down)
tree, hence this method is called decision tree method.

Regression Tree:
Consider modelling the baseball players salary, a quantitative variable. Hence the resulting tree
model is a regression tree. There are 322 observations (players) and 20 variables are measured
on them. There can be many variables affecting salary (salary is in thousands of dollars,
expressed in logs) but suppose we consider only two predictors (X1: years of experience and
X2: hits made in previous season). The resulting decision tree is represented as follows:

[Salary is color coded from low (blue, green) to high (yellow, red)]

Overall, the tree stratifies or segments players into three regions of predictor space:
𝑅1 = {𝑋|𝑌𝑒𝑎𝑟𝑠 < 4.5}, 𝑅2 = {𝑋|𝑌𝑒𝑎𝑟𝑠 ≥ 4.5, 𝐻𝑖𝑡𝑠 < 117.5},
𝑅3 = {𝑋|𝑌𝑒𝑎𝑟𝑠 ≥ 4.5, 𝐻𝑖𝑡𝑠 ≥ 117.5}
Here we have two internal nodes and three leaves (final nodes). Within a given region,
predicted salary is the average salary of players in this region.
Interpretation of the tree outcome: Year is most important predictor of salary so that players
with lower years of experience (in particular less than 4.5 years) have lower salary.
Given that players have low experience, the number of hits has no role to play in salary. But
among the players with more than 4.5 years, the number of hits also becomes important with
higher salary being paid to players with high number of hits.
In general, the goal of the tree algorithm is to find regions (boxes) 𝑅1 , 𝑅2 ,… 𝑅𝐽 that minimize
the residual sum of squares (RSS), given by:
𝐽 ⬚

∑ ∑(𝑦̂𝑖 − 𝑦̅𝑅𝑗 )2
𝑗=1 𝑖∈𝑅𝑗

Where 𝑦̅𝑅𝑗 is the mean response for the training observations within the jth box.

What variable to start with making a tree and where the cut is to be made? e.g., why
𝑌𝑒𝑎𝑟𝑠 < 4.5? This is determined by the decision tree algorithm so that the cut at this point in
this variable brings largest reduction in residual sum of squares. Initially the residual sum of
squares equals sum of squared deviation of all observation y from the grand mean of 𝑦̅. The
algorithm selects predictor and considers all possible cuts for the predictor. The cut that
provides greatest reduction in the residual sum of squares is selected by the algorithm. After a
cut is made, the algorithm keeps on segmenting the predictor space to look for further reduction
in the residual sum of squares.
The algorithm stops when a stopping criterion is reached e.g., when there are only a certain
number of cases remaining in a region of predictor space.
Ex1: Prepare a decision tree from the given rectangular predictor space (right panel) or vice
versa given the decision tree (left panel), prepare a rectangular predictor space for the two
predictors X1 and X2.
Ex2: Prepare a decision tree from the given rectangular predictor for the two predictors X1
and X2.

Ex3: Prepare a rectangular predictor space corresponding to the decision tree given below for
the two predictors X1 and X2.

Classification Tree:
Used for qualitative dependent variables. The tree building procedure is similar to the
regression tree.
Here the prediction of Y for a test case is given by the label of the most commonly occurring
class in which the test case falls (i.e., based on majority vote).
We keep on segmenting the predictor space to minimize the nodes impurity measured by e.g.,
the Gini Coefficient G. (Here K is number of classes of response variable e.g., K=2 for binary
classification).
𝐾

𝐺 = ∑ 𝑝̂𝑚𝑘 (1 − 𝑝̂𝑚𝑘 )
𝑘=1
Here 𝑝̂𝑚𝑘 is the proportion of training observations in the mth region that are from the kth
class. G will be smaller when there is mostly one type of cases/observations in the region so
that 𝑝̂𝑚𝑘 is closer to either zero or 1. When there is an even number of cases of the two types
in the region, 𝑝̂ 𝑚𝑘 will be 0.5 and G will be large.

Ex4: Consider predicting the loan default status (yes or no) of a sample of clients of a bank
from the following classification tree. The predictors are whether the client owns a home,
marital status (married or single/divorced) and annual income (dollars).

a) Predict whether or not a married client who has income of $100,000 and who does not own
a house will default.
b) Predict whether a single client who has income of $60,000 and who owns a house will
default.
c) Predict whether a single client who has an income of $80,000 and who does not own a
house will default.
d) What are the two most important predictors in predicting default status?
e) What is the number of internal nodes and leaves in this problem?
f) Is this a regression or classification problem?
[Ans: a) not default b) not default c) default, d) Home ownership and marital status, e)3,4 f)
classification]

Ex5: Consider the tree grown on a response variable ‘pollution level’ and seven predictors as
mentioned in the tree plot corresponding to different test locations. The predictors include
‘number of industries’ in the location, the ‘population’ in thousands in that location, average
number of ‘wet days’ in the year, average ‘temperature’ in Fahrenheit, and ‘wind speed’ in
KM/hour. The numbers in each leave node are the average pollution level in this predictor
space.
a). Is this a regression or classification tree?
b).What are the two most important variables which determine the pollution level in an area?
c). Predict the average pollution level in a test location which has 500 industries, with a
population of 200 thousand, the average wet days being 150, average temperature level of 50
Fahrenheit, and wind speed of 8 Km/hour.
[Answer: a) regression b) number of industries and population in the test area c) 33.88]

Ex6: Consider the tree grown to predict whether the client will buy computer (yes or no). The
predictors are age (with three levels youth, middle-aged and senior), whether the client is
student and credit rating of the individual (fair or excellent). The resulting classification tree is
as follows:

a) Is this a regression or classification problem?


b) Predict whether a youth who is not a student and has fair credit rating will buy a computer.
c) Predict whether a senior who has excellent credit rating will buy a computer.
[Ans: a) classification b) not buy c) buy]
[Further details in Ch3, p-172 and onwards in Bowerman’s book]
Decision Tree Models in R: (using the hiring.csv data file)
# Decision tree for classification # using hiring data
library(tree) # first load the package ‘tree’ in R and access it via library
hiring=read.csv(file.choose())
attach(hiring)
head(hiring)
hiring$hire=as.factor(hiring$hire) #note dependent variable must be factor
set.seed(1)
hiring_model=tree(hire~educ+exp+male, data=hiring) # fit the model
# note deviance (also known as cross-entropy) is the default impurity measure
plot(hiring_model)
text(hiring_model, pretty=0)
# pruning tree # pruning results in a more manageable and interpretable tree
prune.tree=prune.tree(hiring_model , best=4) # a 4 leaves tree
plot(prune.tree)
text(prune.tree,pretty =0)

You might also like