0% found this document useful (0 votes)

23 views6 pages

Studio 9 Questions

Uploaded by

nargolic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views6 pages

Studio 9 Questions

Uploaded by

nargolic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

FIT2086 Studio 9

Supervised Machine Learning Methods

Daniel F. Schmidt
October 9, 2020

Contents

1 Introduction 2

2 Decision Trees 2

3 Random Forests 4

4 k-Nearest Neighbours 5

5 Additional Questions 6

1
1 Introduction
Studio 8 introduces you to several supervised machine learning techniques; in particular, you will look
at using decision trees for regression and classification, as well as k nearest neighbours methods. To
complete this Studio, you will need to install three packages: rpart, randomForest and kknn in R.
During your Studio session, your demonstrator will go through the answers with you, both on the
board and on the projector as appropriate. Any questions you do not complete during the session
should be completed out of class before the next Studio. Complete solutions will be released on the
Friday after your Studio.

2 Decision Trees
In the first part of this Studio we will look at how to learn a basic decision trees from data, how to
visualise/interpret the tree, and how to make predictions. We we will look at both continuous targets
(regression trees) as well as categorical targets (classification trees). Begin by ensuring that the rpart
package is loaded.

1. Load the diabetes.train.csv and diabetes.test.csv data into R. Use summary() to inspect
your training data; you will see that it has 10 predictors, AGE, SEX, BMI, BP (blood pressure)
and six blood serum measurements S1 through to S6; it also has a target variable Y, which is a
measure of diabetes progression over a fixed period of time. The higher this value, the worse the
diabetes progression.
2. Let us fit a decision tree to our training data. To do this, use

tree.diabetes = rpart(Y ˜ ., diabetes.train)

which fits a decision tree to the data using some basic heuristics to decide when to stop growing
the tree.
3. We can also explore the relationships between the diabetes progression outcome variable (Y) and
the predictor variables that were used by the decision tree package. We will first do this by
examining the decision tree in the console:

tree.diabetes

This displays the tree in text form. The asterisks “*” denote the terminal (leaf) nodes of the
tree, and the nodes without asterisks are split nodes; the information contains which variables
are split on, what the splits are, and for each leaf node, what the predicted value of Y is. How
many leaf nodes are there in this tree? Which variables has the tree used to predict diabetes
progression?
4. The output of the above command produces all the information related to the tree we have
learned but can be hard to understand. It is easier to visualise the decision tree by using the
plot() function to get a graphical representation of the relationships:

plot(tree.diabetes)
text(tree.diabetes, digits=3)

This displays the tree, along with the various decision rules at each split node, and the predicted
value of Y at each leaf. You may need to click the “Zoom” button above the plot in R Studio to
get this picture displayed more clearly. Using this information answer the following questions:

2
(a) What is the estimated average diabetes progression for individuals with BMI = 28.0, blood
pressure (BP) = 96 and S6 = 110?
(b) What is the estimated average diabetes progression for individuals with BMI = 20.1, S5 =
4.7 and S3 = 38?
(c) Find the characteristics of the individuals with the worst (highest) predicted average dia-
betes progression.
5. The rpart package provides a measure of importance for each variable. To access this type

tree.diabetes$variable.importance

This reports the variables in order of importance, the importance being defined by the amount
that they increase the goodness-of-fit of the tree to the data. Larger scores are better, though
the numbers are themselves defined in terms of an arbitrary unit so it might be better to use

tree.diabetes$variable.importance / max(tree.diabetes$variable.importance)

which normalizes the importance scores so that they are relative to the importance of the most
important predictor. Which three predictors are the most important?
6. We can now test to see how well this tree predicts onto future data. We can use the predict()
function to get predictions for new data, and then calculate the root-mean squared error (RMSE):

sqrt(mean((predict(tree.diabetes, diabetes.test) - diabetes.test$Y)ˆ2))

How can we interpret this score?

7. The rpart package provides the ability to use cross-validation to try and “prune” the tree down,
removing extra predictors and simplifying the tree without damaging the predictions too much
(and potentially improving them). The code is a little involved, so I have included a wrapper
function in the file wrappers.R; source this file to load the wrapper functions into memory. Then,
to perform CV

cv = learn.tree.cv(Y ˜.,data=diabetes.train,nfolds=10,m=1000)

The nfolds parameter tells the code how many different ways to break up the data, and a value
of 10 is usually fine. The m parameter tells the code how many times to repeat the cross-validation
process (randomly dividing the data up, training on some of the data, testing on the remaining
data) – the higher this number is, the less the trees found by CV will vary from run to run, as
it reduces the random variability in the cross-validation scores – but the longer the training will
take as we are doing more cross-validation tests. The cv object returned by learn.tree.cv()
contains three entries. The cv$cv.stats object contains the statistics of the cross-validation.
We can visual this using:
plot.tree.cv(cv)
This shows the cross-validation score (y-axis) against the tree size in terms of number of leaf
nodes (x-axis). The cv$best.cp entry is the best value of the complexity parameter for our
dataset, as estimated by CV, and can be passed to prune.rpart() to prune our tree down. The
optimum number of leaf nodes is plotted in red.
8. The cv$best.tree object contains the pruned tree, using cv$best.cp as the pruning complexity
parameter. Plot this tree, and compare it to the previous tree tree.diabetes. How do they
compare?

3
(a) Has cross-validation removed any predictor variables from the original tree tree.diabetes?
(b) What are the characteristics that predict the worst diabetes progression in this new tree?
(c) What is the RMSE for this new tree cv$best.tree on the test data?
9. We can now compare the performance of our decision tree to a standard linear model. Use the
glmnet package to fit a linear model using the lasso, and calculate the RMSE for the fitted
model:

lasso.fit = cv.glmnet.f(Y ˜ ., data=diabetes.train)

glmnet.tidy.coef(lasso.fit)
sqrt(mean((predict.glmnet.f(lasso.fit,diabetes.test)-diabetes.test$Y)ˆ2))

How does the linear model compare in terms of which predictors it has chosen to use with the
tree selected by CV?

3 Random Forests
A random forest is a collection of classification or regression trees that are grown by controlled, random
splitting. Once grown, all the trees in a random forest are used to make predictions and to determine
which predictors are associated with the outcome variable. However, the relationship between the
predictors and the target is much more opaque than for a decision tree or linear model. In order to
use random forests in R you must install and load the randomForest package.

1. First, use R to learn a random forest from the diabetes.train data set:

rf.diabetes = randomForest(Y ˜ ., data=diabetes.train)

This trains a forest of decision trees on our data.

2. Unlike a single decision tree, a random forest is difficult to visualise and interpret as it consists
of many hundreds or thousands of trees. After learning the random forest from the data, we can
inspect the model by typing:
rf.diabetes
This returns some basic information about the model, such as the percentage of variance ex-
plained by the tree (roughly equivalent to 100 R2 ).
3. To see how well our random forest predicts onto our testing data we can use the predict()
function and calculate RMSE:

sqrt(mean((predict(rf.diabetes, diabetes.test) - diabetes.test$Y)ˆ2))

We can see that the random forest performs quite a bit better than our single best decision tree,
and is basically the same as the linear model in this case.
4. So far we have been run the random forest package using the default settings for all parameters.
Although the package randomForest has many interesting user-settable options (see the help for
more details), the following three options are most useful for common use:

• ntree: Specifies the number of trees to grow. This should not be set to too small a number,
to ensure that every input row gets predicted at least a few times (Default: 500)

4
• importance: Should importance of predictors be computed? (Default: FALSE)
Let’s explore how we can use these options when analysing our diabetes data set.

rf.diabetes = randomForest(Y ˜ ., data=diabetes.train, importance=TRUE, ntree=5000)

The number of trees in this example is set to 5, 000. In general, using more trees leads to
improvements in prediction error. However, the computation complexity of the algorithm grows
with the number of trees which means large forests can take a long time to learn and use for
prediction. Calculate RMSE on the test data for this new random forest.
5. The option importance tells the random forest package that we wish to rank our predictor
variables in terms of their strength of association with the outcome. To view the final ranking,
we need to run:
round( importance( rf.diabetes ), 2)
The output of the command contains several columns which are different measures of variable
importance. When used for continuous outcome variables, the command produces %IncMSE which
corresponds to the estimated increase in mean square prediction error that occurs if a particular
exposure variable is omitted. Which variables seem to be the most important using this measure?

4 k-Nearest Neighbours
The last supervised machine learning technique we will examine are called k-Nearest Neighbours (kNN)
classifiers. A big advantage of kNN based methods is their great flexibility, speed and lack of assump-
tions. The obvious disadvantage is that like random forests they are difficult to interpret; they also
suffer from the fact that selecting important predictors is not naturally handled by most packages. To
install and load the kknn package.

1. Once again, let’s look at the diabetes data. Using the default settings, a kNN can be used to
make predictions with the code below:

ytest.hat = fitted( kknn(Y ˜ ., diabetes.train, diabetes.test) )

The fitted command tells the R package to use the k-NN algorithm to make predictions about
future (test) data. Calculate the RMSE on the test data using:

sqrt(mean((ytest.hat - diabetes.test$Y)ˆ2))

How does this compare to the prediction errors achieved by the linear model, decision tree and
random forests?
The kNN approach does not actually build a model to describe the data. Instead, to make a
prediction about an individual’s outcome, a k-NN finds the k closest individuals (in terms of
predictor variables) in the training data to the individual in question and uses a combination
of their target variables to make predictions on future data. Practically, this means that when
using the kknn package, we need to specify both the training and test dataset whenever we need
to make predictions using fitted(). A downside of this model-free approach is that the larger
our training data, the slower it becomes to produce predictions onto new data.
2. As with random forests the k-NN procedure has several options that control the behaviour of
the algorithm. The most important of these are k and kernel. The k option controls the size
of neighbourhood used when making predictions, i.e., how many individuals from the training

5
data are used to form a prediction on the test data. The kernel option determines how the
individuals in the neighbourhood are combined together when making predictions (see Lecture 9
for details on these types of parameters). The best values for these parameters will depend on the
particular dataset, and as with lasso hyperparameters, or the size of a tree, can be automatically
be chosen using cross-validation. To do this, use the following (provided wrapper function):
kernels = c("rectangular","triangular","epanechnikov","gaussian","rank","optimal")
knn = train.kknn(Y ˜ ., data = diabetes.train, kmax=25, kernel=kernels)
ytest.hat = fitted( kknn(Y ˜ ., diabetes.train, diabetes.test,
kernel = knn$best.parameters$kernel, k = knn$best.parameters$k) )
The above code uses train.kknn() to try a combination of different k values (1 to kmax) and
different kernels, and stores all these inside the knn object. We can then use the k and ker-
nel nominated as best by cross-validation in conjunction with the fitted() command, to get
improved predictions.
Calculate the RMSE on the testing data for the predictions made by the k-NN method with the
k and kernel chosen by cross-validation. How do they compare to the RMSE scores obtained by
the linear model, decision tree and random forest?

5 Additional Questions
You have now learned how to use the rpart, randomForest and kknn packages to build machine
learning models for predicting, and in the case of trees, learning which variables are important. You
can use this newfound knowledge apply these models to some of the datasets we have examined over
the last few weeks:
All three methods can be applied to categorical target variables (i.e., classification). The only
changes you need to make are when computing predictions. When using decision trees, you the
predictions are the probabilities of the target being in each of the classes; for our binary classification
problems, you need to take the second column as the probabilities to pass to my.prediction.stats(),
i.e.,
my.prediction.stats(my.pred.stats(predict(tree, pima.test)[,2], pima.test$DIABETES)
To produce probabilities of classification for random forests you need to use predict() with the type
argument set appropriately, i.e. if rf.pima is our random forest trained on the Pima indians data,
then we can predict on future data using
predict(rf.pima,pima.test,type="prob")[,2]
The k-NN package can only produce the best guesses at our target classes, so we cannot compute
AUC scores or log-loss. Instead, if ytest.hat are predictions then can compute classification accuracy
using something like
mean(yhat.test == pima.test$DIABETES)*100
Using this, do the following:
1. Explore the genetic data we examined last week using decision trees, random forests and k-NN
methods. What variables do the tree-based methods select?
2. Explore the Pima indians data using these three methods. How do the predictions compare to
the logistic regression based methods we examined? What variables do the tree-based methods
select?

Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Practical Guide To Scikit-Learn For Data Science
No ratings yet
Practical Guide To Scikit-Learn For Data Science
27 pages
Risk and Return Practice Questions
No ratings yet
Risk and Return Practice Questions
6 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
Prediction of Cardio-Vascular Disease Using Machine Learning Algorithms and Flask Api
No ratings yet
Prediction of Cardio-Vascular Disease Using Machine Learning Algorithms and Flask Api
23 pages
Comparative Study of Machine Learning Algorithms For Diabetes
No ratings yet
Comparative Study of Machine Learning Algorithms For Diabetes
11 pages
Presentation 3
No ratings yet
Presentation 3
8 pages
Z Score Table - Z Table and Z Score Calculation
No ratings yet
Z Score Table - Z Table and Z Score Calculation
7 pages
Smart Health Predition Using Data Mining 1
No ratings yet
Smart Health Predition Using Data Mining 1
13 pages
M3-M4-Understanding of Data
No ratings yet
M3-M4-Understanding of Data
16 pages
Statistical Treatement of Data
100% (1)
Statistical Treatement of Data
4 pages
Lecture Notes - Decision Tree
No ratings yet
Lecture Notes - Decision Tree
13 pages
Meta-Analysis Fixed Effect Vs Random Effects
No ratings yet
Meta-Analysis Fixed Effect Vs Random Effects
162 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
Decision Theory
No ratings yet
Decision Theory
101 pages
Compiled by Solomon Kebede
No ratings yet
Compiled by Solomon Kebede
136 pages
Ch. 9 (B) Lec
No ratings yet
Ch. 9 (B) Lec
38 pages
10.3934 Publichealth.2023030
No ratings yet
10.3934 Publichealth.2023030
21 pages
Performance Metrics Classification
No ratings yet
Performance Metrics Classification
39 pages
Session 2-3 (ANOVA) Regression
No ratings yet
Session 2-3 (ANOVA) Regression
54 pages
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
No ratings yet
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
21 pages
Random Forest PDF
No ratings yet
Random Forest PDF
14 pages
PH1700 Session 4b - Stu - Poisson - Estimation & Inference
No ratings yet
PH1700 Session 4b - Stu - Poisson - Estimation & Inference
38 pages
MIS410 Chapter6
No ratings yet
MIS410 Chapter6
47 pages
Package Desire': R Topics Documented
No ratings yet
Package Desire': R Topics Documented
22 pages
Random Forest Intro Presented
No ratings yet
Random Forest Intro Presented
38 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
DIAPRO - Diabetes Prediction Application
No ratings yet
DIAPRO - Diabetes Prediction Application
18 pages
ML FDP Over All Summary
No ratings yet
ML FDP Over All Summary
44 pages
MLPPT 11 45
No ratings yet
MLPPT 11 45
31 pages
Standard Error
No ratings yet
Standard Error
14 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
15 pages
Random Forest
No ratings yet
Random Forest
5 pages
Article 6
No ratings yet
Article 6
11 pages
Final
No ratings yet
Final
44 pages
Review Statistik (Simple Linear and Correlation)
No ratings yet
Review Statistik (Simple Linear and Correlation)
21 pages
Probability Normal Distribution
No ratings yet
Probability Normal Distribution
20 pages
Sampling Distribution
No ratings yet
Sampling Distribution
22 pages
A Comprehensive Guide On Advanced Microsoft Excel For Data Analysis
No ratings yet
A Comprehensive Guide On Advanced Microsoft Excel For Data Analysis
15 pages
Project Report
No ratings yet
Project Report
10 pages
DA Lab Week-3
No ratings yet
DA Lab Week-3
15 pages
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
No ratings yet
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
12 pages
Maths II Preliminary Paper B
No ratings yet
Maths II Preliminary Paper B
5 pages
245-Article Text-2088-1-10-20240129
No ratings yet
245-Article Text-2088-1-10-20240129
8 pages
Sse 25 21 114-1
No ratings yet
Sse 25 21 114-1
14 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Analyzing The Behavior of Different Classification Algorithms in Diabetes Prediction
No ratings yet
Analyzing The Behavior of Different Classification Algorithms in Diabetes Prediction
6 pages
Capstone Presentation Version 1.0
No ratings yet
Capstone Presentation Version 1.0
21 pages
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
No ratings yet
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
13 pages
Wepik Advancements in Diabetes Detection Leveraging Machine Learning Models Including SVM Random Forest 20231103202928mQLf
No ratings yet
Wepik Advancements in Diabetes Detection Leveraging Machine Learning Models Including SVM Random Forest 20231103202928mQLf
12 pages
Prediction of Diabetes Disease Using An Ensemble of Machine Learning Multi-Classifier Models
No ratings yet
Prediction of Diabetes Disease Using An Ensemble of Machine Learning Multi-Classifier Models
24 pages
Dar Lect 12
No ratings yet
Dar Lect 12
29 pages
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
No ratings yet
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
6 pages
PA
No ratings yet
PA
8 pages
Chapter Three 111
No ratings yet
Chapter Three 111
13 pages
Lecture 15: Tree-Based Algorithms - Applied ML
No ratings yet
Lecture 15: Tree-Based Algorithms - Applied ML
17 pages
Report
No ratings yet
Report
11 pages
Random Forest
No ratings yet
Random Forest
8 pages
ETE 399 Mini Project
No ratings yet
ETE 399 Mini Project
7 pages
(Ronjon Kundu)
No ratings yet
(Ronjon Kundu)
8 pages
Arnav MLlab02
No ratings yet
Arnav MLlab02
6 pages
Pima
No ratings yet
Pima
5 pages
DDPIS Diabetes Disease Prediction by Improvising
No ratings yet
DDPIS Diabetes Disease Prediction by Improvising
11 pages
Stat 302 Practice Final: Brad Mcneney 2017-04-15
No ratings yet
Stat 302 Practice Final: Brad Mcneney 2017-04-15
7 pages
A Very Basic Introduction To Random Forests Using R - Oxford Protein Informatics Group
No ratings yet
A Very Basic Introduction To Random Forests Using R - Oxford Protein Informatics Group
7 pages
ML Mini Project
No ratings yet
ML Mini Project
8 pages
Buettner 2019
No ratings yet
Buettner 2019
6 pages
Download
No ratings yet
Download
6 pages
IT0089 TB391 Decision Tree RABE
No ratings yet
IT0089 TB391 Decision Tree RABE
6 pages
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
No ratings yet
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
7 pages
Relationships Between Two Quantitative Variables: Questions On Topic Four
No ratings yet
Relationships Between Two Quantitative Variables: Questions On Topic Four
6 pages
CHNGPT Code R
No ratings yet
CHNGPT Code R
25 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
Table 9 3 Contains 40 Annual Counts of The Numbers of Recruits and Spawners in A Salmon
No ratings yet
Table 9 3 Contains 40 Annual Counts of The Numbers of Recruits and Spawners in A Salmon
2 pages
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
No ratings yet
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
5 pages
Itmconf Icacc2022 03057
No ratings yet
Itmconf Icacc2022 03057
6 pages
Assignment 1 Research Methodology
No ratings yet
Assignment 1 Research Methodology
5 pages
Prediction of Diabetes Using R
No ratings yet
Prediction of Diabetes Using R
6 pages
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
No ratings yet
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
5 pages
TY - COMP - Descriptive Analytics - DEC 2019
No ratings yet
TY - COMP - Descriptive Analytics - DEC 2019
4 pages
Toth 2021
No ratings yet
Toth 2021
11 pages
DATA MINING - Syllabus
No ratings yet
DATA MINING - Syllabus
4 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Sofware Eng Ahnuf
No ratings yet
Sofware Eng Ahnuf
2 pages
Diabetes Prediction Using Machine Learning: Model Selection
No ratings yet
Diabetes Prediction Using Machine Learning: Model Selection
1 page
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Studio 9 Questions

Uploaded by

Studio 9 Questions

Uploaded by

FIT2086 Studio 9

Supervised Machine Learning Methods

tree.diabetes = rpart(Y ˜ ., diabetes.train)

sqrt(mean((predict(tree.diabetes, diabetes.test) - diabetes.test$Y)ˆ2))

How can we interpret this score?

lasso.fit = cv.glmnet.f(Y ˜ ., data=diabetes.train)

rf.diabetes = randomForest(Y ˜ ., data=diabetes.train)

This trains a forest of decision trees on our data.

sqrt(mean((predict(rf.diabetes, diabetes.test) - diabetes.test$Y)ˆ2))

rf.diabetes = randomForest(Y ˜ ., data=diabetes.train, importance=TRUE, ntree=5000)

ytest.hat = fitted( kknn(Y ˜ ., diabetes.train, diabetes.test) )

You might also like