2019 - Introduction To Data Analytics Using R
2019 - Introduction To Data Analytics Using R
(University of London)
BSc EXAMINATION
RUBRIC
Page 1 of 5
1. (11 marks)
(a) What is (simple) random sampling? Name two statistical learning methods where
random sampling with replacement is used. (4 marks)
(b) Write down the formula for calculating an unbiased estimate, s2, of the variance of
a large (but finite) population, based on a random sample of n items. Define any
symbols you use. (3 marks)
(c) Create a box-and-whisker plot for the following dataset: 2, 5, 8, 9, 10, 13, 16, 19, 27.
(4 marks)
2. (13 marks)
(a) Finn is given a dataset with a categorical response variable and a number of predictor
variables. He intends to build a logistic regression model and a random forest model
and compare which model is better. Explain how Finn can achieve this using R.
You don’t need to write any R code to answer this question, but you may use some
functions or features in R language to help you explain if needed. (7 marks)
(b) If your model is overfitted, what could be done to reduce/avoid overfitting in general?
Given the models of decision trees and regression, explain your approaches to reduce
overfitting for the two models. (6 marks)
3. (12 marks)
Page 2 of 5
4. (13 marks)
In a study of the effect of three chemicals - dioxin, bioxin and tioxin on reproduction in
fish, the three chemical levels (in parts per billion) were measured for 18 different ponds
in regions of Vietnam that had been exposed to agent orange (a herbicide chemical) during
the Vietnam War. Researchers fertilised a sample of fish eggs in a sample of water from
each of the ponds, then counted how many of the fertilised eggs eventually hatched.
Logistic regression output is provided below.
(a) Using the following output of logistic regression where the categorical response (y) is
whether fish eggs would be hatched (1) or not (0), carefully choose the predictors and
write down a reasonable logistic regression model. Justify your answer. (4 marks)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.5543 7.2946 -2.132 0.0043
dioxin 1.5859 0.3569 ??? 0.5365
bioxin -0.5643 0.3317 ??? 0.0065
tioxin 1.9639 0.8800 ??? 0.00054
(b) Does the probability of fish eggs hatching increase or decrease with bioxin? Justify
your answer. (2 marks)
(c) We have dioxin = 5 and bioxin = 25. Which value do we have to choose for tioxin
in order to get a probability of 50% for the fish eggs to be hatched? (4 marks)
(d) Suppose the dataset is called FishEggHatch. Write down the R command for the
logistic regression model you got in (a). (3 marks)
5. (15 marks)
(a) What approaches can be used to deal with unknown values in the dataset? (5 marks)
(b) What are the four main objectives of Principal Components Analysis (PCA)? (4 marks)
(c) What are the differences between Maximal Margin Classifiers (MMC) and Support
Vector Classifers (SVC)? (6 marks)
Page 3 of 5
6. (16 marks)
(a) We are trying to learn regression parameters for a dataset which we know was gen-
erated from a polynomial of a certain degree, but we do not know what this degree
is. Assume the data was actually generated from a polynomial of degree 5 with some
added noise, that is
For training we have 100 (x, y)-pairs and for testing we are using an additional set of
100 (x, y)-pairs. Since we do not know the degree of the polynomial we learn two
models from the data.
Model A learns parameters for a polynomial of degree 4 and
Model B learns parameters for a polynomial of degree 6.
Which of these two models is likely to fit the test data better? Justify your answer.
(4 marks)
(b) Write down the R code to build and test the model you choose for (a). Give reasonable
names to the datasets you use. (6 marks)
(c) Which metric(s) from the R results would you use to compare the two models? What
do(es) the metric(s) tell us? (6 marks)
7. (10 marks)
(a) Given the following dissimilarity matrix, identify which two clusters should be merged
next and write the dissimilarity matrix for the next round using single linkage. (5 marks)
ADE B CF G H
ADE 0.00
B 2.64 0.00
CF 2.91 4.01 0.00
G 2.59 3.41 3.67 0.00
H 1.11 3.40 2.06 3.29 0.00
(b) Explain the principles of centroid linkage and highlight its main drawback. (5 marks)
Page 4 of 5
8. (10 marks)
(a) Assume we are trying to learn a decision tree. Our input data consists of n obser-
vations, each with p predictors (n >> p). If all predictors are binary, what is the
maximal number of leaf nodes that we can have in a decision tree for this data? What
is the maximal number of internal nodes (including the root)? Justify your answer.
(6 marks)
(b) What is the leave-one-out cross validation error estimate for maximum margin sep-
aration in the following figure? (We are asking for a number.) Justify your answer.
(4 marks)
Page 5 of 5