0% found this document useful (0 votes)
83 views

2019 - Introduction To Data Analytics Using R

This document contains an examination for an introductory course in data analytics using R. It consists of 8 questions worth a total of 100 marks. The questions cover topics such as random sampling, variance estimation, logistic regression, k-means clustering, principal component analysis, and decision trees. Students are instructed to attempt all questions and show their work clearly. They are allowed 2 hours to complete the examination.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

2019 - Introduction To Data Analytics Using R

This document contains an examination for an introductory course in data analytics using R. It consists of 8 questions worth a total of 100 marks. The questions cover topics such as random sampling, variance estimation, logistic regression, k-means clustering, principal component analysis, and decision trees. Students are instructed to attempt all questions and show their work clearly. They are allowed 2 hours to complete the examination.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

UCL

(University of London)

BSc EXAMINATION

Department of Computer Science and Information Systems

INTRODUCTION TO DATA ANALYTICS USING


R
CREDIT VALUE: 15 credits

Date of examination: MONDAY 03 JUNE 2019


Duration of paper: 09.00 – 11.00

RUBRIC

1. This paper contains 8 questions for a total of 100 marks.


2. Students should attempt to answer all of them.
3. The use of non-programmable electronic calculators is permitted.
4. This paper is not prior-disclosed.
5. Time allowed: 2 hours.

Page 1 of 5
1. (11 marks)

(a) What is (simple) random sampling? Name two statistical learning methods where
random sampling with replacement is used. (4 marks)
(b) Write down the formula for calculating an unbiased estimate, s2, of the variance of
a large (but finite) population, based on a random sample of n items. Define any
symbols you use. (3 marks)
(c) Create a box-and-whisker plot for the following dataset: 2, 5, 8, 9, 10, 13, 16, 19, 27.
(4 marks)

2. (13 marks)

(a) Finn is given a dataset with a categorical response variable and a number of predictor
variables. He intends to build a logistic regression model and a random forest model
and compare which model is better. Explain how Finn can achieve this using R.
You don’t need to write any R code to answer this question, but you may use some
functions or features in R language to help you explain if needed. (7 marks)
(b) If your model is overfitted, what could be done to reduce/avoid overfitting in general?
Given the models of decision trees and regression, explain your approaches to reduce
overfitting for the two models. (6 marks)

3. (12 marks)

(a) Explain how the k-means clustering algorithm works. (6 marks)


(b) Can k-means ever give results which contain more or less than k clusters? (2 marks)
(c) Why would it be recommended to run the k-means algorithm several times on the
same dataset? What is the best way to report results of all the runs? (4 marks)

Page 2 of 5
4. (13 marks)
In a study of the effect of three chemicals - dioxin, bioxin and tioxin on reproduction in
fish, the three chemical levels (in parts per billion) were measured for 18 different ponds
in regions of Vietnam that had been exposed to agent orange (a herbicide chemical) during
the Vietnam War. Researchers fertilised a sample of fish eggs in a sample of water from
each of the ponds, then counted how many of the fertilised eggs eventually hatched.
Logistic regression output is provided below.

(a) Using the following output of logistic regression where the categorical response (y) is
whether fish eggs would be hatched (1) or not (0), carefully choose the predictors and
write down a reasonable logistic regression model. Justify your answer. (4 marks)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.5543 7.2946 -2.132 0.0043
dioxin 1.5859 0.3569 ??? 0.5365
bioxin -0.5643 0.3317 ??? 0.0065
tioxin 1.9639 0.8800 ??? 0.00054
(b) Does the probability of fish eggs hatching increase or decrease with bioxin? Justify
your answer. (2 marks)
(c) We have dioxin = 5 and bioxin = 25. Which value do we have to choose for tioxin
in order to get a probability of 50% for the fish eggs to be hatched? (4 marks)
(d) Suppose the dataset is called FishEggHatch. Write down the R command for the
logistic regression model you got in (a). (3 marks)

5. (15 marks)

(a) What approaches can be used to deal with unknown values in the dataset? (5 marks)
(b) What are the four main objectives of Principal Components Analysis (PCA)? (4 marks)
(c) What are the differences between Maximal Margin Classifiers (MMC) and Support
Vector Classifers (SVC)? (6 marks)

Page 3 of 5
6. (16 marks)

(a) We are trying to learn regression parameters for a dataset which we know was gen-
erated from a polynomial of a certain degree, but we do not know what this degree
is. Assume the data was actually generated from a polynomial of degree 5 with some
added noise, that is

y = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5 + ε, ε ∼ N (0, 1).

For training we have 100 (x, y)-pairs and for testing we are using an additional set of
100 (x, y)-pairs. Since we do not know the degree of the polynomial we learn two
models from the data.
Model A learns parameters for a polynomial of degree 4 and
Model B learns parameters for a polynomial of degree 6.
Which of these two models is likely to fit the test data better? Justify your answer.
(4 marks)
(b) Write down the R code to build and test the model you choose for (a). Give reasonable
names to the datasets you use. (6 marks)
(c) Which metric(s) from the R results would you use to compare the two models? What
do(es) the metric(s) tell us? (6 marks)

7. (10 marks)

(a) Given the following dissimilarity matrix, identify which two clusters should be merged
next and write the dissimilarity matrix for the next round using single linkage. (5 marks)

ADE B CF G H
ADE 0.00
B 2.64 0.00
CF 2.91 4.01 0.00
G 2.59 3.41 3.67 0.00
H 1.11 3.40 2.06 3.29 0.00

(b) Explain the principles of centroid linkage and highlight its main drawback. (5 marks)

Page 4 of 5
8. (10 marks)

(a) Assume we are trying to learn a decision tree. Our input data consists of n obser-
vations, each with p predictors (n >> p). If all predictors are binary, what is the
maximal number of leaf nodes that we can have in a decision tree for this data? What
is the maximal number of internal nodes (including the root)? Justify your answer.
(6 marks)
(b) What is the leave-one-out cross validation error estimate for maximum margin sep-
aration in the following figure? (We are asking for a number.) Justify your answer.
(4 marks)

Page 5 of 5

You might also like