0% found this document useful (0 votes)

85 views5 pages

2019 - Introduction To Data Analytics Using R

This document contains an examination for an introductory course in data analytics using R. It consists of 8 questions worth a total of 100 marks. The questions cover topics such as random sampling, variance estimation, logistic regression, k-means clustering, principal component analysis, and decision trees. Students are instructed to attempt all questions and show their work clearly. They are allowed 2 hours to complete the examination.

Uploaded by

Yeickson Mendoza Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views5 pages

2019 - Introduction To Data Analytics Using R

Uploaded by

Yeickson Mendoza Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

UCL

(University of London)

BSc EXAMINATION

Department of Computer Science and Information Systems

INTRODUCTION TO DATA ANALYTICS USING

R
CREDIT VALUE: 15 credits

Date of examination: MONDAY 03 JUNE 2019

Duration of paper: 09.00 – 11.00

RUBRIC

1. This paper contains 8 questions for a total of 100 marks.

2. Students should attempt to answer all of them.
3. The use of non-programmable electronic calculators is permitted.
4. This paper is not prior-disclosed.
5. Time allowed: 2 hours.

Page 1 of 5
1. (11 marks)

(a) What is (simple) random sampling? Name two statistical learning methods where
random sampling with replacement is used. (4 marks)
(b) Write down the formula for calculating an unbiased estimate, s2, of the variance of
a large (but finite) population, based on a random sample of n items. Define any
symbols you use. (3 marks)
(c) Create a box-and-whisker plot for the following dataset: 2, 5, 8, 9, 10, 13, 16, 19, 27.
(4 marks)

2. (13 marks)

(a) Finn is given a dataset with a categorical response variable and a number of predictor
variables. He intends to build a logistic regression model and a random forest model
and compare which model is better. Explain how Finn can achieve this using R.
You don’t need to write any R code to answer this question, but you may use some
functions or features in R language to help you explain if needed. (7 marks)
(b) If your model is overfitted, what could be done to reduce/avoid overfitting in general?
Given the models of decision trees and regression, explain your approaches to reduce
overfitting for the two models. (6 marks)

3. (12 marks)

(a) Explain how the k-means clustering algorithm works. (6 marks)

(b) Can k-means ever give results which contain more or less than k clusters? (2 marks)
(c) Why would it be recommended to run the k-means algorithm several times on the
same dataset? What is the best way to report results of all the runs? (4 marks)

Page 2 of 5
4. (13 marks)
In a study of the effect of three chemicals - dioxin, bioxin and tioxin on reproduction in
fish, the three chemical levels (in parts per billion) were measured for 18 different ponds
in regions of Vietnam that had been exposed to agent orange (a herbicide chemical) during
the Vietnam War. Researchers fertilised a sample of fish eggs in a sample of water from
each of the ponds, then counted how many of the fertilised eggs eventually hatched.
Logistic regression output is provided below.

(a) Using the following output of logistic regression where the categorical response (y) is
whether fish eggs would be hatched (1) or not (0), carefully choose the predictors and
write down a reasonable logistic regression model. Justify your answer. (4 marks)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.5543 7.2946 -2.132 0.0043
dioxin 1.5859 0.3569 ??? 0.5365
bioxin -0.5643 0.3317 ??? 0.0065
tioxin 1.9639 0.8800 ??? 0.00054
(b) Does the probability of fish eggs hatching increase or decrease with bioxin? Justify
your answer. (2 marks)
(c) We have dioxin = 5 and bioxin = 25. Which value do we have to choose for tioxin
in order to get a probability of 50% for the fish eggs to be hatched? (4 marks)
(d) Suppose the dataset is called FishEggHatch. Write down the R command for the
logistic regression model you got in (a). (3 marks)

5. (15 marks)

(a) What approaches can be used to deal with unknown values in the dataset? (5 marks)
(b) What are the four main objectives of Principal Components Analysis (PCA)? (4 marks)
(c) What are the differences between Maximal Margin Classifiers (MMC) and Support
Vector Classifers (SVC)? (6 marks)

Page 3 of 5
6. (16 marks)

(a) We are trying to learn regression parameters for a dataset which we know was gen-
erated from a polynomial of a certain degree, but we do not know what this degree
is. Assume the data was actually generated from a polynomial of degree 5 with some
added noise, that is

y = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5 + ε, ε ∼ N (0, 1).

For training we have 100 (x, y)-pairs and for testing we are using an additional set of
100 (x, y)-pairs. Since we do not know the degree of the polynomial we learn two
models from the data.
Model A learns parameters for a polynomial of degree 4 and
Model B learns parameters for a polynomial of degree 6.
Which of these two models is likely to fit the test data better? Justify your answer.
(4 marks)
(b) Write down the R code to build and test the model you choose for (a). Give reasonable
names to the datasets you use. (6 marks)
(c) Which metric(s) from the R results would you use to compare the two models? What
do(es) the metric(s) tell us? (6 marks)

7. (10 marks)

(a) Given the following dissimilarity matrix, identify which two clusters should be merged
next and write the dissimilarity matrix for the next round using single linkage. (5 marks)

ADE B CF G H
ADE 0.00
B 2.64 0.00
CF 2.91 4.01 0.00
G 2.59 3.41 3.67 0.00
H 1.11 3.40 2.06 3.29 0.00

(b) Explain the principles of centroid linkage and highlight its main drawback. (5 marks)

Page 4 of 5
8. (10 marks)

(a) Assume we are trying to learn a decision tree. Our input data consists of n obser-
vations, each with p predictors (n >> p). If all predictors are binary, what is the
maximal number of leaf nodes that we can have in a decision tree for this data? What
is the maximal number of internal nodes (including the root)? Justify your answer.
(6 marks)
(b) What is the leave-one-out cross validation error estimate for maximum margin sep-
aration in the following figure? (We are asking for a number.) Justify your answer.
(4 marks)

Page 5 of 5

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6440)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (998)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1855)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Chapter10 Heteroskedasticity
100% (1)
Chapter10 Heteroskedasticity
44 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
1 s2.0 S1755008424001066 Main
No ratings yet
1 s2.0 S1755008424001066 Main
50 pages
Mathematics PostTest ANSWERS
No ratings yet
Mathematics PostTest ANSWERS
9 pages
Linear Functions Exercises
100% (2)
Linear Functions Exercises
7 pages
Key Maths Grade 9
No ratings yet
Key Maths Grade 9
3 pages
Recurrent Neural Networks: Pytorch
No ratings yet
Recurrent Neural Networks: Pytorch
6 pages
Fadhila Mohune - 211011060082
No ratings yet
Fadhila Mohune - 211011060082
2 pages
Ratio
No ratings yet
Ratio
18 pages
AU-T2-M-4226-Writing-the-Probability-of-Outcomes-in-Fraction-Format-Differentiated-Activity-Sheets-English_ver_1
No ratings yet
AU-T2-M-4226-Writing-the-Probability-of-Outcomes-in-Fraction-Format-Differentiated-Activity-Sheets-English_ver_1
6 pages
Instant Access to Singular Spectrum Analysis using R Hossein Hassani ebook Full Chapters
100% (3)
Instant Access to Singular Spectrum Analysis using R Hossein Hassani ebook Full Chapters
69 pages
MTC SENARIO QUESTION
100% (1)
MTC SENARIO QUESTION
25 pages
Amod1 PDF
100% (1)
Amod1 PDF
377 pages
Advanced Data Acquisition Techniques With NI R Series
No ratings yet
Advanced Data Acquisition Techniques With NI R Series
11 pages
Practice Test 1B 2024 - Linear Relations
No ratings yet
Practice Test 1B 2024 - Linear Relations
6 pages
Maths Term 2 2019 Mocks F4 P2
No ratings yet
Maths Term 2 2019 Mocks F4 P2
16 pages
Man Eco 2
No ratings yet
Man Eco 2
4 pages
Thermal Energy ExamZone Answers
No ratings yet
Thermal Energy ExamZone Answers
3 pages
Practical Work For A Level Physics
No ratings yet
Practical Work For A Level Physics
15 pages
Schema's Mathematical Finance
No ratings yet
Schema's Mathematical Finance
7 pages
Individual Competition Sept. 20, 2014 English Version: Problem I-1
No ratings yet
Individual Competition Sept. 20, 2014 English Version: Problem I-1
3 pages
The Importance of Logic
No ratings yet
The Importance of Logic
8 pages
Hns 2321 Biostatistics Lecture Notes on Inferential Statistics
No ratings yet
Hns 2321 Biostatistics Lecture Notes on Inferential Statistics
25 pages
compound bars
No ratings yet
compound bars
5 pages
Adiab FL Temp v2
No ratings yet
Adiab FL Temp v2
8 pages
Doom AI
No ratings yet
Doom AI
7 pages
NPV Breakeven Analysis
0% (1)
NPV Breakeven Analysis
20 pages
Unit 1. Introduction Units and Measurements
No ratings yet
Unit 1. Introduction Units and Measurements
46 pages
Datalog Educational System V3.8 User's Manual
No ratings yet
Datalog Educational System V3.8 User's Manual
264 pages
Chapter 4 & 5
No ratings yet
Chapter 4 & 5
12 pages
COSC-90-Lecture-1
No ratings yet
COSC-90-Lecture-1
27 pages

2019 - Introduction To Data Analytics Using R

Uploaded by

2019 - Introduction To Data Analytics Using R

Uploaded by

UCL

Department of Computer Science and Information Systems

INTRODUCTION TO DATA ANALYTICS USING

Date of examination: MONDAY 03 JUNE 2019

1. This paper contains 8 questions for a total of 100 marks.

(a) Explain how the k-means clustering algorithm works. (6 marks)

y = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5 + ε, ε ∼ N (0, 1).

You might also like