0% found this document useful (0 votes)
6 views3 pages

SDSC3006 - Assignment 2

The document outlines Assignment 2 for SDSC 3006 Fundamentals of Machine Learning I, due on October 29, 2023. It includes various questions related to logistic regression, LDA, QDA, bootstrap sampling, k-fold cross-validation, and simulated data analysis. The assignment requires students to perform calculations, produce summaries, and analyze data using different machine learning techniques.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

SDSC3006 - Assignment 2

The document outlines Assignment 2 for SDSC 3006 Fundamentals of Machine Learning I, due on October 29, 2023. It includes various questions related to logistic regression, LDA, QDA, bootstrap sampling, k-fold cross-validation, and simulated data analysis. The assignment requires students to perform calculations, produce summaries, and analyze data using different machine learning techniques.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

SDSC 3006 Fundamentals of Machine Learning I

Assignment 2

Deadline: October 29, Sunday@ 10:00 PM

1. Suppose we collect data for a group of students in a statistics class with variables 𝑋1 = hours
studied, 𝑋2 = undergrad GPA, and 𝑌 = receive an A. We fit a logistic regression and produce
estimated coefficient, 𝛽̂0 = −6, 𝛽̂1 = 0.05, 𝛽̂2 = 1.
(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5
gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a 50% chance of getting
an A in the class?

2. Answer the following questions about the differences between LDA and QDA.
(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the
training set? On the test set?
(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on
the training set? On the test set?
(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA
relative to LDA to improve, decline, or be unchanged? Why?
(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will
probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible
enough to model a linear decision boundary. Justify your answer.

3. This question should be answered using the Weekly data set (ISLP package). This data is similar
in nature to the Smarket data from this chapter’s lab, except that it contains 1089 weekly returns
for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be
any patterns?
(b) Use the full data set to perform a logistic regression with Direction as the response and the five
lag variables plus Volume as predictors. Use the summary function to print the results. Do any of
the predictors appear to be statistically significant? If so, which ones?
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the
confusion matrix is telling you about the types of mistakes made by logistic regression.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2
as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions
for the held out data (that is, the data from 2009 and 2010).
(e) Repeat (d) using LDA.
1
(f) Repeat (d) using QDA.
(g) Repeat (d) using KNN with K = 1.
(h) Which of these methods appears to provide the best results on this data?

4. Suppose that we obtain a bootstrap sample from a set of n observations.


(a) What is the probability that the first bootstrap observation is not the jth observation from the
original sample? Justify your answer.
(b) What is the probability that the second bootstrap observation is not the jth observation from
the original sample?
(c) Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)n.
(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?

5. Answer the following questions about k-fold cross-validation.


(a) Explain how k-fold cross-validation is implemented.
(b) What are the advantages and disadvantages of k-fold cross validation relative to:
i. The validation set approach?
ii. LOOCV?

6. Perform cross-validation on a simulated data set and answer the questions.


(a) Generate a simulated data set as follows:
> set.seed (1)
> y=rnorm (100)
> x=rnorm (100)
> y=x−2*x^2+rnorm (100)
In this data set, what is n and what is p? Write out the model used to generate the data in equation
form.
(b) Create a scatterplot of X against Y. Comment on what you find.
(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following
four models using least squares:
i. Y = β0 + β1X + ε
ii. Y = β0 + β1X + β2X2 + ε
iii. Y = β0 + β1X + β2X2 + β3X3 + ε
iv. Y = β0 + β1X + β2X2 + β3X3 + β4X4 + ε.
Note you may find it helpful to use the data.frame() function to create a single data set containing
both X and Y.

2
(d) Repeat (c) using another random seed, and report your results. Are your results the same as
what you got in (c)? Why?
(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain
your answer.
(f) Comment on the statistical significance of the coefficient estimates that results from fitting each
of the models in (c) using least squares. Do these results agree with the conclusions drawn based
on the cross-validation results?

You might also like