Assignment Solution 2
Assignment Solution 2
(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of
3.5 gets an A in the class.
β0 +β1 x1 +β2 x2 −6+0.05×40+3.5
e e
P (Y = 1) = = = 0.3775 (1)
−6+0.05×40+3.5
1 + eβ0 +β1 x1 +β2 x2 1 + e
(b) How many hours would the student need to study to have a 50% chance of getting an A in
the class?
Using the fact that when the probability is 0.5, the odds ratio is 1:
β0 +β1 x1 +β2 x2
e = 1 (2)
Q2
(a) If the Bayes decision boundary is linear:
Training set: QDA might fit better due to its flexibility.
Test set: LDA will likely perform better, as it aligns with the true linear boundary.
(b) If the Bayes decision boundary is non-linear:
Training set: QDA will perform better.
Test set: QDA is expected to perform better given sufficient data.
(c) As sample size ( n ) increases:
The test accuracy of QDA relative to LDA will likely improve due to reduced overfitting.
(d) True or False statement:
False. If the true boundary is linear, LDA will likely perform better on test data despite QDA's
flexibility. QDA will also suffer from overfitting.
Q3
In [ ]: import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)
In [ ]: weekly = load_data('Weekly')
In [ ]: # (a)
weekly.corr(numeric_only=True)
As one would expect, the correlations between the lagged return variables and today’s return are
close to zero. The only substantial correlation is between Year and Volume . We further
investigate the graphical trend.
In [ ]: weekly.plot(y='Volume')
In [ ]: weekly.groupby('Year')['Volume'].mean().plot()
In [ ]: # Accuracy
np.mean(pred==weekly.Direction)
Out[ ]: 0.5610651974288338
Weeks the market goes up the logistic regression is right most of the time, 557/(557+48) = 92.1%.
Weeks the market goes down the logistic regression is wrong most of the time 54/(430+54) =11.2%.
In [ ]: # (d)
train = (weekly.Year<=2008)
X_lag2 = X[['intercept', 'Lag2']]
X_train, X_test = X_lag2.loc[train], X_lag2.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]
L_train , L_test = D.loc[train], D.loc[~train]
In [ ]: res_lag2 = sm.GLM(y_train,X_train,family=sm.families.Binomial()).fit()
pred = np.array(['Down'] * len(L_test))
pred[res_lag2.predict(X_test)>0.5] = 'Up'
confusion_table(pred,L_test)
In [ ]: np.mean(pred==L_test)
Out[ ]: 0.625
In [ ]: #(e)
from sklearn.discriminant_analysis import \
(LinearDiscriminantAnalysis as LDA,
QuadraticDiscriminantAnalysis as QDA)
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
for M in [X_train, X_test]]
res_lda = lda.fit(X_train, L_train)
pred = res_lda.predict(X_test)
confusion_table(pred,L_test)
Out[ ]: Truth Down Up
Predicted
Down 9 5
Up 34 56
In [ ]: np.mean(pred==L_test)
Out[ ]: 0.625
In [ ]: # (f)
qda = QDA(store_covariance=True)
res_qda = qda.fit(X_train, L_train)
pred = res_qda.predict(X_test)
confusion_table(pred,L_test)
In [ ]: np.mean(pred==L_test)
Out[ ]: 0.5865384615384616
In [ ]: # (g)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
res_knn = knn.fit(X_train, L_train)
pred = res_knn.predict(X_test)
confusion_table(pred,L_test)
In [ ]: np.mean(pred==L_test)
Out[ ]: 0.49038461538461536
(h) Logistic regression and LDA methods provide the best test accuracy (0.625).
Q4
(a) What is the probability that the first bootstrap observation is not the observation from
j
th
probability of picking any single observation is . Therefore, the probability that the first
1
th 1
j 1 − n
(b) What is the probability that the second bootstrap observation is not the jth observation
from the original sample?
Since bootstrap sampling is done with replacement, the probability remains the same for the
second observation. It is . 1 −
1
(c) Argue that the probability that the observation is not in the bootstrap sample is
j
th
(1 −
1
n
.)
n
observation to not be in the entire bootstrap sample of size , this event has to happen times
n
n n
(d) When n = 5, what is the probability that the observation is in the bootstrap sample?
j
th
The probability that the observation is in the bootstrap sample is the complement of the
j
th
1 − (1 −
1
5
) .5
= 0.6723
Q5
(a) k-fold cross-validation:
Split data into parts. k
Repeat times. k
Q6
In [ ]: # (a)
np.random.seed(1)
y = np.random.normal(size=100)
x = np.random.normal(size=100)
y = x - 2 * x**2 + np.random.normal(size=100) # n=100, p=2, Y = X - 2 X^2 + \epsilon
In [ ]: #(b)
from matplotlib import pyplot as plt
plt.scatter(x,y)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
In [ ]: # (d)
np.random.seed(2023)
for i, degree in enumerate(range(1,5)):
X = np.power.outer(H, np.arange(degree+1))
M_CV = cross_validate(M, X, y, cv=len(H))
cv_error[i] = np.mean(M_CV['test_score'])
cv_error
The result is the same as in (c). Because LOOCV evaluates each sample independently. And the
process is the same for all data.
(e) The quadratic polynomial (Model ii) had the lowest LOOCV test error rate. This was expected
because it matches the true form of Y.
In [ ]: # (f)
X = sm.add_constant(np.column_stack([x,x**2,x**3,x**4]))
res = sm.GLM(y,X,family=sm.families.Gaussian()).fit()
res.summary()