0% found this document useful (0 votes)
2 views8 pages

Assignment Solution 2

The document discusses various statistical methods and analyses, including logistic regression for predicting student grades based on study hours and GPA, and the performance of different classifiers (LDA, QDA, KNN) on market direction prediction. It also covers bootstrap sampling probabilities, k-fold cross-validation advantages, and model evaluation through LOOCV, ultimately highlighting the effectiveness of quadratic models in fitting data. Additionally, it presents correlation analysis and model fitting results using GLM.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Assignment Solution 2

The document discusses various statistical methods and analyses, including logistic regression for predicting student grades based on study hours and GPA, and the performance of different classifiers (LDA, QDA, KNN) on market direction prediction. It also covers bootstrap sampling probabilities, k-fold cross-validation advantages, and model evaluation through LOOCV, ultimately highlighting the effectiveness of quadratic models in fitting data. Additionally, it presents correlation analysis and model fitting results using GLM.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Q1

(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of
3.5 gets an A in the class.
β0 +β1 x1 +β2 x2 −6+0.05×40+3.5
e e
P (Y = 1) = = = 0.3775 (1)
−6+0.05×40+3.5
1 + eβ0 +β1 x1 +β2 x2 1 + e

(b) How many hours would the student need to study to have a 50% chance of getting an A in
the class?
Using the fact that when the probability is 0.5, the odds ratio is 1:
β0 +β1 x1 +β2 x2
e = 1 (2)

Rearranging and solving for (x_1):


−β0 − β2 × 3.5 6 − 3.5
x1 = = = 50 (3)
β1 0.05

Q2
(a) If the Bayes decision boundary is linear:
Training set: QDA might fit better due to its flexibility.
Test set: LDA will likely perform better, as it aligns with the true linear boundary.
(b) If the Bayes decision boundary is non-linear:
Training set: QDA will perform better.
Test set: QDA is expected to perform better given sufficient data.
(c) As sample size ( n ) increases:
The test accuracy of QDA relative to LDA will likely improve due to reduced overfitting.
(d) True or False statement:
False. If the true boundary is linear, LDA will likely perform better on test data despite QDA's
flexibility. QDA will also suffer from overfitting.

Q3
In [ ]: import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)

In [ ]: weekly = load_data('Weekly')
In [ ]: # (a)
weekly.corr(numeric_only=True)

Out[ ]: Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Toda


Year 1.000000 -0.032289 -0.033390 -0.030006 -0.031128 -0.030519 0.841942 -0.03246
Lag1 -0.032289 1.000000 -0.074853 0.058636 -0.071274 -0.008183 -0.064951 -0.07503
Lag2 -0.033390 -0.074853 1.000000 -0.075721 0.058382 -0.072499 -0.085513 0.05916
Lag3 -0.030006 0.058636 -0.075721 1.000000 -0.075396 0.060657 -0.069288 -0.07124
Lag4 -0.031128 -0.071274 0.058382 -0.075396 1.000000 -0.075675 -0.061075 -0.00782
Lag5 -0.030519 -0.008183 -0.072499 0.060657 -0.075675 1.000000 -0.058517 0.01101
Volume 0.841942 -0.064951 -0.085513 -0.069288 -0.061075 -0.058517 1.000000 -0.03307
Today -0.032460 -0.075032 0.059167 -0.071244 -0.007826 0.011013 -0.033078 1.00000

As one would expect, the correlations between the lagged return variables and today’s return are
close to zero. The only substantial correlation is between Year and Volume . We further
investigate the graphical trend.
In [ ]: weekly.plot(y='Volume')

Out[ ]: <Axes: >

In [ ]: weekly.groupby('Year')['Volume'].mean().plot()

Out[ ]: <Axes: xlabel='Year'>


In [ ]: # (b)
X = MS(weekly[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]).fit_transform(week
y = weekly.Direction == 'Up'
glm = sm.GLM(y,X,family=sm.families.Binomial())
res = glm.fit()
res.summary()

Out[ ]: Generalized Linear Model Regression Results


Dep. Variable: Direction No. Observations: 1089
Model: GLM Df Residuals: 1082
Model Family: Binomial Df Model: 6
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -743.18
Date: Wed, 01 Nov 2023 Deviance: 1486.4
Time: 07:20:26 Pearson chi2: 1.09e+03
No. Iterations: 4 Pseudo R-squ. (CS): 0.009000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
intercept 0.2669 0.086 3.106 0.002 0.098 0.435
Lag1 -0.0413 0.026 -1.563 0.118 -0.093 0.010
Lag2 0.0584 0.027 2.175 0.030 0.006 0.111
Lag3 -0.0161 0.027 -0.602 0.547 -0.068 0.036
Lag4 -0.0278 0.026 -1.050 0.294 -0.080 0.024
Lag5 -0.0145 0.026 -0.549 0.583 -0.066 0.037
Volume -0.0227 0.037 -0.616 0.538 -0.095 0.050
Lag 2 appears to have statistical significance with a Pr(>|z|) = 3%.
In [ ]: # (c)
from ISLP import confusion_table
D = weekly.Direction
pred = np.array(['Down'] * len(weekly.Direction))
pred[res.predict(X)>0.5] = 'Up'
confusion_table(pred,D)

Out[ ]: Truth Down Up


Predicted
Down 54 48
Up 430 557

In [ ]: # Accuracy
np.mean(pred==weekly.Direction)

Out[ ]: 0.5610651974288338

Weeks the market goes up the logistic regression is right most of the time, 557/(557+48) = 92.1%.
Weeks the market goes down the logistic regression is wrong most of the time 54/(430+54) =11.2%.
In [ ]: # (d)
train = (weekly.Year<=2008)
X_lag2 = X[['intercept', 'Lag2']]
X_train, X_test = X_lag2.loc[train], X_lag2.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]
L_train , L_test = D.loc[train], D.loc[~train]

In [ ]: res_lag2 = sm.GLM(y_train,X_train,family=sm.families.Binomial()).fit()
pred = np.array(['Down'] * len(L_test))
pred[res_lag2.predict(X_test)>0.5] = 'Up'
confusion_table(pred,L_test)

Out[ ]: Truth Down Up


Predicted
Down 9 5
Up 34 56

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.625

In [ ]: #(e)
from sklearn.discriminant_analysis import \
(LinearDiscriminantAnalysis as LDA,
QuadraticDiscriminantAnalysis as QDA)
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
for M in [X_train, X_test]]
res_lda = lda.fit(X_train, L_train)
pred = res_lda.predict(X_test)
confusion_table(pred,L_test)
Out[ ]: Truth Down Up
Predicted
Down 9 5
Up 34 56

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.625

In [ ]: # (f)
qda = QDA(store_covariance=True)
res_qda = qda.fit(X_train, L_train)
pred = res_qda.predict(X_test)
confusion_table(pred,L_test)

Out[ ]: Truth Down Up


Predicted
Down 0 0
Up 43 61

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.5865384615384616

In [ ]: # (g)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
res_knn = knn.fit(X_train, L_train)
pred = res_knn.predict(X_test)
confusion_table(pred,L_test)

Out[ ]: Truth Down Up


Predicted
Down 22 32
Up 21 29

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.49038461538461536

(h) Logistic regression and LDA methods provide the best test accuracy (0.625).

Q4
(a) What is the probability that the first bootstrap observation is not the observation from
j
th

the original sample?


The bootstrap sample is drawn with replacement from the original set of observations. The
n

probability of picking any single observation is . Therefore, the probability that the first
1

bootstrap observation is not the observation is .


n

th 1
j 1 − n
(b) What is the probability that the second bootstrap observation is not the jth observation
from the original sample?
Since bootstrap sampling is done with replacement, the probability remains the same for the
second observation. It is . 1 −
1

(c) Argue that the probability that the observation is not in the bootstrap sample is
j
th

(1 −
1

n
.)
n

The probability that the observation is not chosen in a single draw is


j
th
. For the
1 −
1
jth

observation to not be in the entire bootstrap sample of size , this event has to happen times
n

n n

in a row. Therefore, the probability is . (1 −


1
n
)
n

(d) When n = 5, what is the probability that the observation is in the bootstrap sample?
j
th

The probability that the observation is in the bootstrap sample is the complement of the
j
th

probability that the observation is not in the bootstrap sample, which is


j
th

1 − (1 −
1

5
) .5
= 0.6723

Q5
(a) k-fold cross-validation:
Split data into parts. k

Train on parts, validate on the remaining part.


k − 1

Repeat times. k

Average results for performance metric.


(b) Relative to:
i. Validation set approach:
Advantages: Often lower variance.
Disadvantages: k-fold is more computationally intensive.
ii. LOOCV:
Advantages: k-fold is faster; better bias-variance trade-off.
Disadvantages: LOOCV uses more data for training; k-fold may have slightly higher bias.

Q6
In [ ]: # (a)
np.random.seed(1)
y = np.random.normal(size=100)
x = np.random.normal(size=100)
y = x - 2 * x**2 + np.random.normal(size=100) # n=100, p=2, Y = X - 2 X^2 + \epsilon

In [ ]: #(b)
from matplotlib import pyplot as plt
plt.scatter(x,y)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Quadratic plot. X from about -2 to 2. Y from about -8 to 2.


In [ ]: # (c)
from ISLP.models import sklearn_sm
from sklearn.model_selection import \
(cross_validate ,
KFold ,
ShuffleSplit)
cv_error = np.zeros(4)
H=np.array(x)
M = sklearn_sm(sm.OLS)
np.random.seed(1)
for i, degree in enumerate(range(1,5)):
X = np.power.outer(H, np.arange(degree+1))
M_CV = cross_validate(M, X, y, cv=len(H))
cv_error[i] = np.mean(M_CV['test_score'])
cv_error

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

In [ ]: # (d)
np.random.seed(2023)
for i, degree in enumerate(range(1,5)):
X = np.power.outer(H, np.arange(degree+1))
M_CV = cross_validate(M, X, y, cv=len(H))
cv_error[i] = np.mean(M_CV['test_score'])
cv_error

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

The result is the same as in (c). Because LOOCV evaluates each sample independently. And the
process is the same for all data.
(e) The quadratic polynomial (Model ii) had the lowest LOOCV test error rate. This was expected
because it matches the true form of Y.
In [ ]: # (f)
X = sm.add_constant(np.column_stack([x,x**2,x**3,x**4]))
res = sm.GLM(y,X,family=sm.families.Gaussian()).fit()
res.summary()

Out[ ]: Generalized Linear Model Regression Results


Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 95
Model Family: Gaussian Df Model: 4
Link Function: Identity Scale: 0.99826
Method: IRLS Log-Likelihood: -139.24
Date: Wed, 01 Nov 2023 Deviance: 94.835
Time: 07:20:27 Pearson chi2: 94.8
No. Iterations: 3 Pseudo R-squ. (CS): 0.9993
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.0866 0.144 0.600 0.549 -0.196 0.370
x1 1.0834 0.189 5.724 0.000 0.712 1.454
x2 -2.2455 0.214 -10.505 0.000 -2.664 -1.827
x3 0.0436 0.058 0.755 0.451 -0.070 0.157
x4 0.0482 0.043 1.132 0.257 -0.035 0.132
p-values show statistical significance of linear and quadratic terms, which agrees with the CV
results.

You might also like