0% found this document useful (0 votes)

2 views8 pages

Assignment Solution 2

The document discusses various statistical methods and analyses, including logistic regression for predicting student grades based on study hours and GPA, and the performance of different classifiers (LDA, QDA, KNN) on market direction prediction. It also covers bootstrap sampling probabilities, k-fold cross-validation advantages, and model evaluation through LOOCV, ultimately highlighting the effectiveness of quadratic models in fitting data. Additionally, it presents correlation analysis and model fitting results using GLM.

Uploaded by

jackyko0319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

Assignment Solution 2

Uploaded by

jackyko0319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Q1

(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of
3.5 gets an A in the class.
β0 +β1 x1 +β2 x2 −6+0.05×40+3.5
e e
P (Y = 1) = = = 0.3775 (1)
−6+0.05×40+3.5
1 + eβ0 +β1 x1 +β2 x2 1 + e

(b) How many hours would the student need to study to have a 50% chance of getting an A in
the class?
Using the fact that when the probability is 0.5, the odds ratio is 1:
β0 +β1 x1 +β2 x2
e = 1 (2)

Rearranging and solving for (x_1):

−β0 − β2 × 3.5 6 − 3.5
x1 = = = 50 (3)
β1 0.05

Q2
(a) If the Bayes decision boundary is linear:
Training set: QDA might fit better due to its flexibility.
Test set: LDA will likely perform better, as it aligns with the true linear boundary.
(b) If the Bayes decision boundary is non-linear:
Training set: QDA will perform better.
Test set: QDA is expected to perform better given sufficient data.
(c) As sample size ( n ) increases:
The test accuracy of QDA relative to LDA will likely improve due to reduced overfitting.
(d) True or False statement:
False. If the true boundary is linear, LDA will likely perform better on test data despite QDA's
flexibility. QDA will also suffer from overfitting.

Q3
In [ ]: import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)

In [ ]: weekly = load_data('Weekly')
In [ ]: # (a)
weekly.corr(numeric_only=True)

Out[ ]: Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Toda

Year 1.000000 -0.032289 -0.033390 -0.030006 -0.031128 -0.030519 0.841942 -0.03246
Lag1 -0.032289 1.000000 -0.074853 0.058636 -0.071274 -0.008183 -0.064951 -0.07503
Lag2 -0.033390 -0.074853 1.000000 -0.075721 0.058382 -0.072499 -0.085513 0.05916
Lag3 -0.030006 0.058636 -0.075721 1.000000 -0.075396 0.060657 -0.069288 -0.07124
Lag4 -0.031128 -0.071274 0.058382 -0.075396 1.000000 -0.075675 -0.061075 -0.00782
Lag5 -0.030519 -0.008183 -0.072499 0.060657 -0.075675 1.000000 -0.058517 0.01101
Volume 0.841942 -0.064951 -0.085513 -0.069288 -0.061075 -0.058517 1.000000 -0.03307
Today -0.032460 -0.075032 0.059167 -0.071244 -0.007826 0.011013 -0.033078 1.00000

As one would expect, the correlations between the lagged return variables and today’s return are
close to zero. The only substantial correlation is between Year and Volume . We further
investigate the graphical trend.
In [ ]: weekly.plot(y='Volume')

Out[ ]: <Axes: >

In [ ]: weekly.groupby('Year')['Volume'].mean().plot()

Out[ ]: <Axes: xlabel='Year'>

In [ ]: # (b)
X = MS(weekly[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]).fit_transform(week
y = weekly.Direction == 'Up'
glm = sm.GLM(y,X,family=sm.families.Binomial())
res = glm.fit()
res.summary()

Out[ ]: Generalized Linear Model Regression Results

Dep. Variable: Direction No. Observations: 1089
Model: GLM Df Residuals: 1082
Model Family: Binomial Df Model: 6
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -743.18
Date: Wed, 01 Nov 2023 Deviance: 1486.4
Time: 07:20:26 Pearson chi2: 1.09e+03
No. Iterations: 4 Pseudo R-squ. (CS): 0.009000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
intercept 0.2669 0.086 3.106 0.002 0.098 0.435
Lag1 -0.0413 0.026 -1.563 0.118 -0.093 0.010
Lag2 0.0584 0.027 2.175 0.030 0.006 0.111
Lag3 -0.0161 0.027 -0.602 0.547 -0.068 0.036
Lag4 -0.0278 0.026 -1.050 0.294 -0.080 0.024
Lag5 -0.0145 0.026 -0.549 0.583 -0.066 0.037
Volume -0.0227 0.037 -0.616 0.538 -0.095 0.050
Lag 2 appears to have statistical significance with a Pr(>|z|) = 3%.
In [ ]: # (c)
from ISLP import confusion_table
D = weekly.Direction
pred = np.array(['Down'] * len(weekly.Direction))
pred[res.predict(X)>0.5] = 'Up'
confusion_table(pred,D)

Out[ ]: Truth Down Up

Predicted
Down 54 48
Up 430 557

In [ ]: # Accuracy
np.mean(pred==weekly.Direction)

Out[ ]: 0.5610651974288338

Weeks the market goes up the logistic regression is right most of the time, 557/(557+48) = 92.1%.
Weeks the market goes down the logistic regression is wrong most of the time 54/(430+54) =11.2%.
In [ ]: # (d)
train = (weekly.Year<=2008)
X_lag2 = X[['intercept', 'Lag2']]
X_train, X_test = X_lag2.loc[train], X_lag2.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]
L_train , L_test = D.loc[train], D.loc[~train]

In [ ]: res_lag2 = sm.GLM(y_train,X_train,family=sm.families.Binomial()).fit()
pred = np.array(['Down'] * len(L_test))
pred[res_lag2.predict(X_test)>0.5] = 'Up'
confusion_table(pred,L_test)

Out[ ]: Truth Down Up

Predicted
Down 9 5
Up 34 56

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.625

In [ ]: #(e)
from sklearn.discriminant_analysis import \
(LinearDiscriminantAnalysis as LDA,
QuadraticDiscriminantAnalysis as QDA)
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
for M in [X_train, X_test]]
res_lda = lda.fit(X_train, L_train)
pred = res_lda.predict(X_test)
confusion_table(pred,L_test)
Out[ ]: Truth Down Up
Predicted
Down 9 5
Up 34 56

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.625

In [ ]: # (f)
qda = QDA(store_covariance=True)
res_qda = qda.fit(X_train, L_train)
pred = res_qda.predict(X_test)
confusion_table(pred,L_test)

Out[ ]: Truth Down Up

Predicted
Down 0 0
Up 43 61

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.5865384615384616

In [ ]: # (g)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
res_knn = knn.fit(X_train, L_train)
pred = res_knn.predict(X_test)
confusion_table(pred,L_test)

Out[ ]: Truth Down Up

Predicted
Down 22 32
Up 21 29

In [ ]: np.mean(pred==L_test)

Out[ ]: 0.49038461538461536

(h) Logistic regression and LDA methods provide the best test accuracy (0.625).

Q4
(a) What is the probability that the first bootstrap observation is not the observation from
j
th

the original sample?

The bootstrap sample is drawn with replacement from the original set of observations. The
n

probability of picking any single observation is . Therefore, the probability that the first
1

bootstrap observation is not the observation is .

th 1
j 1 − n
(b) What is the probability that the second bootstrap observation is not the jth observation
from the original sample?
Since bootstrap sampling is done with replacement, the probability remains the same for the
second observation. It is . 1 −
1

(1 −
1

n
.)
n

The probability that the observation is not chosen in a single draw is

j
th
. For the
1 −
1
jth

observation to not be in the entire bootstrap sample of size , this event has to happen times
n

n n

in a row. Therefore, the probability is . (1 −

1
n
)
n

(d) When n = 5, what is the probability that the observation is in the bootstrap sample?
j
th

The probability that the observation is in the bootstrap sample is the complement of the
j
th

probability that the observation is not in the bootstrap sample, which is

j
th

1 − (1 −
1

5
) .5
= 0.6723

Q5
(a) k-fold cross-validation:
Split data into parts. k

Train on parts, validate on the remaining part.

k − 1

Repeat times. k

Average results for performance metric.

(b) Relative to:
i. Validation set approach:
Advantages: Often lower variance.
Disadvantages: k-fold is more computationally intensive.
ii. LOOCV:
Advantages: k-fold is faster; better bias-variance trade-off.
Disadvantages: LOOCV uses more data for training; k-fold may have slightly higher bias.

Q6
In [ ]: # (a)
np.random.seed(1)
y = np.random.normal(size=100)
x = np.random.normal(size=100)
y = x - 2 * x**2 + np.random.normal(size=100) # n=100, p=2, Y = X - 2 X^2 + \epsilon

In [ ]: #(b)
from matplotlib import pyplot as plt
plt.scatter(x,y)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Quadratic plot. X from about -2 to 2. Y from about -8 to 2.

In [ ]: # (c)
from ISLP.models import sklearn_sm
from sklearn.model_selection import \
(cross_validate ,
KFold ,
ShuffleSplit)
cv_error = np.zeros(4)
H=np.array(x)
M = sklearn_sm(sm.OLS)
np.random.seed(1)
for i, degree in enumerate(range(1,5)):
X = np.power.outer(H, np.arange(degree+1))
M_CV = cross_validate(M, X, y, cv=len(H))
cv_error[i] = np.mean(M_CV['test_score'])
cv_error

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

In [ ]: # (d)
np.random.seed(2023)
for i, degree in enumerate(range(1,5)):
X = np.power.outer(H, np.arange(degree+1))
M_CV = cross_validate(M, X, y, cv=len(H))
cv_error[i] = np.mean(M_CV['test_score'])
cv_error

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

The result is the same as in (c). Because LOOCV evaluates each sample independently. And the
process is the same for all data.
(e) The quadratic polynomial (Model ii) had the lowest LOOCV test error rate. This was expected
because it matches the true form of Y.
In [ ]: # (f)
X = sm.add_constant(np.column_stack([x,x**2,x**3,x**4]))
res = sm.GLM(y,X,family=sm.families.Gaussian()).fit()
res.summary()

Out[ ]: Generalized Linear Model Regression Results

Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 95
Model Family: Gaussian Df Model: 4
Link Function: Identity Scale: 0.99826
Method: IRLS Log-Likelihood: -139.24
Date: Wed, 01 Nov 2023 Deviance: 94.835
Time: 07:20:27 Pearson chi2: 94.8
No. Iterations: 3 Pseudo R-squ. (CS): 0.9993
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.0866 0.144 0.600 0.549 -0.196 0.370
x1 1.0834 0.189 5.724 0.000 0.712 1.454
x2 -2.2455 0.214 -10.505 0.000 -2.664 -1.827
x3 0.0436 0.058 0.755 0.451 -0.070 0.157
x4 0.0482 0.043 1.132 0.257 -0.035 0.132
p-values show statistical significance of linear and quadratic terms, which agrees with the CV
results.

MI - Unit 5
No ratings yet
MI - Unit 5
72 pages
KNN Bias Variance Classification Metrics
No ratings yet
KNN Bias Variance Classification Metrics
81 pages
Lecture 6 Model Selection and Regularization 11oct2023
No ratings yet
Lecture 6 Model Selection and Regularization 11oct2023
29 pages
Resampling Methods - ML
No ratings yet
Resampling Methods - ML
115 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ISLR Solutions - Classification
No ratings yet
ISLR Solutions - Classification
20 pages
Excellent 05 - Overfitting
No ratings yet
Excellent 05 - Overfitting
22 pages
Week7 Lecture 1 ML SPR25
No ratings yet
Week7 Lecture 1 ML SPR25
23 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Statistical Learning: Master in Data Science For Management
No ratings yet
Statistical Learning: Master in Data Science For Management
47 pages
4-ResamplingMethods 1
No ratings yet
4-ResamplingMethods 1
23 pages
Probability and Non-Probability Samples LMU Munich 2022
No ratings yet
Probability and Non-Probability Samples LMU Munich 2022
22 pages
18 CV & Model Selection
No ratings yet
18 CV & Model Selection
11 pages
Text
No ratings yet
Text
9 pages
Q and A BIS
No ratings yet
Q and A BIS
7 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
DATA ANALYSIS UNIT 4 Notes
No ratings yet
DATA ANALYSIS UNIT 4 Notes
19 pages
Classification: K N X X X y I y
No ratings yet
Classification: K N X X X y I y
6 pages
Lemlem Abebaw Asaye Asignment 7
No ratings yet
Lemlem Abebaw Asaye Asignment 7
9 pages
Mid-Term2024 SOL
No ratings yet
Mid-Term2024 SOL
4 pages
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
No ratings yet
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
16 pages
Interview
No ratings yet
Interview
4 pages
SDSC3006 - Assignment 2
No ratings yet
SDSC3006 - Assignment 2
3 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
13 pages
Bias:Variance Tradeoff
No ratings yet
Bias:Variance Tradeoff
6 pages
Soft Skills: Publisher
No ratings yet
Soft Skills: Publisher
172 pages
Data Science Q&A - Latest Ed (2020) - 5 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 5 - 1
2 pages
T05 Soln
No ratings yet
T05 Soln
4 pages
Classification
No ratings yet
Classification
4 pages
T04 Soln
No ratings yet
T04 Soln
4 pages
1.2.3.4.short Solutions
No ratings yet
1.2.3.4.short Solutions
14 pages
Quiz 2 Practice
No ratings yet
Quiz 2 Practice
5 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
Machine Learning Insem-01 QP
No ratings yet
Machine Learning Insem-01 QP
6 pages
Wa0197.
No ratings yet
Wa0197.
4 pages
SLA Mid-termV2 Soln
No ratings yet
SLA Mid-termV2 Soln
5 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
Exam 21
No ratings yet
Exam 21
17 pages
Chapter 4 - Part 2
No ratings yet
Chapter 4 - Part 2
4 pages
Activity 7
No ratings yet
Activity 7
5 pages
Probability and Statistics With R For Engineers and Scientists 1st Edition Michael Akritas Solutions Manual Download
100% (23)
Probability and Statistics With R For Engineers and Scientists 1st Edition Michael Akritas Solutions Manual Download
16 pages
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
No ratings yet
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
7 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Exercise 1 Statistical Learning
No ratings yet
Exercise 1 Statistical Learning
11 pages
Machine 2020 Jul-Dec
No ratings yet
Machine 2020 Jul-Dec
45 pages
Eir December 2019
No ratings yet
Eir December 2019
1,937 pages
Machine 2021 Jul-Dec
No ratings yet
Machine 2021 Jul-Dec
46 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
Report On Launching of New Product
40% (5)
Report On Launching of New Product
43 pages
ISLR
No ratings yet
ISLR
9 pages
DSE Math
No ratings yet
DSE Math
3 pages
DSE Math
No ratings yet
DSE Math
3 pages
Design Standards and Specifications
No ratings yet
Design Standards and Specifications
5 pages
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
32 pages
8 Balanced - BST - New
No ratings yet
8 Balanced - BST - New
78 pages
Chapter - 14 Advanced Regression Models
No ratings yet
Chapter - 14 Advanced Regression Models
49 pages
Chan Sui Ki Second Term Exam
No ratings yet
Chan Sui Ki Second Term Exam
14 pages
W1 - Network Basics
No ratings yet
W1 - Network Basics
38 pages
BREAK Character Sheet (Tam)
No ratings yet
BREAK Character Sheet (Tam)
1 page
10th Science Sample Paper 2024
No ratings yet
10th Science Sample Paper 2024
13 pages
Versa CSeries Aluminum Solenoid Valves
No ratings yet
Versa CSeries Aluminum Solenoid Valves
24 pages
LinearAI-DS Mid ch1-6 2021S2 DR - Omar
No ratings yet
LinearAI-DS Mid ch1-6 2021S2 DR - Omar
10 pages
Cisco Hidden Commands
100% (1)
Cisco Hidden Commands
24 pages
Florida Department of Children and Families Legislative Budget Request FY 2010-11
No ratings yet
Florida Department of Children and Families Legislative Budget Request FY 2010-11
419 pages
Grounded Theory Thesis Structure
100% (3)
Grounded Theory Thesis Structure
5 pages
Paracetamol - Infusion: Presentation Dose
No ratings yet
Paracetamol - Infusion: Presentation Dose
2 pages
DDMCA Regulations Updated
No ratings yet
DDMCA Regulations Updated
11 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
After The Storm
No ratings yet
After The Storm
4 pages
Essay
No ratings yet
Essay
7 pages
Avatar Courage - AHTS Brochure Dec 2022 (Singapore Flag)
No ratings yet
Avatar Courage - AHTS Brochure Dec 2022 (Singapore Flag)
2 pages
Dok Chart and Stems PDF
No ratings yet
Dok Chart and Stems PDF
5 pages
SDSC3006 - Assignment 1
No ratings yet
SDSC3006 - Assignment 1
2 pages
Chapter 1 Business
No ratings yet
Chapter 1 Business
52 pages
Ejercicios de Matematica Avanzada para Ingenieros
No ratings yet
Ejercicios de Matematica Avanzada para Ingenieros
6 pages
Assignment 2 Ans
No ratings yet
Assignment 2 Ans
6 pages
Ijfs 11 00110
No ratings yet
Ijfs 11 00110
17 pages
App 002 Final Exam Reviewer
No ratings yet
App 002 Final Exam Reviewer
3 pages
Read Me
No ratings yet
Read Me
2 pages
Network Administrator or Configuration Manager or Application de
No ratings yet
Network Administrator or Configuration Manager or Application de
2 pages
The Tower (2012 South Korean Film) : From Wikipedia, The Free Encyclopedia
No ratings yet
The Tower (2012 South Korean Film) : From Wikipedia, The Free Encyclopedia
9 pages
SpeedHeat EzeeStat II Instructions Rev 04
No ratings yet
SpeedHeat EzeeStat II Instructions Rev 04
4 pages
Personal Letter Exercise
No ratings yet
Personal Letter Exercise
3 pages
Area Manager Training Programme Overview PDF
No ratings yet
Area Manager Training Programme Overview PDF
2 pages
Ec2209 Set 3
No ratings yet
Ec2209 Set 3
2 pages

Assignment Solution 2

Uploaded by

Assignment Solution 2

Uploaded by

Q1

Rearranging and solving for (x_1):

Out[ ]: Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Toda

Out[ ]: <Axes: >

Out[ ]: <Axes: xlabel='Year'>

Out[ ]: Generalized Linear Model Regression Results

Out[ ]: Truth Down Up

Out[ ]: Truth Down Up

Out[ ]: Truth Down Up

Out[ ]: Truth Down Up

the original sample?

bootstrap observation is not the observation is .

The probability that the observation is not chosen in a single draw is

in a row. Therefore, the probability is . (1 −

probability that the observation is not in the bootstrap sample, which is

Train on parts, validate on the remaining part.

Average results for performance metric.

Quadratic plot. X from about -2 to 2. Y from about -8 to 2.

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

Out[ ]: array([8.29221162, 1.01709581, 1.04655346, 1.05749267])

Out[ ]: Generalized Linear Model Regression Results

You might also like