0% found this document useful (0 votes)
122 views6 pages

STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1

This document contains solutions to homework problems from an Introduction to Statistical Learning course. It includes summaries of key concepts related to model flexibility, types of prediction problems, k-nearest neighbors classification, and code to analyze the MovieLens dataset. Sample code analyzes relationships between college characteristics and fits models to movie rating data.

Uploaded by

Babu Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views6 pages

STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1

This document contains solutions to homework problems from an Introduction to Statistical Learning course. It includes summaries of key concepts related to model flexibility, types of prediction problems, k-nearest neighbors classification, and code to analyze the MovieLens dataset. Sample code analyzes relationships between college characteristics and fits models to movie rating data.

Uploaded by

Babu Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

STAT 4540 Homework 1 Solution

1 ISLR 2.4.1
(a) We expect the performance of a flexible statistical learning method to be better. A more flexible
approach will fit the data closer and with the large sample size a better fit than an inflexible approach
would be obtained.
(b) We expect the performance of a flexible statistical learning method to be worse. A flexible method
would overfit the small number of observations.
(c) We expect the performance of a flexible statistical learning method to be better. With more
degrees of freedom, a flexible model would obtain a better fit.
(d) We expect the performance of a flexible statistical learning method to be worse. Flexible methods
will fit to the noise in the error terms and thus increase the variance.

2 ISLR 2.4.4
(a)

• Response variable: health status (ill/healthy); predictors: age, blood pressure, gender, etc. The
goal is prediction.
• Response variable: outcome of a test (fail/pass); predictors: hardness of the test, preparing time,
etc. The goal is prediction.
• Response variable: poll result (approve/against); predictors: socioeconomic status, eduction level,
age, etc. The goal is both inference and prediction.
(b)
• Response variable: stock market price; predictors: previous prices. The goal is prediction.
• Response variable: income; predictors: age, education level, gender, etc. The goal is both prediction
and inference.
• Response variable: working hours of a bulb; predictors: brand, price, type, etc. The goal is
prediction.
(c)

• Marketing survey.
• Movie rating.
• Symptoms of diseases.

1
3 ISLR 2.4.7
(a)

d(x1 , x0 ) = 32 = 3

d(x2 , x0 ) = 22 = 2
p
d(x3 , x0 ) = 12 + 32 ≈ 3.2
p
d(x4 , x0 ) = 12 + 22 ≈ 2.2
p
d(x5 , x0 ) = 12 + 12 ≈ 1.4
p
d(x6 , x0 ) = 12 + 12 + 12 ≈ 1.7.

(b) Our prediction is green since the single nearest neighbor is obs 5, with Y = green.
(c) Our prediction is red, since 3-nearest neighbors are obs 5, 6, 2, with corresponding Y = green, red,
red.
(d) Small. A small K would be flexible for a non-linear decision boundary, whereas a large K would try
to fit a more linear boundary because it takes more points into consideration.

4 ISLR 2.4.8
R code and output:
##(a)
college <- read.csv("College.csv", header = TRUE)

##(b)
rownames(college) = college[,1]
fix(college)

college=college[,-1]
fix(college)

##(c)
#(i)
summary(college)
Private Apps Accept Enroll Top10perc Top25perc F.Undergra
No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 1
Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 9
Median : 1558 Median : 1110 Median : 434 Median :23.00 Median : 54.0 Median : 1707
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
P.Undergrad Outstate Room.Board Books Personal PhD
Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median : 353.0 Median : 9990 Median :4200 Median : 500.0 Median :1200 Median : 75.00
Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
Median : 82.0 Median :13.60 Median :21.00 Median : 8377 Median : 65.00
Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00

2
#(ii)
pairs(college[,1:10])

#(iii)
plot(college$Private, college$Outstate)

#(iv)
Elite = rep("No", nrow(college))
Elite[college$Top10perc>50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

3
summary(college$Elite)
# No Yes
# 699 78
plot(college$Elite, college$Outstate)

# (v)
par(mfrow=c(2,2))
hist(college$Apps)
hist(college$perc.alumni, col=2)
hist(college$S.F.Ratio, col=3, breaks=10)
hist(college$Expend, breaks=100)

4
# (vi)
par(mfrow=c(1,2))
plot(college$Outstate, college$Grad.Rate)
# High tuition correlates to high graduation rate.
plot(college$Top10perc, college$Grad.Rate)
# Colleges with the most students from top 10% perc don’t necessarily
have the highest graduation rate.

5
R code:
dat <- read.table("u.data", sep = "\t")
colnames(dat) <- c("usrid", "movid", "rating", "timestamp")

dat$time <- as.POSIXct(dat$timestamp, origin="1970-01-01", tz="UTC")

rating <- dat[ c(1:3, 5)]

dat <- scan("u.item", what = rep("character", 24), sep = "\n", encoding = "UTF-8")
movdf <- matrix(NA_character_, length(dat), 24)
for (ii in 1:length(dat)) {
movdf[ii, ] <- strsplit(dat[ii], split = "\\|")[[1]]
}

colnames(movdf) <- c("movid", "title", "reldate", "vidreldate", "URL",


"unknown", "Action", "Adventure", "Animation",
"Children", "Comedy", "Crime", "Documentary", "Drama", "Fantasy",
"FilmNoir", "Horror", "Musical", "Mystery", "Romance",
"SciFi","Thriller", "War", "Western")

movie <- matrix(as.numeric(movdf[ , c(1, 6:24)]), nrow = nrow(movdf), ncol = length(c(1, 6:24)))
colnames(movie) <- colnames(movdf)[c(1, 6:24)]
head(movie)

5
action <- rowSums(movie[,c("Action", "Adventure", "Fantasy", "Horror", "SciFi", "Thriller")])
children <- rowSums(movie[,c("Animation", "Children")])
comedy <- rowSums(movie[,c("Comedy"), drop=FALSE])
drama <- rowSums(movie[,c("Crime", "Documentary", "Drama", "FilmNoir",
"Musical", "Mystery", "Romance", "War", "Western")])

genre <- cbind(action, children, comedy, drama)


genre <- genre[rating$movid, , drop=FALSE]

logit <- function(p) log(p / (1 - p))


pop1 <- aggregate(rating$rating > 3, by = list(rating$movid), sum)
pop2 <- aggregate(rating$rating > 0, by = list(rating$movid), sum)
pop <- logit((pop1[ , 2] ) / (pop2[ , 2]))
head(pop, n=5)
# 0.8962438 -0.4502010 -0.4989912 0.3381129 -0.1865860

popular <- pop[rating$movid]


x <- cbind(1, genre, popular)
y <- rating$rating

head(x, n=5)
# action children comedy drama popular
[1,] 1 0 2 1 0 1.1564319
[2,] 1 3 0 0 0 1.4160205
[3,] 1 1 0 0 0 -2.4849066
[4,] 1 1 0 1 1 0.2231436
[5,] 1 1 0 0 2 0.4519851

head(y, n=5)
# 3 3 1 2 1

You might also like