ISL Answers
ISL Answers
Conceptual
1.
(a)
• Better : A large sample size means a flexible model will be able to better fit the data.
(b)
• Worse : A flexible model would likely overfit. Flexible methods generally do better when large datasets
are available.
(c)
• Better : Flexible methods perform better on non-linear datasets as they have more degrees of freedom
to approximate a non-linear
(d)
• Worse : A flexible model would likely overfit, due to more closely fitting the noise in the error terms
than an inflexible method. In other words, the data points will be far from 𝑓 (ideal function to describe
the data) if the variance of the error terms is very high. This hints that f is linear and so a simpler
model would better be able to estimate 𝑓.
2.
(a)
(b)
(c)
1
• n=52. p=% change in US market, % change in UK market, % change in German market.
3.
(a)
(b)
• Var(E) is the irreducible error and the test predictions cannot be better than this, therefore it is a
straight line. Test MSE reduces to an optimum point as increased flexibility means a better fit, with
further increases leading to overfitting. Training MSE continues to reduce as more flexibility means
the method can very closely fit the training data. Variance increases as the method tends to overfit as
flexibility increases (fitting training data too well and not generalizing to test data). Generally, bias is
reduced as the flexibility increases due to the method being better able to fit the data.
4.
(a)
Classification methods would be useful for applications where the outcomes are to be classified
into a category, this can be a binary classification or a multi-class classification. Some areas
where classification could be useful:
• Breast cancer prediction: Given a set of predictors such as a mammogram scan, age, family history,
lifestyle and other variables, and a response of Yes(has cancer) and No(does not have cancer) – we can
then train a model to predict whether a patient has breast cancer.
• Classifying species of plants: Given a set of images of a plant, a model can be trained that will
classify that plant into one of the trained species. This is a multi-class classification problem. The
response would be the species name and the predictors would be images of that species.
2
• Fraud detection: Classify whether a transaction is fraudulent, given data like the transaction amount,
location, purchased item or service, previous customer transactions etc. The response would be “Yes”
or “No”, and our aim is to make a prediction.
• Stock price: Classify whether a stock will go up or down in price the next day given a set of financial
data and news from the preceding week. The aim is to make a prediction.
(b)
Regression methods are useful when we have a quantitative response; that is where we need
to predict a numerical value for our response. Some areas where regression could be useful
are:
• House price factors: Given a set of predictors such as location, house features, median income for
the area and so on and the house price as the response/target, we can train a model to infer the impact
of those variables on house prices.
• Salary: Predict the salary of an individual given their education, work history, skillsets and other
relevant data (age, sex, etc.). The response is the salary amount.
• Sales: Predict unit sales of a product given marketing data such as TV, Radio or Internet advert
expenditure, and use it to infer the importance of each advertising method. The response is the unit
sales of the product.
• Driving Insurance premium: Given a set of variables such as the drivers history, age, type of
vehicle, expected yearly mileage and the premium as the response, we can train a model to predict the
insurance premium for new customers.
(c)
Cluster analysis is useful in cases where we do not have a target response available – i.e. unsu-
pervised learning. We can aim to ascertain whether observations can be classed into distinct
groups or understand if there are any underlying relationship between variables. Some areas
where this can be useful are:
• Tissue classification: : Clustering can be used to separate different types of tissue in medical images.
This can be useful in identifying groups of tissue that are not normal and need further study.
• Market research: Differentiate a group of people within a city into distinct market segments to
increase marketing effectiveness or identify new opportunities. Given data such as incomes, location,
age, sex, opinion polls and so on for a city, we can segment the city into different consumer areas.
• Image segmentation: Separate an image into different regions to make object recognition easier. For
example, segmenting image frames from a video camera in a car into ‘other vehicles’, ‘humans’, ‘road
signs’ and so on can help ADAS (Advanced driver-assistance systems) in vehicles make the correct
decision.
• Gaming market segmentation: Given a set observations with variables such as age, location,
income, sex, hours spent gaming, gaming devices used and so on. We could use cluster analysis to
see if these observations fall into distinct groups. If there are distinct groupings, then it could be
helpful with further study – say for example one grouping could represent casual gamers and the other
hardcore gamers, and another one could be newer gamers (say people over the age 60).
5. & 6.
• Flexible methods work well when the underlying function is non-linear. The predictions in general
have a lower bias but can have a higher variance, as these models are more likely to overfit the data.
3
• Less flexible methods do not tend to overfit the data but can have a high bias when the underlying
function is non-linear. They can also use fewer observations and parameters, particularly when it is
assumed that the underlying function is linear. Flexible methods tend to require a larger number of
observations and parameters, and can lead to overfitting (higher variance).
• Flexible methods (non-parametric methods) are preferable when we make no assumptions about the
function to be estimated.Most real-life relationships are non-linear and so a non-parametric approach
is better suited to modelling them. Flexible models by their nature are more complex and less inter-
pretable than their linear counterparts, so even though their predictions might be more accurate, we
may not be able to explain why it has made those predictions (a black box model).
• Less flexible methods (parametric) are useful if we assume or know that the underlying function is
linear. As a linear relationship is assumed, the model needs to predict fewer parameters than a non-
parametric method. Additionally, these models are more interpretable, and so will be preferred when
we are interested in making inferences or the interpretability of the results.
7. (a) The Euclidean distance is the straight line distance between two points. This can be calculated using
the Pythagorean theorem.
For 3D space we have:
𝑑(1, 𝑡𝑒𝑠𝑡) = 3
𝑑(2, 𝑡𝑒𝑠𝑡) = 2
𝑑(3, 𝑡𝑒𝑠𝑡) = 3.16
𝑑(4, 𝑡𝑒𝑠𝑡) = 2.24
𝑑(5, 𝑡𝑒𝑠𝑡) = 1.41
𝑑(6, 𝑡𝑒𝑠𝑡) = 1.73
(b)
(c)
• Red; as nearest three observations are green, red and red.The probability of the test point belonging
to red is 2/3 and green is 1/3. Therefore, the prediction is red.
(d)
• For highly non-linear boundaries, we would expect the best value of K to be small. Smaller values of K
result in a more flexible KNN model, and this will produce a decision boundary that is non-linear. A
larger K would mean more data points are considered by the KNN model and this means its decision
boundary is closer to a linear shape.
Applied
8.
4
library(ISLR)
#library(ggplot2)
(a) (b)
college.rownames = rownames(College)
(c) i.
summary(College)
ii.
5
0 10000 30000 50000 0 2000 4000 6000
1.8
Private
1.4
1.0
40000
Apps
20000
0
20000
Accept
10000
0
6000
4000
Enroll
2000
0
80
60
Top10perc
40
20
0
1.0 1.2 1.4 1.6 1.8 2.0 0 5000 15000 25000 0 20 40 60 80
iii.
6
20000
15000
Outstate
10000
5000
No Yes
Private
iv.
summary(college.df)
7
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 No :699
## 1st Qu.: 53.00 Yes: 78
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
8
20000
15000
Outstate
10000
5000
No Yes
Elite
v.
par(mfrow=c(2,2))
hist(College$Apps, xlim=c(0,25000), xlab = "Applications", main = "Apps using default bin sizes")
hist(College$Apps, xlim=c(0,25000), breaks=25, xlab = "Applications",
main = "Apps using smaller bin sizes")
hist(College$Top10perc, breaks=25, xlab = "Pct. new students from top 10% of H.S. class",
main="Top10Perc")
hist(College$Outstate, xlab="Out-of-state tuition",ylab="Amount",main="Outstate")
9
100 200 300 400 500 600
Apps using default bin sizes Apps using smaller bin sizes
400
300
Frequency
Frequency
200
100
0
0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
Applications Applications
Top10Perc Outstate
150
100 120
100
80
Frequency
Amount
60
50
40
20
0
Pct. new students from top 10% of H.S. class Out−of−state tuition
• Histogram of Apps(Number of applications received) is highly right skewed. This shows that most
universities received 5000 or fewer applications. The mean number of applications received will also
be heavily skewed.
• Histogram for Top10Perc(Number of new students who are the top 10% of their class) is also right
skewed; this shows that only a few universities get a majority of their new students from this cohort.
mean(college.df$Apps)
## [1] 3001.638
median(college.df$Apps)
## [1] 1558
vi.
10
xlab = "Student to Faculty Ratio", ylab = "Graduation Rate",
main = "Plot of Grad.Rate vs S/F Ratio")
60
40
20
10 20 30 40
• The results suggest a negative linear relationship between the graduation rate of students and the
student to faculty ratio at universities.
• As the student to faculty ratio increases, we can expect students to have a lower graduation rate.
9.
(a)
11
(b)
sapply(Auto[,1:7], range)
(c)
sapply(Auto[,1:7], sd)
(d)
sapply(Auto.reduced[,1:7], mean)
sapply(Auto.reduced[,1:7], sd)
(e)
12
pairs(Auto[,1:7])
30
mpg
10
3 4 5 6 7 8
cylinders
300
displacement
100
150
horsepower
50
3500
weight
1500
25
20
acceleration
15
10
82
78
year
74
70
10 20 30 40 100 300 1500 3000 4500 70 74 78 82
cor(Auto[,1:7])
13
• From the pair plot and the correlation data, we can state there exists linear relationships between some
of the variables.
• For example, mpg has strong negative linear relationships with displacement, cylinders and weight.
That is we can expect the mpg of the car to decrease as their displacement and cylinders increase.
• mpg has a positive correlation with year, and this suggests that newer models tend to have higher mpg
than older ones.
(f)
• Both the plots and the correlation data suggests we can predict mpg.
• An increase in the variables displacement, cylinders and weight will lead to a reduced mpg.
• Newer models year tend to have higher mpg.
10.
(a)
library(MASS)
?Boston
dim(Boston)
## [1] 506 14
(b)
14
0.4 0.5 0.6 0.7 0.8 200 300 400 500 600 700
80
60
crim
40
20
0
0.8
nox
0.6
0.4
10
8
dis
6
4
2
600
tax
400
200
50
40
30
medv
20
10
0 20 40 60 80 2 4 6 8 10 12 10 20 30 40 50
• crim seems to have a negative linear relationship with medv and dis.
• nox has a negative linear relationship with dis.
• dis has a positive linear relationship with medv.
(c)
## [,1]
## zn -0.20046922
## indus 0.40658341
## chas -0.05589158
## nox 0.42097171
## rm -0.21924670
## age 0.35273425
## dis -0.37967009
## rad 0.62550515
## tax 0.58276431
## ptratio 0.28994558
## black -0.38506394
15
## lstat 0.45562148
## medv -0.38830461
• There are some correlations between crim and other variables, but they are not as strong as some of
the relationships we observed in the Auto dataset.
• crim has a negative linear relationship with medv, dis and black.
• crim has a positive linear relationship with indus, nox, rad and tax.
(d)
# Suburbs with crime rate higher than 2 s.d from the mean(higher than 95% of suburbs).
High.Crime = Boston[which(Boston$crim > mean(Boston$crim) + 2*sd(Boston$crim)),]
range(Boston$crim) ; mean(Boston$crim) ; sd(Boston$crim)
## [1] 3.613524
## [1] 8.601545
• There are 16 suburbs with a crime rate higher than 95% of the other suburbs.
• Some suburbs have extremely high rates of crime (5-8 s.d from the mean).
• The range is very wide, it goes from a rate of near zero to 89.
# Suburbs with tax rates higher than 2 s.d from the mean.
High.Tax = Boston[which(Boston$tax > mean(Boston$tax) + 2*sd(Boston$tax)),]
range(Boston$tax)
• There are no suburbs with a tax rate higher than 2 s.d. from the mean. This seems reasonable as
property tax rates are designed not to be extremely drastic.
• The range is narrower than the crime rate.
• Some suburbs do have tax rates higher than 1 s.d.(higher than 65% of suburbs) from the mean.
?Boston
# Suburbs with pupil teacher ratio higher than 2 s.d from the mean.
High.PT = Boston[which(Boston$ptratio > mean(Boston$ptratio) + 2*sd(Boston$ptratio)),]
range(Boston$ptratio)
• There are no suburbs with a high pupil to teacher ratio, and this a reasonable outcome as educational
laws limit the numbers of teacher or students per class/school.
• The range in quite narrow, and and all pupil teacher ratios lie within 2 s.d. of the mean.
• Some pupil teacher ratios are higher than 1 s.d.
(e)
16
sum(Boston$chas==1)
## [1] 35
(f)
median(Boston$ptratio)
## [1] 19.05
(g)
which(Boston$medv == min(Boston$medv))
• There are two suburbs (399 & 406) that have the lowest median property values.
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
## medv
## 399 5
range(Boston$lstat)
range(Boston$ptratio)
• crim is more than 2 s.d. above the mean - very high crime rates in this suburb. Both ptratioand
lstat are close to their maximum values.
(h)
## [1] 64
17
# More than 8 rooms
sum(Boston$rm > 8)
## [1] 13
summary(Boston)
18
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
• Relatively low crim, lstat and much higher medv when comparing the IQR range.
19