0% found this document useful (0 votes)

39 views

Test Your Knowledge of Linear Regression and PCA in R

1. This document provides instructions for exercises involving linear regression and PCA in R using various datasets. It includes questions asking to perform simple and multiple linear regressions, interpret outputs, and examine diagnostics plots. 2. Questions involve using the Auto dataset to perform simple and multiple linear regressions with mpg as the response variable and other variables as predictors. Outputs and diagnostics plots are to be interpreted. 3. Questions involve using simulated data to examine the multicollinearity problem when fitting regression models with correlated predictors.

Uploaded by

Chong Jun Wei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Test Your Knowledge of Linear Regression and PCA in R

Uploaded by

Chong Jun Wei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1

The Analytics Edge FALL 2020

Test your knowledge of Linear Regression and PCA in R

Exercise: Week 2

1. This question involves the use of simple linear regression on the Auto dataset. This dataset
was taken from the StatLib library which is maintained at Carnegie Mellon University. The
dataset has the following fields:

• mpg: miles per gallon

• cylinders: number of cylinders
• displacement: engine displacement (cu. inches)
• horsepower: engine horsepower
• acceleration: time to accelerate from 0 to 60 mph (sec.)
• year: model year (modulo 100)
• origin: origin of car (1. American, 2. European, 3. Japanese)
• name: vehicle name

(a) Perform a simple linear regression with mpg as the response and horsepower as the pre-
dictor. Comment on why you need to change the horsepower variable before performing
the regression.
(b) Comment on the output by answering the following questions:
• Is there a strong relationship between the predictor and the response?
• Is the relationship between the predictor and the response positive or negative?
(c) What is the predicted mpg associated with a horsepower of 98? What is the associated
99% confidence interval?
Hint: You can check the predict.lm function on how the confidence interval can be
computed for predictions with R.
(d) Compute the correlation between the response and the predictor variable. How does this
compare with the R2 value?
(e) Plot the response and the predictor. Also plot the least squares regression line.
(f) First install the package ggfortify which aids plotting linear models with ggplot2. Use
the following two commands in R to produce diagnostic plots of the linear regression fit:
> library(ggfortify)
> autoplot(your model name)
Comment on the Residuals versus Fitted plot and the Normal Q-Q plot and on any
problems you might see with the fit.
2

2. This question involves the use of multiple linear regression on the Auto dataset building on
question 1.

(a) Produce a scatterplot matrix which includes all the variables in the dataset.
(b) Compute a matrix of correlations between the variables using the function cor(). You
need to exclude the name variable which is qualitative.
(c) Perform a multiple linear regression with mpg as the response and all other variables except
name as the predictors. Comment on the output by answering the following questions:
• Is there a strong relationship between the predictors and the response?
• Which predictors appear to have a statistically significant relationship to the re-
sponse?
• What does the coefficient for the year variable suggest?

3. This problem focusses on the multicollinearity problem with simulated data.

(a) Perform the following commands in R:

> set.seed(1)
> x1 <− runif(100)
> x2 <− 0.5*x1 + rnorm(100)/10
> y <− 2 + 2*x1 + 0.3*x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and
x2. Write out the form of the linear model. What are the regression coefficients?
(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relation-
ship between the variables.
(c) Using the data, fit a least square regression to predict y using x1 and x2.
• What are the estimated parameters of β̂0 , β̂1 and β̂2 ? How do these relate to the
true β0 , β1 and β2 ?
• Can you reject the null hypothesis H0 : β1 = 0?
• How about the null hypothesis H0 : β2 = 0?
(d) Now fit a least squares regression to predict y using only x1.
• How does the estimated β̂1 relate to the true β1 ?
• Can you reject the null hypothesis H0 : β1 = 0?
(e) Now fit a least squares regression to predict y using only x2.
• How does the estimated β̂2 relate to the true β2 ?
• Can you reject the null hypothesis H0 : β2 = 0?
(f) Provide an explanation on the results in parts (c)-(e).

4. This problem involves the Boston dataset. This data was part of an important paper in 1978
by Harrison and Rubinfeld titled “Hedonic housing prices and the demand for clean
air” published in the Journal of Environmental Economics and Management 5(1): 81-102.
The dataset has the following fields:
3

• crim: per capita crime rate by town

• zn: proportion of residential land zoned for lots over 25,000 sq.ft
• indus: proportion of non-retail business acres per town
• chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
• nox: nitrogen oxides concentration (parts per 10 million)
• rm: average number of rooms per dwelling
• age: proportion of owner-occupied units built prior to 1940
• dis: weighted mean of distances to five Boston employment centres
• rad: index of accessibility to radial highways
• tax: full-value property-tax rate per $10,000
• ptratio: pupil-teacher ratio by town
• black: 1000(Bk − 0.63)2 where Bk is the proportion of black residents by town
• lstat: lower status of the population (percent)
• medv: median value of owner-occupied homes in $1000s

We will try to predict the median house value using thirteen predictors.

(a) For each predictor, fit a simple linear regression model using a single variable to predict
the response. In which of these models is there a statistically significant relationship
between the predictor and the response? Plot the figure of relationship between medv and
lstat as an example to validate your finding.
(b) Fit a multiple linear regression models to predict your response using all the predictors.
Compare the adjusted R2 from this model with the simple regression model. For which
predictors, can we reject the null hypothesis H0 : βj = 0?
(c) Create a plot displaying the univariate regression coefficients from (a) on the X-axis and
the multiple regression coefficients from (b) on the Y-axis. That is each predictor is
displayed as a single point in the plot. Comment on this plot.
(d) In this question, we will check if there is evidence of non-linear association between the
lstat predictor variable and the response? To answer the question, fit a model of the
form
medv = β0 + β1 lstat + β2 lstat2 + .

You can make use of the poly() function in R. Does this help improve the fit¿ Add higher
degree polynomial fits. What is the degree of the polynomial fit beyond which the terms
no longer remain significant?
4

5. Orley Ashenfelter in his paper “Predicting the Quality and Price of Bordeaux Wines”
published in The Economic Journal showed that the variability in the prices of Bordeaux wines
is predicted well by the weather that created the grapes. In this question, you will validate
how these results translate to a dataset for wines produced in Australia. The data is provided
in the file winedata.csv. The dataset contains the following variables:

• vintage: year the wine was made

• price91: 1991 auction prices for the wine in dollars
• price92: 1992 auction prices for the wine in dollars
• temp: average temperature during the growing season in degree Celsius
• hrain: total harvest rain in mm
• wrain: total winter rain in mm
• tempdiff: sum of the difference between the maximum and minimum temperatures dur-
ing the growing season in degree Celsius

(a) Define two new variables age91 and age92 that captures the age of the wine (in years) at
the time of the auctions. For example, a 1961 wine would have an age of 30 at the auction
in 1991. What is the average price of wines that were 15 years or older at the time of the
1991 auction?
(b) What is the average price of the wines in the 1991 auction that were produced in years
when both the harvest rain was below average and the temperature difference was below
average?
(c) In this question, you will develop a simple linear regression model to fit the log of the
price at which the wine was auctioned in 1991 with the age of the wine. To fit the model,
use a training set with data for the wines up to (and including) the year 1981. What is
the R-squared for this model?
(d) Find the 99% confidence interval for the estimated coefficients from the regression.
(e) Use the model to predict the log of prices for wines made from 1982 onwards and auctioned
in 1991. What is the test R-squared?
(f) Which among the following options describes best the quality of fit of the model for
this dataset in comparison with the Bordeaux wine dataset that was analyzed by Orley
Ashenfelter?
• The result indicates that the variation of the prices of the wines in this dataset is
explained much less by the age of the wine in comparison to Bordeaux wines.
• The result indicates that the variation of the prices of the wines in this dataset is
explained much more by the age of the wine in comparison to Bordeaux wines.
• The age of the wine has no predictive power on the wine prices in both the datasets.
5

(g) Construct a multiple regression model to fit the log of the price at which the wine was auc-
tioned in 1991 with all the possible predictors (age91, temp, hrain, wrain, tempdiff)
in the training dataset. To fit your model, use the data for wines made up to (and includ-
ing) the year 1981. What is the R-squared for the model?
(h) Is this model preferred to the model with only the age variable as a predictor (use the
adjusted R-squared for the model to decide on this)?
(i) Which among the following best describes the output from the fitted model?
• The result indicates that less the temperature, the better is the price and quality of
the wine
• The result indicates that greater the temperature difference, the better is the price
and quality of wine.
• The result indicates that lesser the harvest rain, the better is the price and quality of
the wine.
• The result indicates that winter rain is a very important variable in the fit of the data.
(j) Of the five variables (age91, temp, hrain, wrain, tempdiff), drop the two variables
that are the least significant from the results in (g). Rerun the linear regression and write
down your fitted model.
(k) Is this model preferred to the model with all variables as predictors (use the adjusted
R-squared in the training set to decide on this)?
(l) Using the variables identified in (j), construct a multiple regression model to fit the log
of the price at which the wine was auctioned in 1992 (remember to use age92 instead of
age91). To fit your model, use the data for wines made up to (and including) the year
1981. What is the R-squared for the model?
(m) Suppose in this application, we assume that a variable is statistically significant at the 0.2
level. Would you reject the hypothesis that the coefficient for the variable hrain is zero?
(n) By separately estimating the equations for the wine prices for each auction, we can better
establish the credibility of the explanatory variables because:
• We have more data to fit our models with.
• The effect of the weather variables and age of the wine (sign of the estimated coeffi-
cients) can be checked for consistency across years.
• 1991 and 1992 are the markets when the Australian wines were traded heavily.
Select the best option.
(o) The current fit of the linear regression using the weather variables drops all observations
where any of the entries are missing. Provide a short explanation on when this might not
be a reasonable approach to use.
6

6. This question involves the use of principal component analysis on the well-known iris dataset.
The dataset is available in R.

(a) How many observations are there in the dataset? What are the different fields/attributes
in the data set?
(b) Create a new dataset iris.data by removing the Species column and store its content
as iris.sp.
(c) Compare the various pair of features using a pairwise scatterplot and find correlation
coefficients between the features. Which features seem to be highly correlated?
(d) Conduct a principal component analysis on iris.data without standardizing the data.
You may use prcomp(..., scale=F).
(i) How many principal components are required to explain at least 90 % of the vari-
ability in the data? Plot the cumulative percentage of variance explained by the
principal components to answer this question.
(ii) Plot the data along the first two principal components and color the different
species separately. Does the first principal component create enough separation
among the different species? To plot, you may use the function fviz pca ind or
fviz pca biplot in library(factoextra). Alternatively, you may use biplot or
construct a plot using ggplot2 as well.
(e) Do the same exercise as in (d) above, now after standardizing the dataset. Comment on
any differences you observe.

7. This problem involves the dataset wine italy.csv which was obtained from the University
of Irvine Machine Learning Repository. These data are the results of a chemical analysis of
wines grown in the same region in Italy but derived from three different cultivars. The analysis
determined the quantities of 13 constituents found in each of the three types of wines. The
first column identifies the cultivars and the next thirteen are the attributes given by:

• alcohol: Alcohol
• malic: Malic acid
• ash: Ash
• alkalin: Alkalinity of ash
• mag: Magnesium
• phenols: Total phenols
• flavanoids: Flavanoids
• nonflavanoids: Nonflavanoid phenols
• proanth: Proanthocyanins
• color: Color Intensity
• hue: Hue
7

• od280: OD280/ OD315 of diluted wines

• proline: Proline

(a) Check the relationship between the variables by creating a pair-wise scatterplot of the
thirteen attributes.
(b) Conduct a principal component analysis on the standardized data. What proportion of
the total variance is explained by the first two components?
(c) Plot the data along the first two principal components and color the different cultivars
separately. Also plot the loadings of the different components to show the importance of
the different attributes on the first two principal components?
(i) Which two key attributes differentiate Cultivar 2 from the other two cultivars?
(ii) Which two key attributes differentiate Cultivar 3 from the other two cultivars?
(d) Use an appropriate plot to find the number of attributes required to explain at least 80%
of the total variation in the data.

Case Study 1 - Choosing A New Director of Research
67% (6)
Case Study 1 - Choosing A New Director of Research
2 pages
HW 03 Sol
No ratings yet
HW 03 Sol
9 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
pastPaper2024Spring_Assm02
No ratings yet
pastPaper2024Spring_Assm02
24 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Session7 LinearRegression
No ratings yet
Session7 LinearRegression
52 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
HW2 Solution
No ratings yet
HW2 Solution
7 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
Econometrics - Week 5 Tutorials 2024
No ratings yet
Econometrics - Week 5 Tutorials 2024
3 pages
Tut Sol Week12
No ratings yet
Tut Sol Week12
8 pages
STATISTICAL-MODELLING
No ratings yet
STATISTICAL-MODELLING
39 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Activity 7
No ratings yet
Activity 7
5 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
Dar Solved Ans
No ratings yet
Dar Solved Ans
20 pages
IE 451 Fall 2023-2024 Homework 4 Solutions
No ratings yet
IE 451 Fall 2023-2024 Homework 4 Solutions
19 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
STAT 31631 – Statistical Modeling_Assignment01
No ratings yet
STAT 31631 – Statistical Modeling_Assignment01
2 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Assignment 3 (2023)
No ratings yet
Assignment 3 (2023)
9 pages
Docx
No ratings yet
Docx
7 pages
222BDA35 Activity2
No ratings yet
222BDA35 Activity2
5 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
Machine Learning-Lecture 1(Student)
No ratings yet
Machine Learning-Lecture 1(Student)
14 pages
Week2 Excel Problem Statement Real Estate-1
No ratings yet
Week2 Excel Problem Statement Real Estate-1
2 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
MidtermII Preparation Questions
No ratings yet
MidtermII Preparation Questions
5 pages
Homework 5 Solutions
No ratings yet
Homework 5 Solutions
10 pages
3602Final_Question
No ratings yet
3602Final_Question
18 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
Exp 1 145
No ratings yet
Exp 1 145
4 pages
R Regression Commands
No ratings yet
R Regression Commands
5 pages
HW3
No ratings yet
HW3
2 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
unit5_R
No ratings yet
unit5_R
5 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Experiment 1
No ratings yet
Experiment 1
4 pages
DAY 6 MLR Case Studies
No ratings yet
DAY 6 MLR Case Studies
24 pages
Estad Istica II Chapter 4: Simple Linear Regression
No ratings yet
Estad Istica II Chapter 4: Simple Linear Regression
46 pages
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
No ratings yet
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
4 pages
Isye4031 Regression and Forecasting Practice Problems 2 Fall 2014
No ratings yet
Isye4031 Regression and Forecasting Practice Problems 2 Fall 2014
5 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
Lab 4
No ratings yet
Lab 4
7 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
Econ452: Problem Set 2: University of Michigan - Department of Economics
No ratings yet
Econ452: Problem Set 2: University of Michigan - Department of Economics
4 pages
Problem Statement - Excel Project - Treo's Real Estate
No ratings yet
Problem Statement - Excel Project - Treo's Real Estate
3 pages
DMV Unit 3 PPT_RSK_250419_125620 jfhuehiwhu
No ratings yet
DMV Unit 3 PPT_RSK_250419_125620 jfhuehiwhu
89 pages
STAT 5700 Homework 1
No ratings yet
STAT 5700 Homework 1
19 pages
20BCE1205 Lab3
No ratings yet
20BCE1205 Lab3
9 pages
Assignment 2
100% (1)
Assignment 2
8 pages
Asynchronus Learning Module - Sesi 8
No ratings yet
Asynchronus Learning Module - Sesi 8
9 pages
Homework 2
100% (1)
Homework 2
14 pages
Unit 5-1
No ratings yet
Unit 5-1
17 pages
Statistical Models in R
No ratings yet
Statistical Models in R
18 pages
L10 Multiple Regression
No ratings yet
L10 Multiple Regression
14 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Revised Rguhs M Final
No ratings yet
Revised Rguhs M Final
205 pages
E-Marketing, External and Internal Influence
No ratings yet
E-Marketing, External and Internal Influence
8 pages
IELTS Academic Reading
0% (1)
IELTS Academic Reading
3 pages
19BCP096 HashPointers
No ratings yet
19BCP096 HashPointers
4 pages
Diversity of Learners A Study About How F558a5f5
No ratings yet
Diversity of Learners A Study About How F558a5f5
7 pages
In The Moot Court of Cnlu Munsif Ii Patna
No ratings yet
In The Moot Court of Cnlu Munsif Ii Patna
7 pages
Hold On Adele - Pesquisa Google
No ratings yet
Hold On Adele - Pesquisa Google
1 page
CBSE Class 10 English Communicative - SET 3 Question Paper 2020
No ratings yet
CBSE Class 10 English Communicative - SET 3 Question Paper 2020
7 pages
english trivia
No ratings yet
english trivia
8 pages
Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) For Mobile Computers
No ratings yet
Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) For Mobile Computers
11 pages
midterm-winter-2024
No ratings yet
midterm-winter-2024
12 pages
DEUTEROMYCETES
No ratings yet
DEUTEROMYCETES
34 pages
John Ashbery Self Portrait Review by Richardson
No ratings yet
John Ashbery Self Portrait Review by Richardson
25 pages
Akhet - The Horizon: Quotes From The Instructions of Ptahotep
No ratings yet
Akhet - The Horizon: Quotes From The Instructions of Ptahotep
3 pages
MR Market
No ratings yet
MR Market
64 pages
Syllabus
No ratings yet
Syllabus
3 pages
Linguistic Banjar
No ratings yet
Linguistic Banjar
18 pages
Growth of Arbitration in Criminal Law
No ratings yet
Growth of Arbitration in Criminal Law
6 pages
RSPM - Self Help Manual
No ratings yet
RSPM - Self Help Manual
9 pages
Ariel Heryanto Budaya Populer Di Indonesia Mencair
No ratings yet
Ariel Heryanto Budaya Populer Di Indonesia Mencair
5 pages
7 EL 114 - Ramayana
No ratings yet
7 EL 114 - Ramayana
2 pages
A Frogs Skin - Lesson Plan
No ratings yet
A Frogs Skin - Lesson Plan
5 pages
Penetration Pricing
No ratings yet
Penetration Pricing
3 pages
Jackson V AEG Live TRANSCRIPTS of Kathy Jorrie (Outside Counsel) Drew Up DR Murray and Michael Jackson Contracts For AEG Live
No ratings yet
Jackson V AEG Live TRANSCRIPTS of Kathy Jorrie (Outside Counsel) Drew Up DR Murray and Michael Jackson Contracts For AEG Live
117 pages
Systematic Desensitization
No ratings yet
Systematic Desensitization
30 pages
Chapter 1 - English For Front Desk Management
No ratings yet
Chapter 1 - English For Front Desk Management
36 pages
KATALK
No ratings yet
KATALK
59 pages
Summary of The Domain Driven Design Concepts - Robloxro - Medium PDF
No ratings yet
Summary of The Domain Driven Design Concepts - Robloxro - Medium PDF
7 pages
Art App ConWorld
No ratings yet
Art App ConWorld
70 pages

Test Your Knowledge of Linear Regression and PCA in R

Uploaded by

Test Your Knowledge of Linear Regression and PCA in R

Uploaded by

1

The Analytics Edge FALL 2020

Test your knowledge of Linear Regression and PCA in R

• mpg: miles per gallon

3. This problem focusses on the multicollinearity problem with simulated data.

(a) Perform the following commands in R:

• crim: per capita crime rate by town

• vintage: year the wine was made

• od280: OD280/ OD315 of diluted wines

You might also like