Assignment_02
Assignment_02
650 Assignment #2
Data Analytics Page 1 of 2
Dr. Ruxian Wang Johns Hopkins Carey Business School
Assignment #2
Attention: Please prepare two files for each homework assignment: the .docx or .pdf file for your
answers including figures to each question; the other .R file for your R script. File names should
be “LastName FirstName number.docx” and “LastName FirstName number.R”. All assignments
should submitted via our course website.
1. Grade point average of 12 graduating MBA students, GPA, and their GMAT scores taken
before entering the MBA program are given below. Use the GMAT scores as a predictor of
GPA, and conduct a regression of GPA on GMAT scores.
x=GMAT y=GPA
560 3.20
540 3.44
520 3.70
580 3.10
520 3.00
620 4.00
660 3.38
630 3.83
550 2.67
550 2.75
600 2.33
537 3.75
2. Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1
for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction
between GPA and Gender. The response is starting salary after graduation (in thousands of
dollars). Suppose we use least squares to fit the model, and get βb0 = 50, βb1 = 20, βb2 = 0.07,
βb3 = 35, βb4 = 0.01, βb5 = −10.
3. In this exercise you will create some simulated data and will fit simple linear regression
models to it. Make sure to use command set.seed(1) prior to starting part (a) to ensure
consistent results. (Hint: rnorm(n, mean = a, sd = b) generates n random variables with
mean a, standard deviation b, e.g., rnorm(100, mean = 10, sd = 5) returns a vector with
100 values, each of which follows a normal distribution with mean 10 and standard deviation
5.)
(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from
a N (0, 1) distribution. This represents a feature, X.
(b) Using the rnorm() function, create a vector, , containing 100 observations drawn from
a N (0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
(c) Using x and , generate a vector y according to the model
Y = −1 + 0.5X + . (1)
What is the length of the vector y? What are the values of β0 and β1 in this linear
model?
(d) Create a scatterplot displaying the relationship between x and y. Comment on what
you observe.
(e) Fit a least squares linear model to predict y using x. Comment on the model obtained.
How do βb0 and βb1 compare to β0 and β1
(f) Now fit a polynomial regression model that predicts y using x and x2 . Is there evidence
that the quadratic term improves the model fit? Explain your answer.
(g) Repeat (a)-(f) after modifying the data generation process in such a way that there is
less noise in the data. The model (1) should remain the same. You can do this by
decreasing the variance of the normal distribution used to generate the error term in
(b). Describe your results.
(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is
more noise in the data. The model (1) should remain the same. You can do this by
increasing the variance of the normal distribution used to generate the error term in
(b). Describe your results.
(i) What are the confidence intervals for β0 and β1 based on the original data set, the noisier
data set, and the less noisy data set? Comment on your results.