0% found this document useful (0 votes)
19 views

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Linear regression in R

MACC7006 Accounting Data and Analytics

Keri Hu

Faculty of Business and Economics

1/23
Today: Linear regression in R

By the end of today’s lecture, you should be able to:

• Perform regression analysis to determine linear relationships


between variables
• Understand hypothesis testing and statistical inference
• Interpret coefficient estimates and add best fit lines to scatter
plots of the data

We will work with the datasets: Wine.csv and WineTest.csv.

2/23
Review of regression basics

• Linear regression: Explain movements in the dependent variable by


movements in the independent variables
ñ find the line that fits data in the sample

• Univariate model: Yi “ β0 ` β1 Xi ` ϵi

• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi

• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K

• Interpretation:
• β̂1 : One-unit increase in X1 is associated with β̂1 units of increase
in Y on average, holding constant X2 , . . . , XK .

3/23
Correlation is not causation

Regression results cannot prove causality.

• If two things A and B are related statistically, it is possible that


• A causes B
• B causes A
• Some third factor causes both A and B.
• Correlation
• How strongly the variables are linearly related and change together
• Not why and how behind the relationship – just the relationship exists

4/23
Example: Covid infection

03/26/2020 excerpt of KPBS San Diego:

• “Of the 297 people in San Diego County with positive diagnoses,
cases in patients between 20 and 59 formed the bulk of the total,
236 overall or 79% of cases.”

Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?

1 KPBS San Diego, 2020 5/23


Testing rate matters

“Dr. Eric McDonald said that statistic probably represented a testing bias,
as members of the military, first-responders and healthcare workers fall
most frequently into that age group and these people are tested at rates
much higher than the general population.”

• Members of the military, first-responders and healthcare workers are


mostly in 20 ´ 59.
• Essential workers are tested more and more positive can be found.
• Age 20 ´ 59 œ More vulnerable to COVID

6/23
Variables in the dataset

Build a linear regression model to predict Price, using Age, AGST,


HarvestRain, WinterRain, and FrancePop as independent variables

7/23
Plot Price versus Age, AGST, HarvestRain, WinterRain

8/23
Estimate a linear model: lm()

Fit a regression line (we save the model to WineReg)

• We do not need to use $ to specify variables here, because we have


the data argument telling R which dataset to use.

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +


Age + FrancePop, data = Wine)

• Check the output of the model: summary(WineReg)

9/23
Regression result

10/23
Description of the table

• Residual: ei “ Yi ´ Ŷi
• Estimate: β̂0 (Intercept), β̂1 (WinterRain), β̂2 (AGST), β̂3
(HarvestRain), β̂4 (Age), β̂5 (FrancePop)
• The other three columns (Std. Error, t value, and Pr(>|t|))
help us determine if a variable should be included in the model,
specifically if its coefficient is significantly different from zero.
• “***, **, *, ., ” (most significant Ñ least significant): which
variables are significant
• Adjusted R2 : R2 adjusted for the number of independent variables

So how do the three columns mean?

11/23
Hypothesis testing

Did our sample “conform to” a particular hypothesis?


A hypothesis test evaluates two mutually exclusive statements about a
population and determines which is supported by the sample data.
$
&Null hypothesis H0 : The age does not affect wine quality.


versus

%Alternative hypothesis H : The age does affect wine quality.

A

1. State the hypotheses to be tested: H0 (something we expect to


reject) and HA (something to be supported)
2. Determine which test to use (e.g. t test)
3. Estimate equation and calculate value of test statistic (e.g. t value)
4. Draw a conclusion

12/23
Hypothesis testing in regression

We want to determine whether each independent variable is correlated


with the dependent variable respectively.

1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.


H0 : βk “ 0 and HA : βk ‰ 0
2. Obtain t value and Pr(>|t|) for each variable Xk respectively
3. Determine whether to reject the null hypothesis βk “ 0.

13/23
Null hypothesis H0 : βk “ 0

If H0 (i.e. βk “ 0) is true, the distribution of estimated regression


coefficient β̂k should follow t distribution and be something like this:

If our sample statistic (e.g. t value) is far from the hypothetical value 0,
we can say this is unusual enough and reject the null hypothesis βk “ 0.

2 https://analystprep.com/cfa-level-1-exam/quantitative-methods/one-tailed-vs-two-tailed-

hypothesis-testing/ 14/23
t value, Std. Error, and Pr(>|t|)

A higher |t value| or a lower Pr(>|t|) implies being statistically more


significant, i.e., strong evidence of correlation with the dependent variable.

• Normalization: t value of Xk “ β̂k {Std. Error of β̂k


• Standard error: estimated standard deviation
• The larger the sample, the more precise coefficient estimates and the
higher |t value| .

• Pr(>|t|): Probability of observing a t value more extreme than


this sample if H0 is true
• If Pr(>|t|) is small, it means that the t value in this sample is
extreme and unlikely if we assume H0 .

15/23
Level of significance α

Definition: probability of type I error (rejecting H0 when it is true)

• An independent variable is statistically significant if the level of


significance is small.

• The probability of false rejection is small when |t value| is big


enough or Pr(>|t|) is small enough.

• Use 0.1% (***), 1% (**), 5% (*), or 10% (.) level of significance

• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient


β̂k is statistically significant at the 5% level, or βk “ 0 is rejected at
the 5% level.

16/23
Refine the model

Remove insignificant independent variables

• Due to multicollinearity, we should remove independent variables one


at a time.

• Two variables that are not significant: Age and FrancePop


• Try removing FrancePop first, since it makes the least intuitive sense.

17/23
Re-run the model by leaving out FrancePop

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +


Age, data = Wine)

18/23
What has changed?

• All of our independent variables are significant!

• By removing an independent variable, all of our coefficient estimates


adjusted slightly.

• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2


increases from 0.7845 to 0.7943.
• If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R2 would decrease to 0.7537.

19/23
Multicollinearity

What is the correlation between Age and FrancePop?


cor(Wine$Age, Wine$FrancePop)

[1] -0.9945

20/23
Add best fit line to plot

We regress Price on AGST:


WineLess <- lm(Price „ AGST, data = Wine)
plot(Wine$AGST, Wine$Price, abline(WineLess), ylab = ...)

21/23
Make predictions

We can make predictions on new observations by using predict.


WineTest <- read.csv("WineTest.csv")
WinePredictions <- predict(WineReg, newdata = WineTest)
str(WinePredictions)

22/23
Compare to the actual values

Out-of-sample R2

Use the mean of Price in the training set to calculate SST .

23/23

You might also like