0% found this document useful (0 votes)

19 views

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Linear regression in R

MACC7006 Accounting Data and Analytics

Keri Hu

Faculty of Business and Economics

1/23
Today: Linear regression in R

By the end of today’s lecture, you should be able to:

• Perform regression analysis to determine linear relationships

between variables
• Understand hypothesis testing and statistical inference
• Interpret coefficient estimates and add best fit lines to scatter
plots of the data

We will work with the datasets: Wine.csv and WineTest.csv.

2/23
Review of regression basics

• Linear regression: Explain movements in the dependent variable by

movements in the independent variables
ñ find the line that fits data in the sample

• Univariate model: Yi “ β0 ` β1 Xi ` ϵi

• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi

• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K

• Interpretation:
• β̂1 : One-unit increase in X1 is associated with β̂1 units of increase
in Y on average, holding constant X2 , . . . , XK .

3/23
Correlation is not causation

Regression results cannot prove causality.

• If two things A and B are related statistically, it is possible that

• A causes B
• B causes A
• Some third factor causes both A and B.
• Correlation
• How strongly the variables are linearly related and change together
• Not why and how behind the relationship – just the relationship exists

4/23
Example: Covid infection

03/26/2020 excerpt of KPBS San Diego:

• “Of the 297 people in San Diego County with positive diagnoses,
cases in patients between 20 and 59 formed the bulk of the total,
236 overall or 79% of cases.”

Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?

1 KPBS San Diego, 2020 5/23

Testing rate matters

“Dr. Eric McDonald said that statistic probably represented a testing bias,
as members of the military, first-responders and healthcare workers fall
most frequently into that age group and these people are tested at rates
much higher than the general population.”

• Members of the military, first-responders and healthcare workers are

mostly in 20 ´ 59.
• Essential workers are tested more and more positive can be found.
• Age 20 ´ 59 œ More vulnerable to COVID

6/23
Variables in the dataset

Build a linear regression model to predict Price, using Age, AGST,

HarvestRain, WinterRain, and FrancePop as independent variables

7/23
Plot Price versus Age, AGST, HarvestRain, WinterRain

8/23
Estimate a linear model: lm()

Fit a regression line (we save the model to WineReg)

• We do not need to use $ to specify variables here, because we have

the data argument telling R which dataset to use.

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

Age + FrancePop, data = Wine)

• Check the output of the model: summary(WineReg)

9/23
Regression result

10/23
Description of the table

• Residual: ei “ Yi ´ Ŷi
• Estimate: β̂0 (Intercept), β̂1 (WinterRain), β̂2 (AGST), β̂3
(HarvestRain), β̂4 (Age), β̂5 (FrancePop)
• The other three columns (Std. Error, t value, and Pr(>|t|))
help us determine if a variable should be included in the model,
specifically if its coefficient is significantly different from zero.
• “***, **, *, ., ” (most significant Ñ least significant): which
variables are significant
• Adjusted R2 : R2 adjusted for the number of independent variables

So how do the three columns mean?

11/23
Hypothesis testing

Did our sample “conform to” a particular hypothesis?

A hypothesis test evaluates two mutually exclusive statements about a
population and determines which is supported by the sample data.
$
&Null hypothesis H0 : The age does not affect wine quality.
’
’
versus
’
%Alternative hypothesis H : The age does affect wine quality.
’
A

1. State the hypotheses to be tested: H0 (something we expect to

reject) and HA (something to be supported)
2. Determine which test to use (e.g. t test)
3. Estimate equation and calculate value of test statistic (e.g. t value)
4. Draw a conclusion

12/23
Hypothesis testing in regression

We want to determine whether each independent variable is correlated

with the dependent variable respectively.

1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.

H0 : βk “ 0 and HA : βk ‰ 0
2. Obtain t value and Pr(>|t|) for each variable Xk respectively
3. Determine whether to reject the null hypothesis βk “ 0.

13/23
Null hypothesis H0 : βk “ 0

If H0 (i.e. βk “ 0) is true, the distribution of estimated regression

coefficient β̂k should follow t distribution and be something like this:

If our sample statistic (e.g. t value) is far from the hypothetical value 0,
we can say this is unusual enough and reject the null hypothesis βk “ 0.

2 https://analystprep.com/cfa-level-1-exam/quantitative-methods/one-tailed-vs-two-tailed-

hypothesis-testing/ 14/23
t value, Std. Error, and Pr(>|t|)

A higher |t value| or a lower Pr(>|t|) implies being statistically more

significant, i.e., strong evidence of correlation with the dependent variable.

• Normalization: t value of Xk “ β̂k {Std. Error of β̂k

• Standard error: estimated standard deviation
• The larger the sample, the more precise coefficient estimates and the
higher |t value| .

• Pr(>|t|): Probability of observing a t value more extreme than

this sample if H0 is true
• If Pr(>|t|) is small, it means that the t value in this sample is
extreme and unlikely if we assume H0 .

15/23
Level of significance α

Definition: probability of type I error (rejecting H0 when it is true)

• An independent variable is statistically significant if the level of

significance is small.

• The probability of false rejection is small when |t value| is big

enough or Pr(>|t|) is small enough.

• Use 0.1% (), 1% (), 5% (), or 10% (.) level of significance

• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient

β̂k is statistically significant at the 5% level, or βk “ 0 is rejected at
the 5% level.

16/23
Refine the model

Remove insignificant independent variables

• Due to multicollinearity, we should remove independent variables one

at a time.

• Two variables that are not significant: Age and FrancePop

• Try removing FrancePop first, since it makes the least intuitive sense.

17/23
Re-run the model by leaving out FrancePop

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

Age, data = Wine)

18/23
What has changed?

• All of our independent variables are significant!

• By removing an independent variable, all of our coefficient estimates

adjusted slightly.

• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2

increases from 0.7845 to 0.7943.
• If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R2 would decrease to 0.7537.

19/23
Multicollinearity

What is the correlation between Age and FrancePop?

cor(Wine$Age, Wine$FrancePop)

[1] -0.9945

20/23
Add best fit line to plot

We regress Price on AGST:

WineLess <- lm(Price „ AGST, data = Wine)
plot(Wine$AGST, Wine$Price, abline(WineLess), ylab = ...)

21/23
Make predictions

We can make predictions on new observations by using predict.

WineTest <- read.csv("WineTest.csv")
WinePredictions <- predict(WineReg, newdata = WineTest)
str(WinePredictions)

22/23
Compare to the actual values

Out-of-sample R2

Use the mean of Price in the training set to calculate SST .

23/23

Week 4 Project: Case Study
No ratings yet
Week 4 Project: Case Study
2 pages
Assignments
No ratings yet
Assignments
6 pages
BES - R Lab
No ratings yet
BES - R Lab
5 pages
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
No ratings yet
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
25 pages
Notes - Predicitve Analystics - Multiple Regression_s
No ratings yet
Notes - Predicitve Analystics - Multiple Regression_s
24 pages
Chapter 1 Introduction of Regression
No ratings yet
Chapter 1 Introduction of Regression
43 pages
Week 8 - 10
No ratings yet
Week 8 - 10
72 pages
correlation
No ratings yet
correlation
13 pages
408 Mid
No ratings yet
408 Mid
7 pages
R Exercise For Referencer
No ratings yet
R Exercise For Referencer
14 pages
Chapter05DemandEstimation (1)
No ratings yet
Chapter05DemandEstimation (1)
41 pages
Cheat Sheet Statistics
No ratings yet
Cheat Sheet Statistics
3 pages
linearregression
No ratings yet
linearregression
18 pages
23 HW Assignment Biostat
No ratings yet
23 HW Assignment Biostat
6 pages
Regn_lect_5
No ratings yet
Regn_lect_5
9 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
Transformando La Movilidad Urbana en Mexico2
No ratings yet
Transformando La Movilidad Urbana en Mexico2
4 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Module 3 - MultipleLinearRegression - Afterclass1b
No ratings yet
Module 3 - MultipleLinearRegression - Afterclass1b
34 pages
Estimation of Causal Relationships I: Illustration 1
No ratings yet
Estimation of Causal Relationships I: Illustration 1
8 pages
Week 11-1 - Lecture 14 - Student
No ratings yet
Week 11-1 - Lecture 14 - Student
42 pages
05 Linear Regression 2
No ratings yet
05 Linear Regression 2
71 pages
Bivariate Regression Analysis
No ratings yet
Bivariate Regression Analysis
15 pages
Advanced Data Analytics_2_Correlation and Simpleregression
No ratings yet
Advanced Data Analytics_2_Correlation and Simpleregression
36 pages
Regression
No ratings yet
Regression
90 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
3 pages
Chapter 4 Thomas Managerial Economics
No ratings yet
Chapter 4 Thomas Managerial Economics
36 pages
@regression
No ratings yet
@regression
33 pages
Correlation, Regression and Test of Signficance in R
No ratings yet
Correlation, Regression and Test of Signficance in R
16 pages
ACTS 372 UNIT 6
No ratings yet
ACTS 372 UNIT 6
40 pages
Lecture Plan 12 - 16!1!1
No ratings yet
Lecture Plan 12 - 16!1!1
7 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Dr. Sufian M. Salih / Regression and Correlation
No ratings yet
Dr. Sufian M. Salih / Regression and Correlation
14 pages
Chapter14
No ratings yet
Chapter14
65 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Applied General Statistics (HIS 223)
No ratings yet
Applied General Statistics (HIS 223)
35 pages
Unit-15 Data Analysis and R
No ratings yet
Unit-15 Data Analysis and R
12 pages
Multiple Linear Regression: y BX BX BX
No ratings yet
Multiple Linear Regression: y BX BX BX
14 pages
Lecture Week 12 - Intro To Regression
No ratings yet
Lecture Week 12 - Intro To Regression
5 pages
A Tutorial On How To Run A Simple Linear Regression in Excel
No ratings yet
A Tutorial On How To Run A Simple Linear Regression in Excel
19 pages
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
No ratings yet
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
34 pages
Predective Analytics or Inferential Statistics
No ratings yet
Predective Analytics or Inferential Statistics
27 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
L4&5 Multiple Regression 2010B
No ratings yet
L4&5 Multiple Regression 2010B
77 pages
Introduction To Simple Linear Regression
No ratings yet
Introduction To Simple Linear Regression
34 pages
Chapter 4 Part 3 Inference
No ratings yet
Chapter 4 Part 3 Inference
22 pages
EE4 Ch04 Solutions Manual
No ratings yet
EE4 Ch04 Solutions Manual
12 pages
Quantitative Methods Vocabulary
No ratings yet
Quantitative Methods Vocabulary
5 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Quants
No ratings yet
Quants
8 pages
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
7 Regression Analysis
No ratings yet
7 Regression Analysis
23 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Lecture 13 (2)
No ratings yet
Lecture 13 (2)
7 pages
STAB27
No ratings yet
STAB27
51 pages
Team Assignment Report Part A: Business Statistics
No ratings yet
Team Assignment Report Part A: Business Statistics
32 pages
Chapter 5 Regression Analysis
No ratings yet
Chapter 5 Regression Analysis
14 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Lab 5
No ratings yet
Lab 5
30 pages
Lab 3 (Tutorial 1)
No ratings yet
Lab 3 (Tutorial 1)
20 pages
Lab 4
No ratings yet
Lab 4
20 pages
Lab 1
No ratings yet
Lab 1
26 pages
Assignment Norm Dist
No ratings yet
Assignment Norm Dist
2 pages
234-Article Text-709-4-10-20210928
No ratings yet
234-Article Text-709-4-10-20210928
22 pages
Kessler, Ronald Sex Differences in Vulnerability To Undesirable Life Events
No ratings yet
Kessler, Ronald Sex Differences in Vulnerability To Undesirable Life Events
13 pages
EXPIRIENTIAL EXERCISES 10
No ratings yet
EXPIRIENTIAL EXERCISES 10
4 pages
Machine Tools
No ratings yet
Machine Tools
60 pages
SaaS Ideas
No ratings yet
SaaS Ideas
2 pages
Abstract (Summary)
No ratings yet
Abstract (Summary)
4 pages
Work-Life Balance: A Project Report ON
No ratings yet
Work-Life Balance: A Project Report ON
51 pages
Literature Review
No ratings yet
Literature Review
8 pages
Sec 1 Marketing & Sales
No ratings yet
Sec 1 Marketing & Sales
8 pages
Lean Diagnosis For Chilean Construction Industry - Towards More Sustainable Lean Practices and Tools - Salvatierra1 Et Al. 2015
No ratings yet
Lean Diagnosis For Chilean Construction Industry - Towards More Sustainable Lean Practices and Tools - Salvatierra1 Et Al. 2015
10 pages
12 - Analyse Donné Aussi
No ratings yet
12 - Analyse Donné Aussi
8 pages
Chapter 11: Multiple Regression and Correlation: Ey Ey Ey X X Ey X X X For Which by Contrast, When X Ey X
No ratings yet
Chapter 11: Multiple Regression and Correlation: Ey Ey Ey X X Ey X X X For Which by Contrast, When X Ey X
14 pages
Practical Research 1-Week5-6
No ratings yet
Practical Research 1-Week5-6
22 pages
Developing Emotional Intelligence
No ratings yet
Developing Emotional Intelligence
3 pages
Sample Motivation Letter For Masters Degree in Architecture PDF
No ratings yet
Sample Motivation Letter For Masters Degree in Architecture PDF
1 page
Yoon Et Al., 2016
No ratings yet
Yoon Et Al., 2016
21 pages
The Nature and Relevance of Research
No ratings yet
The Nature and Relevance of Research
42 pages
Refrnce
No ratings yet
Refrnce
2 pages
Classroom Environment Factors As Predictors of School Dropout Process Among Adolescents in Rural Areas in Mezam Division of The North West Region of Cameroon
No ratings yet
Classroom Environment Factors As Predictors of School Dropout Process Among Adolescents in Rural Areas in Mezam Division of The North West Region of Cameroon
11 pages
Cs3337 2016 Fall Final Exam Key Results Software Engineering Exam CSULA
No ratings yet
Cs3337 2016 Fall Final Exam Key Results Software Engineering Exam CSULA
5 pages
Recherche Operationnelle
No ratings yet
Recherche Operationnelle
23 pages
The State of AI in Accounting Report 2025 by Karbon
No ratings yet
The State of AI in Accounting Report 2025 by Karbon
23 pages
Rak-50 3149 Brinkgreve Et Al.. 2010 Validation of Empirical Formulas To Derive Model Parameters For Sands.
No ratings yet
Rak-50 3149 Brinkgreve Et Al.. 2010 Validation of Empirical Formulas To Derive Model Parameters For Sands.
6 pages
Chapter 3 Dissertation Sample
100% (2)
Chapter 3 Dissertation Sample
4 pages
BM Paper 2 - Markscheme
No ratings yet
BM Paper 2 - Markscheme
12 pages
22-23AIHAVendorDirectory
No ratings yet
22-23AIHAVendorDirectory
28 pages
Digital Marketing
No ratings yet
Digital Marketing
54 pages
Final Purposal
No ratings yet
Final Purposal
18 pages

Lab 2

Uploaded by

Lab 2

Uploaded by

Linear regression in R

MACC7006 Accounting Data and Analytics

Faculty of Business and Economics

By the end of today’s lecture, you should be able to:

• Perform regression analysis to determine linear relationships

We will work with the datasets: Wine.csv and WineTest.csv.

• Linear regression: Explain movements in the dependent variable by

• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi

• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K

Regression results cannot prove causality.

• If two things A and B are related statistically, it is possible that

03/26/2020 excerpt of KPBS San Diego:

Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?

1 KPBS San Diego, 2020 5/23

• Members of the military, first-responders and healthcare workers are

Build a linear regression model to predict Price, using Age, AGST,

Fit a regression line (we save the model to WineReg)

• We do not need to use $ to specify variables here, because we have

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

• Check the output of the model: summary(WineReg)

So how do the three columns mean?

Did our sample “conform to” a particular hypothesis?

1. State the hypotheses to be tested: H0 (something we expect to

We want to determine whether each independent variable is correlated

1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.

If H0 (i.e. βk “ 0) is true, the distribution of estimated regression

A higher |t value| or a lower Pr(>|t|) implies being statistically more

• Normalization: t value of Xk “ β̂k {Std. Error of β̂k

• Pr(>|t|): Probability of observing a t value more extreme than

Definition: probability of type I error (rejecting H0 when it is true)

• An independent variable is statistically significant if the level of

• The probability of false rejection is small when |t value| is big

• Use 0.1% (***), 1% (**), 5% (*), or 10% (.) level of significance

• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient

Remove insignificant independent variables

• Due to multicollinearity, we should remove independent variables one

• Two variables that are not significant: Age and FrancePop

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

• All of our independent variables are significant!

• By removing an independent variable, all of our coefficient estimates

• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2

What is the correlation between Age and FrancePop?

We regress Price on AGST:

We can make predictions on new observations by using predict.

Use the mean of Price in the training set to calculate SST .

You might also like

• Use 0.1% (), 1% (), 5% (), or 10% (.) level of significance