0% found this document useful (0 votes)
19 views40 pages

EVSC 445 Week 11

Uploaded by

poopstack1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views40 pages

EVSC 445 Week 11

Uploaded by

poopstack1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

March 26, 2024

March 26, 2024 1 / 31


Future Classes

1. Multiple Linear Regression


2. Multivariate Analysis

Lecture: Mar 26 Lecture: Apr 2 Review: Apr 9


We are here Mini-Proj due Midterm 03
Lab on Thurs on Apr 4 on Apr 11
Mar 28

March 26, 2024 2 / 31


Last Week: Simple Linear Regression

25

20

15

Y
y
10

0
0 5 10 15 20
I If β̂0 = β̂intercept = 1.0, β̂1 = β̂slope = 1.2 X
x
– We used this model to calculate the fitted (expected) immune marker
value ŷ for someone who eats 6 units of junk food.

March 26, 2024 3 / 31


Last Week: Simple Linear Regression

25

20

15

Y
y
10

0
0 5 10 15 20
I If β̂0 = β̂intercept = 1.0, β̂1 = β̂slope = 1.2 X
x
– We used this model to calculate the fitted (expected) immune marker
value ŷ for someone who eats 6 units of junk food.
– ŷ = β̂0 + β̂1 x
– ŷ = 1.0 + 1.2 (6) = 8.2 March 26, 2024 3 / 31
Regression ANOVA Sums of Squares Table
> anova(lm(y~x))
Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)


x 1 1108.80 1108.80 75.687 4.719e-08 ***
Residuals 19 278.35 14.65
---

ANOVA table
Source of variation df Sum of Squares Mean Square Fstat
Covariate x 1 1108.80 1108.80 75.686
Error 19 278.35 14.65 –
Total 20 1387.15 – –

March 26, 2024 4 / 31


Regression ANOVA Sums of Squares Table
> anova(lm(y~x))
Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)


x 1 1108.80 1108.80 75.687 4.719e-08 ***
Residuals 19 278.35 14.65
---

ANOVA table
Source of variation df Sum of Squares Mean Square Fstat
Covariate x 1 1108.80 1108.80 75.686
Error 19 278.35 14.65 –
Total 20 1387.15 – –

The ANOVA table doesn’t tell us whether the relation is positive or


negative. i.e., there is no estimate of β̂0 or β̂1 . For this we fit a regression
model using function lm(y ∼ x).
March 26, 2024 4 / 31
Simple Linear Regression

> summary(lm(y~x))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0000 1.6125 0.62 0.543
x 1.2000 0.1379 8.70 4.72e-08 ***
---

March 26, 2024 5 / 31


Simple Linear Regression

> summary(lm(y~x))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0000 1.6125 0.62 0.543
x 1.2000 0.1379 8.70 4.72e-08 ***
---

I The regression analysis does tell us whether the relation is positive or


negative.
I i.e., there is an estimate of β̂0 or β̂1 under the ’Estimate’ column.

What are the Degrees of Freedom for β̂0 and β̂1 in the t-tests?

March 26, 2024 5 / 31


Simple Linear Regression

> summary(lm(y~x))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0000 1.6125 0.62 0.543
x 1.2000 0.1379 8.70 4.72e-08 ***
---

I The regression analysis does tell us whether the relation is positive or


negative.
I i.e., there is an estimate of β̂0 or β̂1 under the ’Estimate’ column.

What are the Degrees of Freedom for β̂0 and β̂1 in the t-tests?
I dfβ0 = observation number - number of parameters = 21 − 2 = 19
I dfβ1 = observation number - number of parameters = 21 − 2 = 19

March 26, 2024 5 / 31


Linear Regression Table: Standard Error for the mean, x̄
> summary(lm(y~x))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0000 1.6125 0.62 0.543
x 1.2000 0.1379 8.70 4.72e-08 ***
---
I Recall how we calculate the t-statistic for a one-sample t-test where
the statistical hypotheses are HO : µ = 0 vs HA : µ 6= 0.
x̄ − 0 x̄ − 0
t-stat = q =
var (x) Std.Error(x)
n

Now the null hypotheses are β0 = 0 and β1 = 0 and

β̂0 − 0 1.0
t-stat = = = 0.62
Std.Error(β̂0 ) 1.6125
March 26, 2024 6 / 31
Understanding the Output of Linear Regression Table
Analysis of Variance Table

Response: y.
Df Sum Sq Mean Sq F value Pr(>F)
x 1 1108.80 1108.80 75.687 4.719e-08 ***
Residuals 19 278.35 14.65
.
.
.
Residual standard error: 3.827 on 19 degrees of freedom
Multiple R-squared: 0.7993,Adjusted R-squared: 0.7888
F-statistic: 75.69 on 1 and 19 DF, p-value: 4.719e-08

I r 2 can be interpreted as the percentage of variance in the dependent


variable that can be explained by the predictors.
I 2 is now adjusted by a penalty for the number of parameters
radj

March 26, 2024 7 / 31


How to adjust the r 2 when there’s more than 1 parameter
E.g., β̂0 and β̂1
I There’s a formula for this.
Analysis of Variance Table

Response: y.
Df Sum Sq Mean Sq F value Pr(>F)
x 1 1108.80 1108.80 75.687 4.719e-08 ***
Residuals 19 278.35 14.65
.
.
.
Residual standard error: 3.827 on 19 degrees of freedom
Multiple R-squared: 0.7993,Adjusted R-squared: 0.7888
F-statistic: 75.69 on 1 and 19 DF, p-value: 4.719e-08

SSR SST − SSE


r2 = = =?
SST SST

2 SST /dfT − SSE /dfE


radj = =?
SST /dfT March 26, 2024 8 / 31
Multiple Regression

In simple linear regression, the regression model takes the form:

yi = β̂0 + β̂1 x1 + i
ŷi = β̂0 + β̂1 x1

More than one independent variable (e.g., x1 , x2 , . . .) can be used to


explain variance in the dependent variable (y ), as long as they are not
linearly related. This is called multiple regression.

I For example, a multiple regression model takes the form:


yi = β̂0 + β̂1 x1 + β̂2 x2 + . . . + β̂p xp + i
where subscript p is the number of variables (covariates)

March 26, 2024 9 / 31


Multiple Regression

More than one independent variable (e.g., x1 , x2 , . . .)


can be used to explain variance in
(still just one) dependent variable (y ).

March 26, 2024 10 / 31


Multiple Regression

More than one independent variable (e.g., x1 , x2 , . . .)


can be used to explain variance in
(still just one) dependent variable (y ).

I In simple linear regression the model takes the form:

yi = β̂0 + β̂1 X1 + i
where X1 is the only independent variable.

March 26, 2024 10 / 31


Multiple Regression

More than one independent variable (e.g., x1 , x2 , . . .)


can be used to explain variance in
(still just one) dependent variable (y ).

I In simple linear regression the model takes the form:

yi = β̂0 + β̂1 X1 + i
where X1 is the only independent variable.

I In multiple regression, the model takes the form:

yi = β̂0 + β̂1 X1 + β̂2 X2 + . . . + β̂p Xp + i


where p is the number of variables.

March 26, 2024 10 / 31


Multiple Regression

I In multiple regression, the data are now multi-dimentional!


...But we are still finding the smallest sum of all the squared residuals.
I Multiple Regression Model

yi = β̂0 + β̂1 X1 + β̂2 X2 + . . . + β̂p Xp + i


β̂0 is the intercept
β̂1 how much does y change if X1 increases by one unit
..
.
β̂p how much does y change if Xp increases by one unit

March 26, 2024 11 / 31


Multiple Regression
I What is the objective?
1. you want to know how Y is affected by all the predictor variables (X ’s)

2. you want to know how Y is affected by a subset of predictor variables


(X ’s), but you put in others to control for other sources of variability
(i.e., to get a more accurate estimate of β̂1 )
3. you want to find the best, most parsimonious model for predicting Y
from a subset of predictor variables (X ’s)

March 26, 2024 12 / 31


Multiple Regression
I What is the objective?
1. you want to know how Y is affected by all the predictor variables (X ’s)

2. you want to know how Y is affected by a subset of predictor variables


(X ’s), but you put in others to control for other sources of variability
(i.e., to get a more accurate estimate of β̂1 )
3. you want to find the best, most parsimonious model for predicting Y
from a subset of predictor variables (X ’s)

I In all cases, the goal is to estimate the regression coefficients ( β̂’s )


from the data using the least squares principle that minimizes
X n
SSE = (yi − yˆi )2
i=1

March 26, 2024 12 / 31


March 26, 2024 13 / 31
Multiple Regression

I Example of 2nd objective:


You’d like to know how Y is affected by a subset of predictor
variables (X ’s), but you put in others to control for other sources of
variability (i.e., to get a more accurate estimate of β̂1 )
I Dependent Variable: Home Rent Price
I Key Predictor Variable: Square Footage

March 26, 2024 14 / 31


Multiple Regression

I Example of 2nd objective:


You’d like to know how Y is affected by a subset of predictor
variables (X ’s), but you put in others to control for other sources of
variability (i.e., to get a more accurate estimate of β̂1 )
I Dependent Variable: Home Rent Price
I Key Predictor Variable: Square Footage
I Other Factors:
I Median neighbourhood income
I Age of home
I Quality of local schools
I Distance from downtown Vancouver

March 26, 2024 14 / 31


Assumptions of Simple and Multiple Linear Regression

I As with simple linear regression, we assume


1. the samples are randomly selected from the population
2. the samples are independent of one another (i.e., no auto-correlation)
3. the residuals (  ) are normally distributed
4. the residuals have constant variance across range of predictors ( X ’s )

I We check the assumptions as before with plots of the model residuals.

March 26, 2024 15 / 31


Multiple Regression: Another example
Assume we have radio-collared some (n) adult caribou and
we want to look to see if there is a relationship between
I the time ( Y ) it takes a caribou to migrate between
calving grounds and over-wintering habitat, and
1. the number of steps the animal takes ( X1 )
2. the age of the caribou (years) ( X2 )
..
.

The data table takes the form (where each row is one animal) :
Y X1 X2 ... Xp
y1 x11 x12 ... x1p
y2 x21 x22 ... x2p
.. .. .. .. ..
. . . . .
yn xn1 xn2 ... xnp
March 26, 2024 16 / 31
Interpreting Multiple Regression

> model=lm(time~number.of.steps+age, data=data)


> summary(model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3541424 1.0956079 2.149 0.042922 *
number.of.steps 0.0016184 0.0001708 9.473 3.2e-09 ***
age 0.0143239 0.0036154 3.962 0.000662 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

This table tells us the estimates for β̂0 , β̂1 , β̂2 , providing the direction and
the magnitude of the relationship, as well as p-values for our HO ’s.

March 26, 2024 17 / 31


Interpreting Multiple Regression

> model=lm(time~number.of.steps+age, data=data)


> summary(model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3541424 1.0956079 2.149 0.042922 *
number.of.steps 0.0016184 0.0001708 9.473 3.2e-09 ***
age 0.0143239 0.0036154 3.962 0.000662 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

This table tells us the estimates for β̂0 , β̂1 , β̂2 , providing the direction and
the magnitude of the relationship, as well as p-values for our HO ’s.

What are the hypotheses related to β̂0 , β̂1 , β̂2 ?


March 26, 2024 17 / 31
Multiple Regression: Continuous Covariates

I What is a continuous covariate?

How do we write the model?

How do we use the model to find the predicted values, ŷ , time to migrate?

March 26, 2024 18 / 31


Another multiple regression example

Predicting the aggregated habitat index score, ŷ , from vegetation


characteristics of forest stands used by woodland caribou in
Western Manitoba’s Kississing-Naosap range.

How do we write this model?

March 26, 2024 19 / 31


Another multiple regression example : Discrete variables

Predicting the aggregated habitat index score, ŷ , from vegetation


characteristics of forest stands used by woodland caribou in
Western Manitoba’s Kississing-Naosap range.

ŷ = Constant + β̂1 X1 + β̂2 X2 + β̂3 X3 + β̂4 X4


Constant = β̂0 or the intercept
ŷ = 4.96 + 1.13 X1 + 0.009 X2 + 0.006 X3 − 0.47 X4

I What is the discrete variable ‘presence of trembling aspen (0,1)’ ?


I What if we had multiple species, e.g, ‘presence of trembling aspen
(0,1), alder (0,1), maple (0,1)’ ?

March 26, 2024 20 / 31


Multiple Regression: A good example.
https://rcompanion.org/rcompanion/e_05.html

March 26, 2024 21 / 31


Multiple Regression: Additional Concepts
I I. Correlation and Colinearity

Whenever you have a dataset with multiple numeric variables, it is a


good idea to look at the correlations among these variables. One
reason is that if you have a dependent variable, you can easily see
which independent variables correlate with that dependent variable.

March 26, 2024 22 / 31


Multiple Regression: Additional Concepts

I II. Correlation and Colinearity

A second reason is that if you will be constructing a multiple


regression model, adding an independent variable that is strongly
correlated with an independent variable already in the model is
unlikely to improve the model much, and you may have good reason
to chose one variable over another – and can often make the
regression coefficients unstable (standard error estimates for β̂’s may
be artifically large). This is called colinearity .

March 26, 2024 23 / 31


Multiple Regression: Additional Concepts

I III. Auto-correlation

When a sample is measured and its value is related to the value at


the previous sample then it’s said to be auto-correlated (bad!), and
this is a form of pseudo-replication.

Ones to be careful of are :


I spatial-autocorrelation,
I temporal-autocorrelation, or . . .
I spatio-temporal autocorrelation.

March 26, 2024 24 / 31


Example : The Maryland Biological Stream Survey
https://rcompanion.org/rcompanion/e 05.html

I Response variable: Longnose


I Covariates:
I Average
I DO2
I Maxdepth
I NO3
I SO4
I Temp

model.null = lm(Longnose ~ 1,
data=Data)
model.full = lm(Longnose ~ Acerage + DO2 + Maxdepth + NO3 + SO4 + Temp,
data=Data)

March 26, 2024 25 / 31


Example : The Maryland Biological Stream Survey
https://rcompanion.org/rcompanion/e 05.html

step(model.null,
scope = list(upper=model.full),
direction="both",
data=Data)

March 26, 2024 26 / 31


Model Selection: The Maryland Biological Stream Survey

Start: AIC=525.45 Step: AIC=510.84


Longnose ~ 1 Longnose ~ Acerage + NO3

Df Sum of Sq RSS AIC Df Sum of Sq RSS AIC


+ Acerage 1 17989.6 131841 518.75 + Maxdepth 1 6058.4 107904 509.13
+ NO3 1 14327.5 135503 520.61 <none> 113962 510.84
+ Maxdepth 1 13936.1 135894 520.81 + Temp 1 2897.8 111064 511.09
<none> 149831 525.45 + DO2 1 401.3 113561 512.60
+ Temp 1 2931.0 146899 526.10 + SO4 1 4.9 113957 512.84
+ DO2 1 2777.7 147053 526.17 - NO3 1 17878.6 131841 518.75
+ SO4 1 45.3 149785 527.43 - Acerage 1 21540.7 135503 520.61

Step: AIC=518.75 Step: AIC=509.13


Longnose ~ Acerage Longnose ~ Acerage + NO3 + Maxdepth

Df Sum of Sq RSS AIC Df Sum of Sq RSS AIC


+ NO3 1 17878.6 113962 510.84 <none> 107904 509.13
+ Maxdepth 1 7447.6 124393 516.80 + Temp 1 2948.0 104956 509.24
<none> 131841 518.75 + DO2 1 669.6 107234 510.70
+ DO2 1 3105.4 128735 519.13 - Maxdepth 1 6058.4 113962 510.84
+ Temp 1 2879.9 128961 519.25 + SO4 1 5.9 107898 511.12
+ SO4 1 176.5 131664 520.66 - Acerage 1 14652.0 122556 515.78
- Acerage 1 17989.6 149831 525.45 - NO3 1 16489.3 124393 516.80

I Which model? We select the model with the smallest AIC value.

March 26, 2024 27 / 31


Multiple Regression: Additional Concepts
I Model Selection: Stepwise Selection and
Aikaike Information Criteria (AIC or BIC)

I Stepwise: A way of automating model selection using Forward,


Backward, Both directions of adding and removing variables one at a
time to of from the model.
I The Akaike information criterion, AIC , is calculated from the maximum
log-likelihood of the model and the number of parameters (p) used to
reach that likelihood. The AIC function is −2(log-likelihood) + 2p .
I AIC or AICc: These are estimators of the
relative quality between a set of statistical
models for a given set of data
(smaller number better).

March 26, 2024 28 / 31


Multiple Regression: Selecting between models

A good model should only be as complex as necessary to describe a


dataset. We’ve talked about
I 2
radj
I AIC
I BUT don’t forget: common sense!
I If you are choosing between a very simple model with 1 covariate, and
a very complex model with, say, 10 covariates, the very complex
model needs to provide a much better fit to the data in order to
justify its increased complexity. If it cant, then the more simpler
model should be preferred.

March 26, 2024 29 / 31


Multiple Regression: Additional Concepts

I Data transformations

I taking logs: It is possible to take the log of both the response variable
or any covariates for which their distributions do not look normal (add
some small value if the variable = 0.
I generalized linear models (GLMs): A rich modeling world where data
no longer are normally distributed (although, the normal distribution is
just a special case).
I logistic regression for binary outcomes
I poisson regression for count data

March 26, 2024 30 / 31


Multiple Regression: Regression Residuals
Homoscedasticity means having the same scatter
Heteroscedasticity means there’s some pattern in the residuals

March 26, 2024 31 / 31

You might also like