0% found this document useful (0 votes)
37 views35 pages

Lecture 6 Regression Analysis

Uploaded by

sharontao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views35 pages

Lecture 6 Regression Analysis

Uploaded by

sharontao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Environmental Data Analysis

Lecture 6

Dr. Zhi NING


Regression Analysis
Regression analysis

• To predict values of a dependent variable(s) from the


values of one or more independent variables.
• Applications:
– For example, if one variable is very difficult or very
expensive to analyze or the determinations were of
poor quality;
• Such a relation could take
the simple form of a
linear model, or more
generally of a non-linear
model.
Regression analysis

• Linear regression
– estimation of regression line,
– analysis of variance,
• Linear multiple regression
• Non-linear regression
Regression analysis

• Many different methods to estimate the regression


parameters such that the error is minimized:
– Least squares method (LS): minimize the sum of the
squared errors;
– Least absolute deviation (LAD): minimize the average of
absolute errors
– Least trimmed sum of squares method (LTS): more robust.
Regression analysis

• Requirement of regression analysis


– Homogeneity (Heterogeneity) of the variance
• Log-transformation if necessary

– Normality
– Important if confidence interval based on normal theory is computed
Regression analysis

• Requirement of regression analysis


• Data outliers, extreme values
• Their presence at large distances and away from apparent
trend can lever the regression;
• Inspect from scatterplot for outliers;
• For multiple regression, scatterplot matrix;
Linear regression

• Describes the linear relationship between two variables, x


and y
• Seeks to summarize the relationship between two variables,
shown graphically in their scatterplot, by a single straight
line.
• The regression procedure chooses that line producing the
least error for predictions of y given observations of x.

• http://cameron.econ.ucdavis.edu/excel/ex61multipleregression.html
Linear regression

• Error between actual and predicted y values:

• The sum of squared residuals:

• Least square method:


Linear regression
• Derive the above and we have the “normal equations”:

• Finally, the equations yield:


Example
The relation bewteen the concentration of trace Excel can easily compute the regression line parameteres
metal V and the generation of reactive oxygen
with slope and intercept
species
X Y
Concentration of V Reactive oxygen species
(ng/m3) (ug Zymosan per m3) 35.00
1 0.024 13.79
2 0.040 9.73 y = 253.59x + 3.5657
3 0.011 6.11 30.00 R² = 0.6704
4 0.046 6.89
5 0.101 22.49
6 0.006 10.74

Reactive oxygen species


25.00

(ug Zymosan per m3)


7 0.005 3.66
8 0.007 1.74
9 0.003 3.56
20.00
10 0.007 1.97
11 0.004 2.35
12 0.004 5.97
13 0.044 26.72 15.00
14 0.100 25.39
15 0.036 16.32
16 0.051 26.08 10.00
17 0.012 4.46
18 0.068 24.56
19 0.009 2.98 5.00
20 0.016 1.79
21 0.005 0.69
22 0.036 5.22
0.00
23 0.044 19.42
24 0.028 21.98
0.000 0.020 0.040 0.060 0.080 0.100 0.120
25 0.080 23.89 Concentration of V
26 0.070 18.62 (ng/m3)
27 0.038 10.82
28 0.012 5.91
29 0.040 17.60
30 0.075 24.61 Calculate the residual of the regression.
Important parameters of regression

• Distribution of the residuals


– Statistically, it is assumed the ei are independent
random variables with a distribution
– The residuals follow t distribution with 0 mean;

– The variance of the distribution is:

– Try the se2 values in the data example.


Important parameters of regression

• There are THREE y values

^𝑖 𝑦 𝑖 𝑦
𝑦
• Basic regression
DATA = FIT + RESIDUAL

- - )+ - )
• The first term is the total variation in the response y,
• The second term is the variation in mean response,
• The third term is the residual value.
Important parameters of regression

DATA = FIT + RESIDUAL

- - )+ - )

- - )2 + - )2
SST SSR SSE
Sum of squares, total Regression Sum of Squared Errors
Sum of Squares
Important parameters of regression

• Analysis of variance: ANOVA

SSE refers to the Sum of


Squared Differences
between the residuals and
their mean, or Sum of
Squared Errors

SST stands for “sum of squares, total”,


which has the mathematical meaning of
SSR stands for the the sum of squared deviations of the y
Regression values around their mean
Sum of Squares, or the sum
of squared differences
between the regression
predictions and
the sample mean of y,
Important parameters of regression

• Analysis of variance: ANOVA

• MSR = SSR/1 =Mean Square Regression


• MSE = SSE/[(n-2)]=se2 Mean Square Error or Residual, the
sample variance of the residuals.

– Calculate from the sample question


Goodness of fitness

1. MSE ( se2 or SSE/(n-2))


– The most fundamental
– It indicates the variability of, or the uncertainty about,
the observed y values (the quantities being forecast)
around the forecast regression line.
– Good: MSE or SSE  0 (very small error)
– Bad: SSR  0 (slope b 0)
Goodness of fitness

2. Coefficient of determination(R2) - )2
SSR

– Proportion of the variation of the predictand that is


described or accounted for by the regression (SSR)
– Good: SSR=SST and SSE=0, so R2=1;
– Bad: SSR=0, SSE=SST, so R2=0
3. Strength of regression (F)

– Qualitative measure
– A strong relationship between x and y will produce a
large MSR and a small MSE.
Multiple Linear Regression

• Models
Multiple Linear Regression

• Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
• Linearity
• Variables are measured without error
• No exact linear relation exists between any
subset of explanatory variables

Y i
  X i
Multiple linear regression

• One predictor is determined by multiple variables;

- K is number of variables
• Sum of the squares of the distances from the
points to the plane is a minimum.

• Slope (bP)
- Estimated Y changes by bP for each 1 unit increase in
XP holding all other variables constant
• Y-Intercept (b0)
- Average value of Y when all XP = 0
http://cameron.econ.ucdavis.edu/excel/ex61multipleregression.html
Multiple linear regression

• Interpretation of the statistical summary


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.986172933
R Square 0.972537054
Adjusted R Square 0.96076722
Standard Error 14.50293137
Observations 11

– Multiple R: square root of R2


– R square: R2
– Adjusted R square: when more than one x variable:

adjusts for the number of explanatory terms in a model.


– Standard error: Estimate of the standard deviation of the error
– Observations: Number of observations used in regression
Important parameters of regression

• Test for overall significance


• use F-statistics to test the ratio of the variance explained by the
regression and the variance not explained by the regression:
F=MSR/MSE
- Select a X% confidence level
• H0: variation in y is not explained by the linear regression but rather by chance or
fluctuations; Ha: variation in y is explained by the linear regression but rather by chance or
fluctuations
- Reject null hypothesis at the α significance level if F>Fα(1, N-2)

 = 0.05

0 F(α,df1,df2) F
Multiple linear regression

• Interpretation of the statistical summary


ANOVA
df SS MS F Significance F
Regression 3 52139.71513 17379.91 82.62963 7.90582E-06
Residual 7 1472.345127 210.335
Total 10 53612.06026

– Regression sum of squares


– Error (residual) sum of squares
– F test:
• H0: b1=b2=b3=0;
• Ha: at least one of b1, b2, b3 doesn’t equal 0
• P value=7.9E-06<0.05, reject H0 at significance level 0.05
• There is evidence that at least one independent variable affects
http://home.comcast.net/~sharov/PopEcol/tables/f001.html F=4.35 at p =0.05
Y
Regression analysis with categorical
variables
• Qualitative variables are also important
parameters;
– Data sets similar by each measured by different
chemist, or instrument or laboratory, or from
different locations etc.
– Represented by “0” and “1”: categorical variables
• How to determine whether the same model
structure and parameter values hold for each
data set?
• The categorical variables (dummy or indicator
variables)
Regression analysis with categorical
variables
• If there is only one single category:

• If there are two categories:

– Then compare intercepts and slopes using


confidence intervals or t-tests
• A better approach is regression with
category variables, and how ……
Regression analysis with categorical
variables
• Define a categorical variable Z:

• The augmented model is:

– Three independent variables


• x, Z and Zx
Regression analysis with categorical
variables
• Case study
• Cosby Creek, in the southern Appalachian Mountains, was monitored
during three storms to study how pH and other measures of
acidification were affected by the rainfall in that region. Samples were
taken every 30 minutes and 19 characteristics of the stream water
chemistry were measured. Weak acidity (WA) and pH will be examined
in this case study.

• 17 observations for storm 1,


• 14 for storm 2, and
• 13 for storm 3.
Regression analysis with categorical
variables
• For three variables, two dummy variables are
needed:

• The model becomes

• Make a data table …


Regression analysis with categorical
variables
Storm WA Z1 Z2 Z1WA Z2WA pH

• Multiple regression: 1
1
1
190
110
150
1
1
1
0
0
0
190
110
150
0
0
0
5.96
6.08
5.93
1 170 1 0 170 0 5.99
1 170 1 0 170 0 6.01
1 170 1 0 170 0 5.97
1 200 1 0 200 0 5.88
1 140 1 0 140 0 6.06

• Remove p-value >0.05;


1 140 1 0 140 0 6.06
1 160 1 0 160 0 6.03
1 140 1 0 140 0 6.02
1 110 1 0 110 0 6.17

• Then re-model for regression:


1 110 1 0 110 0 6.31
1 120 1 0 120 0 6.27
1 110 1 0 110 0 6.42
1 110 1 0 110 0 6.28
1 110 1 0 110 0 6.43
2 140 0 1 0 140 6.33
2 140 0 1 0 140 6.43
2 120 0 1 0 120 6.37
2 190 0 1 0 190 6.09
2 120 0 1 0 120 6.32
2 110 0 1 0 110 6.37
2 110 0 1 0 110 6.73
2 100 0 1 0 100 6.89
2 100 0 1 0 100 6.87
2 120 0 1 0 120 6.3
2 120 0 1 0 120 6.52
2 100 0 1 0 100 6.39
2 80 0 1 0 80 6.87
2 100 0 1 0 100 6.85
3 580 0 0 0 0 5.82
3 640 0 0 0 0 5.94
3 500 0 0 0 0 5.73
3 530 0 0 0 0 5.91
3 670 0 0 0 0 5.87
3 670 0 0 0 0 5.8
3 640 0 0 0 0 5.8
3 640 0 0 0 0 5.78
3 560 0 0 0 0 5.78
3 590 0 0 0 0 5.73
3 640 0 0 0 0 5.63
3 590 0 0 0 0 5.79
3 600 0 0 0 0 6.02
Regression analysis with categorical
variables
• Observe and further simplify the model

• Similar interval for the last two terms


• Combine them in one variable then?
• Z3WA=Z1WA+Z2WA
Regression analysis with categorical
variables
• How to judge if it is better or worse?
– See the statistics

– Compare regression sums of squares of the


simplified model with the more complicated one.
– Measure of how well the model fits the data.
– Dropping an important term will cause the RSS to
decrease by a noteworthy amount;
– Dropping an unimportant term will change the
regression sum of squares very little.
Regression analysis with categorical
variables
• From model A to C, we have reduced the
variables from 5 to 3.
• Compare the RSS of Model A and C.

• Compare this with F ratio (0.05,2, 38)=3.25


• 1.44<3.25 so cannot reject H0, same then….
Regression analysis with categorical
variables
• Based on model C
– To force Z1, Z2 and Z3, we can get equation for all
the three variables.

You might also like