Section 2

Section 2: Linear Regression
1
Main objective of this session
Aim:
• To introduce the linear regression as data mining technique.
Learning outcomes
1. Identify when the linear regression can be used.
2. Identity the limitations and the main assumptions of the linear regressions models.
3. Understand the MLE technique.
4. Interpret the outputs of the statistical package.
5. Understand how to asses the linear models.
2
1.4 SEMMA Process
SAMPLE EXPLORE MODIFY MODEL ASSESS
• Acquire an • For each • Data • Statistical • Determine how

unbiased sample variable: adjustment. techniques and well the model
of the data • Get a feel of models on data fits the data.
which describes typical values. • Outliers to undertake
the situation treatments the required • What
• Outliers data mining confidence you
• Define the detection • Adjust/take task. should have in
“targets” functions of the results
variables which • Inter data to put it in obtained
capture the relationship most useful (Measurement
respond of the between form. function).
situation. different
variables.
3
2.1 Relationship between variables
– Usually target variable is continuous.
– Try to explain how it depends on the other “independent
variables”
– What is best linear relationship between variables
(regression)
y  a  bx  e
– How strong is that relationship (correlation)
– Actual Valuee=underlying
y  (a  bxpattern
) + noise
– So error e = actual value – underlying pattern

4
2.2 Measurement of Fit
  
 (y
– Set of data (y1, xy1 ), a 2, bx2x), …..(yn, xn ), what is the line of
 
best fit. ei  yi  yi
1 n  2
error or
– For each data point,MSE  (ei ) is:
residual
n i 1
– Just as in measure of spread in last section ei can have

negative and positive values which cancel each other out
in this calculation. 5
2.3 Line of best fit
y = a+bx
y-variable
deviation
= ydata point – yequation
= yi  ŷi
x-variable
6
2.3 Line of best fit
  
• Set of data (y1, x1 ), (y2,yx2),a…..(y
 bnx, xn ), what is line of best fit
n n
• For each 1  1
Minimise ˆ)  (error
data( aˆ ,bpoint,
n
ei ) 2  or
i 1
residual
n
is
( yi  aˆ  bˆx )e2i = yi – (a +bxi )
i
i 1
1 n
  ( yi2  2aˆyi  2bˆyi x i 2aˆbˆxi  aˆ 2  bˆ 2 xi2 )
n i 1
• Minimise mean squared error
n n n
n yi xi   yi  xi
1 n n

bˆ  i 1 i 1 i 1
2
aˆ    yi  b  xi 
ˆ
n
 n  n  i 1 
n xi    xi 
2 i 1
i 1  i 1 
7
• To solve this one can use differentiation or completing the square
2.3 Example
Advertising Example
Advertising budget (x) 20 40 60 80
Units sold ( y) 120 170 210 230
units sold = a +b. advertising budget

n=4; x = 200; y = 730 xy = 40200 x2= 12000;
b= (4.40200 – 200.730)/(4.12000 – 200.200) = 1.85
a = (730/4) – 1.85(200/4) = 90
so best fit line is y= 1.85x + 90
Web page Example

8
Days of testing(x) 0 1 2 3 4 5 6 7 8 9 10
2.4 Goodness of fit
• Set of data (y1, x1 ), (y2, x2 ), …..(yn, xn ), best fit line y = a + bx
• For each data point, error is ei = yi – (a +bxi) = yi – ypredi
• If there was no idea of x related to y then best guess for yi would be its mean
E(y)= i yi /n.
• In that case total sum of squared errors would be
total SSE =i (yi – E(y))2
• Building a regression model, explains some of this variation

Explained SSE = i ( ypredi – E(y))2
• What is unexplained is the sum of the squares of the errors (which we tried to minimise)
Unexplained SSE = i (yi - ypredi)2 n n n

2
 n yi xi   yi  xi 
Explained SSE  i 1 i 1 i 1 
• R2  
Can show that: Total SSE = Explained
Total SSE SSE
 +n Unexplained SSE
2
 2

 n
 n
 n

 
 n xi    xi   n  yi    yi  
2 2
• Coefficient of determination 
 i 1   i 1  i 1  
 i 1
9
2.5 Is b=0?
n
TotalSS  SST  (y i  y )2
• i 1
Useful to decide whether that variable should be in regression
n n n
Unexplained SS=error SS  SSE   ( yi  ( a  bxi ) 2 
i 1
 ( yi  yî ) 2 
i 1
e
i 1
i
2
n n
Explained SS = RegressionSS  SSR   ( yˆ
i 1
i  y ) 2  b 2  ( xi  x ) 2
i 1
SST  SSR  SSE

Error Variance estimator =s 2E = SSE/(n-2)
s 2E s 2E
b variance estimator =s 2b = n
=
(n-1)s 2x
 (x
i 1
i  x )2
H 0 : b  0; H1 : b  0;
check since t=b/s b has Student's t distribution with n-2 degrees of freedom
• t- distribution is like standard normal but with thicker tail:

• (a normal mean divided by sample standard deviation)
• Roughly t values of morebˆthan
 tn 22,mean
 / 2 sb 
bb  bˆ  be
cannot tn  0
2, / 2 sb
10
• Confidence limits
2.6 The output from Linear
Regression
11
Log Horsepower is Y
The REG Procedure
Model: MODEL1
Dependent Variable: loghorse
Number of Observations Read 406

Number of Observations Used 400
Number of Observations with Missing Values 6
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 35.89161 35.89161 1235.69 <.0001
Error 398 11.56023 0.02905
Corrected Total 399 47.45184
Root MSE 0.17043 R-Square 0.7564

Dependent Mean 4.59116 Adj R-Sq 0.7558
Coeff Var 3.71210
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 3.54381 0.03099 114.36 <.0001
WEIGHT 1 0.00035187 0.00001001 35.15 <.0001
The scatterplot of residuals vs. predicted values now shows much more homogeneous variance at all the
predicted values. We still see a couple of outliers.
12
2.7 Multiple linear regression
• This is where there are several independent variables X1, X2 ,.. Xp which can affect the
dependent variable Y. One want to find the best linear combination that describes this
relationship,
Y = a + b1X 1 +b2 X2 +….+ bp Xp+
• Approach is similar to the calculation in the single independent case but the
calculations are harder and usually done of the computer.
We can used Data Analysis in Excel or Enterprise Guide in SAS to do the calculations,
but R 2 etc. have same interpretation in this case.
• Objectives of multiple linear regression

– Linear equation to estimate dependent variable , both to explain the past and predict the future
( but measure how good it is in different ways for these two purposes)
– Marginal change in dependent variable for unit change in independent variable
• Assumptions
– Xs are independent of the error term
– Expected value of Y is weighted linear sum of the independent X variables
– Error terms I mean 0 , variance  2 all errors ( homoscedasticity/uniform variance)
– Errors are not correlated with one another 13
– No linear relationship in the Xs
2.8 Adjusted R2
n
TotalSS  SST   ( yi  y ) 2
i 1
n n n
Unexplained SS=error SS  SSE   ( yi  ( a  b1 x1,i  b2 x2,i   bm xm,i ) 2   ( yi  yî ) 2   ei 2
i 1 i 1 i 1
n
Explained SS = RegressionSS  SSR   ( yî  y ) 2
i 1
SST  SSR  SSE

SSR SSE
R2   1
SST SST
Adding more variables will increase R 2 even if they are not important predictors;
so adjust for fact m independent variables in regression
( SSE / n  m  1)
Adjusted R 2  R 2  1  : better way of comparing regressions with different m
( SST / n  1)
Still true: R = Corr ( Y, Ypredict )
14
2.9 F test
• Whether coefficients of individual variables are 0 still can be checked
using Students t-test
• Canyi also
b0  btest whether the coefficients of a subset of regression variables
1 x1,i  ....bK x K ,i  bK 1 x K 1,i   bK  r x K  r .i   i
areH 0.: b    b  0; b  0 for i=1,...K
0 K 1 K r i
Calcualte SSE from whole model and also SSE(K) model with only has first K Xs.
(SSE(K)-SSE)/r is estimate of "variance of errors from last r variables"
SSE/(n-K-r-1) estimate of variance of error from full model
(SSE(K)-SSE)/r
F= is F statistic with (r, n-K-r-1) degress of freedom
SSE/(n-K-r-1)
If hypothesis rejected if chance of result occuring when hypothesis true greater than 
(SSE(K)-SSE)/r
Reject H 0 if > Fr, n-K-r-1,
SSE/(n-K-r-1)
15
2.10: Ways of introducing variables
into multiple regression
• Constant
– Put all the variables into the equation initially and calculate coefficients
• Forward
– Start with only constant term. Use F test to see which variable has lowest
probability of coefficient being zero. Introduce it to equation and repeat
procedure. Stop when chance of zero coefficient is above some predefined
level.
• Backward
– Start with all variables in equation. Apply F test to each variable in turn and
remove variable with highest chance of having zero coefficient. Recalculate
F test values and repeat. Stop when no remaining variable has chance of
zero coefficient below some pre agreed level.
• Stepwise.
– Start with only constant term. At each stage introduce variable as in
forward approach, but also then check whether one should16remove any
2.11 Output of multiple
linear
Anova Table regression
Analysis of Variance
Sum of Mean
a b c d e f
Source DF Squares Square F Value Pr > F
Model 4 9543.72074 2385.93019 46.69 <.0001
Error 195 9963.77926 51.09630
Corrected Total 199 19507
a. Source - Looking at the breakdown of variance in the outcome variable, these are the categories we will examine: Model, Error, and
Corrected Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the
variance which is not explained by the independent variables (Error).
b. DF - These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. The model
degrees of freedom corresponds to the number of coefficients estimated minus 1. Including the intercept, there are 5 coefficients, so the model
has 5-1=4 degrees of freedom. The Error degrees of freedom is the DF total minus the DF model, 199 - 4 =195.
c. Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total, Model and Error.
d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF.
e. F Value - This is the F-statistic is the Mean Square Model (2385.93019) divided by the Mean Square Error (51.09630), yielding F=46.69.
f. Pr > F - This is the p-value associated with the above F-statistic. It is used in testing the null hypothesis that all of the model coefficients are 0.
17
2.11 Output of multiple
linear
Overall Model Fit regression
g j
Root MSE 7.14817 R-Square 0.4892
h k
Dependent Mean 51.85000 Adj R-Sq 0.4788
i
Coeff Var 13.78624
g. Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Error.
h. Dependent Mean - This is the mean of the dependent variable.
i. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of
the dependent variable, multiplied by 100: (100*(7.15/51.85) = 13.79).
j. R-Square - R-Squared is the proportion of variance in the dependent variable (science) which can be explained by the independent variables
(math, female, socst and read). This is an overall measure of the strength of association and does not reflect the extent to which any particular
independent variable is associated with the dependent variable.
k. Adj R-Sq - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted R-squared is comp
the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors.
18
Parameter Estimates
Parameter Standard
Variablel Labelm DFn Estimateo Errorp t Valueq Pr > |t|r
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variablel Labelm DFn 95% Confidence Limitss
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
l. Variable - This column shows the predictor variables (constant, math, female, socst, read). The first refers the model intercept, the height of the regression line when it
crosses the Y axis. In other words, this is the predicted value of science when all other variables are 0.
m. Label - This column gives the label for the variable. Usually, variable labels are added when the data set is created so that it is clear what the variable is (as the name of
the variable can sometimes be ambiguous). SAS has labeled the variable Intercept for us by default. Note that this variable is not added to the data set.
n. DF - This column give the degrees of freedom associated with each independent variable. All continuous variables have one degree of freedom, as do binary variables
(such as female).
o. Parameter Estimates - These are the values for the regression equation for predicting the dependent variable from the independent variable. The regression equation is
presented in many different ways, for example:
Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4
The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.
math - The coefficient is .3893102. So for every unit increase in math, a 0.38931 unit increase in science is predicted, holding all other variables constant.
female - For every unit increase in female, we expect a -2.00976 unit decrease in the science score, holding all other variables constant. Since female is coded 0/1 (0=male,
1=female) the interpretation is more simply: for females, the predicted science score would be 2 points lower than for males.
socst - The coefficient for socst is .0498443. So for every unit increase in socst, we expect an approximately .05 point increase in the science score, holding all other
variables constant.
read - The coefficient for read is .3352998. So for every unit increase in read, we expect a .34 point increase in the science score.
19
Parameter Estimates
Parameter Standard
Variablel Labelm DFn Estimateo Errorp t Valueq Pr > |t|r
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variablel Labelm DFn 95% Confidence Limitss
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
p. Standard Error - These are the standard errors associated with the coefficients.
q. t Value - These are the t-statistics used in testing whether a given coefficient is significantly different from zero.
r. Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that the coefficient (parameter) is 0. Using an alpha of 0.05:
The coefficient for math is significantly different from 0 because its p-value is 0.000, which is smaller than 0.05.
The coefficient for socst (.0498443) is not statistically significantly different from 0 because its p-value is definitely larger than 0.05.
The coefficient for read (.3352998) is statistically significant because its p-value of 0.000 is less than .05.
The intercept is significantly different from 0 at the 0.05 alpha level.
s. 95% Confidence Limits - These are the 95% confidence intervals for the coefficients. The confidence intervals are related to the p-values such that the coefficient will not
be statistically significant if the confidence interval includes 0. These confidence intervals can help you to put the estimate from the coefficient into perspective by seeing
how much the value could vary.
20
2.11 Testing the model
• Verifying if the
• X’s are independent assumptions of the model
of the error • Plotare satisfied.
each independent variable vs.
term the residuals (errors)
X Variable 1 Residual Plot
0.6
0.4
• The graph should be have a cloud
0.2 shape. No trends.
Residuals
0
-0.2
0 2 4 6 8 10 12
• Also it can be complemented
-0.4 calculating the correlation among
-0.6
X Variable 1
the residuals and the variables.
• If there is a trend could mean the
model is not capturing the whole
information from this variable.
• Expected value of Y is weighted • Plot Y against each independent

linear sum of the independent X Y vs. Xi variable Xi.
variables 3
2 • It should be some kind of linear

trend.
Y
0
-100 -50
-1
0 50 100 • If not the variable should be
modified.
-2
-3
Xi
21
2.11 Testing the model
• Error terms I mean 0 , • Plot the Y’s versus Residuals
• Verifying if the assumptions of the model
variance  2 all errors is
constant. • are
The satisfied.
graph should be have a
(homoscedasticity/uniform cloud shape. No trends.
Y vs Residuals
variance) 0.6
• Also it can be complemented
0.4
calculating the correlation
0.2
among the residuals and the
Residuals
0
0 2 4 6 8 10 12
variables.
-0.2
-0.4
• If there is a trend it means that
-0.6
Y the E() is not 0 (a “bias”).
• If the residuals increase or

decrease it means  2 all errors
is not constant. (not
homoscedasticity)
• Errors are not correlated each • No trends on the residual’s

other graphics.
• No linear relationship in the Xs • Calculate the correlation among

the variable.
• If Corr(xi,xj) is higher than 0.8 or
lower than -0.8 is revealing an
strong correlation between the
variables i and j. (Multicolinearity).
Then one variable should
excluded.
22

Section 2

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Section 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Section 2

Uploaded by

Copyright:

Available Formats

Section 2: Linear Regression

3. Understand the MLE technique.

4. Interpret the outputs of the statistical package.

5. Understand how to asses the linear models.

• Acquire an • For each • Data • Statistical • Determine how

– So error e = actual value – underlying pattern

– Just as in measure of spread in last section ei can have

units sold = a +b. advertising budget

Web page Example

• Building a regression model, explains some of this variation

SST  SSR  SSE

• t- distribution is like standard normal but with thicker tail:

Number of Observations Read 406

Root MSE 0.17043 R-Square 0.7564

Y = a + b1X 1 +b2 X2 +….+ bp Xp+

• Objectives of multiple linear regression

SST  SSR  SSE

h. Dependent Mean - This is the mean of the dependent variable.

Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4

• Expected value of Y is weighted • Plot Y against each independent

2 • It should be some kind of linear

• If the residuals increase or

• Errors are not correlated each • No trends on the residual’s

• No linear relationship in the Xs • Calculate the correlation among

You might also like

Ypredicted = b0 + b1x1 + b2x2 + b3x3 + b4x4