Section 2
Section 2
Section 2
1
Main objective of this session
Aim:
• To introduce the linear regression as data mining technique.
Learning outcomes
1. Identify when the linear regression can be used.
2. Identity the limitations and the main assumptions of the linear regressions models.
2
1.4 SEMMA Process
SAMPLE EXPLORE MODIFY MODEL ASSESS
3
2.1 Relationship between variables
– Usually target variable is continuous.
– Try to explain how it depends on the other “independent
variables”
– What is best linear relationship between variables
(regression)
y a bx e
– How strong is that relationship (correlation)
– Actual Valuee=underlying
y (a bxpattern
) + noise
1 n 2
error or
– For each data point,MSE (ei ) is:
residual
n i 1
y-variable
deviation
= ydata point – yequation
= yi ŷi
x-variable
6
2.3 Line of best fit
• Set of data (y1, x1 ), (y2,yx2),a…..(y
bnx, xn ), what is line of best fit
n n
• For each 1 1
Minimise ˆ) (error
data( aˆ ,bpoint,
n
ei ) 2 or
i 1
residual
n
is
( yi aˆ bˆx )e2i = yi – (a +bxi )
i
i 1
1 n
( yi2 2aˆyi 2bˆyi x i 2aˆbˆxi aˆ 2 bˆ 2 xi2 )
n i 1
• Minimise mean squared error
n n n
n yi xi yi xi
1 n n
bˆ i 1 i 1 i 1
2
aˆ yi b xi
ˆ
n
n n i 1
n xi xi
2 i 1
i 1 i 1
7
• To solve this one can use differentiation or completing the square
2.3 Example
Advertising Example
Advertising budget (x) 20 40 60 80
Units sold ( y) 120 170 210 230
n yi xi yi xi
Explained SSE i 1 i 1 i 1
• R2
Can show that: Total SSE = Explained
Total SSE SSE
+n Unexplained SSE
2
2
n
n
n
n xi xi n yi yi
2 2
• Coefficient of determination
i 1 i 1 i 1
i 1
9
2.5 Is b=0?
n
TotalSS SST (y i y )2
• i 1
Useful to decide whether that variable should be in regression
n n n
Unexplained SS=error SS SSE ( yi ( a bxi ) 2
i 1
( yi yˆi ) 2
i 1
e
i 1
i
2
n n
Explained SS = RegressionSS SSR ( yˆ
i 1
i y ) 2 b 2 ( xi x ) 2
i 1
H 0 : b 0; H1 : b 0;
check since t=b/s b has Student's t distribution with n-2 degrees of freedom
11
Log Horsepower is Y
The REG Procedure
Model: MODEL1
Dependent Variable: loghorse
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 35.89161 35.89161 1235.69 <.0001
Error 398 11.56023 0.02905
Corrected Total 399 47.45184
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 3.54381 0.03099 114.36 <.0001
WEIGHT 1 0.00035187 0.00001001 35.15 <.0001
The scatterplot of residuals vs. predicted values now shows much more homogeneous variance at all the
predicted values. We still see a couple of outliers.
12
2.7 Multiple linear regression
• This is where there are several independent variables X1, X2 ,.. Xp which can affect the
dependent variable Y. One want to find the best linear combination that describes this
relationship,
• Approach is similar to the calculation in the single independent case but the
calculations are harder and usually done of the computer.
We can used Data Analysis in Excel or Enterprise Guide in SAS to do the calculations,
but R 2 etc. have same interpretation in this case.
• Assumptions
– Xs are independent of the error term
– Expected value of Y is weighted linear sum of the independent X variables
– Error terms I mean 0 , variance 2 all errors ( homoscedasticity/uniform variance)
– Errors are not correlated with one another 13
– No linear relationship in the Xs
2.8 Adjusted R2
n
TotalSS SST ( yi y ) 2
i 1
n n n
Unexplained SS=error SS SSE ( yi ( a b1 x1,i b2 x2,i bm xm,i ) 2 ( yi yˆi ) 2 ei 2
i 1 i 1 i 1
n
Explained SS = RegressionSS SSR ( yˆi y ) 2
i 1
14
2.9 F test
• Whether coefficients of individual variables are 0 still can be checked
using Students t-test
• Canyi also
b0 btest whether the coefficients of a subset of regression variables
1 x1,i ....bK x K ,i bK 1 x K 1,i bK r x K r .i i
areH 0.: b b 0; b 0 for i=1,...K
0 K 1 K r i
Calcualte SSE from whole model and also SSE(K) model with only has first K Xs.
(SSE(K)-SSE)/r is estimate of "variance of errors from last r variables"
SSE/(n-K-r-1) estimate of variance of error from full model
(SSE(K)-SSE)/r
F= is F statistic with (r, n-K-r-1) degress of freedom
SSE/(n-K-r-1)
If hypothesis rejected if chance of result occuring when hypothesis true greater than
(SSE(K)-SSE)/r
Reject H 0 if > Fr, n-K-r-1,
SSE/(n-K-r-1)
15
2.10: Ways of introducing variables
into multiple regression
• Constant
– Put all the variables into the equation initially and calculate coefficients
• Forward
– Start with only constant term. Use F test to see which variable has lowest
probability of coefficient being zero. Introduce it to equation and repeat
procedure. Stop when chance of zero coefficient is above some predefined
level.
• Backward
– Start with all variables in equation. Apply F test to each variable in turn and
remove variable with highest chance of having zero coefficient. Recalculate
F test values and repeat. Stop when no remaining variable has chance of
zero coefficient below some pre agreed level.
• Stepwise.
– Start with only constant term. At each stage introduce variable as in
forward approach, but also then check whether one should16remove any
2.11 Output of multiple
linear
Anova Table regression
Analysis of Variance
Sum of Mean
a b c d e f
Source DF Squares Square F Value Pr > F
Model 4 9543.72074 2385.93019 46.69 <.0001
Error 195 9963.77926 51.09630
Corrected Total 199 19507
a. Source - Looking at the breakdown of variance in the outcome variable, these are the categories we will examine: Model, Error, and
Corrected Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the
variance which is not explained by the independent variables (Error).
b. DF - These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. The model
degrees of freedom corresponds to the number of coefficients estimated minus 1. Including the intercept, there are 5 coefficients, so the model
has 5-1=4 degrees of freedom. The Error degrees of freedom is the DF total minus the DF model, 199 - 4 =195.
c. Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total, Model and Error.
d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF.
e. F Value - This is the F-statistic is the Mean Square Model (2385.93019) divided by the Mean Square Error (51.09630), yielding F=46.69.
f. Pr > F - This is the p-value associated with the above F-statistic. It is used in testing the null hypothesis that all of the model coefficients are 0.
17
2.11 Output of multiple
linear
Overall Model Fit regression
g j
Root MSE 7.14817 R-Square 0.4892
h k
Dependent Mean 51.85000 Adj R-Sq 0.4788
i
Coeff Var 13.78624
g. Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Error.
i. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of
the dependent variable, multiplied by 100: (100*(7.15/51.85) = 13.79).
j. R-Square - R-Squared is the proportion of variance in the dependent variable (science) which can be explained by the independent variables
(math, female, socst and read). This is an overall measure of the strength of association and does not reflect the extent to which any particular
independent variable is associated with the dependent variable.
k. Adj R-Sq - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted R-squared is comp
the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors.
18
Parameter Estimates
Parameter Standard
Variablel Labelm DFn Estimateo Errorp t Valueq Pr > |t|r
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variablel Labelm DFn 95% Confidence Limitss
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
l. Variable - This column shows the predictor variables (constant, math, female, socst, read). The first refers the model intercept, the height of the regression line when it
crosses the Y axis. In other words, this is the predicted value of science when all other variables are 0.
m. Label - This column gives the label for the variable. Usually, variable labels are added when the data set is created so that it is clear what the variable is (as the name of
the variable can sometimes be ambiguous). SAS has labeled the variable Intercept for us by default. Note that this variable is not added to the data set.
n. DF - This column give the degrees of freedom associated with each independent variable. All continuous variables have one degree of freedom, as do binary variables
(such as female).
o. Parameter Estimates - These are the values for the regression equation for predicting the dependent variable from the independent variable. The regression equation is
presented in many different ways, for example:
The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.
math - The coefficient is .3893102. So for every unit increase in math, a 0.38931 unit increase in science is predicted, holding all other variables constant.
female - For every unit increase in female, we expect a -2.00976 unit decrease in the science score, holding all other variables constant. Since female is coded 0/1 (0=male,
1=female) the interpretation is more simply: for females, the predicted science score would be 2 points lower than for males.
socst - The coefficient for socst is .0498443. So for every unit increase in socst, we expect an approximately .05 point increase in the science score, holding all other
variables constant.
read - The coefficient for read is .3352998. So for every unit increase in read, we expect a .34 point increase in the science score.
19
Parameter Estimates
Parameter Standard
Variablel Labelm DFn Estimateo Errorp t Valueq Pr > |t|r
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variablel Labelm DFn 95% Confidence Limitss
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
p. Standard Error - These are the standard errors associated with the coefficients.
q. t Value - These are the t-statistics used in testing whether a given coefficient is significantly different from zero.
r. Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that the coefficient (parameter) is 0. Using an alpha of 0.05:
The coefficient for math is significantly different from 0 because its p-value is 0.000, which is smaller than 0.05.
The coefficient for socst (.0498443) is not statistically significantly different from 0 because its p-value is definitely larger than 0.05.
The coefficient for read (.3352998) is statistically significant because its p-value of 0.000 is less than .05.
The intercept is significantly different from 0 at the 0.05 alpha level.
s. 95% Confidence Limits - These are the 95% confidence intervals for the coefficients. The confidence intervals are related to the p-values such that the coefficient will not
be statistically significant if the confidence interval includes 0. These confidence intervals can help you to put the estimate from the coefficient into perspective by seeing
how much the value could vary.
20
2.11 Testing the model
• Verifying if the
• X’s are independent assumptions of the model
of the error • Plotare satisfied.
each independent variable vs.
term the residuals (errors)
X Variable 1 Residual Plot
0.6
0.4
• The graph should be have a cloud
0.2 shape. No trends.
Residuals
0
-0.2
0 2 4 6 8 10 12
• Also it can be complemented
-0.4 calculating the correlation among
-0.6
X Variable 1
the residuals and the variables.
• If there is a trend could mean the
model is not capturing the whole
information from this variable.
0
-100 -50
-1
0 50 100 • If not the variable should be
modified.
-2
-3
Xi
21
2.11 Testing the model
• Error terms I mean 0 , • Plot the Y’s versus Residuals
• Verifying if the assumptions of the model
variance 2 all errors is
constant. • are
The satisfied.
graph should be have a
(homoscedasticity/uniform cloud shape. No trends.
Y vs Residuals
variance) 0.6
• Also it can be complemented
0.4
calculating the correlation
0.2
among the residuals and the
Residuals
0
0 2 4 6 8 10 12
variables.
-0.2
-0.4
• If there is a trend it means that
-0.6
Y the E() is not 0 (a “bias”).