Chapter 5
Chapter 5
Chapter 5
'Regression' (latin) means 'retreat', 'going back to', 'stepping back'. In a 'regression' we try to (stepwise) retreat from our data and explain them with one or more explanatory predictor variables. We draw a 'regression line' that serves as the (linear) model of our observed data.
www.vias.org/.../img/gm_regression.jpg
Correlation
Regression
In a correlation, we look at the relationship between two variables without knowing the direction of causality
In a regression, we try to predict the outcome of one variable from one or more predictor variables. Thus, the direction of causality can be established.
1 predictor=simple regression >1 predictor=multiple regression
Regression
For a regression you do want to find out about those relations between variables, in particular, whether one 'causes' the other. Therefore, an unambiguous causal template has to be established between the causer and the causee before the analysis! This template is inferential. Regression is THE statistical method underlying ALL inferential statistics (t-test, ANOVA, etc.). All that follows is a variation of regression.
http://snobear.colorado.edu/Markw/SnowHydro/ERAN/regression.jp
In mathematics, a coefficient is a constant multiplicative factor of a certain object. For example, the coefficient in 9x2 is 9. http://en.wikipedia.org/wiki/Coefficient
Yi = (b0 + b1Xi) + i
Yi = outcome we want to predict b0 = intercept of the regression line b1 = slope of the regression line regression coefficients
Yi = (- 4 + 1.33Xi) + i
http://algebra-tutoring.com/slope-intercept-form-equation-lines-1-gifs/slope-52.gif
'goodness-of-fit'
The line of best fit (regression line) is compared with the most basic model. The former should be significantly better than the latter. The most basic model is the mean of the data.
Mean, Y
The summed squared differences between observed values and the regression line, SSR, are smaller, hence this regression line is a much better model of the data
Model Mean, Y
SSM: sum of squared differences between the mean of Y and the regresion line (as our model)
R2 = SSM SST
The basic comparison in statistics is always to compare the amount of variance that our model can explain with the total amount of variation there is. If the model is good it can explain a significant proportion of this overall variance.
This is the same measure as the R2 in chapter 4 on correlation. Take the square root of R2 and you have the Pearson correlation coefficient r!
If b1=0, this means: A change in one unit of the predictor X does not change the predicted variable Y The gradient of the regression line is 0.
t = bobserved bexpected
SEb t= bobserved 0.
Since bexpeted=0
SE
Linear Regression
300,00
200,00
100,00
Linear Regression
300,00
200,00
100,00
Model 1
R2= 33% of the total variance can be explained by the predictor 'advertisement'.
Model 1
df 1 198 199
F 99,587
Sig. ,000a
a. Predictors: (Constant), ADVERTS Advertsing Budget (thousands of pounds) b. Dependent Variable: SALES Record Sales (thousands)
MSR
Regression b0 intercept where regrescoefficients b0, sion line crosses Y axis When no money is spent (X=0), 134,140 records are sold
Coefficientsa
b1
b1 gradient If predictor X is increased by 1 unit (1000, then 96,12 extra records will be sold
t= B/SEB 134,14/7,537= 17,799
Model 1
t 17,799 9,979
,578
=.09612
Model 1
t 17,799 9,979
,578
t= B/SEB
For the constant: 134,14/7,537=17,799 For ADVERTS: B=0.09612/.010 should result in 9.612, however, t= 9.979
Whatswrong?Nothing,thisisaroundingerror.Ifyoudouble-click on the output table Coefficients,amoreexactnumberwillbeshown: 9.612E-02 = 0,09612448597388 .010 = 0,00963236621523 If you re-compute the equation with these numbers, the result is correct: 0,09612448597388/ 0,00963236621523 = 9.979
Yi = (b0 + b1Xi)
= 134.14 + (.09612 x Advertising Budgeti)
Expl: If 100,000 are spent on ads,
Is that a good deal?
Multiple regression
In a multiple regression, we predict the outcome of a dependent variable Y by a linear combination of >1 independent predictor variables Xi
Outcomei = (Modeli) + errori Every variable has its own coefficient: b1, b2,...,bn (5.9)
b1X1= 1st predictor variable with its coefficient b2X2 = 2nd predictor variable with its coefficient, etc. i = residual term
3D-Scatterplot of the relation between record sale (Y) and advertisement budget (X1) No of plays on Radio 1/week (X2)
3-D-scatterplot
If adjusted appropriately, you can see the regression plain and the confidence plains almost like lines
The regression plains are chosen as to cover most of the data points in the threedimensional data cloud
Sum of squares, R,
2 R
The terms we encountered for simple regression, SST, SSR, SSM, still mean the same, but are more complicated to compute now.
Instead of the simple correlational coefficient R, we use a multiple correlation coefficient Multiple R.
Multiple R is the correlation between the predicted and observed values of the outcome. As in simple R, Multiple R, should be great. Multiple R2 is a measure of the explained variance of Y by the predictor variables X1-Xn.
Methods of regression
The predictors of the model should be selected carefully, e.g., based on past research or theoretically well motivated. Hierarchical method (ordered entry): first, known predictors are entered, then new ones, either blockwise (all together) or stepwise Forced entry ('enter'): All predictors are forced into the model simultaneously Stepwise methods: Forward: Predictors are introduced one by one, according to their predictive power. Stepwise: Same as Forward + a removal test. Backward: Predictors are judged against a removal criterion and eliminated accordingly.
Based on the theoretical literature, choose predictors in their order of importance. Do not choose too many Run an initial multiple regression Eliminate useless predictors Take ca. n=15 subjects per predictor
1. The model must fit the data sample 2. The model should generalize beyond the sample
This is done by running the regression without that particular case and then use the new model to predict the value of the just excluded case (its 'adjusted predicted value'). If the case is similar to all other cases, its 'adjusted predicted value' will not differ much from its predicted value, given the model including it.
Cook's distance measures the influence of a case on the overall model's ability to predict all cases. Leverage estimatestheinfluenceofthe observed value of the outcome variable overthepredictedvalues.(Field 2005, 736)
Leverage values lie between 0<x>1 and may be used to define cut-off points for excluding influential cases.
Mahalanobis distances measure the distance of cases from the means of the predictor variables.
Your predictor
using file dfbeta.sav Case 30 removed All data (including (with Data --> Select outlier, case 30): cases --> use filter variable)
B0=29; b1= -.90 B0 = 31; b1=-1 Both regression coefficients b0 (constant/intercept) and b1 (gradient/slope) changed !
Model 1
(Constant) X
t , ,
Sig. , ,
a. Dependent Variable: Y
a. Dependent Variable: Y
Dfbeta of the constant (dfb0) and of the predictor x (dfb1) are much higher than those of the other cases
Without case 30
Outlier
http://image.informatik.htwaalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
(using the file pubs.sav) The correlation between no. of pubs in London districts outlier and deaths with and without the outlier. Note: The residual for the outlier fitted to the regression line including it is small. However, its influence statistics is huge.
Why? The outlier is the 'City of London' district, where a lot of pubs are but only few residents live. The ones who are drinking in those pubs are visitors, hence, the ratio of deaths of citizens given the overall consumation of alcohol is relatively low.
1 2 3 4 5 6 7 8 Total
St. DFFIT St. DFB Interc St. DFB Pubs 0,04 -0,74 -0,74 0,37 0,03 -0,41 -0,41 0,18 0,02 -0,18 -0,17 0,07 0,02 0,02 0,02 -0,01 0,01 0,2 0,19 -0,06 0,01 0,4 0,38 -0,1 0 0,68 0,63 -0,12 0,86 -4,60E+008 92676016 -4,30E+008 8 8 8 8
The residual of the outlier #8 is small because it actually sits very close to the regression line
If these assumptions are not met, we cannot draw valid conclusions from our model!
If our model is generalizable, it should be able to predict the outcome of a different sample. Adjusted R2: R2 indicates the loss of predictive power (shrinkage) if the model were applied to the population: adj R2 = 1n-1 n-2 n+1 n-k-1 n-k-2 n (1-R2)
Data splitting: The entire sample is split into two. Regressions are computed and compared for both halves. Nice method but one rarely has so many data.
Sample size
The required sample size for a regression depends on The number of predictors k The size of the effect The size of the statistical power
e.g., large efffect --> n= 80 (for up to 20 predictors) medium effect --> n=200 small effect --> n=600
(Multi-)Collinearity
If 2 predictors are inter-correlated, we speak of collinearity. In the worst case, 2 variables have a correlation of 1. This is bad for a regression, since the regression cannot be computed reliably anymore. This is because the variables become interchangeable. High collinearity is rare, but some degree of collinearity is always around. Problems with collinearity:
It underestimates the variance of a second variable if this variable is strongly intercorrelated with the first variable. It adds little unique variance although taken for itself it would explain a lot. We can't decide which variable is important, which variable should be included The regression coefficients (b-values) become instable.
X1: Advertisement budget, X2: times played on radio, X3: attractiveness of the band
Since we know already that money for ads is a predictor, it will be entered into the regression first (1st block), and the 2 new predictors later (2nd block) --> hierarchical method ('Enter').
1st block Var 1 2nd block Var 2+3
Regression diagnostics
The regression diagnostics are saved in the data file, each as a separate variable in a new column
Options
R of predictors 123 with outcome R of pred1 with the others R of pred2 with the other R of pred3 with the others Significance levels for all correlations
Correlations: R's between all variables and signiflevels. Pred 2 (plays on radio) is the best predictor.
Predictors should not correlate higher than R>.9 (collinearity)
Summary of model
Only advertisement as predic tor Correlation between predictor(s) and outcome Change from 0 to .335 (Model 1) and another change of .330 (Model 2) Model Summaryc Degrees of freedom; df1:p-1 df2:N-p-1 (N=sample size; p=# of predictors) If errors are independent. If value close to 2, then OK
Change Statistics Adjusted R Square ,331 ,660 Std. Error of the R Square Estimate Change 65,9914 ,335 47,0873 ,330 F Change 99,587 96,447 Sig. F Change ,000 ,000 DurbinWatson 1,950
Model 1 2
df1 1 2
a. Predictors: (Constant), ADVERTS Advertsing Budget (thousands of pounds) b. Predictors: (Constant), ADVERTS Advertsing Budget (thousands of pounds), ATTRACT Attractiveness of Band, AIRPLAY No. of plays on Radio 1 per week c. Dependent Variable: How well Sales (thousands) SALES Record
3 predic tors
ANOVA for the model against the basic model (the mean)
SSM Df equal to # of cases Df equal to Df equal minus # of # of cases to # of coefficients minus 1 predic(b0,b1) 200-1=199 tors 200-2=198
ANOVAc Sum of Squares 433687,8 862264,2 1295952 861377,4 434574,6 1295952
SSR
Model 1
SST
F 99,587
Sig. ,000a
Significance level
129,498
,000b
a. Predictors: (Constant), ADVERTS Advertsing Budget (thousands of pounds) b. Predictors: (Constant), ADVERTS Advertsing Budget (thousands of pounds), ATTRACT Attractiveness of Band, AIRPLAY No. of plays on Radio 1 per week c. Dependent Variable: SALES Record Sales (thousands)
Both Model 1 and 2 have improved the prediction significantly, Model 2 (3 predictors) even better than Model 1 (1 predictor)
Record sales increase by .511 SD's when the predictor (ads) changes 1 SD; b1 and b2 have equal 'gains' Model 1= same as in first analysis
Model 1
2b0
b1
b2 b3
(Constant) ADVERTS Advertsing Budget (thousands of 9,61E-02 pounds) (Constant) -26,613 ADVERTS Advertsing Budget (thousands of 8,49E-02 pounds) AIRPLAY No. of plays 3,367 on Radio 1 per week ATTRACT Attractiveness of Band 11,086
Unstandardized Coefficients Std. B Error 134,140 7,537 ,010 17,350 ,007 ,278 2,438
Standardized Coefficients Beta t 17,799 9,979 -1,534 ,511 ,512 ,192 12,261 12,123 4,548 Sig. ,000 ,000 ,127 ,000 ,000 ,000
95% Confidence Interval for B Lower Upper Bound Bound 119,28 149,002 ,077 -60,830 ,071 2,820 6,279 ,115 7,604 ,099 3,915 15,894
,578
,578
,578
,578
1,000
1,0
The 'Coefficients' table tells us the individual contribution of variables to regression model. The Standardized tell us the importance of each predictor
Pearson Corr of predictor x outcome controlled for each single other predictor Pearson Corr of predictor x outcome controlled for all the other predictor Beta's 'unique relationship'
Excluded variables
Excluded Variablesb Collinearity Statistics Partial Minimum Correlation Tolerance VIF Tolerance ,665 ,344 ,990 ,993 1,010 1,007 ,990 ,993
Model Beta In t a 1 AIRPLAY No. of plays ,546 12,51 on Radio 1 per week a ATTRACT ,281 5,136 Attractiveness of Band
a. Predictors in the Model: (Constant), ADVERTS Advertsing Budget (thousands of pounds) b. Dependent Variable: SALES Record Sales (thousands)
SPSS gives a summary of those predictors that were not entered in the Model (here only for Model 1) and evaluates the contribution of the excluded variables.
Salesi
= -26.61+(0.08Adi)+ (3.37Airplayi) + (11.09 Attracti) Interpretation: If Ad increaes 1 unit-->sales increase .08 units; if airplay + 1 unit-->sales+3.37; if attract + 1 unit --> sales +11 units, independent of the contributions of the other predictors.
No Multicollinearity
Model 1 2
Dimension 1 2 1 2 3 4
Variance Proportions ADVERTS Advertsing AIRPLAY Budget No. of plays (thousands on Radio 1 of pounds) per week ,11 ,89 ,02 ,96 ,02 ,00 ,01 ,05 ,93 ,00
Each predictor's variance proportions load highly on a different dimension (Eigenvalue) --> they are not intercorrelated, hence no collinearity
Casewise diagnostics
Casewise Diagnosticsa
z-value
Case Number Std. Residual 1 2,125 >5% 2 -2,314 10 2,114 47 -2,442 52 2,069 55 -2,424 61 2,098 68 -2,345 100 2,066 164 >1% -2,577 169 >1% 3,061 200 >5% -2,064
SALES Record Sales (thousands) 330,00 120,00 300,00 40,00 190,00 190,00 300,00 70,00 250,00 120,00 360,00 110,00
Predicted Value 229,9203 228,9490 200,4662 154,9698 92,5973 304,1231 201,1897 180,4156 152,7133 241,3240 215,8675 207,2061
Residual 100,0797 -108,9490 99,5338 -114,9698 97,4027 -114,1231 98,8103 -110,4156 97,2867 -121,3240 144,1325 -97,2061
The casewise diagnostics lists cases that lie outside the boundaries of 2 SD (in the z-distribution, only 5% should be beyond 1.96 SD and only 1% beyond 2.58 Case 169 deviates most and needs to be followed up
Identify influencing cases by the case summary In the standardized residulas, no more than 5% must have values exceeding 2 and 1% exceeding 3. Cook's distances >1 might pose a problem Leverage (# of predictors + 1/sample size) must not be twice or three times higher Mahalanobis distance: cases with >25 in large samples (n=500) and >15 in small samples (n=100) can be problemantic Absolute values of DFBeta should not exceed 1 Determine upper and lower limit of covariance ratio (CVR). Upper limit = 1+3(average leverage); lower limit = 1-3(average leverage).
Plot of standardized residual *ZRESID/ standardized predicted value *ZPRED Points are randomly and evently dispersed --> assumptions of linearity and homoscedasticity are met
The distribution of the residuals is normal (left hand picture), the observed probabilities correspond to the expected ones (right hand side)
The Kolmogoroff-Smirnov-Test for the standardized residuals is n.s. --> normal distribution
Scatterplots of the residuals of the outcome variable and each of the predictors separately. No indication of outliers, evenly spaced out cloud of dots (only the residual variance of 'attractiveness of band' seems to be uneven.