Correlation and Regression
Correlation and Regression
Correlation analysis measures the association between two or more variables. In this present section, we
will only consider the association between two variables. It has broadly two types:
1. Two variables are positively associated when large values of one variable tend to occur with large
values of the other variable, and small values tend to occur together as well. As one variable
increases in value, the other variable tends to increase in value. As one variable decreases in value,
the other variable tends to decrease in value.
2. Two variables are negatively associated when large values of one variable tend to occur with small
values of the other variable, and vice versa. As one variable increases in value, the other variable
tends to decrease in value. Or, as one variable decreases in value, the other variable tends to
increase in value.
Strength of association:
◼ If there is a strong linear association, the scatterplot points will tend to fall along a straight line.
◼ If there is a weak linear association, the scatterplot points will be highly variable about the possible
trend line.
◼ There is no linear association if the trend line appears to be horizontal.
Correlation coefficient:
Pearson correlation coefficient, denoted by r , measures the direction and strength of the linear relationship
between two quantitative variables. It is measured as
r=
( xi − x )( yi − y ) = xi yi − nxy
( xi − x ) 2 ( yi − y ) 2 ( xi 2 − nx 2 )( yi 2 − ny 2 )
◼ The value of r is always between –1 and +1.
◼ –1 is a perfect negative linear relationship or association between the variables.
◼ +1 is a perfect positive linear relationship.
Measuring the strength:
◼ Values of r near 0 indicate a weak linear association.
◼ The strength of the association increases as you move away from 0 toward either –1 or +1.
◼ Values close to –1 or +1 indicate the points in the scatterplot are close to a straight line
r=1 r>0
10 10
9
8 8
7
6 6
y
5
4 4
3 3
2 2
1 1
2 4 6 8 10 0 2 4 6 8
x
r = -1 r<0
11 10
10 9
9 8
8 7
7
y
6 5
5
4 3
3
2 1
2 4 6 8 10 0 2 4 6 8
Now we want to test that is there is any relation between height and weight. For solving this problem, we
follow the following steps:
Step 1: Analyze→Correlate→Bivariate
Step 2: Bring the height and weight into the Variables box→Click OK
Output:
Here we see that Pearson correlation coefficient is 0.672. This implies that there exists moderate positive
linear relationship between the variables height and weight. The row of Sig (2-tailed) refers to p-value for
the test of population correlation to be equal to zero. Small p-value indicates that threre exist significant
correlation between the variables weight and height in the population.
Regression Analysis
The primary difference between correlation and regression analysis is that in former case we do not
consider any dependent and independent variables. However, in regression analysis we consider one
dependent variable and one or more independent variables and we would like to find the relationship
between these two types of variables. That is we would like to find the changing pattern of the dependent
variable (expressed as y) for change in the independent variable(s) (expressed as x).
Other names of independent variables are: predictors, regressors.
We express the relationship between x and y in terms of a model, which we call regression model. To
explain this, let us consider that there is one independent variable and one dependent variable.
The scatterplot of x and y is given below. We would like to make a line to represent the relationship. This
straight line is called fitted regression line and is expressed as 𝑦̂ = 𝑎 + 𝑏𝑥.
r=1 r>0
10
𝑦̂ = 𝑎 + 𝑏𝑥.
8
4
3
2
1
4 6
We obtain the8desired10line by Least
0 2
Squares 4 and thus
method 6 the intercept
8 and slope obtained are called
least squares estimates. The xestimates are expressed by the formula:
∑ 𝑥𝑦−𝑛𝑥̅ 𝑦̅
𝑏= ∑ 𝑥 2 −𝑛𝑥̅ 2
and
𝑎 = 𝑦̅ − 𝑏𝑥̅
After finding the values of a and b, we can predict the average value of y for a given value of x.
For practical implementation, let us consider our previous data set as in correlation analysis. Now we
consider height as independent variable and weight as dependent variable. We perform regression analysis
using the following steps:
Model Summary
ANOVAa
Total 2327.690 39
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Let us interpret the above three tables, namely Model Summary, ANOVA and Coefficients.
R square, as shown in the first table, is called coefficient of determination. Here R2=0.215 indicates that
21.5% variation in the dependent variable (weight) has been explained by the independent variable
(height). Therefore, other factors which are not included in the model are responsible for the rest amount of
1-0.215 = 0.785 or 78.5% variation in the dependent variable.
Second table indicates that the regression model is significant as the pvalue (sig) is smaller than 0.05. In
other words, height has significant effect on weight.
The third table reports the estimated intercept and slope coefficients (column of B) to be 113.744 and 0.727
respectively. The estimated slope of 0.727 indicates that if height increases for one unit (cm), then on an
average weight increases by 0.727 units (pounds).
The last column of the third table indicates the pvalues (here Sig.) for testing whether the population
intercept and slope are different from zero. As both the pvalues (here 0.000 and 0.003) are smaller than
0.05, we can conclude that the population slope and intercept terms are significantly different from zero.
Output:
Model Summary
ANOVAa
Total 2327.690 39
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
The table of Model Summary indicates that the R square value is 0.263. Therefore, 26.3% variation in the
variable weight has been explained by the independent variables height and age.
The second table (ANOVA) indicates that the regression is significant, as the pvalue is 0.004. In other
words, at least one of the independent variables has significant effect on the dependent variable (weight).
The third table (Coefficients) shows the regression coeffieints with pvalues and other relevant statistics.
We can see from the last column (pvalue) that the variable height has significant effect as the pvalue is
0.045 which is less than 0.05. However, age does not have significant effect on weight as pvalue (0.129) is
greater than 0.05. The regression coefficient of height is 0.53, which indicates that one unit (cm) increase
in height results in 0.53 unit (pounds) increase in weight when the other variable (age) is in the model.
Selecting subset regression:
Sometimes our data set consists of pretty many independent variables (predictors) and we would like to
choose best subset of predictors. This process is called subset selection. There are three ways of doing so,
namely forward selection, backward elimination and stepwise regression.
The present data set has very few independent variables, we do not implement the method. However, we
can demonstrate the process. While running the regression analysis as done before, just selet Method
Stepwise in the following window:
Residual plots
We need residual analysis for finally comment about the fitted model. We follow the same process of
regression analysis, but nedd to produce some plots as shown below:
Step 1: Analyzse → Regression → Linear
Step 2: Choose dependent and independent Step 3: Bring ZPRED in X and ZRESID in Y.
variable.
Also select Normal Probability Plot for Residual Plots.
Click on the tab Plots Click Continue, then OK.
Output:
If the data comes from Normal distribution, the points in Normal P-P plot should lie on the straight line.
For small data set, some deviation is alright. For the second plot of standardized residuals versus predicted
values, there should not be any pattern for a good fitted model. There seems to be two possible outliers.