Lecture 4 - Correlation and Regression
Lecture 4 - Correlation and Regression
Table 1 Notation for the Data Used in Simple Regression and Correlation
We wish to measure both the direction and the strength of the relationship between Y and X .
Two related measures, known as the covariance and the correlation coefficient, are developed
below.
On the scatter plot of Y versus X , let us draw a vertical line at and a horizontal line at , as
shown in Figure 14,
which is known as the covariance between Y and X, indicates the direction of the linear
relationship between Y and X.
If Cov(Y, X) > 0, then there is a positive relationship between Y and X,
If Cov(Y, X) < 0, then the relationship is negative.
Unfortunately, Cov(Y, X) does not tell us much about the strength of such a relationship because
it is affected by changes in the units of measurement. For example, we would get two different
values for the Cov(Y, X) if we report Y and/or X in terms of thousands of FCFA‟s instead of
FCFA‟s. To avoid this disadvantage of the covariance, we standardize the data before computing
the covariance.
To standardize the Y data, we first subtract the mean from each observation then divide by the
standard deviation, that is, we compute:
̅
Where,
( ̅)
√∑
is the sample standard deviation of Y. It can be shown that the standardized variable follows a
standard normal distribution (i.e., has mean zero and standard deviation one). We standardize X
in a similar way by subtracting the mean Z from each observation zi then divide by the standard
deviation sx.
Example:
To study the relationship between the length of a service call and the number of electronic
components in the computer that must be repaired or replaced, a sample of records on service
calls was taken. The data (See table below) consist of the length of service calls in minutes (the
response variable) and the number of components repaired (the predictor variable):
Required: Calculate Cov(Y,X) and Cor(Y,X) , where Y denotes length of Service Calls, and X,
the number of Units Repaired, X. Interpret your results.
Solution
6.1.2.Correlation Coefficient
Statistic showing the degree of relation between two variables. Three main types (Fig 15)
Correlation
Non parametric
Parametric
(Ranked correlations)
( )
Test Statistic
A key question for correlation analysis s: “Is there any evidence of linear relationship between
annual x(dependent variable) of a sample and its independent variable at level of significance?
Example
You are given the data below concerning sales made by 7 stores in a certain locality
Critical value;
Decision rule: There is evidence of a linear relationship at 5% level of significance. Reject Ho.
Exercises
2. The correlation coefficient is used to determine:
a. A specific value of the y-variable given a specific value of the x-variable
b. A specific value of the x-variable given a specific value of the y-variable
c. The strength of the relationship between the x and y variables
d. None of these
2. The birth weights of 1,333 fifty-year-old men from a certain locality were traced through birth
records. Adult height and birth weight were significantly correlated (r = 0.22, P<0.001).
a) What is meant by „correlated‟ and „r = 0.22‟?
b) What assumptions are required for the calculation of the P value?
c) What can we conclude about the relationship between adult height and birth weight?
3. The likelihood that a statistic would be as extreme or more extreme than what was observed is called
A. statistically significant result
B. test statistic
C. significance level
D. p-value
4. Which of the following makes no sense?
a) p < .10
b) r = .5
c) p = - .05
d) r = - .95
5. The diagram to the below is an example of a
a) histogram illustrating a lack of correlation between tobacco and alcohol
b) scatterplot illustrating a perfect correlation between tobacco and alcohol
c) scatterplot illustrating a positive correlation between tobacco and alcohol
d) histogram illustrating a positive correlation between tobacco and alcohol
6.2. Regression Analysis
By the end of this lecture you should be able to understand the following:
Types of Regression Models
Determining the Simple Linear Regression Equation
Measures of Variation
Assumptions of Regression and Correlation
Residual Analysis
Measuring Autocorrelation
Inferences about the Slope
Pitfalls in Regression and Ethical Issues
Regression models attempt to minimize the distance measured vertically between the
observation point and the model line (or curve).
The length of the line segment is called residual, modeling error, or simply error.
The negative and positive errors should cancel out
⇒ Zero overall error
Many lines will satisfy this criterion
To have a good model,
Choose the line that minimizes the sum of squares of the errors.
where,
The best linear model minimizes the sum of squared errors (SSE):
OR
.
The sum of squared errors without regression, that is, total sum of squares or (SST) is:
It is a measure of y's variability and is called variation of y.
OR
The fraction of the variance that is explained determines the goodness of the regression model
and is called, the coefficient of determination,
Where;
By using the least squares method (a procedure that minimizes the vertical deviations of plotted
points surrounding a straight line) we are able to construct a best fitting straight line to the scatter
diagram points and then formulate a regression equation in the form of:
ŷ a bX
The sample regression line provides an estimate of the population regression line.
Example:
The number of disk Input/Output‟s (I/O's) and processor times of seven programs were measured
as:
number of disk (x) 14 16 27 42 39 50 83
processor times (y) 2 5 7 9 10 13 20
Required:
c) Scattered diagram
d) Regression line relating x and y.
e) Interpret your result
f) Coefficient of Determination and explain your results
Solution
a) Scattered diagram
25
(̅̅̅̅̅̅̅)
= 0.2438
̅ ̅
= 9.43 – 0.2438*38.71
= -0.0083
Hence, the required model is:
⏞
c) Interpretation:
- The estimated average processor times (y) is -0.0083 when there is
no disk
- The estimated average processing time increases 0.2438 time for
each additional disk
c) Coefficient of Determination
NB: In the single independent variable case, the coeficient of determination is:
Where,
= Coefficient of determination, and
R = Simple correlation coefficient
Where,
Total n- 1 SST
The standard error of the regression slope coefficient b1 is Sb1 and of b0 is sb0
For the disk I/O and CPU data of Example above, we have n=7,
√( ) = 0.8311
quantile of a t variate with n-2 degrees of freedom. The confidence intervals are:
If a confidence interval includes zero, then the regression parameter cannot be considered
different from zero at the at 100(1-α)% confidence level.
For b1,
Where,
Purposes:
Prediction
Explanation
Theory building
6.3.1.Introduction
In the last chapter we began our study of regression and correlation analysis. However, the
methods presented considered only the relationship between one dependent variable and one
independent variable. The possible effect of other independent variables was ignored. For
example, we described how the repair cost of a car was related to the age of the car. Are there
other factors that affect the repair cost? Does the size of the engine or the number of miles driven
affect the repair cost? When several independent variables are used to estimate the value of the
dependent variable it is called multiple regression
Definition
Multiple linear regression is a method of analysis for assessing the strength of the
relationship between each of a set of explanatory variables (sometimes known as
independent variables, although this is not recommended since the variables are often
correlated), and a single response (or dependent) variable.
The independent variables can be measured at any level (i.e., nominal, ordinal, interval,
or ratio). However, nominal or ordinal-level IVs that have more than two values or
categories (e.g., race) must be recoded prior to conducting the analysis because linear
regression procedures can only handle interval or ratio-level IVs, and nominal or
ordinal-level IVs with a maximum of two values (i.e., dichotomous). The dependent
variable MUST be measured at the interval- or ratio-level.
• The independent variables can be measured at any level (i.e., nominal, ordinal, interval,
or ratio). However, nominal or ordinal-level IVs that have more than two values or
categories (e.g., race) must be recoded prior to conducting the analysis because linear
regression procedures can only handle interval or ratio-level IVs, and nominal or ordinal-
level IVs with a maximum of two values (i.e., dichotomous).
Goal
There is a total amount of variation in y (SSTO). We want to explain as much of this variation as
possible using a linear model and our multiple explanatory variables.
Design Requirements
One dependent variable (criterion)
Two or more independent variables (predictor variables).
Sample size: >= 50 (at least 10 times as many cases as independent variables)
A linear regression model with two predictor variables can be expressed with the following
equation:
Y = B0 + B1*X1 + B2*X2 + ε.
The variables in the model are:
Y, the response variable;
X1, the first predictor variable;
X2, the second predictor variable; and
ε, the residual error, which is an unmeasured variable.
There are k-1 explanatory variables, k parameters. The parameters in the model are the
Standardized coefficients:
B0, the Y-intercept;
B1, B2, , are the regression coefficients. They indicate the change in the estimated
value of the dependent variable for a unit change in one of the independent variables,
when the other independent variables are held constant.
Regression coefficient's show the amount of changes in the dependent response) variable (in its
measurement unit) when independent (predictors) variables change one unit (in their
measurement unit).
The degrees of freedom for regression are k, the number of independent variables. The degrees
of freedom associated with the error term are n − (k + 1). The SS in the middle of the top row of
the ANOVA table refers to the sum of squares, or the variation.
The column headed MS refers to the mean square and is obtained by dividing the SS term by the
df term. Thus, MSR, the mean square regression, is equal to SSR/k, and MSE equals SSE/ [n −
(k + 1)]. The general format of the ANOVA table is:
Analysis of Variance
Source df SS MS F
Regression K SSR
Another measure of the effectiveness of the regression equation is the coefficient of multiple
determination, i.e., the proportion of the variation in the dependent variable, Y, that is explained
by the set of independent variables x1, x2, x3,…xk.
The coefficient of multiple determinations, written or R square, may range from 0 to 1.0. It is
the percent of the variation explained by the regression. The ANOVA table is used to calculate
the coefficient of multiple determinations. It is the sum of squares due to the regression divided
by the sum of squares total.
must always be between 0 and 1.0, inclusive. That is, . The closer is to 1.0, the
stronger the association between Y and the set of independent variables, X1 ,X2, . . . Xk.
If there is a large discrepancy between R² and Adjusted R², extraneous variables should be
removed from the analysis and R² recomputed.
Definition: Extraneous variables are any variables that you are not intentionally studying in your
experiment or test. When you run an experiment, you're looking to see if one variable (the
independent variable) has an effect on another variable (the dependent variable). ... These
undesirable variables are called extraneous variables
6.3.5.Global test
An overall test of the regression model. It investigates the possibility that all the regression
coefficients are equal to zero. It tests the overall ability of the set of independent variables to explain
differences in the dependent variable.
H0: β1 = β2 = β3 = 0
Rejecting H0 and accepting H1 implies that one or more of the independent variables is useful in
explaining differences in the dependent variable. However, a word of caution, it does not suggest
how many or identify which regression coefficients are not zero. Note also that βi denotes the
population value of the slope, whereas bj, a point estimate of βj, is computed from sample data.
7. There should be no significant outliers, high leverage points or highly influential points.
Outliers, leverage and influential points are different terms used to represent observations
in your data set that are in some way unusual when you wish to perform a multiple
regression analysis. Detect outliers using "casewise diagnostics" and "studentized deleted
residuals",; check for influential points using Cook's Distance.
8. The degree to which outliers affect the regression solution depends upon where the
outlier is located relative to the other cases in the analysis. Outliers whose locations
have a large effect on the regression solution are called influential cases.
Whether or not a case is influential is measured by Cook’s distance.
Cook‟s distance is an index measure; it is compared to a critical value based on the
formula:
4 / (n – k – 1)
where n is the number of cases and k is the number of independent variables.
**If a case has a Cook’s distance greater than the critical value, it should be examined
for exclusion.
9. Finally, you need to check that the residuals (errors) are approximately normally
distributed. Two common methods to check this assumption include using: (a) a
histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal
Q-Q Plot of the studentized residuals.
Exercise
1. Literacy rate is a reflection of the educational facilities and quality of education available in a
country, and mass communication plays a large part in the educational process. In an effort to
relate the literacy rate of a country to various mass communication outlets, a demographer
has proposed to relate literacy rate to the following variables: number of daily newspaper
copies (per 1000 population), number of radios (per 1000 population), and number of TV
sets(per 1000 population). Here are the data for a sample of 10 countries:
2. A general practice based study sought to find out if people‟s ears increase in size as they get
older. Two hundred and six patients were studied with ear size being assessed by the length
of the left external ear from the top to the lowest part. Measurements were made simply,
using a transparent plastic ruler. The relation between the patient‟s age and ear length (see
graph below) was examined by calculating a regression equation.
A general practice based study sought to find out if people‟s ears increase in size as they get older. Two
hundred and six patients were studied with ear size being assessed by the length of the left external ear
from the top to the lowest part. Measurements were made simply, using a transparent plastic ruler. The
relation between the patient‟s age and ear length (see graph below) was examined by calculating a
regression equation.
The mean age of the patients was 53.75 years (range 30 - 93) and the mean ear length was
675mm (range 520 - 840mm). The linear regression equation was
ear length = 55.9 + 0.22 × age with the 95% confidence interval for the b coefficient being 0.17
to 0.27. The author concluded that „It seems therefore that as we get older our ears get bigger (on
average by 0.22 mm a year)‟.
a) What are the interpretations of the numbers 55.9 and 0.22 in the regression equation?
b) Are the assumptions about the data required for the regression analysis satisfied here?
c) Are the conclusions justified by the data?
3. The accompanying data is on y = profit margin of savings and loan companies in a given
year, x1 = net revenues in that year, and x2 = number of savings and loan branches offices.
4. In a study of physical fitness and cardiovascular risk factors in children, blood pressure and
recovery index (post exercise recovery rate, an indicator of fitness) were measured (Hoffman
and Walter 1989). Multiple regression was used to look at the relationship between systolic
blood pressure and recovery index, adjusted for age, race, area of residence and ponderal
index (wt/ht2). For the boys, the adjusted regression coefficient of systolic blood pressure on
recovery index was given as follows:
b = –0.086, SE b = 0.039, 95% CI = –0.162 to –0.010.
a) What is meant by „multiple regression analysis‟?
b) What is meant by the terms „b‟, „SE b‟ and „95% CI‟?
c) What assumptions about the variables are required for these analyses to be valid?
d) Why was the regression adjusted and what does this mean?
e) What would be the effect of adjusting for race if systolic blood pressure were
related to race and recovery index were not?
f) What would be the effects of adjusting for ponderal index if blood pressure and
recovery index were both related to ponderal index?
5. The growth of children from early childhood through adolescence generally follows a linear
pattern. Data on the heights of female Americans during childhood, from four to nine years
old, were compiled and the least squares regression line was obtained as ŷ = 32 + 2.4x where
ŷ is the predicted height in inches, and x is age in years.
1) Interpret the value of the estimated slope b1 = 2. 4.
2) Would interpretation of the value of the estimated y-intercept, b0 = 32, make sense here?
3) What would you predict the height to be for a female American at 8 years old?
4) What would you predict the height to be for a female American at 25 years old?
6. A multiple regression analysis was used to model the relationship between body mass index
(dependent variable) and two independent variables (height and age) for 33 randomly
selected level II students of Douala University in 2012. The output from SPSS analysis is as
shown below:
Model Unstandardized t 95% Confidence Interval for β
Coefficients
β Std. Error Lower Bound Upper Bound
1 (Constant) 4.267 1.452
Height (cm) -0.316 0.231
Age (years) 0.854 0.222
a) What is meant by „multiple regression analyses?
b) What is meant by the terms „SE β‟ and „95% Confidence Interval‟?
c)List 3 assumptions about the variables required for these analyses to be valid?
c) Complete the table for values of “t”, Lower Bound, and upper Bound
d) Which of the variables are not significant predictors? Test at 5% significance level
e) Complete the ANOVA table and interpret your results
10. Match the statements below with the corresponding terms from the list.
A) R2 adjusted B) Residual plots C) R2 D) Residual E) Influential points F) outliers
___ Worst kind of outlier, can totally reverse the direction of association between x and y
____ Used to check the assumptions of the regression model.
____ Used when trying to decide between two models with different numbers of
predictors.
____Proportion of the variability in y explained by the regression model.
____ ls the observed value of y minus the predicted value of y for the observed x.
____ A point that lies far away from the rest.
11.In regression analysis, the variable that is used to explain the change in the outcome of an
experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct
12. In a regression and correlation analysis if r2 = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST
13. In a regression analysis if SSE = 200 and SSR = 300, then the coefficient of determination is
a. 0.6667
b. 0.6000
c. 0.4000
d. 1.5000
14. If the correlation coefficient is 0.8, the percentage of variation in the response variable
explained by the variation in the explanatory variable is
a. 0.80% b. 80% c. 0.64% d. 64%
15. A residual plot:
a. displays residuals of the explanatory variable versus residuals of the response variable.
b. displays residuals of the explanatory variable versus the response variable.
c. displays explanatory variable versus residuals of the response variable.
d. displays the explanatory variable versus the response variable.
e. displays the explanatory variable on the x axis versus the response variable on the y
axis.