0% found this document useful (0 votes)
63 views

Lecture 4 - Correlation and Regression

The document discusses correlation and regression analysis. It defines concepts like covariance, correlation coefficient, and Pearson's r. It provides examples to calculate r and test for correlation. It also discusses simple linear regression models, determining the regression equation, and assumptions of regression analysis.

Uploaded by

manuelmbiawa30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Lecture 4 - Correlation and Regression

The document discusses correlation and regression analysis. It defines concepts like covariance, correlation coefficient, and Pearson's r. It provides examples to calculate r and test for correlation. It also discusses simple linear regression models, determining the regression equation, and assumptions of regression analysis.

Uploaded by

manuelmbiawa30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

1.

Correlation and Regression

By the end of this lecture you should be able to:


 Calculate and interpret the coefficient of correlation- Measuring the Strength of the
Association
 Carryout test of hypothesis on r, the coefficient of correlation
 Carryout correlation analysis in SPSS and write a scientific report
6.1. The concept of Covariance and Correlation Analysis

6.1.1. Covariance and Correlation Coefficient


Suppose we have observations on n subjects consisting of a dependent or response variable Y and
an explanatory variable X . The observations are usually recorded as in Table 1.

Table 1 Notation for the Data Used in Simple Regression and Correlation

We wish to measure both the direction and the strength of the relationship between Y and X .
Two related measures, known as the covariance and the correlation coefficient, are developed
below.

On the scatter plot of Y versus X , let us draw a vertical line at and a horizontal line at , as
shown in Figure 14,

Figure 14 A graphical illustration of the correlation coefficient


where,
are the sample mean of Y and X, respectively. The two lines divide the graph into four quadrants.
For each point i in the graph, compute the following quantities:
̅
 ̅ the deviation of each observation yi from the mean of the response variable,
 ̅ the deviation of each observation xi from the mean of the predictor variable, and
 the product of the above two quantities, ( ̅)( ̅)
If the linear relationship between Y and X is positive (as X increases Y also increases), then
there are more points in the first and third quadrants than in the second and fourth quadrants.
Conversely, if the relationship between Y and X is negative (as X increases Y decreases), then
there are more points in the second and fourth quadrants than in the first and third quadrants.
Therefore, the sign of the quantity,
∑ ( ̅)( ̅) ∑
( ) ̅̅

which is known as the covariance between Y and X, indicates the direction of the linear
relationship between Y and X.
 If Cov(Y, X) > 0, then there is a positive relationship between Y and X,
 If Cov(Y, X) < 0, then the relationship is negative.

Unfortunately, Cov(Y, X) does not tell us much about the strength of such a relationship because
it is affected by changes in the units of measurement. For example, we would get two different
values for the Cov(Y, X) if we report Y and/or X in terms of thousands of FCFA‟s instead of
FCFA‟s. To avoid this disadvantage of the covariance, we standardize the data before computing
the covariance.
To standardize the Y data, we first subtract the mean from each observation then divide by the
standard deviation, that is, we compute:
̅

Where,

( ̅)
√∑
is the sample standard deviation of Y. It can be shown that the standardized variable follows a
standard normal distribution (i.e., has mean zero and standard deviation one). We standardize X
in a similar way by subtracting the mean Z from each observation zi then divide by the standard
deviation sx.

Example:

To study the relationship between the length of a service call and the number of electronic
components in the computer that must be repaired or replaced, a sample of records on service
calls was taken. The data (See table below) consist of the length of service calls in minutes (the
response variable) and the number of components repaired (the predictor variable):

Required: Calculate Cov(Y,X) and Cor(Y,X) , where Y denotes length of Service Calls, and X,
the number of Units Repaired, X. Interpret your results.

Solution
6.1.2.Correlation Coefficient
Statistic showing the degree of relation between two variables. Three main types (Fig 15)

Correlation

Non parametric
Parametric
(Ranked correlations)

Pearson's r Spearman's rho Kendhall's tau

Figure 15: Correlation in different forms


Non-parametric correlations use ordinal level variables while parametric type uses ratio/scale
variables. In this lecture, we focus on parametric correlations.

6.1.3. Parametric Correlation: The Pearson’s r


For any two variables, x and y, the sample correlation coefficient is:

̅̅
( )
√(∑ ̅ ) √(∑ )

( )

6.1.4. Sample Correlation coefficient (r): Features


It is also called Pearson's correlation or product moment correlation coefficient.
It measures the nature and strength between two variables of the quantitative type.
The sign of r denotes the nature of association while the value of r denotes the
strength of association
If the sign is +ve this means the relation is direct (an increase in one variable is
associated with an increase in the other variable and a decrease in one variable is
associated with a decrease in the other variable).
While if the sign is -ve this means an inverse or indirect relationship (which
means an increase in one variable is associated with a decrease in the other).
If r = 0, this means no association or correlation between the two variables.
If 0 < r < 0.25 = weak correlation.
If 0.25 ≤ r < 0.75 = intermediate correlation.
If 0.75 ≤ r < 1 = strong correlation.
If r = l = perfect correlation
Sample Correlation Coefficient r is an Estimate of ρ and is Used to Measure the
Strength of the Linear Relationship in the Sample Observations
:
6.1.5.t Test for Correlation
The necessary steps are s follows:
Hypotheses
H0: ρ = 0 (no correlation)
H1: ρ ≠ 0 (correlation)
The population Correlation Coefficient ρ (Rho) is used to Measure the strength between the
variables

Test Statistic

A key question for correlation analysis s: “Is there any evidence of linear relationship between
annual x(dependent variable) of a sample and its independent variable at level of significance?

Example
You are given the data below concerning sales made by 7 stores in a certain locality

a) Calculate r, the correlation coefficient


b) Is there any evidence of linear relationship between annual sales of a store
and its square footage at .05 level of significance?
Solution
a) Verify that r = 0.9706.There is a strong positive association between annual sales and
square footage
b) Hypotheses:
H0: ρ = 0 (no association)
H1: ρ ≠ 0 (association)
α = .05; df = 7 - 2 = 5

Calculation of the t statistic:

Critical value;

Decision rule: There is evidence of a linear relationship at 5% level of significance. Reject Ho.
Exercises
2. The correlation coefficient is used to determine:
a. A specific value of the y-variable given a specific value of the x-variable
b. A specific value of the x-variable given a specific value of the y-variable
c. The strength of the relationship between the x and y variables
d. None of these
2. The birth weights of 1,333 fifty-year-old men from a certain locality were traced through birth
records. Adult height and birth weight were significantly correlated (r = 0.22, P<0.001).
a) What is meant by „correlated‟ and „r = 0.22‟?
b) What assumptions are required for the calculation of the P value?
c) What can we conclude about the relationship between adult height and birth weight?
3. The likelihood that a statistic would be as extreme or more extreme than what was observed is called
A. statistically significant result
B. test statistic
C. significance level
D. p-value
4. Which of the following makes no sense?
a) p < .10
b) r = .5
c) p = - .05
d) r = - .95
5. The diagram to the below is an example of a
a) histogram illustrating a lack of correlation between tobacco and alcohol
b) scatterplot illustrating a perfect correlation between tobacco and alcohol
c) scatterplot illustrating a positive correlation between tobacco and alcohol
d) histogram illustrating a positive correlation between tobacco and alcohol
6.2. Regression Analysis

By the end of this lecture you should be able to understand the following:
 Types of Regression Models
 Determining the Simple Linear Regression Equation
 Measures of Variation
 Assumptions of Regression and Correlation
 Residual Analysis
 Measuring Autocorrelation
 Inferences about the Slope
 Pitfalls in Regression and Ethical Issues

6.2.1.The Concept of Regression


Regression tells us how to draw the straight line described by the correlation
 Definition of a Good Model
 Estimation of Model parameters
 Allocation of Variation
 Standard deviation of Errors
 Confidence Intervals for Regression Parameters
 Confidence Intervals for Predictions
 Visual Tests for verifying Regression Assumption

6.2.2.Simple Linear Regression Models


 Regression Model: Predict a response for a given set of predictor variables.
 Response Variable: Estimated variable
 Predictor Variables: Variables used to predict the response. predictors or factors
 Linear Regression Models: Response is a linear function of predictors.
 Simple Linear Regression Models: Only one predictor.
 For a given population, it is given by:

6.2.3. Definition of a Good Model

 Regression models attempt to minimize the distance measured vertically between the
observation point and the model line (or curve).
 The length of the line segment is called residual, modeling error, or simply error.
 The negative and positive errors should cancel out
⇒ Zero overall error
 Many lines will satisfy this criterion
To have a good model,
Choose the line that minimizes the sum of squares of the errors.

where,

is the predicted response when the predictor variable is x.


The parameter b0 and b1 are fixed regression parameters to be determined from the data.
 Given n observation pairs {(x1, y1), …, (xn, yn)}, the estimated response for the ith
observation is:

The error is:

The best linear model minimizes the sum of squared errors (SSE):

OR

.
The sum of squared errors without regression, that is, total sum of squares or (SST) is:
It is a measure of y's variability and is called variation of y.

OR

The fraction of the variance that is explained determines the goodness of the regression model
and is called, the coefficient of determination,

6.2.4. Estimation of Model Parameters: Least Squares Criterion


bo and b1 are obtained by finding the values of bo and b1 that minimize the sum of the squared
residuals. Minimization follows the approach of partial derivatives with respect to bo, and later,
with respect to b1. Hence, the resulting regression parameters that give minimum error variance
are:

Where;
By using the least squares method (a procedure that minimizes the vertical deviations of plotted
points surrounding a straight line) we are able to construct a best fitting straight line to the scatter
diagram points and then formulate a regression equation in the form of:

ŷ  a  bX
The sample regression line provides an estimate of the population regression line.

The individual error terms, have a mean of zero, i.e., ( )


is the estimated average value of y when the value of x is zero
is the estimated change in the average value of y as a result of a one-unit change in x

Example:
The number of disk Input/Output‟s (I/O's) and processor times of seven programs were measured
as:
number of disk (x) 14 16 27 42 39 50 83
processor times (y) 2 5 7 9 10 13 20
Required:
c) Scattered diagram
d) Regression line relating x and y.
e) Interpret your result
f) Coefficient of Determination and explain your results
Solution
a) Scattered diagram
25

Processor times (y)


20
15
10
5
0
0 20 40 60 80 100
Number of disk (x)

b) Equation of the regression line:


number of disk (x) processor times (y) X*y
14 2 196 4 28
16 5 256 25 80
27 7 729 49 189
42 9 1764 81 378
39 10 1521 100 390
50 13 2500 169 650
83 20 6889 400 1660
271 66 13855 828 3375

From the table,


̅
̅
∑ ̅̅
∑ ( ̅)

(̅̅̅̅̅̅̅)
= 0.2438
̅ ̅
= 9.43 – 0.2438*38.71
= -0.0083
Hence, the required model is:

c) Interpretation:
- The estimated average processor times (y) is -0.0083 when there is
no disk
- The estimated average processing time increases 0.2438 time for
each additional disk

c) Coefficient of Determination

SSE = 828 + 0.0083*66 – 0.2438*3375 = 5.87


SST = 828 – 7*9.43*9.43 = 205.71
SSR = SST – SSE = 205.74 – 5.87 = 199.84

The regression explains 97.15% of CPU time‟s variation.

NB: In the single independent variable case, the coeficient of determination is:

Where,
= Coefficient of determination, and
R = Simple correlation coefficient

6.2.5.Standard Deviation of Errors


The standard deviation of the variation of observations around the mean or regression line is
called the standard error of estimate, denoted as, . Since errors are obtained after calculating
two regression parameters from the data, errors have n-2 degrees of freedom
( ⏞)
√ √

Where,

is called mean squared errors or (MSE).

The ANOVA Table (one way)


df SS MS F Significance F
Regression K SSR P-value(sig)

Error n–k-1 SSE

Total n- 1 SST

The standard error of the regression slope coefficient b1 is Sb1 and of b0 is sb0

For the disk I/O and CPU data of Example above, we have n=7,

√( ) = 0.8311

Therefore the 90% confidence interval for b0 is:


( )( )
= (-1.6830, 1.6663)
For b1,
√( )
= 0.0187

α = level of significance, e.g. α = 5%, 1%, etc


Definition: A P-value (or probability value) is the probability of getting a value of the sample
test statistic that is at least as extreme as the one found from the sample data, assuming that the
null hypothesis is true. By extreme we mean: far from what we would expect to observe if the
null hypothesis is true. The lower the P-value, the more evidence there is in favor of rejecting the
null hypothesis.

6.2.6.Confidence Interval Estimate of the Slope


The 100(1-α)% confidence intervals for b0 and b1 can be computed using ( ), the

quantile of a t variate with n-2 degrees of freedom. The confidence intervals are:

If a confidence interval includes zero, then the regression parameter cannot be considered
different from zero at the at 100(1-α)% confidence level.

From our previous example,


CI: 0.2438 ( )( )
= (0.0061, 0.2514)
Interpretation:
For b1: We are 90% confident that the average CPU time is between 0.0061 and
0.2514 per second.

6.2.7.Inference about the Slope: t Test


t test for a population slope: is there a linear relationship between x and y?
Hypotheses:
(No linear relationship exist)
(Linear relationship exist)
Test statistic:
For b0

For b1,

Where,

b1 = sample regression slope coefficient


6.3. Multiple Regression Analysis

Purposes:
Prediction
Explanation
Theory building

6.3.1.Introduction
In the last chapter we began our study of regression and correlation analysis. However, the
methods presented considered only the relationship between one dependent variable and one
independent variable. The possible effect of other independent variables was ignored. For
example, we described how the repair cost of a car was related to the age of the car. Are there
other factors that affect the repair cost? Does the size of the engine or the number of miles driven
affect the repair cost? When several independent variables are used to estimate the value of the
dependent variable it is called multiple regression

Definition
 Multiple linear regression is a method of analysis for assessing the strength of the
relationship between each of a set of explanatory variables (sometimes known as
independent variables, although this is not recommended since the variables are often
correlated), and a single response (or dependent) variable.
 The independent variables can be measured at any level (i.e., nominal, ordinal, interval,
or ratio). However, nominal or ordinal-level IVs that have more than two values or
categories (e.g., race) must be recoded prior to conducting the analysis because linear
regression procedures can only handle interval or ratio-level IVs, and nominal or
ordinal-level IVs with a maximum of two values (i.e., dichotomous). The dependent
variable MUST be measured at the interval- or ratio-level.
• The independent variables can be measured at any level (i.e., nominal, ordinal, interval,
or ratio). However, nominal or ordinal-level IVs that have more than two values or
categories (e.g., race) must be recoded prior to conducting the analysis because linear
regression procedures can only handle interval or ratio-level IVs, and nominal or ordinal-
level IVs with a maximum of two values (i.e., dichotomous).
Goal
There is a total amount of variation in y (SSTO). We want to explain as much of this variation as
possible using a linear model and our multiple explanatory variables.

Design Requirements
 One dependent variable (criterion)
 Two or more independent variables (predictor variables).
 Sample size: >= 50 (at least 10 times as many cases as independent variables)
A linear regression model with two predictor variables can be expressed with the following
equation:
Y = B0 + B1*X1 + B2*X2 + ε.
The variables in the model are:
Y, the response variable;
X1, the first predictor variable;
X2, the second predictor variable; and
ε, the residual error, which is an unmeasured variable.
There are k-1 explanatory variables, k parameters. The parameters in the model are the
Standardized coefficients:
B0, the Y-intercept;
B1, B2, , are the regression coefficients. They indicate the change in the estimated
value of the dependent variable for a unit change in one of the independent variables,
when the other independent variables are held constant.

Regression coefficient's show the amount of changes in the dependent response) variable (in its
measurement unit) when independent (predictors) variables change one unit (in their
measurement unit).

What is the meaning of the regression coefficients?


B0 is the estimated value of Y when every other variable is fixed (equal to 0)
B1 associated with the independent variable, X1 indicates that for each additional unit of X1, Y
increases by B1 units, if the other independent variables are held constant..
Can you now interpret the other coefficients?

6.3.2.Multiple Standard Error of Estimate


It is likely that there is some error in the estimation. This can be measured by the multiple standard error
of estimate. Multiple standard error of estimate: Measures the error in the predicted value of the
dependent variable.
Where:
Y is the observation.
Y hat ( ̂ ) is the value estimated from the regression equation
n is the number of observations in the sample.
k is the number of independent variables.
s the standard error of estimate. The subscripts indicate the number of
independent variables being used to estimate the value of Y.

6.3.3.The ANOVA Table


A convenient means of showing the regression output is to use an ANOVA table. The variation in the
dependent variable is separated into two components:
a. that explained by the regression, that is, the independent variable and
b. the residual error or unexplained variation.
These two categories are identified in the source column of the following ANOVA table. The
column headed "df" refers to the degrees of freedom associated with each category. The total
degrees of freedom is (n − 1).

The degrees of freedom for regression are k, the number of independent variables. The degrees
of freedom associated with the error term are n − (k + 1). The SS in the middle of the top row of
the ANOVA table refers to the sum of squares, or the variation.

The column headed MS refers to the mean square and is obtained by dividing the SS term by the
df term. Thus, MSR, the mean square regression, is equal to SSR/k, and MSE equals SSE/ [n −
(k + 1)]. The general format of the ANOVA table is:
Analysis of Variance
Source df SS MS F
Regression K SSR

Error n-(k + 1) SSE


( )
Total N-1 SST
Notice that the multiple standard error of the estimate can be readily computed from the
ANOVA table using

Another measure of the effectiveness of the regression equation is the coefficient of multiple
determination, i.e., the proportion of the variation in the dependent variable, Y, that is explained
by the set of independent variables x1, x2, x3,…xk.

The coefficient of multiple determinations, written or R square, may range from 0 to 1.0. It is
the percent of the variation explained by the regression. The ANOVA table is used to calculate
the coefficient of multiple determinations. It is the sum of squares due to the regression divided
by the sum of squares total.

must always be between 0 and 1.0, inclusive. That is, . The closer is to 1.0, the
stronger the association between Y and the set of independent variables, X1 ,X2, . . . Xk.

6.3.4.Adjusted Coefficient of Determination


As the number of independent variables in the regression model increases, the coefficient of
multiple determination, , increases. Even if the additional independent variable is not a good
predictor, its inclusion in the model decreases SSE which in turn increases SSR and . Because
of this, another measure of the effectiveness of a multiple regression model, called, should
be considered.
Definition: , is the proportion of the variation in Y explained by X1 ,X2, . . . Xk, adjusted for
the number of predictors in the model.Mathematically,

If there is a large discrepancy between R² and Adjusted R², extraneous variables should be
removed from the analysis and R² recomputed.
Definition: Extraneous variables are any variables that you are not intentionally studying in your
experiment or test. When you run an experiment, you're looking to see if one variable (the
independent variable) has an effect on another variable (the dependent variable). ... These
undesirable variables are called extraneous variables

6.3.5.Global test
An overall test of the regression model. It investigates the possibility that all the regression
coefficients are equal to zero. It tests the overall ability of the set of independent variables to explain
differences in the dependent variable.

H0: β1 = β2 = β3 = 0

H1 : Not all the β s = 0

Rejecting H0 and accepting H1 implies that one or more of the independent variables is useful in
explaining differences in the dependent variable. However, a word of caution, it does not suggest
how many or identify which regression coefficients are not zero. Note also that βi denotes the
population value of the slope, whereas bj, a point estimate of βj, is computed from sample data.

6.3.6.Testing Individual Regression Coefficients


The procedure is same as in linear regression. If the hypothesis test finds that the null hypothesis
cannot be rejected, then the variable should be dropped from the model. However, the above test
only supports removing one variable at a time from the model. After a variable is removed, a
new regression model is constructed using the remaining variables and a new t-test can be
conducted for each of the remaining variables.

6.3.7.Conditions to carry out multiple Regression Analysis


Conditions for multiple regressions mirror those of simple regression
1. Your dependent variable should be measured on a continuous scale (i.e., it is either
an interval or ratio variable). Examples of variables that meet this criterion include
revision time (measured in hours), intelligence (measured using IQ score), exam
performance (measured from 0 to 100), weight (measured in kg), and so forth.
2. You have two or more independent variables, which can be either continuous (i.e., an
interval or ratio variable) or categorical (i.e., an ordinal or nominal variable).
3. Successive residuals should be independent. This means that there is not a pattern to the
residuals, the residuals are not highly correlated, and that there are not long runs of all
positive or all negative residuals. When successive residuals are correlated we refer to
this condition as autocorrelation, that is, correlation of successive residuals.
Autocorrelation frequently occurs when data are collected over a period of time.
 The Durbin-Watson statistic is used to test for the presence of serial correlation
among the residuals.
 The value of the Durbin-Watson statistic ranges from 0 to 4.
 As a general rule of thumb, the residuals are not correlated if the Durbin-Watson
statistic is approximately 2, and an acceptable range is 1.50 - 2.50.
4. There needs to be a linear relationship between (a) the dependent variable and each of
your independent variables, and (b) the dependent variable and the independent variables
collectively. Creating scatterplots and partial regression plots using SPSS Statistics
helps
5. Your data needs to show homoscedasticity, which is where the variances along the line
of best fit remain similar as you move along the line. When you analyse your own data,
you will need to plot the studentized residuals against the unstandardized predicted
values.
6. Your data must not show multicollinearity, which occurs when you have two or more
independent variables that are highly correlated with each other. This leads to problems
with understanding which independent variable contributes to the variance explained in
the dependent variable, as well as technical issues in calculating a multiple regression
model. To to detect for multicollinearity through an inspection of correlation coefficients
and Tolerance/VIF values; and (b) how to interpret these correlation coefficients and
Tolerance/VIF values so that you can determine whether your data meets or violates this
assumption.
 Multicollinearity answers the question, “Is there any variable in the model
that is measuring the same relationship/quantity as is measured by another
variable or group of variables?”.
 The question is answered by using collinearity statistic:
 Check that neither of the predictor variables has a variance inflation factor
(VIF= 1 / (1 − R2) )>10. VIF is a measure of the degree to which an independent
variable is correlated with the other independent variables in the regression
model.
 Check that in the regression analysis no two independent variables are highly
correlated, e.g. r = 0.90, or higher.
 Tolerance is the amount of variability in one independent variable that is not
explained by the other independent variables.
 Tolerance values less than 0.10 indicate collinearity

 If we discover collinearity in the regression output, we should reject the interpretation


of the relationships as false until the issue is resolved.
 Multicollinearity can be resolved by combining the highly correlated variables through
principal component analysis, or omitting a variable from the analysis.

7. There should be no significant outliers, high leverage points or highly influential points.
Outliers, leverage and influential points are different terms used to represent observations
in your data set that are in some way unusual when you wish to perform a multiple
regression analysis. Detect outliers using "casewise diagnostics" and "studentized deleted
residuals",; check for influential points using Cook's Distance.
8. The degree to which outliers affect the regression solution depends upon where the
outlier is located relative to the other cases in the analysis. Outliers whose locations
have a large effect on the regression solution are called influential cases.
Whether or not a case is influential is measured by Cook’s distance.
Cook‟s distance is an index measure; it is compared to a critical value based on the
formula:
4 / (n – k – 1)
where n is the number of cases and k is the number of independent variables.
**If a case has a Cook’s distance greater than the critical value, it should be examined
for exclusion.

9. Finally, you need to check that the residuals (errors) are approximately normally
distributed. Two common methods to check this assumption include using: (a) a
histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal
Q-Q Plot of the studentized residuals.

6.3.8. Standardized coefficient


 In statistics, standardized coefficients or beta coefficients are the estimates resulting
from a regression analysis that have been standardized so that the variances of dependent
and independent variables are 1.
 Therefore, standardized coefficients refer to how many standard deviations a dependent
variable will change, per standard deviation increase in the predictor variable.
 For univariate regression, the absolute value of the standardized coefficient equals the
correlation coefficient.
 Standardization of the coefficient is usually done to answer the question of which of the
independent variables have a greater effect on the dependent variable in a multiple
regression analysis, when the variables are measured in different units of measurement
(for example, income measured in dollars and family size measured in number of
individuals).
 The beta coefficients can be negative or positive, and have a t-value and significance of
the t-value associated with each.
 The t-test assesses whether the beta coefficient is significantly different from zero. If the
beta coefficient is not statistically significant (i.e., the t-value is not significant), the
variable does not significantly predict the outcome.
 If the beta coefficient is significant, examine the sign of the beta:
 If the beta coefficient is positive, the interpretation is that for every 1-unit increase in the
predictor variable, the outcome variable will increase by the beta coefficient value.
 If the beta coefficient is negative, the interpretation is that for every 1-unit increase in the
predictor variable, the outcome variable will decrease by the beta coefficient value.

Exercise
1. Literacy rate is a reflection of the educational facilities and quality of education available in a
country, and mass communication plays a large part in the educational process. In an effort to
relate the literacy rate of a country to various mass communication outlets, a demographer
has proposed to relate literacy rate to the following variables: number of daily newspaper
copies (per 1000 population), number of radios (per 1000 population), and number of TV
sets(per 1000 population). Here are the data for a sample of 10 countries:

1) What is the response variable? What are the explanatory variables?


2) Enter the data into SPSS
3) Use the SPSS output to make a statement concerning whether each of the following
assumptions in a multiple linear regression is satisfied:
1) the linearity assumption
2) the uniform variance (homoscedasticity) assumption
3) the normality assumption
4) Write the least-squares regression equation for this problem.
5) Explain what each term in the regression equation represents in terms of the problem
6) What are the degrees of freedom for the t* value in this problem?
7) Interpret the coefficient of multiple determinations, R2.
8) At the 5% significance level, state and justify if the model is useful for predicting the
response.
9) At the 5% significance level, does it appear that any of the predictor variables can be
removed from the full model as unnecessary?
10) Create scatterplots to check Assumption 1 as well as to identify potential outliers and
potential influential observations.
11) Write a summary report of this research using the APA reporting standard

2. A general practice based study sought to find out if people‟s ears increase in size as they get
older. Two hundred and six patients were studied with ear size being assessed by the length
of the left external ear from the top to the lowest part. Measurements were made simply,
using a transparent plastic ruler. The relation between the patient‟s age and ear length (see
graph below) was examined by calculating a regression equation.

A general practice based study sought to find out if people‟s ears increase in size as they get older. Two
hundred and six patients were studied with ear size being assessed by the length of the left external ear
from the top to the lowest part. Measurements were made simply, using a transparent plastic ruler. The
relation between the patient‟s age and ear length (see graph below) was examined by calculating a
regression equation.

The mean age of the patients was 53.75 years (range 30 - 93) and the mean ear length was
675mm (range 520 - 840mm). The linear regression equation was

ear length = 55.9 + 0.22 × age with the 95% confidence interval for the b coefficient being 0.17
to 0.27. The author concluded that „It seems therefore that as we get older our ears get bigger (on
average by 0.22 mm a year)‟.
a) What are the interpretations of the numbers 55.9 and 0.22 in the regression equation?
b) Are the assumptions about the data required for the regression analysis satisfied here?
c) Are the conclusions justified by the data?

3. The accompanying data is on y = profit margin of savings and loan companies in a given
year, x1 = net revenues in that year, and x2 = number of savings and loan branches offices.

a. Determine the multiple regression equation for the data.


b. Compute and interpret the coefficient of multiple determinations, R2.
c. At the 5% significance level, determine if the model is useful for predicting the
response.
d. Create all partial plots to check Assumption 1 as well as to identify outliers and
potential influential observations.
e. Obtain the residuals and create residual plots. Decide whether or not it is
reasonable to consider that the assumptions for multiple regression analysis are
met by the variables in questions.
f. At the 5% significance level, does it appear that any of the predictor variables can
be removed from the full model as unnecessary?
g. Obtain and interpret 95% confidence intervals for the slopes, βi, of the population
regression line that relates net revenues and number of branches to profit margin.
h. Are there any multicollinearity problems (i.e., are net revenues and number of
branches collinear [estimating similar relationships/quantities])?
i. Obtain a point estimate for the mean profit margin with 3.5 net revenues and 6500
branches.
j. Test the alternative hypothesis that the mean profit margin with 3.5 net revenues
and 6500 branches is greater than 0.70. Test at the 5% significance level.
k. Determine a 95% confidence interval for the mean profit margin with 3.5 net
revenues and 6500 branches.
l. Find the predicted profit margin for my bank with 3.5 net revenues and 6500
branches.
m. Determine a 95% prediction interval for the profit margin for Dr. Street‟s bank
with 3.5 net revenues and 6500 branches.

4. In a study of physical fitness and cardiovascular risk factors in children, blood pressure and
recovery index (post exercise recovery rate, an indicator of fitness) were measured (Hoffman
and Walter 1989). Multiple regression was used to look at the relationship between systolic
blood pressure and recovery index, adjusted for age, race, area of residence and ponderal
index (wt/ht2). For the boys, the adjusted regression coefficient of systolic blood pressure on
recovery index was given as follows:
b = –0.086, SE b = 0.039, 95% CI = –0.162 to –0.010.
a) What is meant by „multiple regression analysis‟?
b) What is meant by the terms „b‟, „SE b‟ and „95% CI‟?
c) What assumptions about the variables are required for these analyses to be valid?
d) Why was the regression adjusted and what does this mean?
e) What would be the effect of adjusting for race if systolic blood pressure were
related to race and recovery index were not?
f) What would be the effects of adjusting for ponderal index if blood pressure and
recovery index were both related to ponderal index?
5. The growth of children from early childhood through adolescence generally follows a linear
pattern. Data on the heights of female Americans during childhood, from four to nine years
old, were compiled and the least squares regression line was obtained as ŷ = 32 + 2.4x where
ŷ is the predicted height in inches, and x is age in years.
1) Interpret the value of the estimated slope b1 = 2. 4.
2) Would interpretation of the value of the estimated y-intercept, b0 = 32, make sense here?
3) What would you predict the height to be for a female American at 8 years old?
4) What would you predict the height to be for a female American at 25 years old?
6. A multiple regression analysis was used to model the relationship between body mass index
(dependent variable) and two independent variables (height and age) for 33 randomly
selected level II students of Douala University in 2012. The output from SPSS analysis is as
shown below:
Model Unstandardized t 95% Confidence Interval for β
Coefficients
β Std. Error Lower Bound Upper Bound
1 (Constant) 4.267 1.452
Height (cm) -0.316 0.231
Age (years) 0.854 0.222
a) What is meant by „multiple regression analyses?
b) What is meant by the terms „SE β‟ and „95% Confidence Interval‟?
c)List 3 assumptions about the variables required for these analyses to be valid?
c) Complete the table for values of “t”, Lower Bound, and upper Bound
d) Which of the variables are not significant predictors? Test at 5% significance level
e) Complete the ANOVA table and interpret your results

7. Write down the letter corresponding to the correct answers.


1. If we discover collinearity in the regression output, we should:
A- Reject the interpretation of the relationships as false until the issue is resolved
B- Accept the results C - Use a different statistic D – Use Chi-square test

8. Multicollinearity can be resolved by:


A - Combining the highly correlated variables through principal component analysis,
B- Omitting a variable from the analysis C- A and B are correct D – All of the above

10. Match the statements below with the corresponding terms from the list.
A) R2 adjusted B) Residual plots C) R2 D) Residual E) Influential points F) outliers
___ Worst kind of outlier, can totally reverse the direction of association between x and y
____ Used to check the assumptions of the regression model.
____ Used when trying to decide between two models with different numbers of
predictors.
____Proportion of the variability in y explained by the regression model.
____ ls the observed value of y minus the predicted value of y for the observed x.
____ A point that lies far away from the rest.

11.In regression analysis, the variable that is used to explain the change in the outcome of an
experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct
12. In a regression and correlation analysis if r2 = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST
13. In a regression analysis if SSE = 200 and SSR = 300, then the coefficient of determination is
a. 0.6667
b. 0.6000
c. 0.4000
d. 1.5000
14. If the correlation coefficient is 0.8, the percentage of variation in the response variable
explained by the variation in the explanatory variable is
a. 0.80% b. 80% c. 0.64% d. 64%
15. A residual plot:
a. displays residuals of the explanatory variable versus residuals of the response variable.
b. displays residuals of the explanatory variable versus the response variable.
c. displays explanatory variable versus residuals of the response variable.
d. displays the explanatory variable versus the response variable.
e. displays the explanatory variable on the x axis versus the response variable on the y
axis.

You might also like