Correlation and Regression
Correlation and Regression
Regression
Objectives
determination
Introduction
In the part 1, we discussed the non-parametric test (chi-square), and parametric test (t-
test, and ANOVA). Another area of inferential statistics involves determining whether a
relationship exists between two or more numerical or quantitative variables. For example, a
businessperson may want to know whether the volume of sales for a given month is related
to the amount of advertising the firm does that month. Educators are interested in
determining whether the number of hours a student studies is related to the student’s score
on a particular exam. Medical researchers are interested in questions such as, Is caffeine
related to heart damage? Or Is there a relationship between a person’s age and his or her
blood pressure? These are only a few of the many questions that can be answered by using
the techniques of correlation and regression analysis.
There are two types of relationships: simple and multiple. In a simple relationship,
there are two variables — an independent variable, also called an explanatory variable or a
predictor variable, and a dependent variable, also called a response variable. A simple
relationship analysis is called simple regression, and there is one independent variable that
is used to predict the dependent variable. For example a manager may wish to see whether
the number of years the salesperson have been working for the company has anything to do
with the amount sales they make. This type of study involves a simple relationship since
there are only two variables— years of experience and amount of sales.
Predictions are made in all areas and daily. Examples include weather forecasting,
stock market analyses, sales predictions, crop predictions, gasoline price predictions, and
sports predictions. Some predictions are more accurate than others, due to the strength of
the relationship. That is, the stronger the relationship is between variables, the more
accurate the prediction is.
Scatter Plots and Correlation
In simple correlation and regression studies, the researcher collects data on two
numerical or quantitative variable to see whether a relationship exists between numbers. As
stated previously, the two variables for this study are called the independent and dependent
variables. The independent variable is the variable in regression that can controlled or
manipulated, while the dependent variables is the variable in regression that cannot be
controlled or manipulated.
Scatter Plot
Correlation
To measure the strength if the linear relationship between two variables, correlation
coefficient is use. There are several types of correlation coefficients. The one that will be
explained this part is called the Pearson product moment correlation coefficient (PPMC),
name after statistician Karl Pearson.
The correlation coefficient computed from the sample data measures the strength
and direction of a linear relationship between two quantitative variables, the symbol for the
sample correlation coefficient is r.
The symbol for the population
correlation coefficient is ρ (Greek
letter rho).
2. The data pairs fall approximately on a straight line and are measured at the interval
or ratio level.
The variables have a joint normal distribution. (This mean that given any specific value
of x, the y values are normally distributed; and given any specific value of y, the x values are
normally distributed.)
Correlation and Causation. Researchers must understand the nature of the linear
relationship between the independent and dependent variables. When a hypothesis test
indicates that a significant linear relationship exists between the variables, researchers must
consider the possibilities outlined below.
When the null hypothesis has been rejected for a specific α value, any of the following
five possibilities can exists.
1. There is a direct cause-and-effect relationship between the variables. That is, c causes y.
For example, water causes plants to grow, poison caused death, and heat causes ice to
melt.
2. There is a reverse cause-and-effect relationship between the variables. That is, y causes
x. For example, suppose a researcher believes excessive coffee consumption causes
nervousness, but the research fails to consider that the reverse situation may occur. That
is, it may be that an extremely nervous person craves coffee to calm his or her nerves.
3. The relationship between the variables may be caused by a third variable. For example, if
a statistician correlated the number of deaths due to drowning and the number of cans
of soft drink consumed daily during the summer, he or she would probably find a
significant relationship. However, the soft drink is not necessarily responsible for the
deaths, since both variables may be related to heat and humidity.
5. The relationship may be coincidental. For example, a researcher may be able to find a
significant relationship between “the increase in the number of people who are
exercising and the increase in the number of people who are committing crimes”, but
common sense dictates that any relationship between these two values must be due to
coincidence.
Regression
In studying relationships between two variables, collect the data and then construct a
scatter plot. The purpose of the scatter plot is to determine the nature of the relationship.
The possibilities include a positive linear relationship, a negative linear relationship, a
curvilinear relationship, or no discernible relationship. The next step is the compute the
value of the correlation coefficient and to test the significance of the relationship. If the value
of the correlation coefficient is significant, the next step is to determine the equation of the
regression line, which is the data’s line of best fit. (Note: determining the regression line
when r is not significant and then making predictions using the regression line are
meaningless.)
1. One independent and dependent variable that is measures at the continuous level.
5. The variances along the line of best fit remain similar as you move along the line, known
as homoscedasticity.
6. The residuals (errors) of the regression line are approximately normally distributed.
Fitting lines to Data
We can describe the pattern of the plot with a simple mathematical function (like a
straight line). Then we can characterize the relation by providing the formula for the function
y = ax + b where, y is the dependent variable, b is the slope, a is the intercept, and x is the
independent variable
Example 11: In a study on speed control, it was found that the main reasons for regulations
were to make traffic flow more efficient and to minimize the risk of danger. An area that
focused on in the study was the distance required to completely stop a vehicle at various
speeds. Use the following table to answer the questions.
8. Is r significant at α = 0.05?
9. If relationship is existing, what is the regression line equation to predict the braking
distance?
Solution:
7. To answer item 7 & 8. Let’s test the hypothesis if relationship is existing and if it is
significant and if the relationship is positive or negative.
Solution:
8. Since relationship is existing, let us compute the regression line equation using SPSS
Multiple Regression
The purpose of multiple regression is similar to simple regression, but with more
predictor variables. Multiple regression attempts to predict a normal (or continuous)
dependent variable from a combination of several scale and/or dichotomous independent/
predictor variables. For example, suppose a nursing instructor wishes to see whether there
is a relationship between a student’s grade point average (GPA), age, and score on the
board examination. The two independent variables are GPA will be denoted as x1 and age
denoted as x2. The instructor will collect the data for all three variables for a sample of
nursing students. Rather than conduct two separate simple regression studies, one using
the GPA and board exam scores and another using ages and board exam scores, the
instructor can conduct one study using multiple regression analysis with two independent
variables — GPA and ages — and one dependent variable — board exam scores.
(Interpreting the
coefficient: other would
label the strength as
“Perfect” for much larger
than typical values,
“Strong" for Large or larger
than typical, “Moderate” for
medium or typical, and
“Weak” for small or smaller
than typical values of
coefficients. R for
regression coefficient and r
for correlation coefficient)
Exercises 7:
A researcher collects the following data and determines that there is significant
relationship between the age of a copy machine and its monthly maintenance cost. Predict
the monthly maintenance cost of copy machine given its age.
8. Is r significant at α = 0.05?
9. If relationship is existing, what is the
regression line equation to predict
the copy machine monthly
maintenance cost?