0% found this document useful (0 votes)
30 views11 pages

Correlation and Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

Correlation and Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Correlation and

Regression
Objectives

determine the relationship if it exists using a scatter plot

test the hypothesis

determining the equation of the regression and compute the coefficient of

determination
Introduction
In the part 1, we discussed the non-parametric test (chi-square), and parametric test (t-
test, and ANOVA). Another area of inferential statistics involves determining whether a
relationship exists between two or more numerical or quantitative variables. For example, a
businessperson may want to know whether the volume of sales for a given month is related
to the amount of advertising the firm does that month. Educators are interested in
determining whether the number of hours a student studies is related to the student’s score
on a particular exam. Medical researchers are interested in questions such as, Is caffeine
related to heart damage? Or Is there a relationship between a person’s age and his or her
blood pressure? These are only a few of the many questions that can be answered by using
the techniques of correlation and regression analysis.

Correlation is a statistical method used to determine whether a linear relationship


between variables exists. Regression is a statistical method used to describe the nature of
the relationship between variables, that is, positive or negative, linear or nonlinear.

There are two types of relationships: simple and multiple. In a simple relationship,
there are two variables — an independent variable, also called an explanatory variable or a
predictor variable, and a dependent variable, also called a response variable. A simple
relationship analysis is called simple regression, and there is one independent variable that
is used to predict the dependent variable. For example a manager may wish to see whether
the number of years the salesperson have been working for the company has anything to do
with the amount sales they make. This type of study involves a simple relationship since
there are only two variables— years of experience and amount of sales.

Simple relationships can also be positive or negative. A positive relationship exists


when both variables increase or decrease at the same time. For instance, a person’s height
and weight are related; and the relationship is positive, since the taller a person is generally,
the more the person weighs. In a negative relationship, as one variable increases, the other
variables decreases, and vice versa. For example, if you measure the strength of people
over 60 years of age, you will find that as age increases, the strength generally decreases.
The generally is used here because there are exceptions.

Predictions are made in all areas and daily. Examples include weather forecasting,
stock market analyses, sales predictions, crop predictions, gasoline price predictions, and
sports predictions. Some predictions are more accurate than others, due to the strength of
the relationship. That is, the stronger the relationship is between variables, the more
accurate the prediction is.
Scatter Plots and Correlation
In simple correlation and regression studies, the researcher collects data on two
numerical or quantitative variable to see whether a relationship exists between numbers. As
stated previously, the two variables for this study are called the independent and dependent
variables. The independent variable is the variable in regression that can controlled or
manipulated, while the dependent variables is the variable in regression that cannot be
controlled or manipulated.

Scatter Plot

A scatter plot is a graph of the ordered pairs (x, y) of


numbers consisting of the independent variable (x)
and the dependent variable (y). The scatter plot is a
visual way to describe the nature of the relationship
between the independent and dependent variables.
The figure at the left is a scatter plot.We use scatter
plot to get sense of whether or not it is appropriate to
use correlation coefficients and, later regression.

Correlation

To measure the strength if the linear relationship between two variables, correlation
coefficient is use. There are several types of correlation coefficients. The one that will be
explained this part is called the Pearson product moment correlation coefficient (PPMC),
name after statistician Karl Pearson.

The correlation coefficient computed from the sample data measures the strength
and direction of a linear relationship between two quantitative variables, the symbol for the
sample correlation coefficient is r.
The symbol for the population
correlation coefficient is ρ (Greek
letter rho).

The range of the correlation


coefficient is from -1 to +1. If there
is a strong positive linear
relationship between the variables,
the value of r will be close to +1. If
there is a strong negative linear
relationship between the
variables , the value of r will be
close to -1. When there is no linear
relationship between the variables
or only a weak relationship, the
value of r will be close to 0.
The direction of the line is based on the sign of correlation coefficient. “+” or positively
correlated (sloping upwards) and “-“ or negatively correlated (sloping downwards).

Assumptions for the Correlation Coefficient

1. The sample is a random sample.

2. The data pairs fall approximately on a straight line and are measured at the interval
or ratio level.

The variables have a joint normal distribution. (This mean that given any specific value
of x, the y values are normally distributed; and given any specific value of y, the x values are
normally distributed.)

Hypothesis testing for Correlation

H0 : The correlation is significantly equal to zero.

H1 : The correlation is not significantly equal to zero.

Correlation and Causation. Researchers must understand the nature of the linear
relationship between the independent and dependent variables. When a hypothesis test
indicates that a significant linear relationship exists between the variables, researchers must
consider the possibilities outlined below.

Possible Relationships Between Variables

When the null hypothesis has been rejected for a specific α value, any of the following
five possibilities can exists.

1. There is a direct cause-and-effect relationship between the variables. That is, c causes y.
For example, water causes plants to grow, poison caused death, and heat causes ice to
melt.

2. There is a reverse cause-and-effect relationship between the variables. That is, y causes
x. For example, suppose a researcher believes excessive coffee consumption causes
nervousness, but the research fails to consider that the reverse situation may occur. That
is, it may be that an extremely nervous person craves coffee to calm his or her nerves.

3. The relationship between the variables may be caused by a third variable. For example, if
a statistician correlated the number of deaths due to drowning and the number of cans
of soft drink consumed daily during the summer, he or she would probably find a
significant relationship. However, the soft drink is not necessarily responsible for the
deaths, since both variables may be related to heat and humidity.

4. There may be a complexity of interrelationships among many variables. For example, a


researcher may find a significant relationship between students’ high school grades and
college grades. But there probably are many other variables involved, such as IQ, hours
of study, influence of parents, motivations, age, and instructors.

5. The relationship may be coincidental. For example, a researcher may be able to find a
significant relationship between “the increase in the number of people who are
exercising and the increase in the number of people who are committing crimes”, but
common sense dictates that any relationship between these two values must be due to
coincidence.

Regression
In studying relationships between two variables, collect the data and then construct a
scatter plot. The purpose of the scatter plot is to determine the nature of the relationship.
The possibilities include a positive linear relationship, a negative linear relationship, a
curvilinear relationship, or no discernible relationship. The next step is the compute the
value of the correlation coefficient and to test the significance of the relationship. If the value
of the correlation coefficient is significant, the next step is to determine the equation of the
regression line, which is the data’s line of best fit. (Note: determining the regression line
when r is not significant and then making predictions using the regression line are
meaningless.)

Line of Best Fit

The figure shows a scatter plot and several


lines that can be drawn on the graph near the
points. Given the scatter plot, you must be able
to draw the line of best fit. Best fit means that the
sum of the squares of the vertical distances from
each point to the line is at a minimum. The reason
you need a line of best fit is that the values of y
will be predicted from the values of x, hence, the
closer the point are to the line, the better the fit
and the prediction will be.

Assumptions of a Simple Regression (R)

1. One independent and dependent variable that is measures at the continuous level.

2. There should be a linear relationship between your dependent and independent


variables.

3. No autocorrelation- there should be independence of observations

4. There should be no significant outliers.

5. The variances along the line of best fit remain similar as you move along the line, known
as homoscedasticity.

6. The residuals (errors) of the regression line are approximately normally distributed.
Fitting lines to Data

We can describe the pattern of the plot with a simple mathematical function (like a
straight line). Then we can characterize the relation by providing the formula for the function
y = ax + b where, y is the dependent variable, b is the slope, a is the intercept, and x is the
independent variable

Example 11: In a study on speed control, it was found that the main reasons for regulations
were to make traffic flow more efficient and to minimize the risk of danger. An area that
focused on in the study was the distance required to completely stop a vehicle at various
speeds. Use the following table to answer the questions.

MPH Braking distance (feet)


20 20
30 45
40 81
50 133
60 205
80 411

Assume MPH is going to be used to predict stopping distance.

1. Which of the two variables is the independent variable?

2. Which is the dependent variable?

3. What type of variable is the independent variable?

4. What type of variable is the dependent variable?

5. Construct a scatter plot for the data.

6. Is there a linear relationship between the two variables?

7. Is the relationship positive or negative?

8. Is r significant at α = 0.05?
9. If relationship is existing, what is the regression line equation to predict the braking
distance?

Solution:

1. The independent variable is miles per hour (MPH).

2. The dependent variable is braking distance (feet).


3. Miles per hour is a continuous quantitative variable.

4. Braking distance is a continuous


quantitative variable.

5. The scatter plot can be done using the


positive quadrant of the cartesian plane
with the value on the x-axis—the
independent variable which is the MPH
and on the y-axis —the dependent
variable braking distance.

6. Using the scatter plot above, there


might be a linear relationship between
the two variables, but there is a bit of a
curve in the data.

7. To answer item 7 & 8. Let’s test the hypothesis if relationship is existing and if it is
significant and if the relationship is positive or negative.

Solution:

Step 1. State the hypotheses and identify the claim


H0: The correlation is significantly equal to zero.
H1: T The correlation is significantly not equal to zero.
Level of significance is at α = 0.05

Step 2. Compute the test value. (Using PPMC


or Pearson Correlation). The correlations
result shows that the correlation coefficient is
r = .965.

Step 3. Find the p-value.


We will locate the p-value in the Sig. (2-
tailed) which is p = .002.

Step 4. Make a decision.


Since p = .002 is less than α = 0.05, we will reject the null hypothesis.

Step 5. Summarize the result


There is an enough evidence that the correlation is not equal to zero, it has a
correlation coefficient value of r = .965 which is close to +1, this mean that there is a
strong positive linear relationship between the two variables— higher speeds are associated
with longer braking distance. The strong relationship between them suggests that braking
distance can be accurately predicted from MPH. Which we can continue using Regression.

8. Since relationship is existing, let us compute the regression line equation using SPSS

The result in MODEL SUMMARY, notice


that the bivariate regression the R is the
same as the r value of the correlation
coefficient.

The regression coefficient which is


the slope of the best-fit line or
regression line that can be found the
COEFFICIENTS table in the
Unstandardized B column, the slope or
regression coefficient (a) value of 6.44
and the constant or what we call as
intercept (b) in our regression line
formula y = ax + b has a value
−151.20.
Thus, it is necessary to determine
the IV and DV between our variables.
In our example the DV is the braking
distance and IV is the MPH. We can
predict that for the braking distance
would be y = 6.44x − 151.20. This
equation means that braking distance can be predicted by the MPH multiplied by 6.44 then
minus 151.20 miles.

Multiple Regression

The purpose of multiple regression is similar to simple regression, but with more
predictor variables. Multiple regression attempts to predict a normal (or continuous)
dependent variable from a combination of several scale and/or dichotomous independent/
predictor variables. For example, suppose a nursing instructor wishes to see whether there
is a relationship between a student’s grade point average (GPA), age, and score on the
board examination. The two independent variables are GPA will be denoted as x1 and age
denoted as x2. The instructor will collect the data for all three variables for a sample of
nursing students. Rather than conduct two separate simple regression studies, one using
the GPA and board exam scores and another using ages and board exam scores, the
instructor can conduct one study using multiple regression analysis with two independent
variables — GPA and ages — and one dependent variable — board exam scores.

Multiple regression correlation R can also be computed to determine if a significant


relationship exists between the independent variables and the dependent variable. Multiple
regression analysis is used when a statistician thinks there are several independent variables
contributing to the variation of the dependent variable. This analysis then can be used to
increase the accuracy of predictions for the dependent variable over one independent
variable alone.

(Interpreting the
coefficient: other would
label the strength as
“Perfect” for much larger
than typical values,
“Strong" for Large or larger
than typical, “Moderate” for
medium or typical, and
“Weak” for small or smaller
than typical values of
coefficients. R for
regression coefficient and r
for correlation coefficient)

Exercises 7:

A researcher collects the following data and determines that there is significant
relationship between the age of a copy machine and its monthly maintenance cost. Predict
the monthly maintenance cost of copy machine given its age.

Machine Age Monthly cost


A 1 3100
B 2 3900
C 3 3500
D 4 4500
E 4 4650
F 6 5150
Answer the following questions.

1. Which of the two variables is the


independent variable?

2. Which is the dependent variable?

3. What type of variable is the


independent variable?

4. What type of variable is the


dependent variable?

5. Construct a scatter plot for the data.

6. Is there a linear relationship between


the two variables?

7. Is the relationship positive or


negative?

8. Is r significant at α = 0.05?
9. If relationship is existing, what is the
regression line equation to predict
the copy machine monthly
maintenance cost?

You might also like