Correlation and Linear
Correlation and Linear
Regression Analysis
By : Girma M.
1
Outline’s
• Introduction
• Scattered plot and Correlation
• Regression
• Coefficient of Determination and Standard Error
of the Estimate
• Multiple Regression and Non-Linear Regression
(Optional)
2
Introduction
• In the previous chapters, two areas of inferential statistics confidence
intervals and hypothesis testing and in addition comparison of means
were explained.
• Another area of inferential statistics involves determining whether a
relationship exists between two or more numerical or quantitative variables.
• For example:
o a businessperson may want to know whether the volume of sales for a given
month is related to the amount of advertising the firm does that month.
o Educators are interested in determining whether the number of hours a student
studies is related to the student’s score on a particular exam.
o Medical researchers are interested in questions such as, Is caffeine related to
heart damage? or Is there a relationship between a person’s age and
his or her blood pressure?
o A zoologist may want to know whether the birth weight of a certain animal is
related to its life span.
o A forester may want to develop an allometitic equation for to determine the
biomass of certain species.
• These are only a few of the many questions that can be answered by using
the techniques of correlation and regression analysis.
3
Con’t
Correlation is a statistical method used to determine whether a linear
relationship between variables exists.
Regression is a statistical method used to describe the nature of the
relationship between variables, that is, positive or negative, linear or
nonlinear.
The purpose of this chapter is to answer these questions statistically:
1. Are two or more variables linearly related?
2. If so, what is the strength of the relationship?
3. What type of relationship exists?
4. What kind of predictions can be made from the relationship?
• To answer the first two questions, statisticians use a numerical measure to
determine whether two or more variables are linearly related and to
determine the strength of the relationship between or among the variables.
This measure is called a correlation coefficient.
• For example, there are many variables that contribute to heart disease,
among them lack of exercise, smoking, heredity, age, stress, and diet. Of
these variables, some are more important than others; therefore, a physician
who wants to help a patient must know which factors are most important.4
Con’t
• To answer 3rd question, there are two types of relationships: simple and
multiple.
• In simple relationship, there are two variables an independent variable, also
called an explanatory variable or a predictor variable, and a dependent
variable, also called a response variable.
• A simple relationship analysis is called simple regression, and there is one
independent variable that is used to predict the dependent variable.
• For Example : Income and Expenditure
Demand and Supply
• Simple relationships can also be positive or negative.
• A positive relationship exists when both variables increase or decrease at the
same time.
• For example: a person’s height and weight are related; and the relationship is
positive, since the taller a person is, generally, the more the person weighs.
• In a negative relationship, as one variable increases, the other variable
decreases, and vice versa.
• For example, if you measure the strength of people over 60 years of age, you
will find that as age increases, strength generally decreases. The word generally
is used here because there are exceptions.
5
Con’t
• In a multiple relationship, called multiple regression, two
or more independent variables are used to predict one
dependent variable.
• For example: an educator may wish to investigate the relationship
between a student’s success in college and factors such as the number
of hours devoted to studying, the student’s GPA, and the student’s
high school background. This type of study involves several variables.
• Finally, to answer the 4th question: Predictions are made in all areas
and daily.
• For Examples :
weather forecasting,
stock market analyses,
sales predictions,
crop predictions,
gasoline price predictions,
sports predictions
election Prediction etc.
• Some predictions are more accurate than others, due to the strength of
the relationship. That is, the stronger the relationship is between
variables, the more accurate the prediction is. 6
Scatter Plots and Correlation
• In simple correlation and regression studies, the researcher collects data on two
numerical or quantitative variables to see whether a relationship exists between the
variables.
• A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the
independent variable x and the dependent variable y.
• For example, if a researcher wishes to see whether there is a relationship between
number of hours of study and test scores on an exam, she must select a random
sample of students, determine the hours each studied, and obtain their grades on the
exam. A table can be made for the data, as shown here.
68
70
63
60 57
50
0 1 2 3 4 5 6 7
HOURS OF STUDY 7
Con’t
• Correlation : deals with the measurement of the closeness of the variables and it
can be measured using correlation coefficient.
• The correlation coefficient ( r ): computed from the sample data measures the
strength and direction of a linear relationship between two quantitative variables.
8
Con’t
• Graphical representations of Relationship Between the
Correlation Coefficient and the Scatter Plot
9
Con’t
Example: Absences and Final Grades
Compute the value of the correlation coefficient for the data obtained in the study of the
number of absences and the final grade of the seven students in the statistics class in the
given table:
No of fire x 72 69 58 47 84 62 57 45
No of acres burned y 62 42 19 26 51 15 30 15
11
Regression
• In studying relationships between two variables,
First, construct a scatter plot and it helps to determine the
nature of the relationship.
Then, compute the value of the correlation coefficient and to
test the significance of the relationship.
The next step is determine the equation of the regression line,
which is the data’s line of best fit.
• The purpose of the regression line is to enable the researcher to see
the trend and make predictions on the basis of the data.
• The equation that describes how y is related to x and an error term
is called the regression model.
• The simple linear regression model is:
y = A + Bx +e
A and B are called parameters of the model.
e is a random variable called the error term.
12
Con’t
• Line of Best Fit ፡ Best fit means that the sum of the squares of the
vertical distances from each point to the line is at a minimum.
Estimated
a and b Regression Equation
provide estimates of ෝ = 𝒂 + 𝒃𝒙
𝒚
A and B Sample Statistics
a, b
14
Determination of the Regression Line Equation by OLS
• There are several methods for finding the equation of the regression line, but for
now we are using OLS method.
• Ordinary Least Squares Criterion
𝒏
ෝ𝒊 )𝟐
𝐦𝐢𝐧 (𝒚𝒊 − 𝒚
𝒊=𝟏
• where:
𝒚𝒊
𝑜𝑟 𝑎 = 𝑦ത − 𝑏𝑥ҧ
• Rounding Rule for the Intercept and Slope Round the values of a and b to three
decimal places. 15
Example: Absences and Final Grades
• Find the equation of the regression line for the number of absence and
final grade score data.
16
Con’t
Important points:
1. The sign of the
correlation coefficient
and the sign of the slope
of the regression line will
always be the same. That
is, if r is positive, then b
will be positive; if r is
negative, then b will be
negative.
• Use the equation of the regression line to
predict score of a student who has 20
2. The regression line can absent.
be used to make Y’= 102.493-3.622x
predictions for the Y’= 102.493-3.622(20)
dependent variable. Y’=29.703
17
Exercise
• Farm Acreage: Is there a relationship between the number of farms in a state and
the acreage per farm? A random selection of states across the country, both eastern
and western, produced the following results. Can a relationship between these two
variables be concluded?
18
Coefficient of Determination and Standard
Error of the Estimate
• There are Several other measures are associated with the correlation
and regression techniques.
• They include the coefficient of determination, the standard error
of the estimate, and the prediction interval.
• These concepts can be explained, after different types of variation
associated with the regression model must be defined.
Types of Variation for the Regression Model
• Consider the following hypothetical regression model.
20
The procedure for finding the three types of variation
21
Residual Plots
• A residual is the difference between the actual value of y and the
are
predicted value y for a given x value (i.e. the values of (y − 𝑦)
called residuals )
• These values can be plotted with the x values, and the plot, called a
residual plot can be used to determine how well the regression line
can be used to make predictions.
• The x values are plotted using the horizontal axis, and the residuals are
plotted using the vertical axis. Since the mean of the residuals is
always zero, a horizontal line with a y coordinate of zero is placed on
22
the y axis.
Residual Plot
• To interpret a residual plot, you need to determine if the residuals
form a pattern. This is called the homoscedasticity assumption.
Plot the x and residual values for Example of residual plot (only a is suitable)
the above example
23
Coefficient of Determination
78.4
• For the example, r 2 = = 0.845. The term r 2 is usually expressed as a
92.8
percentage. So in this case, 84.5% of the total variation is explained by the
regression line using the independent variable.
• Another way to arrive at the value for r 2 is to square the correlation
coefficient. In this case, r= 0.919 and r 2 =0.845, which is the same value
found by using the variation ratio.
24
Standard Error of the Estimate
Prediction Interval
• The standard error of the estimate can be used for constructing a
ෝ value.
prediction interval (similar to a confidence interval) about a 𝒚
25
Exercise
• Driver’s Age and Accidents: A study is conducted to determine the
relationship between a driver’s age and the number of accidents he or
she has over a 1-year period. The data are shown here.
a. Explore scattered plot
b. Compute correlation coefficient and interpret the value
c. Fit the regression line, predict the number of accidents of a driver
who is 28.
d. Plot the residual plot
e. Compute coefficient of determination,
f. Construct 95% prediction interval for no accidents
26
Thank You!!!
27